I'm new to PyLucene. I managed to install it on my Ubuntu, and looked at [sample code][1] of how custom analyzer is implemented. I tried modifying it by adding an NGramTokenFilter. But I keep getting an error when printing out the result of the custom analyzer. If I remove the ngram filter, it would work just fine.
Basically what I'm trying to do is split all incoming text by white space, lower case it, convert to ascii code, and do ngram.
The code is as follow:
class myAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName):
source = WhitespaceTokenizer()
filter = LowerCaseFilter(source)
filter = ASCIIFoldingFilter(filter)
filter = NGramTokenFilter(filter,1,2)
return self.TokenStreamComponents(source, filter)
def initReader(self, fieldName, reader):
return reader
analyzer = myAnalyzer()
stream = analyzer.tokenStream("", StringReader("MARGIN wondêrfule"))
stream.reset()
tokens=[]
while stream.incrementToken():
tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)
The error I keep getting is:
InvalidArgsError: (<class 'org.apache.lucene.analysis.ngram.NGramTokenFilter'>, '__init__', (<ASCIIFoldingFilter: ASCIIFoldingFilter@192d74fb term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1>, 2, 3))
What am I doing wrong?
Looking at the JavaDoc for NGramTokenFilter, you have to use this:
for a fixed ngram size; or this:
for a range of ngram sizes with a boolean for
preserveOriginal, which determines:What you have is neither of those.
(Side note: I'm not sure an ngram of size 1 makes a lot of sense - but maybe it's what you need.)
Update
Just for completeness, here is my (somewhat modified) standalone version of the code in the question. The most relevant part is this line:
The program (using PyLucene 9.4.1 and Java 11):
The output: