I'm using NGramTokenFilter to process text and store it into PyLucene's index. When searching for a document using IndexSearcher, is it possible to get the list of ngrams that's representing the document found from the indexSearcher? or should I just rerun the analyzer using:
analyzer = myAnalyzer()
stream = analyzer.tokenStream("", StringReader("FACEBK ADS"))
stream.reset()
tokens=[]
while stream.incrementToken():
tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)
I use the following Field settings when building my Index Field:
field = FieldType()
field.setStored(True)
field.setTokenized(True)
field.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
field.setStoreTermVectors=True #RESULTS IN ERROR when used
When searching:
reader = DirectoryReader.open(directory)
searcher = IndexSearcher(reader)
analyzer = myAnalyzer()
query = QueryParser("indexed_field", analyzer).parse('test')
scoreDocs = searcher.search(query, 1).scoreDocs