Lucene | How to find prefix matches at beginning of field?

146 views Asked by At

I want to match prefixes near the start of a field. I have this, but it's not matching the prefix; it only matches the whole word if the search term matches it. It seems like there's no way to combine SpanTermQuery and PrefixQuery.

        var nameTerm = new Term("name", searchTerm);

        var prefixName = new PrefixQuery(nameTerm);

        var prefixAtStart = new BooleanQuery
        {
            { prefixName, Occur.MUST },
            {  new SpanFirstQuery(new SpanTermQuery(nameTerm), 0), Occur.MUST }
        };

For example:

  • Search term: "Comp"
  • Want to find: "Computer science class" and "Comp Sci"
  • Only finding: "Comp Sci"
  • Don't want to find: "Apple's latest computer"

Can the RegexpQuery be made to understand positions?

2

There are 2 answers

0
Abhijit Bashetti On BEST ANSWER

When you only want to match prefixes, you can do it by having below field type for your field.

<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

then in this case the query would be like :

field:comp*

Now you have a second for which you need NGramFilter, so you can use the below field type for your field.

<field name="text_prefix" type="text_prefix" indexed="true" stored="false"/>

<fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        </analyzer>
    </fieldType>
3
azulBonnet On

Translating Abhijit's response, here is the Lucene.Net way to setup the EdgeNGramFilter:

public class CustomAnalyzer : Analyzer
{
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_48, reader);

        TokenFilter filter = new EdgeNGramTokenFilter(LuceneVersion.LUCENE_48, tokenizer, 3, 10);

        return new TokenStreamComponents(tokenizer, filter);
    }
}