Indexing original tokens in solr

319 views Asked by At

Let's say I have a field type as the following:

<fieldType name="text_body" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
    <filter class="solr.FlattenGraphFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

My goal is to index, for each token, the original token as well as the token after passing all the token filters. For example, for the text:

"My dog is barking #DOGS"

The current field type (as mentioned above) will index the following tokens:

"my", "dog", "bark", "dogs", "#dogs" 

"is" will be dropped because of the stopWords filter, and "barking" will become "bark" because of the stemming filter.

I would like that the following tokens will be indexed:

"My", "my", "dog", "barking", "bark", "dogs", "#DOGS".

I considered the "perserveOriginal" parameter in the WordDelimiterGraphFilterFactory but it's only works for tokens with special characters, and also the "original token" passes the other filters after that.

I know that the obvious way is to write a custom TokenFilter that indexes the tokens at their original form right after the tokenizer, but my question is if there is something built in Solr that allows it.

I'm using Solr 6.5.1

Thanks :)

1

There are 1 answers

0
S Jayesh On

Nice question related to maintaining relevany of search for natura language, probably following will help.

If fields to search on are only the fields of mentioned filedType i.e. "text_body", and you want to have both stemmed and original tokans for searching for all the fileds in your list of fileds to search on;

Then try creating an additional field (say field_withoutStemmer) with another fieldType like "text_body" but without following filter:

<filter class="solr.PorterStemFilterFactory"/>

In addition to this if you are using dismax/edismax query parser, then you may want to set "tie" parameter with non-zero value (probably tie=1.0).

Setting "tie=1.0" will generate document's score = sum of scores for both fields of the matched document; otherwise it will disjunct and you will have only highest score of one among both fields.