Solr: Text Analyzers

Text analysis

It does the following

- tokenization

- case normalization

- stemming

- query expansion using synonyms

etc.

Examples of text analyzers

org.apache.lucene.analysis.standard.StandardTokenizerFactory

org.apache.lucene.analysis.core.StopFilterFactory

org.apache.lucene.analysis.synonym.SynonymFilterFactory

org.apache.lucene.analysis.core.LowerCaseFilterFactory

org.apache.lucene.analysis.en.EnglishPossessiveFilterFactory

org.apache.lucene.analysis.miscellaneous.KeywordMarkerFilterFactory

org.apache.lucene.analysis.en.PorterStemFilterFactory

Character Filters (<charFilter>)

Process a stream of text prior to tokenization

- MappingCharFilterFactory (replaces unicode to ascii for example)

- HTMLStripCharFilterFactory (extracts text from html docs)

- PatternReplaceCharFilterFactory (regex pattern)

Tokenization (<tokenizer>)

Takes text in the form of a character stream and splits it into tokens, most of the time skipping insignificant bits like whitespace and joining punctuation.

- KeywordTokenizerFactory

- WhitespaceTokenizerFactory

- StandardTokenizerFactory

- UAX29URLEmailTokenizer (This behaves like StandardTokenizer with the additional property of recognizing e-mail addresses and URLs as single tokens)

- ClassicTokenizerFactory

- LetterTokenizerFactory

- LowerCaseTokenizerFactory

- PatternTokenizerFactory

- PathHierarchyTokenizerFactory

Find more details https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Filtering <filter>

Consume one stream of tokens, known as TokenStream, and generate another. Hence, they can be chained one after another indefinitely. A token filter may be used to perform complex analysis by processing multiple tokens in the stream at once but in most cases it processes each token sequentially and decides to consider, replace, or ignore the token.

Stemming

Stemming is the process of reducing inflected or sometimes derived words to their stem, base, or root form, for example, a stemming algorithm might reduce running and runs, to just run. If you want to improve the precision of search results but retain the recall benefit s, you should consider indexing the data in two fields, one stemmed and the other not stemmed. stemming is language specific.

Stemmers in English

- SnowballPorterFilterFactory

- PorterStemFilterFactory

- KStemFilterFactory (less aggressive than PorterStemmer)

- EnglishMinimalStemFilterFactory

Synonyms

Generally applied for either at query time or index time, but not both.

i-pod, i pod =>ipod

ipod, i-pod, i pod

free Thesaurus is WordNet (http://wordnet.princeton.edu/)