Solr: Text Analyzers
Text analysis
It does the following
- tokenization
- case normalization
- stemming
- query expansion using synonyms
etc.
Examples of text analyzers
org.apache.lucene.analysis.standard.StandardTokenizerFactory
org.apache.lucene.analysis.core.StopFilterFactory
org.apache.lucene.analysis.synonym.SynonymFilterFactory
org.apache.lucene.analysis.core.LowerCaseFilterFactory
org.apache.lucene.analysis.en.EnglishPossessiveFilterFactory
org.apache.lucene.analysis.miscellaneous.KeywordMarkerFilterFactory
org.apache.lucene.analysis.en.PorterStemFilterFactory
Character Filters (<charFilter>)
Process a stream of text prior to tokenization
- MappingCharFilterFactory (replaces unicode to ascii for example)
- HTMLStripCharFilterFactory (extracts text from html docs)
- PatternReplaceCharFilterFactory (regex pattern)
Tokenization (<tokenizer>)
Takes text in the form of a character stream and splits it into tokens, most of the time skipping insignificant bits like whitespace and joining punctuation.
- KeywordTokenizerFactory
- WhitespaceTokenizerFactory
- StandardTokenizerFactory
- UAX29URLEmailTokenizer (This behaves like StandardTokenizer with the additional property of recognizing e-mail addresses and URLs as single tokens)
- ClassicTokenizerFactory
- LetterTokenizerFactory
- LowerCaseTokenizerFactory
- PatternTokenizerFactory
- PathHierarchyTokenizerFactory
Find more details https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Filtering <filter>
Consume one stream of tokens, known as TokenStream, and generate another. Hence, they can be chained one after another indefinitely. A token filter may be used to perform complex analysis by processing multiple tokens in the stream at once but in most cases it processes each token sequentially and decides to consider, replace, or ignore the token.
Stemming
Stemming is the process of reducing inflected or sometimes derived words to their stem, base, or root form, for example, a stemming algorithm might reduce running and runs, to just run. If you want to improve the precision of search results but retain the recall benefit s, you should consider indexing the data in two fields, one stemmed and the other not stemmed. stemming is language specific.
Stemmers in English
- SnowballPorterFilterFactory
- PorterStemFilterFactory
- KStemFilterFactory (less aggressive than PorterStemmer)
- EnglishMinimalStemFilterFactory
Synonyms
Generally applied for either at query time or index time, but not both.
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
i-pod, i pod =>ipod
ipod, i-pod, i pod
free Thesaurus is WordNet (http://wordnet.princeton.edu/)