When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At indexing time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a Soundex transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'.
The lists below provide an overview of some of the more heavily used Tokenizers and TokenFilters provided by Solr "out of the box" along with tips/examples of using them. This list should by no means be considered the "complete" list of all Analysis classes available in Solr! In addition to new classes being added on an ongoing basis, you can load your own custom Analysis code as a Plugin.
For a more complete list of what Tokenizers and TokenFilters come out of the box, please consult the javadocs for the analysis package. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below.
Note: For a good background on Lucene Analysis, it's recommended that you read the following sections in Lucene In Action:
- 1.5.3 : Analyzer
- Chapter 4.0 through 4.7 at least
Try searches for "analyzer", "token", and "stemming".
Contents
- Analyzers, Tokenizers, and Token Filters
- Overview
- Stemming
- Analyzers
- Char Filters
- Tokens and Token Filters
- Specifying an Analyzer in the schema
- CharFilterFactories
- solr.MappingCharFilterFactory
- solr.HTMLStripCharFilterFactory
- TokenizerFactories
- solr.LetterTokenizerFactory
- solr.WhitespaceTokenizerFactory
- solr.LowerCaseTokenizerFactory
- solr.StandardTokenizerFactory
- solr.HTMLStripWhitespaceTokenizerFactory
- solr.HTMLStripStandardTokenizerFactory
- solr.PatternTokenizerFactory
- TokenFilterFactories
- solr.StandardFilterFactory
- solr.LowerCaseFilterFactory
- solr.TrimFilterFactory
- solr.StopFilterFactory
- solr.KeepWordFilterFactory
- solr.LengthFilterFactory
- solr.PorterStemFilterFactory
- solr.EnglishPorterFilterFactory
- solr.SnowballPorterFilterFactory
- solr.WordDelimiterFilterFactory
- solr.SynonymFilterFactory
- solr.RemoveDuplicatesTokenFilterFactory
- solr.ISOLatin1AccentFilterFactory
- solr.PhoneticFilterFactory
- solr.ShingleFilterFactory
- solr.PositionFilterFactory
- solr.ReversedWildcardFilterFactory
- CharFilterFactories
Analyzers
Analyzers are components that pre-process input text at index time and/or at search time. It's important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.
On wildcard and fuzzy searches, no text analysis is performed on the search word.
The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. If you need to pre-process input text and queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need to specify a custom Analyzer in the Solr schema.