Dissolved gas analysis

When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At indexing time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a Soundex transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'.

The lists below provide an overview of some of the more heavily used Tokenizers and TokenFilters provided by Solr "out of the box" along with tips/examples of using them. This list should by no means be considered the "complete" list of all Analysis classes available in Solr! In addition to new classes being added on an ongoing basis, you can load your own custom Analysis code as a Plugin.

For a more complete list of what Tokenizers and TokenFilters come out of the box, please consult the javadocs for the analysis package. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below.

Note: For a good background on Lucene Analysis, it's recommended that you read the following sections in Lucene In Action:

1.5.3 : Analyzer
Chapter 4.0 through 4.7 at least

Try searches for "analyzer", "token", and "stemming".

Contents

Analyzers, Tokenizers, and Token Filters
1. Overview
2. Stemming
3. Analyzers
4. Char Filters
5. Tokens and Token Filters
6. Specifying an Analyzer in the schema
  1. CharFilterFactories
    1. solr.MappingCharFilterFactory
    2. solr.HTMLStripCharFilterFactory
  2. TokenizerFactories
    1. solr.LetterTokenizerFactory
    2. solr.WhitespaceTokenizerFactory
    3. solr.LowerCaseTokenizerFactory
    4. solr.StandardTokenizerFactory
    5. solr.HTMLStripWhitespaceTokenizerFactory
    6. solr.HTMLStripStandardTokenizerFactory
    7. solr.PatternTokenizerFactory
  3. TokenFilterFactories
    1. solr.StandardFilterFactory
    2. solr.LowerCaseFilterFactory
    3. solr.TrimFilterFactory
    4. solr.StopFilterFactory
    5. solr.KeepWordFilterFactory
    6. solr.LengthFilterFactory
    7. solr.PorterStemFilterFactory
    8. solr.EnglishPorterFilterFactory
    9. solr.SnowballPorterFilterFactory
    10. solr.WordDelimiterFilterFactory
    11. solr.SynonymFilterFactory
    12. solr.RemoveDuplicatesTokenFilterFactory
    13. solr.ISOLatin1AccentFilterFactory
    14. solr.PhoneticFilterFactory
    15. solr.ShingleFilterFactory
    16. solr.PositionFilterFactory
    17. solr.ReversedWildcardFilterFactory

Analyzers

Analyzers are components that pre-process input text at index time and/or at search time. It's important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.

On wildcard and fuzzy searches, no text analysis is performed on the search word.

The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. If you need to pre-process input text and queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need to specify a custom Analyzer in the Solr schema.

Char Filters

Char Filter is a component that pre-processes input characters. It can be chained like as Token Filters and placed in front of a Tokenizer. Char Filters can add, change, or remove characters without worrying about fault of Token offsets.

Dissolved gas analysis

Friday, November 20, 2009

Analyzers, Tokenizers, and Token Filters Overview

Analyzers

Char Filters

Labels