Analyzers
An analyzer turns raw text into a token stream. Every text field has exactly one analyzer chosen at index-create time. XERJ ships four built-ins. [fts] default_analyzer picks which one applies unless a field overrides it.
standard
The default. Unicode-aware segmentation, lowercasing, stop-word list for English. Correct for most log and document workloads. Equivalent to ES's standard analyzer.
"message": { "type": "text", "analyzer": "standard" }
whitespace
Splits on whitespace only. No lowercasing, no stop words. Use when you want exact token matching on multi-word identifiers or code.
simple
Splits on any non-letter. Lowercases. Good for mixed-punctuation inputs where you don't want the standard analyzer's tokenizer rules.
english
Standard + English stemming (Porter2) + a slightly larger stop-word list. Use when you want "running" and "runs" to match. Disable with [indexing] turbo_fast_analyzer = true to trade recall for ingest speed — the turbo path skips stemming for documents classified as log-shaped.
Per-field override
{
"fields": {
"message": { "type": "text", "analyzer": "english" },
"raw": { "type": "text", "analyzer": "whitespace" }
}
}
Source · engine/crates/fts/src/analyzer.rs