03 · DATA MODEL

Analyzers

An analyzer turns raw text into a token stream. Every text field has exactly one analyzer chosen at index-create time. XERJ ships four built-ins. [fts] default_analyzer picks which one applies unless a field overrides it.

standard

The default. Unicode-aware segmentation, lowercasing, stop-word list for English. Correct for most log and document workloads. Equivalent to ES's standard analyzer.

"message": { "type": "text", "analyzer": "standard" }

whitespace

Splits on whitespace only. No lowercasing, no stop words. Use when you want exact token matching on multi-word identifiers or code.

simple

Splits on any non-letter. Lowercases. Good for mixed-punctuation inputs where you don't want the standard analyzer's tokenizer rules.

english

Standard + English stemming (Porter2) + a slightly larger stop-word list. Use when you want "running" and "runs" to match. Disable with [indexing] turbo_fast_analyzer = true to trade recall for ingest speed — the turbo path skips stemming for documents classified as log-shaped.

Per-field override

{
  "fields": {
    "message": { "type": "text", "analyzer": "english" },
    "raw":     { "type": "text", "analyzer": "whitespace" }
  }
}

Source · engine/crates/fts/src/analyzer.rs

◀ PREVAggregations

NEXT ▶Vectors & HNSW