04 · OPERATIONS

Troubleshooting

The short list of things that go wrong in production, what they look like, and what to do about them. Every symptom links to the metric, the log line, and the config key you turn.

Symptom: ingest throughput drops after a few hours

Likely cause: merge pressure. Segments are piling up faster than the merger can consolidate them, and the memtable is flushing small segments.

Check:

$ curl -s http://127.0.0.1:8080/v1/metrics | grep -E 'segment_count|merge_duration'
xerj_segment_count{index="logs"} 2847
xerj_merge_duration_seconds_count{index="logs"} 183
xerj_merge_duration_seconds_sum{index="logs"} 1847.2

Fix: raise [merge] max_concurrent (1 → 2–4), raise io_rate_mb_per_sec (100 → 250–500 on NVMe), and raise [storage] flush_size_mb so flushes produce bigger starter segments.

Symptom: queries time out under load

Likely cause: too many concurrent queries or a single query that grew too large.

Check: look for "query cancelled: max_query_memory_mb exceeded" in the logs or active_searches pegged at max_concurrent_searches.

$ journalctl -u xerj --since "10 min ago" | grep -E 'cancel|timeout|rejected'

Fix: if it's memory — raise [limits] max_query_memory_mb (512 → 1024 or 2048 for aggregation-heavy workloads). If it's concurrency — raise max_concurrent_searches, but check that the host actually has headroom first.

Symptom: node RAM climbs forever

Likely cause: HNSW index growing without quantization, or too many indices with large flush_size_mb.

Check:

$ curl -s http://127.0.0.1:8080/v1/metrics | grep memory_usage
xerj_memory_usage_bytes 14200000000  # 14 GB and climbing

Fix: set [vector] hnsw_offload_threshold = 1000000 to auto-scalar4 once an index exceeds 1 M vectors. Or lower flush_size_mb so memtables don't grow unbounded.

Symptom: "WAL replay failed" on restart

Likely cause: the server was killed mid-fsync (power loss, OOM kill). The tail WAL file is torn.

Check: first boot log — "wal replay: truncating torn tail at offset N" is benign (XERJ truncates the torn suffix and continues). A hard "wal replay: checksum mismatch at offset N, refusing to start" is not.

Fix: if truncation worked on its own, nothing to do — the last few seconds of writes are lost but the index is consistent. If it refused to start, run xerj verify --data-dir /var/lib/xerj --repair-wal which truncates the WAL at the last valid entry.

Symptom: "disk full" during a merge

Likely cause: no reservation for merge scratch space. A merge of two N-sized segments needs 2N free until the merge completes.

Fix: lower [merge] max_segment_mb so individual merges stay smaller, or free disk. Once the merge retries and succeeds, old segments are unlinked.

Symptom: cluster flapping — leader changes every few seconds

Likely cause: network latency between peers is high enough that heartbeats miss. Raft responds by calling a new election.

Check:

$ journalctl -u xerj --since "5 min ago" | grep -E 'term|election|leader'
... raft: election timeout, starting new term 17
... raft: received higher term 18, stepping down

Fix: raise [cluster] tick_ms from 50 to 150 or 250 — gives heartbeats more room on a slow network. Never drop the tick interval below the RTT between your worst pair of nodes.

Symptom: search returns stale results

Likely cause: recent docs are still in the memtable and a query on a replica is hitting a node that hasn't replicated them yet.

Fix: pass ?preference=primary on the search query to force routing to the primary shard, or lower [storage] flush_interval_secs so the memtable flushes more often.

Getting help

Collect diagnostics before filing a bug. This command builds a self-contained tarball with the config, recent logs, and a metrics snapshot:

$ xerj support-bundle --out /tmp/xerj-support-$(date +%s).tar.gz
wrote /tmp/xerj-support-1745000000.tar.gz (384 KiB)
contents:
  config.toml
  logs/xerj.log.gz
  metrics.txt
  cluster-health.json
  indices-stats.json

Source · engine/crates/server/src/main.rs · engine/crates/common/src/metrics.rs