01 · OPERATIONS

Running in production

One binary, one config file, one data directory. That's the mental model — everything else is tuning. This page walks the end-to-end deployment: systemd unit, readiness, config reload, log levels, health + metrics, and capacity planning.

Systemd unit

Run XERJ as a non-root user. Create the user and data directory first:

$ sudo useradd --system --home /var/lib/xerj --shell /usr/sbin/nologin xerj
$ sudo install -d -o xerj -g xerj -m 0750 /var/lib/xerj /etc/xerj
$ sudo install -m 0640 -o xerj -g xerj xerj.toml /etc/xerj/xerj.toml

Drop this unit at /etc/systemd/system/xerj.service:

[Unit]
Description=XERJ search + vector + log engine
Documentation=https://xerj.dev/docs/
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=xerj
Group=xerj
ExecStart=/usr/local/bin/xerj --config /etc/xerj/xerj.toml
Restart=on-failure
RestartSec=5
LimitNOFILE=1048576
LimitMEMLOCK=infinity

# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/xerj
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
SystemCallArchitectures=native
LockPersonality=true

[Install]
WantedBy=multi-user.target
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now xerj
$ sudo systemctl status xerj
$ journalctl -u xerj -f

Readiness vs liveness

Two separate probes so orchestrators don't kill a healthy node mid-recovery.

$ curl -sf http://127.0.0.1:8080/v1/health/ready && echo OK
{"status":"ready","wal_replayed":true,"indices":4,"uptime_s":17}
OK

Config reload

Most tunables pick up on SIGHUP without a restart. A few (ports, TLS certs, data_dir, cluster.enabled) require a full restart — those reject a reload with a clear error in the log.

$ sudo systemctl reload xerj    # sends SIGHUP
$ journalctl -u xerj -n 10 --no-pager
... config reloaded: merge.io_rate_mb_per_sec 100 → 250
... config reloaded: limits.max_concurrent_searches 64 → 128

Log levels

Logs go to stdout in a structured format (JSON in production, pretty in a TTY). Controlled by RUST_LOG:

# everything at info, HNSW at debug
$ RUST_LOG="info,xerj_vector=debug" xerj --config /etc/xerj/xerj.toml

# quiet mode
$ RUST_LOG="warn" xerj --config /etc/xerj/xerj.toml

# temporarily bump a running service without restart
$ sudo systemctl set-environment RUST_LOG="info,xerj_query=debug"
$ sudo systemctl reload xerj

Levels, in order of verbosity: error · warn · info · debug · trace. info is the production default.

Metrics scrape

Prometheus endpoint at GET /v1/metrics. See Metrics for the full list. Example scrape config:

# prometheus.yml
scrape_configs:
  - job_name: xerj
    metrics_path: /v1/metrics
    scrape_interval: 15s
    static_configs:
      - targets:
          - xerj-a.internal:8080
          - xerj-b.internal:8080
          - xerj-c.internal:8080
    authorization:
      type: Bearer
      credentials_file: /etc/prometheus/xerj.token

Capacity planning

Three numbers dominate sizing. Pick the largest and plan around it.

Disk2.8× compression on SIEM-shaped data is the working number. Reserve 30% headroom for merges — a merge of two 5 GiB segments needs 10 GiB free temporarily.
RAM~400 MB idle baseline. HNSW adds ~m × 8 × num_vectors bytes per index (≈ 128 bytes/vector at the default m=16). Memtables are bounded by flush_size_mb per index.
CPUOne turbo-ingest pass pins one core per turbo_parallel worker. Leave at least 2 cores free for the query path on busy nodes.

File descriptors

Every open segment holds 3 file descriptors (data, sidx, ids). A 1000-segment index uses ~3000 fds. The systemd unit sets LimitNOFILE=1048576 which is plenty; if you run without systemd, set ulimit -n 1048576.

Source · engine/crates/server/src/main.rs · engine/crates/common/src/metrics.rs