Clustering
XERJ ships an embedded Raft implementation — no etcd, no ZooKeeper, no Consul. One binary is still one binary; cluster mode is a config switch, not a separate process. Raft replicates cluster metadata only: index schemas, shard assignments, and the node roster. Actual index data lives under the storage layer and is replicated per-shard outside the Raft group.
What you get
- Leader election — exactly one leader per term, automatic failover within an election timeout when a leader is lost.
- Log replication — every metadata change (create index, reassign shard, add node) is proposed by the leader and replicated to a quorum before it's committed.
- Log safety — commits are durable across crashes. A node rejoining the cluster catches up from the leader's log without losing data.
- Shard routing — writes are forwarded to the node holding the primary for that shard. Reads can hit any active replica.
- Region split / merge — for range-partitioned indices, regions split when they exceed a byte threshold and merge when neighbours drop below it.
Node lifecycle
Every node has a NodeState in the replicated metadata:
Bootstrapping a 3-node cluster
Start three nodes, each with its own data dir, and point them at each other. On first startup they hold a Raft election and one becomes leader:
# /etc/xerj/node-a.toml (run on 10.0.0.11) [server] data_dir = "/var/lib/xerj" bind_address = "10.0.0.11" [cluster] enabled = true port = 9300 # intra-cluster Raft + search port peers = [ "a=10.0.0.11:9300", # this node "b=10.0.0.12:9300", "c=10.0.0.13:9300", ] tick_ms = 50 # Raft tick interval
# /etc/xerj/node-b.toml (run on 10.0.0.12) [server] data_dir = "/var/lib/xerj" bind_address = "10.0.0.12" [cluster] enabled = true port = 9300 peers = [ "a=10.0.0.11:9300", "b=10.0.0.12:9300", # this node "c=10.0.0.13:9300", ] tick_ms = 50
The peers list uses "<node_id>=<host>:<port>". The current node's id is inferred from the entry whose address matches its own bind_address:port. Start all three boxes:
$ xerj --config /etc/xerj/node-a.toml # on 10.0.0.11 $ xerj --config /etc/xerj/node-b.toml # on 10.0.0.12 $ xerj --config /etc/xerj/node-c.toml # on 10.0.0.13
Within a second the cluster elects a leader. You can check the status on any node:
$ curl -sH "Authorization: ApiKey $XERJ_API_KEY" \
http://10.0.0.11:8080/v1/cluster/health | jq
{
"state": "active",
"leader": "b",
"term": 3,
"nodes": [
{ "id": "a", "address": "10.0.0.11:8100", "state": "active" },
{ "id": "b", "address": "10.0.0.12:8100", "state": "active" },
{ "id": "c", "address": "10.0.0.13:8100", "state": "active" }
],
"indices": 4,
"shards": 24
}
Creating a sharded index
Specify shards at create time. Shards are assigned to nodes round-robin; the leader replicates the assignment through Raft so every node sees the same mapping:
$ curl -sXPUT -H "Authorization: ApiKey $XERJ_API_KEY" \
-H "Content-Type: application/json" \
http://10.0.0.11:8080/v1/indices/events \
-d '{
"shards": 6,
"schema": {
"@timestamp": "date",
"host": "keyword",
"message": "text"
}
}'
Draining a node for maintenance
Before rebooting a node, mark it draining so the cluster migrates its shards to peers first:
$ curl -sXPOST -H "Authorization: ApiKey $XERJ_API_KEY" \
http://10.0.0.11:8080/v1/cluster/nodes/b/drain
{ "state": "draining", "shards_remaining": 8 }
# ... wait for shards_remaining to hit 0 ...
$ systemctl restart xerj # on node b
# After the reboot, mark it active again:
$ curl -sXPOST -H "Authorization: ApiKey $XERJ_API_KEY" \
http://10.0.0.11:8080/v1/cluster/nodes/b/activate
Regions — range partitioning
For range-partitioned indices (typically logs and time-series), XERJ splits the key space into regions. Each region is owned by exactly one node. When a region crosses split_bytes_threshold it splits into two; when two neighbouring regions together fall below merge_bytes_threshold they merge back. The region manager rebalances regions across nodes when load skew exceeds the configured tolerance.
Regions are transparent to clients — writes and reads use the normal index name; the cluster router looks up the correct region and forwards the request.
Failure modes
down with POST /v1/cluster/nodes/:id/remove. The router reassigns its shards to peers.When not to cluster
If one node handles your load, don't cluster. One box avoids network round-trips, needs no quorum, and is trivial to back up. Clustering earns its complexity when you need more CPU or disk than a single box can give you, when you need HA without a hot standby pair, or when you're geo-distributed. The default config ships with [cluster] enabled = false.
Source · engine/crates/cluster/src/node.rs · engine/crates/cluster/src/metadata.rs · engine/crates/cluster/src/regions.rs