Gateway High Availability

Stratum supports a redundant pair of Gateway nodes in a priority-based active/standby arrangement. One node is the active VIP owner; the other stands by, monitoring the active node over a dedicated, encrypted health channel. When the active Gateway fails, the standby promotes itself, assumes the VIP, and announces the new owner — without manual intervention.

How it works

The two Gateways are members of the same cluster and exchange a health signal at a short interval (default 100 ms). Each message is authenticated and replay-protected, so a forged or replayed signal is rejected.

VIP ownership is decided by priority: while both peers are alive, the higher-priority node owns the VIP and the lower-priority node stays in standby. A standby that stops hearing its partner promotes itself regardless of priority — so the VIP is never orphaned — and yields the VIP back to a higher-priority peer once that peer returns. Equal priorities are broken deterministically by node identity so the pair can never both stay active under normal operation.

On promotion, the new owner brings the VIP up on its own interface and sends a gratuitous ARP, so neighbours update their ARP caches and start sending to it within milliseconds rather than waiting for an entry to age out. On demotion it removes only the VIP — every other address on that interface is left untouched, so a node losing the VIP never loses its own addressing with it. Failover detection takes at most interval × missed-threshold; with the defaults that is 100 ms × 3 = 300 ms.

Forming an HA pair

An HA pair is two Gateway nodes in the same cluster configured to monitor each other. Both nodes must already be joined to the cluster.

HA settings are provisioned from the panel, not from the node. Because the pair's shared heartbeat key is a secret that both nodes must agree on, HA is configured in the management panel and delivered to each node inside its signed configuration. The node applies it on the next configuration sync — there is no local command that sets it, and ha set-peer and ha configure exist only to tell you so and point you at the panel.

The four settings the panel delivers are:

Config key	Description
`gateway_peer_addr`	The partner Gateway's address on the management network
`gateway_vip`	The virtual IP the active node owns
`gateway_priority`	This node's priority; the higher-priority peer owns the VIP while both are alive
`gateway_shared_key`	The shared secret authenticating the heartbeat; identical on both nodes

Once both nodes have synced their configuration, confirm the pair from either node:

cenvero-str-ctl ha status

{
  "data": {
    "local_state": "active",
    "peer_state": "standby",
    "peer_addr": "10.0.0.13",
    "vip": "10.0.0.100",
    "last_heartbeat": "2026-07-24T18:10:40Z",
    "active_conns": 0,
    "uptime": "2h44m47s"
  },
  "status": "ok"
}

local_state and peer_state are each one of active (owns the VIP), standby (monitoring, ready to promote), or solo (no peer configured). A peer_state of solo together with an empty peer_addr means this node has not received HA settings yet.

Heartbeat timing

The heartbeat interval and the failover threshold are fixed and not operator-tunable:

Parameter	Value	Description
Heartbeat interval	`100ms`	Time between heartbeat probes
Failover threshold	`3`	Consecutive missed probes before declaring the peer dead

Failover detection therefore takes at most 100ms × 3 = 300 ms.

VIP and route takeover

When the standby detects the active node has failed (missed-threshold heartbeats), it:

Promotes itself to active and assumes the VIP on its own interface.
Sends a gratuitous ARP for the VIP so upstream neighbours redirect traffic to it.
Marks the partner as unavailable in its local state.

VIPs on the surviving node begin accepting new connections immediately. Existing connections that were being handled by the failed node are dropped (the client must reconnect); this is inherent to a stateless L4 failover.

When the failed node returns, arbitration over the heartbeat converges the pair back to a single owner: if the returning node has higher priority it preempts and reclaims the VIP; otherwise it stays in standby.

Split-brain on a full partition

A 2-node HA pair has no third arbiter, witness, or quorum. If the two Gateways stop hearing each other's heartbeats simultaneously — for example a management-network partition where each side is otherwise up — both nodes will promote and assume the VIP, because each believes its partner is dead. This residual split-brain window is an accepted limitation of a 2-node pair without a third arbiter.

When the partition heals, the heartbeat resumes and arbitration converges the pair back to a single owner: the higher-priority node keeps the VIP and the other releases it. If you need to eliminate the split-brain window entirely, front the pair with a third arbiter at the network layer (the platform does not provide one for the 2-node case).

Monitoring

# On either Gateway
cenvero-str-ctl ha status

The two fields worth alerting on are local_state — a node that unexpectedly reports active on both sides of the pair indicates the partition case described above — and last_heartbeat, which should stay within a few hundred milliseconds of now while the peer is healthy. active_conns reports the connections the local node is currently handling, and uptime is how long its HA manager has been running.