Gateway High Availability
Stratum supports a redundant pair of Gateway nodes in a priority-based active/standby arrangement. One node is the active VIP owner; the other stands by, monitoring the active node over a dedicated UDP heartbeat channel. When the active Gateway fails, the standby promotes itself, assumes the VIP, and announces the new owner — without manual intervention.
How it works
The two Gateways are members of the same cluster and exchange a heartbeat on port 7074 UDP at a short interval (default 100 ms). Each datagram is authenticated with AES-GCM under a shared key and carries a monotonically increasing sequence number, so a forged or replayed heartbeat is rejected.
VIP ownership is decided by priority: while both peers are alive, the higher-priority node owns the VIP and the lower-priority node stays in standby. A standby that stops hearing its partner promotes itself regardless of priority — so the VIP is never orphaned — and yields the VIP back to a higher-priority peer once that peer returns. Equal priorities are broken deterministically by node identity so the pair can never both stay active under normal operation.
On promotion, the new owner installs the VIP on its interface (ip addr add) and sends a gratuitous ARP so neighbours update their ARP caches to the new owner. On demotion it removes the VIP with ip addr del (a targeted delete, not an address flush). Failover detection takes at most interval × missed-threshold; with the defaults that is 100 ms × 3 = 300 ms.
Forming an HA pair
An HA pair is two Gateway nodes in the same cluster configured to monitor each other. Both nodes must already be joined to the cluster.
Set the HA peer on each Gateway:
# On gw-01
sudo cenvero-str-ctl ha set-peer --peer gw-02 --peer-ip 10.0.0.13
# On gw-02
sudo cenvero-str-ctl ha set-peer --peer gw-01 --peer-ip 10.0.0.12
Check HA status:
cenvero-str-ctl ha status
LOCAL gw-01 (10.0.0.12) state: active
PEER gw-02 (10.0.0.13) state: standby
HEARTBEAT 100ms last seen 47ms ago
The local node's state is one of active (owns the VIP), standby (monitoring, ready to promote), or solo (no peer configured).
Heartbeat configuration
| Parameter | Default | Description |
|---|---|---|
--interval | 100ms | Time between heartbeat probes |
--missed-threshold | 3 | Consecutive missed probes before declaring peer dead |
--recovery-threshold | 3 | Consecutive received probes before declaring peer recovered |
sudo cenvero-str-ctl ha configure \
--interval 100ms \
--missed-threshold 3 \
--recovery-threshold 3
With the defaults, failover detection takes at most 100ms × 3 = 300 ms. Tightening the interval trades CPU overhead for faster detection.
VIP and route takeover
When the standby detects the active node has failed (missed-threshold heartbeats), it:
- Promotes itself to active and assumes the VIP on its own interface.
- Sends a gratuitous ARP for the VIP so upstream neighbours redirect traffic to it.
- Marks the partner as unavailable in its local state.
VIPs on the surviving node begin accepting new connections immediately. Existing connections that were being handled by the failed node are dropped (the client must reconnect); this is inherent to a stateless L4 failover.
When the failed node returns, arbitration over the heartbeat converges the pair back to a single owner: if the returning node has higher priority it preempts and reclaims the VIP; otherwise it stays in standby.
Split-brain on a full partition
A 2-node HA pair has no third arbiter, witness, or quorum. If the two Gateways stop hearing each other's heartbeats simultaneously — for example a management-network partition where each side is otherwise up — both nodes will promote and assume the VIP, because each believes its partner is dead. This residual split-brain window is an accepted limitation of a 2-node pair without a third arbiter.
When the partition heals, the heartbeat resumes and arbitration converges the pair back to a single owner: the higher-priority node keeps the VIP and the other releases it. If you need to eliminate the split-brain window entirely, front the pair with a third arbiter at the network layer (the platform does not provide one for the 2-node case).
Monitoring
# On either Gateway
cenvero-str-ctl ha status --verbose
LOCAL gw-01 (10.0.0.12) state: active
PEER gw-02 (10.0.0.13) state: standby
HEARTBEAT 100ms last seen 62ms ago seq 184213
LAST-FAILOVER none
PRIORITY local 200 peer 100
VIP-OWNERSHIP local (active)
The LAST-FAILOVER field records the time and direction of the most recent takeover event, which is useful for post-incident review.
See also
- Clustering Overview — Raft membership and the management network ports.
- BGP Edge Routing — configuring the BGP peer sessions that HA relies on.
- Moving Workloads Between Nodes — moving an endpoint to another node before maintenance.
- Configuration — port reference and the management network.