Node Incident Playbooks
These playbooks target sentry and full-node operators supporting Sei validators. Each scenario lists detection signals, immediate actions, and verification steps.
1. P2P Partition
Symptoms: Height stagnates, logs show repeated dial timeout
or missing peers.
Actions:
-
Check network connectivity (
ping
/traceroute
) to known peers. -
Validate
persistent_peers
andseeds
inconfig.toml
. -
Restart Tendermint process to re-open connections:
systemctl restart seid
-
If behind firewalls, ensure inbound/outbound ports (
26656
) are open.
Verify: seid status
shows increasing height and peer count > 0.
2. State Sync Failure
Symptoms: State sync stalls or crashes with snapshot not found
.
Actions:
-
Confirm snapshot providers are reachable.
-
Clear data directory and attempt re-sync:
systemctl stop seid rm -rf ~/.sei/data systemctl start seid
-
If issue persists, switch to trusted snapshot provider or use backup snapshot.
Verify: Node progresses past the snapshot height and enters normal sync mode.
3. Snapshot Corruption
Symptoms: Restored snapshot fails to start or panics on boot.
Actions:
- Validate checksum of the snapshot archive.
- Re-extract snapshot to a clean directory.
- Consider using SeiDB’s built-in pruning to regenerate snapshot post-migration.
Verify: Node completes boot sequence without panics.
4. High Disk Usage
Symptoms: Disk usage exceeds alert thresholds; pruning ineffective.
Actions:
- Run
seidadmin prune
(if available) or enable state-store pruning inapp.toml
. - Rotate logs frequently; implement logrotate.
- Offload old snapshots to external storage.
Verify: Disk usage returns to acceptable levels; monitoring alerts clear.
Quick Reference
Error | Cause | Fix |
---|---|---|
P2P partition | Peers unreachable or misconfigured. | Restart node, verify peer list, ensure ports open. |
State sync stuck | Snapshot provider issue. | Purge data, retry with alternate provider. |
Snapshot corrupted | Checksum mismatch or incomplete extraction. | Re-download snapshot, verify integrity. |
Disk usage spike | Pruning disabled or logs growing uncontrolled. | Enable pruning, rotate logs, offload snapshots. |
Logging & Escalation
- Collect
journalctl -u seid --since "15 minutes ago"
for escalation tickets. - Include
config.toml
,app.toml
, and latest snapshot metadata when contacting core teams. - Document incident timeline and resolution for internal postmortems.