Monitoring
Monitoring
UltraDAG provides built-in monitoring endpoints for health checks, Prometheus metrics, and JSON diagnostics. This guide covers setting up monitoring for production nodes.
Health Endpoints
Simple Health Check
curl http://localhost:10333/health
Returns HTTP 200 with {"status": "ok"} if the node is running. Suitable for Kubernetes liveness probes and simple uptime monitors.
Detailed Health Check
curl http://localhost:10333/health/detailed
Returns component-level diagnostics:
{
"status": "healthy",
"components": {
"dag": {
"status": "healthy",
"round": 15420,
"vertices": 77100
},
"finality": {
"status": "healthy",
"finalized_round": 15418,
"lag": 2
},
"state": {
"status": "healthy",
"accounts": 142,
"total_supply": 1050154200000000
},
"mempool": {
"status": "healthy",
"size": 7,
"capacity": 10000
},
"network": {
"status": "healthy",
"peers": 4
},
"checkpoints": {
"status": "healthy",
"latest_round": 15400
}
}
}
Status levels:
| Status | Meaning |
|---|---|
healthy | Component operating normally |
warning | Component functional but degraded |
unhealthy | Component has issues requiring attention |
degraded | Component partially operational |
Prometheus Metrics
Endpoint
curl http://localhost:10333/metrics
Returns checkpoint metrics in Prometheus exposition format. The /metrics endpoint currently exports checkpoint-related metrics:
# HELP checkpoint_produced_total Total number of checkpoints produced by this node
# TYPE checkpoint_produced_total counter
checkpoint_produced_total 154
# HELP checkpoint_production_duration_ms Duration of last checkpoint production in milliseconds
# TYPE checkpoint_production_duration_ms gauge
checkpoint_production_duration_ms 42
# HELP checkpoint_size_bytes Size of last checkpoint in bytes
# TYPE checkpoint_size_bytes gauge
checkpoint_size_bytes 5000
# HELP checkpoint_cosigned_total Total number of checkpoints co-signed by this node
# TYPE checkpoint_cosigned_total counter
checkpoint_cosigned_total 300
# HELP checkpoint_quorum_reached_total Total number of checkpoints that reached quorum
# TYPE checkpoint_quorum_reached_total counter
checkpoint_quorum_reached_total 152
# HELP checkpoint_validation_failures_total Total number of checkpoint validation failures
# TYPE checkpoint_validation_failures_total counter
checkpoint_validation_failures_total 0
# HELP fast_sync_success_total Total number of successful fast-syncs
# TYPE fast_sync_success_total counter
fast_sync_success_total 1
# HELP checkpoint_persist_success_total Total number of successful checkpoint persists
# TYPE checkpoint_persist_success_total counter
checkpoint_persist_success_total 154
/status endpoint (JSON) or /health/detailed endpoint. These are not currently exported in Prometheus format but can be scraped via a custom exporter wrapping /status.JSON Metrics
curl http://localhost:10333/metrics/json
Returns the same data as JSON, suitable for custom dashboards:
{
"dag_round": 15420,
"finalized_round": 15418,
"finality_lag": 2,
"peer_count": 4,
"mempool_size": 7,
"total_supply": 1050154200000000,
"total_staked": 50000000000000,
"active_validators": 5,
"checkpoint_latest": 15400,
"uptime_seconds": 86400
}
Key Metrics
Critical Metrics
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
finality_lag | <= 3 | 4-10 | > 10 |
peer_count | >= 3 | 2 | 0-1 |
mempool_size | < 5000 | 5000-8000 | > 8000 |
Supply Monitoring
| Metric | Description |
|---|---|
total_supply | Total UDAG minted (should never exceed 21M) |
total_staked | Total UDAG locked in stakes |
liquid + staked + delegated + treasury != total_supply, the node will exit with code 101. This should never happen in normal operation.Grafana Dashboard Setup
Prometheus Configuration
Add UltraDAG to your prometheus.yml:
scrape_configs:
- job_name: 'ultradag'
scrape_interval: 15s
static_configs:
- targets:
- 'ultradag-node-1:10333'
- 'ultradag-node-2:10333'
- 'ultradag-node-3:10333'
metrics_path: '/metrics'
Dashboard Panels
Recommended Grafana dashboard layout:
Row 1: Overview
- Finality lag (gauge, threshold colors)
- Peer count (gauge)
- DAG round (counter)
- Mempool size (gauge)
Row 2: Economics
- Total supply over time (graph)
- Total staked over time (graph)
- Active validators (gauge)
Row 3: Checkpoints
- Checkpoints produced (counter)
- Checkpoint quorum rate (percentage)
- Fast-sync operations (counter)
Row 4: System
- Node uptime (counter)
- Memory usage (gauge, if available from host metrics)
Alert Thresholds
Recommended Alerts
/status JSON endpoint and re-exports the values as Prometheus gauges with an ultradag_ prefix. The built-in /metrics endpoint only exports checkpoint metrics.groups:
- name: ultradag
rules:
- alert: HighFinalityLag
expr: ultradag_finality_lag > 10
for: 5m
labels:
severity: critical
annotations:
summary: "Finality lag is {{ $value }} rounds"
- alert: LowPeerCount
expr: ultradag_peer_count < 2
for: 2m
labels:
severity: warning
annotations:
summary: "Only {{ $value }} peers connected"
- alert: MempoolNearFull
expr: ultradag_mempool_size > 8000
for: 5m
labels:
severity: warning
annotations:
summary: "Mempool at {{ $value }}/10000"
- alert: CheckpointStale
expr: (ultradag_dag_round - ultradag_checkpoint_latest) > 500
for: 10m
labels:
severity: warning
annotations:
summary: "No checkpoint in {{ $value }} rounds"
Circuit Breaker Behavior
UltraDAG has two fatal exit conditions that trigger graceful shutdown:
| Exit Code | Condition | Meaning |
|---|---|---|
| 100 | Finality rollback detected | A previously finalized round was un-finalized. Should never happen in normal operation. Indicates a critical consensus bug. |
| 101 | Supply invariant violated | liquid + staked + delegated + treasury != total_supply. Indicates a critical state engine bug. |
In both cases, the node:
- Signals all components to stop
- Saves all state to disk
- Exits with the appropriate code
Configure your process manager to distinguish these:
[Service]
Restart=on-failure
# Don't restart on exit codes 100 or 101
RestartPreventExitStatus=100 101
Log Levels
Control verbosity with RUST_LOG:
| Level | Use Case |
|---|---|
error | Production (errors only) |
warn | Production (errors + warnings) |
info | Normal operation (recommended) |
debug | Troubleshooting specific issues |
trace | Deep debugging (very verbose, may affect performance) |
Fine-grained control:
RUST_LOG=ultradag_coin=debug,ultradag_network=info,ultradag_node=info
Next Steps
- Node Operator Guide — installation and configuration
- Validator Handbook — validator-specific operations
- CLI Reference — all configuration options