agentd-monitor¶
System health monitoring and alerting daemon. Watches CPU, memory, disk, and load average metrics at configurable intervals and exposes a REST API for querying current state, triggering on-demand collection, and exporting Prometheus metrics.
Base URL¶
Port defaults to 17003 in development and 7003 in production, configurable via the AGENTD_PORT environment variable.
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
AGENTD_PORT |
17003 |
HTTP listen port |
AGENTD_COLLECTION_INTERVAL_SECS |
30 |
Seconds between automatic metric collections |
AGENTD_CPU_ALERT_THRESHOLD |
90.0 |
CPU usage percentage to trigger an alert |
AGENTD_MEMORY_ALERT_THRESHOLD |
90.0 |
Memory usage percentage to trigger an alert |
AGENTD_DISK_ALERT_THRESHOLD |
90.0 |
Disk usage percentage to trigger an alert |
AGENTD_HISTORY_SIZE |
120 |
Number of metric snapshots to retain in memory |
AGENTD_LOG_FORMAT |
text |
Log format: text or json |
Endpoints¶
GET /health¶
Standard health check. Returns service name, version, and metrics collection count.
GET /metrics¶
Returns the latest system metrics snapshot as JSON. Returns 503 if no collection has run yet.
POST /collect¶
Triggers an immediate metrics collection and returns the snapshot along with any threshold alerts.
Response:
{
"metrics": { "...": "..." },
"alerts": [
{
"metric": "cpu",
"current_value": 95.2,
"threshold": 90.0,
"message": "CPU usage 95.2% exceeds threshold 90.0%",
"raised_at": "2026-03-31T12:00:00Z"
}
]
}
GET /history¶
Returns all retained metrics snapshots (up to AGENTD_HISTORY_SIZE) as a JSON array, newest first.
GET /status¶
Health assessment based on configured thresholds. Returns overall status (healthy, degraded, or critical) and any active alerts.
GET /prom-metrics¶
Prometheus exposition format metrics for scraping.
Data Models¶
SystemMetrics¶
| Field | Type | Description |
|---|---|---|
collected_at |
string (datetime) |
Collection timestamp |
cpu |
CpuMetrics |
CPU utilization metrics |
memory |
MemoryMetrics |
Memory usage metrics |
disks |
DiskMetrics[] |
Per-disk usage metrics |
load_average |
LoadAverage |
System load averages |
CpuMetrics¶
| Field | Type | Description |
|---|---|---|
usage_percent |
float |
Global CPU usage 0.0-100.0 |
core_count |
integer |
Number of logical CPU cores |
per_core |
float[] |
Per-core usage percentages |
MemoryMetrics¶
| Field | Type | Description |
|---|---|---|
total_bytes |
integer |
Total physical memory in bytes |
used_bytes |
integer |
Used memory in bytes |
available_bytes |
integer |
Available memory in bytes |
usage_percent |
float |
Memory usage percentage 0.0-100.0 |
DiskMetrics¶
| Field | Type | Description |
|---|---|---|
name |
string |
Disk device name or label |
mount_point |
string |
Mount point path |
total_bytes |
integer |
Total disk space in bytes |
available_bytes |
integer |
Free space in bytes |
used_bytes |
integer |
Used space in bytes |
usage_percent |
float |
Usage percentage 0.0-100.0 |
LoadAverage¶
| Field | Type | Description |
|---|---|---|
one |
float |
1-minute load average |
five |
float |
5-minute load average |
fifteen |
float |
15-minute load average |
Alert¶
| Field | Type | Description |
|---|---|---|
metric |
string |
Metric that triggered the alert (e.g., cpu, memory, disk:/) |
current_value |
float |
Current metric value |
threshold |
float |
Threshold that was exceeded |
message |
string |
Human-readable description |
raised_at |
string (datetime) |
Alert timestamp |