Metrics & Observability
Monitor agent health and system metrics with Prometheus
Metrics & Observability
Monitor agent health and system metrics with Prometheus
Agentfield provides comprehensive health checks and Prometheus-compatible metrics for monitoring your multi-agent system. Track execution performance, agent availability, and system health in real-time.
Quick Start
Check system health:
curl http://localhost:8080/api/v1/healthResponse:
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:45Z",
"version": "1.0.0",
"checks": {
"storage": {
"status": "healthy",
"message": "storage is responsive",
"response_time": 12
},
"cache": {
"status": "healthy",
"message": "cache is responsive",
"response_time": 3
}
}
}Health Check Endpoint
Use Cases
Container Orchestration:
# Kubernetes liveness probe
livenessProbe:
httpGet:
path: /api/v1/health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
# Docker Compose healthcheck
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/api/v1/health"]
interval: 30s
timeout: 10s
retries: 3Load Balancer Health Checks:
# NGINX upstream health check
upstream agentfield_servers {
server agentfield1:8080 max_fails=3 fail_timeout=30s;
server agentfield2:8080 max_fails=3 fail_timeout=30s;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /api/v1/health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx;
}Prometheus Metrics
Agentfield exposes Prometheus-compatible metrics for detailed observability:
Key Metrics
| Metric | Type | Description |
|---|---|---|
agentfield_executions_total | Counter | Total executions by node, type, and status |
agentfield_execution_duration_seconds | Histogram | Execution latency distribution |
agentfield_memory_operations_total | Counter | Memory operations by type and scope |
agentfield_workflow_dag_depth | Histogram | Workflow complexity metrics |
agentfield_queue_depth | Gauge | Current queue backlog |
Prometheus Configuration
# prometheus.yml
scrape_configs:
- job_name: 'agentfield'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
scrape_interval: 15sMonitoring Patterns
Pattern 1: Execution Performance
Track agent execution latency and success rates:
# Average execution duration by agent
rate(agentfield_execution_duration_seconds_sum[5m])
/ rate(agentfield_execution_duration_seconds_count[5m])
# Success rate by agent
rate(agentfield_executions_total{status="succeeded"}[5m])
/ rate(agentfield_executions_total[5m])
# P95 latency
histogram_quantile(0.95,
rate(agentfield_execution_duration_seconds_bucket[5m]))Pattern 2: System Health
Monitor queue depth and resource utilization:
# Queue backlog
agentfield_queue_depth
# Memory operation rate
rate(agentfield_memory_operations_total[5m])
# Workflow complexity trend
avg(agentfield_workflow_dag_depth)Pattern 3: Alerting
Set up alerts for critical conditions:
# Prometheus alerting rules
groups:
- name: agentfield_alerts
rules:
- alert: HighExecutionFailureRate
expr: |
rate(agentfield_executions_total{status="failed"}[5m])
/ rate(agentfield_executions_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High execution failure rate"
description: "{{ $labels.node_id }} has {{ $value }}% failure rate"
- alert: HighQueueDepth
expr: agentfield_queue_depth > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Execution queue is backing up"
description: "Queue depth: {{ $value }}"
- alert: SlowExecutions
expr: |
histogram_quantile(0.95,
rate(agentfield_execution_duration_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Slow agent executions detected"
description: "P95 latency: {{ $value }}s"Grafana Dashboards
Visualize Agentfield metrics with Grafana:
{
"dashboard": {
"title": "Agentfield Multi-Agent System",
"panels": [
{
"title": "Execution Rate",
"targets": [
{
"expr": "rate(agentfield_executions_total[5m])"
}
]
},
{
"title": "Success Rate",
"targets": [
{
"expr": "rate(agentfield_executions_total{status=\"succeeded\"}[5m]) / rate(agentfield_executions_total[5m])"
}
]
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(agentfield_execution_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "Queue Depth",
"targets": [
{
"expr": "agentfield_queue_depth"
}
]
}
]
}
}Integration Examples
Datadog
from datadog import initialize, statsd
# Initialize Datadog
initialize(api_key='your-api-key')
# Send custom metrics
statsd.increment('agentfield.execution.count',
tags=['agent:support', 'status:succeeded'])
statsd.histogram('agentfield.execution.duration',
duration_ms,
tags=['agent:support'])New Relic
const newrelic = require('newrelic');
// Track execution
newrelic.recordMetric('Custom/Agentfield/Execution/Count', 1);
newrelic.recordMetric('Custom/Agentfield/Execution/Duration', durationMs);
// Track errors
newrelic.noticeError(new Error('Execution failed'), {
agent: 'support-agent',
execution_id: 'exec_123'
});CloudWatch
import boto3
cloudwatch = boto3.client('cloudwatch')
# Put custom metrics
cloudwatch.put_metric_data(
Namespace='Agentfield',
MetricData=[
{
'MetricName': 'ExecutionCount',
'Value': 1,
'Unit': 'Count',
'Dimensions': [
{'Name': 'Agent', 'Value': 'support-agent'},
{'Name': 'Status', 'Value': 'succeeded'}
]
},
{
'MetricName': 'ExecutionDuration',
'Value': duration_ms,
'Unit': 'Milliseconds',
'Dimensions': [
{'Name': 'Agent', 'Value': 'support-agent'}
]
}
]
)Best Practices
1. Set Up Alerts
Monitor critical metrics:
- Execution failure rate > 10%
- Queue depth > 100
- P95 latency > 10s
- Storage/cache unhealthy
2. Track SLOs
Define Service Level Objectives:
# Example SLOs
- Availability: 99.9% uptime
- Latency: P95 < 2s
- Success Rate: > 99%
- Queue Depth: < 503. Use Distributed Tracing
Combine metrics with distributed tracing for complete observability:
// OpenTelemetry integration
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('agentfield-client');
const span = tracer.startSpan('execute-agent');
span.setAttribute('agent.id', 'support-agent');
span.setAttribute('workflow.id', workflowId);
try {
const result = await executeAgent();
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);
} finally {
span.end();
}4. Monitor Resource Usage
Track system resources alongside application metrics:
# CPU usage
rate(process_cpu_seconds_total[5m])
# Memory usage
process_resident_memory_bytes
# Goroutines (for Go-based Agentfield)
go_goroutinesRelated
- Agent Execution - Execute agents with automatic metrics
- Async Execution - Monitor async queue depth
- Workflow Management - Track workflow complexity
- REST API Overview - Complete API reference