Metrics & Observability

Monitor agent health and system metrics with Prometheus

Metrics & Observability

Monitor agent health and system metrics with Prometheus

Agentfield provides comprehensive health checks and Prometheus-compatible metrics for monitoring your multi-agent system. Track execution performance, agent availability, and system health in real-time.

Quick Start

Check system health:

curl http://localhost:8080/api/v1/health

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:45Z",
  "version": "1.0.0",
  "checks": {
    "storage": {
      "status": "healthy",
      "message": "storage is responsive",
      "response_time": 12
    },
    "cache": {
      "status": "healthy",
      "message": "cache is responsive",
      "response_time": 3
    }
  }
}

Health Check Endpoint

Use Cases

Container Orchestration:

# Kubernetes liveness probe
livenessProbe:
  httpGet:
    path: /api/v1/health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

# Docker Compose healthcheck
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/api/v1/health"]
  interval: 30s
  timeout: 10s
  retries: 3

Load Balancer Health Checks:

# NGINX upstream health check
upstream agentfield_servers {
  server agentfield1:8080 max_fails=3 fail_timeout=30s;
  server agentfield2:8080 max_fails=3 fail_timeout=30s;

  check interval=3000 rise=2 fall=3 timeout=1000 type=http;
  check_http_send "GET /api/v1/health HTTP/1.0\r\n\r\n";
  check_http_expect_alive http_2xx;
}

Prometheus Metrics

Agentfield exposes Prometheus-compatible metrics for detailed observability:

Key Metrics

MetricTypeDescription
agentfield_executions_totalCounterTotal executions by node, type, and status
agentfield_execution_duration_secondsHistogramExecution latency distribution
agentfield_memory_operations_totalCounterMemory operations by type and scope
agentfield_workflow_dag_depthHistogramWorkflow complexity metrics
agentfield_queue_depthGaugeCurrent queue backlog

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'agentfield'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

Monitoring Patterns

Pattern 1: Execution Performance

Track agent execution latency and success rates:

# Average execution duration by agent
rate(agentfield_execution_duration_seconds_sum[5m])
  / rate(agentfield_execution_duration_seconds_count[5m])

# Success rate by agent
rate(agentfield_executions_total{status="succeeded"}[5m])
  / rate(agentfield_executions_total[5m])

# P95 latency
histogram_quantile(0.95,
  rate(agentfield_execution_duration_seconds_bucket[5m]))

Pattern 2: System Health

Monitor queue depth and resource utilization:

# Queue backlog
agentfield_queue_depth

# Memory operation rate
rate(agentfield_memory_operations_total[5m])

# Workflow complexity trend
avg(agentfield_workflow_dag_depth)

Pattern 3: Alerting

Set up alerts for critical conditions:

# Prometheus alerting rules
groups:
  - name: agentfield_alerts
    rules:
      - alert: HighExecutionFailureRate
        expr: |
          rate(agentfield_executions_total{status="failed"}[5m])
          / rate(agentfield_executions_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High execution failure rate"
          description: "{{ $labels.node_id }} has {{ $value }}% failure rate"

      - alert: HighQueueDepth
        expr: agentfield_queue_depth > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Execution queue is backing up"
          description: "Queue depth: {{ $value }}"

      - alert: SlowExecutions
        expr: |
          histogram_quantile(0.95,
            rate(agentfield_execution_duration_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow agent executions detected"
          description: "P95 latency: {{ $value }}s"

Grafana Dashboards

Visualize Agentfield metrics with Grafana:

{
  "dashboard": {
    "title": "Agentfield Multi-Agent System",
    "panels": [
      {
        "title": "Execution Rate",
        "targets": [
          {
            "expr": "rate(agentfield_executions_total[5m])"
          }
        ]
      },
      {
        "title": "Success Rate",
        "targets": [
          {
            "expr": "rate(agentfield_executions_total{status=\"succeeded\"}[5m]) / rate(agentfield_executions_total[5m])"
          }
        ]
      },
      {
        "title": "P95 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(agentfield_execution_duration_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Queue Depth",
        "targets": [
          {
            "expr": "agentfield_queue_depth"
          }
        ]
      }
    ]
  }
}

Integration Examples

Datadog

from datadog import initialize, statsd

# Initialize Datadog
initialize(api_key='your-api-key')

# Send custom metrics
statsd.increment('agentfield.execution.count',
                 tags=['agent:support', 'status:succeeded'])
statsd.histogram('agentfield.execution.duration',
                 duration_ms,
                 tags=['agent:support'])

New Relic

const newrelic = require('newrelic');

// Track execution
newrelic.recordMetric('Custom/Agentfield/Execution/Count', 1);
newrelic.recordMetric('Custom/Agentfield/Execution/Duration', durationMs);

// Track errors
newrelic.noticeError(new Error('Execution failed'), {
  agent: 'support-agent',
  execution_id: 'exec_123'
});

CloudWatch

import boto3

cloudwatch = boto3.client('cloudwatch')

# Put custom metrics
cloudwatch.put_metric_data(
    Namespace='Agentfield',
    MetricData=[
        {
            'MetricName': 'ExecutionCount',
            'Value': 1,
            'Unit': 'Count',
            'Dimensions': [
                {'Name': 'Agent', 'Value': 'support-agent'},
                {'Name': 'Status', 'Value': 'succeeded'}
            ]
        },
        {
            'MetricName': 'ExecutionDuration',
            'Value': duration_ms,
            'Unit': 'Milliseconds',
            'Dimensions': [
                {'Name': 'Agent', 'Value': 'support-agent'}
            ]
        }
    ]
)

Best Practices

1. Set Up Alerts

Monitor critical metrics:

  • Execution failure rate > 10%
  • Queue depth > 100
  • P95 latency > 10s
  • Storage/cache unhealthy

2. Track SLOs

Define Service Level Objectives:

# Example SLOs
- Availability: 99.9% uptime
- Latency: P95 < 2s
- Success Rate: > 99%
- Queue Depth: < 50

3. Use Distributed Tracing

Combine metrics with distributed tracing for complete observability:

// OpenTelemetry integration
const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('agentfield-client');

const span = tracer.startSpan('execute-agent');
span.setAttribute('agent.id', 'support-agent');
span.setAttribute('workflow.id', workflowId);

try {
  const result = await executeAgent();
  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.setStatus({ code: SpanStatusCode.ERROR });
  span.recordException(error);
} finally {
  span.end();
}

4. Monitor Resource Usage

Track system resources alongside application metrics:

# CPU usage
rate(process_cpu_seconds_total[5m])

# Memory usage
process_resident_memory_bytes

# Goroutines (for Go-based Agentfield)
go_goroutines