Prometheus Metrics Endpoint
The auth service exposes aGET /metrics endpoint in Prometheus exposition format:
Counters
| Metric | Labels | Description |
|---|---|---|
grantex_token_exchange_total | status | Token exchange attempts |
grantex_authorize_total | status | Authorization requests |
grantex_grants_revoked_total | — | Grants revoked (including cascade) |
grantex_webhook_deliveries_total | status | Webhook delivery outcomes |
grantex_anomalies_detected_total | type, severity | Anomalies detected |
Histograms
| Metric | Labels | Description |
|---|---|---|
grantex_authorize_duration_seconds | — | Authorization request duration |
grantex_token_exchange_duration_seconds | — | Token exchange duration |
grantex_http_request_duration_seconds | method, route, status_code | HTTP request duration (all routes) |
Gauges
| Metric | Description |
|---|---|
grantex_active_grants | Current active grants count |
grantex_anomalies_unacknowledged | Unacknowledged anomalies |
Environment Variables
| Variable | Default | Description |
|---|---|---|
METRICS_ENABLED | true | Set to false to disable metrics collection |
Grafana Dashboards
Pre-built Grafana dashboards are available atdeploy/grafana/:
| Dashboard | Description |
|---|---|
overview-dashboard.json | Token exchange rate, success rate gauge, latency p50/p99, grants revoked, active grants, webhook deliveries, anomalies, HTTP error rate |
per-agent-dashboard.json | Per-agent drill-down with a $agent_id template variable |
Import Instructions
- In Grafana, go to Dashboards > Import
- Upload the JSON file or paste its contents
- Select your Prometheus data source when prompted (
${DS_PROMETHEUS}) - Click Import
Health Check Endpoint
The auth service exposes aGET /health endpoint that returns the service status:
- Load balancer health checks — poll
/healthevery 10–30 seconds - Uptime monitoring — UptimeRobot, Pingdom, Cloud Monitoring
- Kubernetes liveness probes —
livenessProbe.httpGet.path: /health
Alerting Thresholds
Recommended thresholds for production alerting:| Metric | Warning | Critical | Action |
|---|---|---|---|
| Token exchange failure rate | > 5% | > 15% | Check auth service logs |
| Token refresh failure rate | > 5% | > 15% | Check for refresh token reuse or clock skew |
| Anomalies detected | > 5/hour | > 10/hour | Review anomaly details |
| Webhook delivery success | < 98% | < 95% | Verify endpoint availability |
| 429 rate | > 50/min | > 200/min | Client misconfiguration or abuse |
| Auth request latency (p99) | > 500ms | > 2s | Database or Redis performance issue |
| Health check failures | 1 consecutive | 3 consecutive | Service restart |
Alertmanager Rules
Logging
Structured Logging
The auth service uses Pino for JSON-structured logging:What to Log
| Event | Log Level | Key Fields |
|---|---|---|
| Grant created | info | grantId, agentId, principalId, scopes |
| Grant revoked | info | grantId, revokedBy, cascadeCount |
| Token exchanged | info | grantId, agentId |
| Token refreshed | info | grantId, agentId |
| Token verification failed | warn | reason, tokenId |
| Auth request denied | warn | agentId, principalId, reason |
| Rate limit hit | warn | ip, endpoint, retryAfter |
| Anomaly detected | warn | type, severity, agentId |
| Webhook delivery failed | error | webhookId, url, statusCode, attempt |
| Database connection error | error | error, pool |