Skip to main content

Story 19.4: System Metrics & Monitoring

FieldValue
Story Points10
SprintSprint 84

User Story

As a DevOps engineer
I want system health metrics
So that I can ensure platform reliability

Metrics Categories

Infrastructure

CPU, memory, disk I/O, network, container health

Application

Request rate, response time (p50/p95/p99), error rate, connections

Database

Query latency, connection pool, slow queries, replication lag

Redis

Memory, hit rate, clients, ops/second

AI / External

Claude API latency, token usage, rate limits

Prometheus Metrics

  • http_requests_total
  • http_request_duration_seconds
  • db_query_duration_seconds
  • ai_requests_total
  • queue_depth