Production incidents have a way of revealing exactly how much you don’t know about your system. The database is slow. Is it slow for everyone, or just this endpoint? Is the slowdown correlated with a deployment, a traffic spike, or a background job? Without observability, you’re guessing.
OpenTelemetry has become the standard for observability in 2026. It provides a unified API for traces, metrics, and logs, with exporters that send data to any backend — Jaeger, Grafana, Datadog, or your own pipeline.
Traces: The Spine of Observability
A trace follows a request through your system. In a FastAPI application instrumented with OpenTelemetry, every HTTP request generates a trace. Each database query, Redis call, and external API request becomes a span within that trace. When something is slow, you can see exactly which component is the bottleneck.
The auto-instrumentation packages do most of the work. pip install opentelemetry-instrumentation-fastapi and opentelemetry-instrumentation-sqlalchemy, add a few lines of initialization code, and your application emits traces for every request and database call. Manual instrumentation fills the gaps — wrap a critical business logic function in a span so it appears in the trace timeline.
Structured Logging
Print statements and unstructured log messages are the enemy of debugging. When something goes wrong at 2 AM, searching through log lines for error messages is slow and error-prone.
Structured logging with python-json-logger or structlog emits JSON log lines with consistent field names. Every log line includes a timestamp, severity level, service name, and trace ID. When the alert fires, you search for the trace ID and see every log line from every service that handled that request, in chronological order.
Metrics: Beyond CPU and Memory
Application-level metrics tell you what’s happening inside your code. Request latency by endpoint. Error rate by status code. Database connection pool utilization. Queue depth for background tasks. These metrics are more useful than CPU and memory graphs during an incident — they tell you what the system is actually doing, not just how hard it’s working.
Prometheus remains the standard metrics backend for Python applications. The prometheus-client library integrates with FastAPI and Django through middleware, exposing a /metrics endpoint that Prometheus scrapes.
The Implementation Order
Don’t try to implement everything at once. Start with structured logging — it has the lowest effort-to-value ratio. Add traces next, focusing on the critical path through your application. Metrics come last, driven by the questions you actually ask during incidents.
Observability is insurance. You hope you never need it, but when you do, you need it badly. A few days of instrumenting your application pays for itself in the first incident it helps you solve.
Discussion
Leave a comment
No comments yet
Be the first to start the conversation.