Datadog figures out how to route LLM requests across Kubernetes clusters
Datadog published a deep dive into its new Kubernetes inference extension, which solves a problem that's been quietly gnawing at anyone running LLMs in the wild: you don't just send requests to a single API anymore. You're juggling OpenAI, Anthropic, your own fine-tuned models, maybe a local Llama 3 instance — and each one has different costs, latency, and failure modes.
The extension lets you define routing rules in Kubernetes. Send a request to the cheapest model first, failover to the expensive one if it takes too long, route image-heavy prompts to a vision model, send routine queries to a smaller model to save cash. You can control this per-service, per-workload, even per user. It's basically a load balancer that understands what your models can actually do.
Datadog has been building in this direction for a while. The company started as a monitoring tool for infrastructure, then expanded into observability, and now it's putting its infrastructure data to work for the LLM layer. The same metrics that tell you a pod is unhealthy are the same ones that tell you which model endpoint is lagging. The same dashboards that caught your deployment going sideways now catch your API call timing out on a Friday afternoon.
This is the kind of work that doesn't make headlines but quietly reshapes how teams build. When routing intelligence lives in your cluster instead of scattered across three different API clients, you stop paying for the mistakes you keep making.
Why this matters for us: the companies building these tools are the ones deciding what infrastructure gets maintained, what costs stay predictable, and who gets left paying for the chaos when the routing breaks.
“You stop paying for the mistakes you keep making.”