Teaching an AI Router to Learn from Its Own Mistakes: The LLM-as-Judge Feedback Loop

Cognitive Router picks which LLM should handle each request. A coding question goes to a model that is good at code. A creative writing prompt goes to a model that is good at prose. A simple greeting goes to the cheapest model that can string a sentence together.

To make those decisions, the router needs to know how good each model actually is. The obvious approach is to look at benchmark scores — MMLU, HumanEval, MT-Bench — and route based on those. I tried that. It works badly.

Benchmarks measure how models perform on benchmark questions. They do not measure how models perform on your questions. A model that scores well on MT-Bench might produce mediocre creative writing for the specific genre you care about. A model that aces HumanEval might fail on the exact programming language and framework combination in your codebase.

The fix is to measure performance on real traffic and feed those measurements back into routing. But that creates a new problem: who rates the responses?

The Problem With Human Ratings

The gold standard for evaluating LLM output is human preference data. A person reads the prompt, reads the response, and scores it. This is what RLHF is built on. It is also expensive, slow, and does not scale.

For a personal project or a small team, human rating is a non-starter. You are not going to read and score 200 model responses per day. Nobody has that kind of time. And even if you did, your ratings would be inconsistent — you would score the same response differently on Monday morning than on Friday afternoon.

I needed a rating system that runs automatically, costs nothing, and produces consistent scores. That is what LLM-as-judge gives you.

The Design

Here is how the feedback loop works in Cognitive Router.

Sample 10% of requests. Not every response gets evaluated. That would be wasteful. Instead, one in ten responses is selected for review. This is enough signal to track performance trends without doubling your API costs.

Send the prompt-response pair to a judge model. The judge receives the original user prompt, the response that was generated, and a scoring rubric. The rubric asks for scores on a few dimensions: correctness, completeness, clarity, and adherence to instructions. Each dimension gets a 1-5 rating plus a short justification.

Use a free model for judging. OpenRouter has free tiers for several models. The judge does not need to be the smartest model available — it needs to be consistent. A mid-tier model that scores reliably is better than a frontier model that is brilliant but inconsistent. The cost is zero.

Store the scores in SQLite. Each evaluation is a row: model name, prompt hash, scores per dimension, timestamp, the justification text. This becomes the empirical performance database.

Blending Scores With EMA

Raw scores are noisy. A model might get a 4.5 on one response and a 2 on the next because the second prompt was genuinely harder, not because the model got worse. You cannot route based on the last single score. You need a trend.

The solution is Exponential Moving Average (EMA). Each new score updates the model's running average, but recent scores weigh more heavily than old ones. The formula is simple:

ema_score = (alpha * new_score) + ((1 - alpha) * previous_ema)

With an alpha of 0.3, a new score shifts the average by 30%. A single bad response does not tank a model's rating, but a sustained decline is reflected within a few evaluations.

Why EMA instead of a simple average? Because models change. A provider updates a model and it gets better at coding but worse at creative writing. A simple average remembers every old score equally, which means the rating lags behind reality. EMA forgets old performance gradually. The router adapts.

The EMA scores are stored per model, per category. A model might have a high EMA for coding tasks and a low one for creative writing. The router checks the relevant category EMA when deciding where to send each request.

How It Changes Routing Behavior

Before the feedback loop, routing was based on static capability estimates. I seeded each model with benchmark-derived scores. The router picked the model with the best estimated score for the task category.

After the feedback loop ran for a week, the scores started diverging from the static estimates. Some findings:

One model was quietly better than benchmarks suggested. A mid-tier model that did not score especially well on MT-Bench consistently produced high-quality responses for analytical writing tasks. The judge rated it above the "better" models in that category. The router started sending it more analytical work, and the scores held up.

One model was quietly worse. A frontier model that benchmarks love was producing verbose, over-cautious responses on simple tasks. The judge docked it for clarity and conciseness. The router started routing simple requests to a cheaper, more direct model.

Category-specific strengths emerged. Two models with similar overall scores had very different profiles once you broke it down by category. One was good at factual accuracy but bad at following formatting instructions. The other was the reverse. The router learned to pick based on what the specific prompt needed.

Benchmarks tell you which model is best on average. Real traffic tells you which model is best for you. The gap between those two is larger than you think.

What Makes a Good Judge Model

I experimented with a few models as the judge before settling on one. Here is what matters:

Consistency over brilliance. The judge needs to score similar responses similarly. A model that gives a 4 to a response today and a 2 to the same response tomorrow is useless as a judge, no matter how insightful its individual ratings are.

Low cost. This runs on 10% of traffic. If the judge model costs $0.05 per call and you route 500 requests per day, that is $2.50 per day just for evaluation. A free model eliminates that concern entirely.

Calibrated rubric adherence. The judge needs to follow the scoring rubric, not invent its own. I tested this by sending the same prompt-response pair multiple times and checking the variance. Models that varied by more than one point on a 5-point scale were disqualified.

Resistance to length bias. Many LLMs rate longer responses higher regardless of quality. The rubric explicitly addresses this — "do not reward verbosity for its own sake" — but some models follow that instruction better than others. Test for it.

When the Judge Is Wrong

The judge is an LLM. It makes mistakes. It will occasionally rate a good response poorly or a bad response well. This is why EMA matters — individual bad ratings get absorbed into the trend.

The bigger risk is systematic bias. If the judge consistently underrates a particular style of response, that model's EMA will be artificially low, and the router will avoid it. This is the same problem as any evaluation system: the evaluator's biases become the system's biases.

Mitigation is straightforward. First, monitor the score distributions per model. If one model's average score is consistently half a point below the others across all categories, investigate whether that reflects real quality differences or judge bias. Second, periodically swap in a different judge model and compare the scores. If the rankings change significantly, you have a judge problem, not a model problem.

The Principle

The general principle: static evaluations decay. Any system that picks between options based on quality estimates needs a mechanism to update those estimates from real outcomes. Benchmarks are a starting point. They are not a maintenance strategy.

The feedback loop does not need to be sophisticated. A free model scoring 10% of responses, blended with EMA, stored in SQLite. No infrastructure. No human raters. No ML pipeline. Just a loop that runs, learns, and quietly makes better decisions over time.

The router that learns from its own traffic will always beat the router that relies on last quarter's benchmark scores. Build the loop. Let it run.