All articles

Hope that prompt works...

Test your prompts like you test your software if you want AI to facilitate you.

I was sitting with a senior engineer who was very happy about his team's "AI adoption." They went from five pull requests a week to eight, sometimes nine. He was showing me the sprint velocity chart and honestly, it looked great.

Then I asked if I could look at their DORA metrics instead.

Code review time had actually nearly doubled. The amount of lines to be reviewed grew as well (roughly +90%). And the rework - the percentage of code that gets deleted within 21 days of being written - had climbed from 5% to 25%.

That's one in four lines his team wrote last month. Already gone.

So the "speed" increase was actually just code that gets thrown away. More PRs, sure, but also more bugs, more time reviewing, and a lot of code that only existed long enough to create problems before someone deleted it.

That changed the conversation pretty quickly.

What happened there is actually not uncommon. The developers weren't doing anything wrong on purpose. They were sharing prompt snippets in Slack, copying Cursor rules from blog posts, using ChatGPT templates they found on Reddit. Nobody tested any of it. And by the time the metrics made the problem obvious, weeks of engineering time and money were already lost.

It's actually sad how often teams realise this too late. So I hope this post helps you avoid the expensive version of the same lesson.

The reality check: engineering or gambling?

At this point you might have written some prompts. Probably to generate unit tests or as a part of a customer support chat bot. They work (most of the time), but sometimes produce different results. That's not surprising given we use non-deterministic system under the hood. That doesn't feel exactly right. We probably want to be certain in our software (especially critical paths of it).

So for starters I want to look at some of the reasons why testing prompts is important. Apart from obvious ones like unpredictable regression caused by API changes (OpenAI retired GPT-4o and three other models from ChatGPT in February 2026 alone) or just general consequences of baked in non-determinism there are things like:

  • Hidden biases brought by models from training data (when model "ignores" instructions in prompt). Your prompt says one thing, the model's priors say another.
  • Debugging difficulty caused by the fact it's hard to isolate root cause without tracking full context (which is a lot - input, prompt version, model version, parameters).
  • Hard to catch garbage in response, caused by the simple fact that ambiguous prompts still produce an output, it's just unreliable. Unlike e.g. syntax errors which break code immediately (these will surface very quickly). In April 2025, OpenAI pushed a system prompt update that made GPT-4o excessively flattering for its 500 million weekly users. They later admitted they "focused too much on short-term feedback." It took days and a social media firestorm before they rolled it back. That system prompt change wasn't treated as a release candidate. Nobody tested it.

GitClear's 2025 analysis of 211 million lines of code actually found that AI output creates what they called an "illusion of correctness" - the visual neatness and consistent style of generated code caused developers to trust it without thorough validation. Their data showed review participation fell nearly 30%.

And well, if OpenAI doesn't always get this right, the "prompts" your team shares without any control probably need some curation too.

Prompts are code. Yes, even that ChatGPT one.

So here I formulated for myself a few simple principles that help me a lot and hopefully will be useful for you too:

  1. Prompts are code. Version them, review them and run automated evals on every change. Like you always do with your code.
  2. You haven't written all the instructions. When working with a "black boxed system" your input and instructions (while critical) do not define 100% of the output. Hidden biases often surface on edge cases. Diverse test inputs catch what prompt tweaking alone doesn't.
  3. "Works on my machine" proves nothing. A prompt that works today may silently break after a model update. Only automated, repeatable tests can give you confidence.
  4. Garbage in, garbage out. Prompt quality is the most critical success factor. Test your inputs even more than your outputs.
  5. Can't reproduce - can't debug. You have to keep full context 100% of the time including inputs, params, model version and more in your control to be able to catch an issue reliably.

And there's one important thing most people miss: these principles apply not just to product prompts.

Think about it. The ChatGPT instruction you pasted into your team's wiki. The Cursor rules file your tech lead shared on Slack. The "system prompt for code review" your org adopted from a conference talk. If multiple people use it and nobody's actually measured whether it works - well, you have untested code running in production. You just don't call it that.

That gap between perception and reality is the whole problem.

Rule of thumb I use: if it lives in a file and runs more than once - it needs testing.

Prompt testing techniques

So if we need to start testing prompts, where do we begin?

Unfortunately direct assertions like assertEqual(LM_response, expected) won't work. You may have many valid outputs, and even if you set some abstract temperature=0, you are not guaranteed consistent outputs across your runs. Quality lives on a spectrum - relevance, coherence, accuracy - not a binary pass/fail.

What still works from traditional testing?

Good news - many "traditional" principles are still applicable:

  • CI/CD integration. Automated pipelines, shifting tests left, running evals on every PR are still your best friends.
  • Tests setup & structure. Didn't change much either, so fixtures, datasets, parameterised tests and pytest-style structure are still applicable.
  • Structure (code-based) assertions. Validating output format (JSON/XML/CSV), mandatory fields, type constraints - fast, cheap and catching a lot of issues early. Do not underestimate.
  • Heuristic (rule-based) checks. Also very good and catch ~30-40% of issues from my experience. They are good for things like output length constraints (too short = incomplete, too long = verbose), required elements (must contain XYZ, must have 3+ bullet points) etc.

Some quick examples:

# Heuristic checks
def test_response_basics(response: str):
    assert len(response) > 50, "Response too short - likely incomplete"
    assert len(response) < 2000, "Response too verbose"
    assert "disclaimer" in response.lower(), "Missing required disclaimer"
    assert response.count("•") >= 3, "Must contain at least 3 bullet points"
# Code-based (structure) assertions
import json
from jsonschema import validate

schema = {
    "type": "object",
    "required": ["summary", "action_items", "priority"],
    "properties": {
        "summary": {"type": "string", "minLength": 10},
        "action_items": {
            "type": "array",
            "minItems": 1,
            "items": {"type": "string"}
        },
        "priority": {"enum": ["low", "medium", "high"]}
    }
}

def test_output_structure(llm_response: str):
    parsed = json.loads(llm_response)  # fails if not valid JSON
    validate(instance=parsed, schema=schema)  # fails if schema mismatch

The "new" stuff

So the gap that traditional testing can't cover gets addressed with other techniques. Here are the main ones:

Semantic similarity allows to validate if the response conveys the same meaning (actual vs expected). If I simplify - it converts actual response and expected output to vectors and measures cosine similarity between them. BERTScore handles this locally - no API calls, no cost per evaluation.

from bert_score import score as bert_score

def test_semantic_similarity(actual: str, expected: str):
    P, R, F1 = bert_score(
        [actual], [expected],
        lang="en",
        model_type="microsoft/deberta-xlarge-mnli"
    )
    assert F1.item() > 0.78, f"Semantic drift detected: {F1.item():.3f}"

Use it when you have "golden sample" outputs and need to detect drift. Typical threshold sits around 0.7-0.85. Tuning that number is where the art comes in (I usually start at 0.75 and adjust based on what my golden samples actually score).

LLM-as-judge is a method where another LLM scores output against certain criteria. It's mostly used for evaluation of response quality aspects like relevancy, correctness, tone, etc. The judge evaluates the response and returns a score in [0; 1] range, which can be tracked and used as quality gateway (with threshold).

From my experience other techniques are a bit more intuitive, and LLM-as-judge raises more questions, which is why I want to go deeper into it.

LLM-as-judge: a closer look

For this section I want to use an example of a dummy SDLC "agent" which does TDD (is really just two small prompts). For demonstration I'm going to use DeepEval (no specific reasons, just used to it, other solutions are just as good).

Defining metrics

First, we define what we measure. Instead of asking "is this good?" we evaluate against specific criteria so "the judge" knows what to score.

For example for the code gen prompt, I'm checking if the generated code satisfies the UTs (as it's kind of a TDD process):

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

code_satisfies_tests = GEval(
    name="CodeSatisfiesTests",
    criteria="Implementation would make all provided test cases pass.",
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.7
)

GEval is a DeepEval built-in that allows to create custom metrics. The technique comes from a paper called G-Eval, which showed 0.514 Spearman correlation with human judgments - the highest of any automated method at the time.

Another measurement that is good to show is alignment - on a high level it checks whether a prompt is able to generate output that follows the instructions:

from deepeval.metrics import PromptAlignmentMetric

aaa_pattern = PromptAlignmentMetric(
    prompt_instructions=[
        "Each test follows Arrange-Act-Assert pattern",
        "Each test has a single assertion"
    ],
    threshold=0.7
)

Score here is calculated as instructions_followed / total_instructions. I.e. if the prompt says "one assertion per test" and "use AAA pattern," the metric checks both explicitly. And the threshold is the pass/fail line to have an exception thrown (I always start at 0.5-0.7 and tighten based on data).

Writing the test

This part might be actually more familiar, as the structure mirrors pytest a lot. assert_test evaluates your LLM output against the metrics:

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from src.metrics.code_gen import code_satisfies_tests
from src.prompt_runner import run_prompt
from datasets.tdd_prompts import create_code_gen_dataset

def test_code_generation():
    dataset = create_code_gen_dataset()

    for golden in dataset.goldens:
        actual_output = run_prompt("code-gen.md", {
            "task": golden.input,
            "design": golden.additional_metadata.get("design", ""),
            "spec": golden.additional_metadata.get("spec", "")
        })

        test_case = LLMTestCase(
            input=golden.input,
            actual_output=actual_output,
            expected_output=golden.expected_output
        )

        assert_test(test_case, [code_satisfies_tests])

But here's the trade-off: LLM-as-judge is slower and more expensive than heuristics and code assertions. So don't reach for it first. Use heuristics and code assertions for the cheap wins. Add semantic similarity when you have golden samples. Reserve LLM-as-judge for the quality criteria you genuinely can't check with code.

Component-level: when one prompt isn't enough

While end-to-end evals are good for one prompt, there are cases where it's not enough - multi-step pipelines, agents, RAGs, etc. When one fails, you need to know which exact component broke.

So here I formulated for myself another important rule I try to follow:

Monolithic prompts that do multiple things are hard to debug. Split them into focused steps (prompt chaining) each with clear inputs and outputs. Test the parts of it independently, so when something fails, you know exactly where.

DeepEval's @observe decorator creates individually testable steps - spans, where each gets its own metrics. A full execution creates a trace containing all spans:

from deepeval.tracing import observe

@observe(type="llm")
def generate_tests(task, design):
    return run_prompt("test-gen.md", {"task": task, "design": design})

@observe(type="llm")
def generate_code(task, design, spec):
    return run_prompt("code-gen.md", {"task": task, "design": design, "spec": spec})

@observe(type="agent", metrics=[code_satisfies_tests, aaa_pattern])
def tdd_flow(task, design):
    spec = generate_tests(task, design)
    code = generate_code(task, design, spec)
    return code

Start here, not everywhere

Don't try to boil the ocean. Pick one prompt - the one that matters most. Pick one or two metrics. Start with a threshold of 0.5-0.7 and tighten based on actual data, not gut feel.

Generally not to get lost, here's the rule of thumb I rely on: use code assertions and heuristics first as they are fast, cheap and catch ~30% of issues. Use embedding-based semantic similarity (e.g. BERTScore) if you have your "golden sample" data set + edge cases. Rely on LLM-as-judge for quality criteria that you can't check with code - and make those criteria specific.

And avoid these mistakes, because I've made every single one:

Overfitting to test cases. Your prompt works flawlessly with your five examples but falls apart on real inputs. Test data ≠ all possible inputs.

Thresholds too high. Setting 0.95 as your pass/fail line and then wondering why good outputs keep failing. LLM-as-judge has variance. Start at 0.5-0.7, tighten based on data.

Testing only happy paths. Clean inputs pass. Production inputs break. Include edge cases: empty inputs, malformed data, languages you didn't plan for.

Skipping human review. Automation catches regressions, not nuance. Use both: automated evals for CI, periodic expert review for calibration.

Testing too late. Finding issues after your prompt is deployed to users, not before. Shift left. Run evals on every PR.

Measuring everything at once. Ten metrics from day one, none properly tuned. Start with one or two critical criteria. Add more when those are stable.

The hard part isn't writing the test. It's defining what "good" means for your specific case. Once you have that, the tooling is the easy part.

So here's the simple rule: test any prompt that is shared with someone or will be used more than once.

That's it. Not just your product prompts. Your Cursor rules. Your ChatGPT templates. The "just paste this into Claude" messages on Slack.

This is a new reality, and I'd probably call it AI hygiene. Same as you wouldn't share untested code with your team, you probably shouldn't share untested prompts either.