When does data help automated agent engineering?

May 20, 2026 · Andrew Jesson

Claude Code can often improve another agent with no training data at all. Across seven applications, data helps only where Claude Code’s own prior knowledge of the task runs out, and how far its guesses drift from the real data mostly tells you which cases those are. The catch is that drift reveals when Claude Code is guessing, not whether guessing actually costs anything.

An agent is more than a model. It is also the prompts, tools, context management, guardrails, orchestration, and compute infrastructure designed around the model.^{1, 2} When the agent does something wrong, an agent engineer tunes one or more of these knobs so that the error never happens again. Automated agent engineering goes meta by putting an AI agent in the role of agent engineer.^{3, 4}

Claude Code and Codex can improve agent prompts and perform engineering practices like iteratively evaluating their changes. But what happens without training data? Surprisingly, Claude Code’s improvements performed as well without training data as with it on several of the applications I tested. On three (Wordle, scientific paper reproduction, business simulation) data did improve task success rate by between five and twenty percentage points, but on the other four (entity extraction, contract extraction, customer service, software engineering) the without-data version performed roughly the same.

Dataset novelty vs. data-ablation gap: Spearman ρ = +0.79, n = 7

Looking into the conversation histories explained why. Even without training data, Claude Code still ran ad hoc evaluations: it made inferences on inputs it generated itself and judged whether the output matched its prompt edit. So the operative question is not whether Claude Code has data; it is whether Claude Code already knows what the data looks like. Data helps exactly when Claude Code’s prior knowledge of the application runs out, and that prior runs out in specific, identifiable ways: the corpus is not memorized, the task shape is unfamiliar, or the harness never says what the agent will actually be shown.

That suggests a way to measure the missing prior from the outside: how far Claude Code’s self-generated inputs drift from the real data. Across the seven applications, this drift tracks the data-ablation gap (the test-set success rate with data minus the rate without it) at a Spearman rank correlation of +0.79 (exact two-sided p = 0.040; Pearson r = +0.96). The lone clean exception is the instructive part: drift measures whether Claude Code is guessing, not whether guessing hurts the metric. Those are two different axes, and one task scores high on the first while shrugging off the second.

The experiment

The seven applications I explored are: named entity extraction (NER), NDA clause extraction, Wordle, customer service, software engineering, business management, and scientific paper reproduction (Science).

For each application agent I ran the same experiment under two conditions, with data and without, using five independent seeds per condition. Claude Code was given configuration files for the agent (prompts, models, tool lists, …). It was instructed to improve the initial prompts. The new prompts were scored on a held-out set of test tasks. The only difference was whether Claude Code was given 100 real traces as training data. In the without-data condition, the same baseline-data paths existed but the trace files were empty.

Claude Code as Agent Harness Engineer

The optimizer is Claude Code running against an internal harness that lets it edit the application agent configuration file. Tool access is Read, Bash, Edit, Write. No external MCP servers.
The optimizer runs inside an isolated Docker container (base image node:24-slim) containing only the Claude Code CLI, curl, and git. There is no Python and no eval source code on the filesystem. The container shares a Docker network with the gateway, so Bash can curl http://gateway:3000/inference ... to test prompts but has no other route to the application code.
Claude Code is running claude-sonnet-4-6. The application agent model is gpt-5.4-mini across all seven applications.

Claude Code is given the following instruction at the start of every run, with placeholders like {config_dir} and {function_name} resolved per application before the run begins. The contents are held constant across all conditions.

# TensorZero Function Optimizer

You are optimizing a TensorZero function to improve its performance metric.

## Environment

- T0 config files: {config_dir}/ (only these and the baseline data below are relevant — don't explore elsewhere)
- Gateway URL: {gateway_url}
- Pre-dumped baseline data: {baseline_data_dir}/ (read-only; direct DB access is not available)
- Restart after config edits: `curl -sf -X POST http://eval:5111/restart-gateway`
- Isolated container. No Python or `pip`; `node` and `curl` are on `$PATH`; `jq` is not installed. Use `node -e "..."` for JSONL parsing (`readline` + `JSON.parse` + project to stdout) — prefer it over shell pipelines when you need fields per row.
- Don't set `temperature` on any variant (some models reject non-default values). Keep an `initial` variant as a baseline reference.
- Don't run evaluation episodes yourself — the harness does that after you exit.

## Task

- Function: `{function_name}`
- Metric: `{metric_name}`. Check the metric's `optimize` field in `tensorzero.toml` for direction (boolean and float metrics may minimize or maximize).
- Baseline performance: {baseline_metrics}

## Available Models

{model_list}

## Baseline data

- `{baseline_data_dir}/inferences.jsonl` — one row per inference (what the model said per task).
- `{baseline_data_dir}/feedback.jsonl` — one row per metric value.
- `{baseline_data_dir}/initial_config/` — read-only copy of the starting T0 config tree.

Files are often 20+ MB. Don't `cat` them whole. Start by `head -3` on each to learn the row shape (field names and nesting vary by env), then project out the fields you need.

### The projection pattern

`grep` first to narrow, then `node -e` to project:

```bash
grep $TARGET_ID {baseline_data_dir}/inferences.jsonl \
  | node -e "
      require('readline').createInterface({input: process.stdin}).on('line', l => {
        const r = JSON.parse(l);
        console.log(r.id, r.variant_name, JSON.stringify(r.output).slice(0,200));
      });"
```

`cat inferences.jsonl | ...` loads the whole file; `grep`-first keeps the pipeline cheap.

### Cross-record one-liners

Adapt the failure predicate to your metric — boolean uses `"value":0` / `"value":1`; float values depend on `optimize` direction.

```bash
# Inferences per episode
grep -o '"episode_id":"[^"]*"' {baseline_data_dir}/inferences.jsonl | sort | uniq -c | sort -rn | head

# Last inference of a failing episode
grep $FAIL_ID {baseline_data_dir}/inferences.jsonl | tail -1

# Which metrics are present
grep -o '"metric_name":"[^"]*"' {baseline_data_dir}/feedback.jsonl | sort | uniq -c

# target_ids of failures (boolean example — adapt the predicate for float metrics)
grep '"metric_name":"{metric_name}"' {baseline_data_dir}/feedback.jsonl \
  | node -e "
      require('readline').createInterface({input: process.stdin}).on('line', l => {
        const r = JSON.parse(l);
        if (r.value === 0 || r.value === false) console.log(r.target_id);
      });" > /tmp/failed.txt
head -5 /tmp/failed.txt | while read id; do grep "$id" {baseline_data_dir}/inferences.jsonl | head -1; done
```

### Templates, schemas, and the required `content` shape

TensorZero has two co-existing config styles. Check which one the function uses in `tensorzero.toml`:

**Legacy** (per-role):

```toml
[functions."my_fn"]
user_schema = "functions/my_fn/user_schema.json"   # and system_schema, assistant_schema

[functions."my_fn".variants.initial]
user_template = "functions/my_fn/initial/user_template.minijinja"
```

**New** (named):

```toml
[functions."my_fn"]
schemas.user_query.path = "functions/my_fn/user_query_schema.json"

[functions."my_fn".variants.initial]
templates.user_query.path = "functions/my_fn/initial/user_query.minijinja"
```

**Canonical `content` block for a templated message** (both styles):

```json
"content": [{
  "type": "template",
  "name": "<template_name>",
  "arguments": { /* object matching the schema */ }
}]
```

For legacy, `"name"` is the role (`"user"` / `"system"` / `"assistant"`). For new, it's the key under `schemas.` / `templates.`.

For a role with no schema: `"content": "Hello"` or `[{"type":"text","text":"Hello"}]`.

**Example** — τ-retail `user_schema.json` and the matching curl body:

```json
// user_schema.json
{ "properties": { "observation": { "type": "string" } },
   "required": ["observation"], "type": "object" }

// curl body
{ "function_name": "tau_bench_retail_v0::act",
   "variant_name": "your_new_variant",
   "input": { "messages": [{ "role": "user", "content": [{
     "type": "template", "name": "user",
     "arguments": { "observation": "Hello, I need to cancel my order." }
   }] }] } }
```

## Methodology

The core loop is: survey the baseline → add variants → test one → iterate. The decisions worth getting right:

- **Metric direction defines "failure."** Don't assume `value:0` is bad; read the metric's `optimize` field.
- **Judge manual variant tests by the `curl /inference` output itself** — right tool call, right JSON, right content.
- **Multi-turn agentic envs** (customer service, business management, coding) need real conversational state to be representative. Pick a real episode from `inferences.jsonl`, copy its first 2–3 messages into your curl body, check how the variant continues. A turn-0 probe alone tells you little.
- **When done, leave the best config in place** with the experimentation section below, and exit.

## Routing: Experimentation Config

After creating new variants, add an experimentation section — otherwise the gateway round-robins and wastes test episodes on bad variants. Keep candidates to your best ~3–4, including `initial` as a baseline.

```toml
[functions."{function_name}".experimentation]
type = "track_and_stop"
metric = "{metric_name}"
candidate_variants = ["initial", "your_new_variant_1", "your_new_variant_2"]
fallback_variants = []
min_samples_per_variant = 5
delta = 0.1
epsilon = 0.0
update_period_s = 5
min_prob = 0.0
max_samples_per_variant = 10000
```

I report the gap in success rate between two runs, score(with data) − score(without data), on the y-axis of the chart above. If data helps, the gap is positive. If data is unnecessary, the gap is around zero.

Evaluation of optimized variants

The per-application metric is binary (success / no success), measured on a held-out test set of up to 100 episodes per (seed, variant). What counts as success depends on the application:
- NER: exact match. The agent correctly identifies and classifies every named entity in the input sentence.
- NDA: exact match. The agent correctly extracts the four target fields (effective_date, jurisdiction, party, term) from the document.
- Wordle: the 5-letter target word is guessed within six attempts.
- Customer service (τ-bench retail): the agent completes the user’s request with the correct database changes (verified by state comparison) and communicates all required information.
- Software engineering (terminal-bench): every test case in the task suite passes.
- Business management (YC bench): the company survives to the simulation horizon without bankruptcy.
- Scientific paper reproduction (replicationbench): every per-paper verifier reward component evaluates to ≥ 1.0 (Harbor’s resolved metric).
Five seeds per (application, condition).
The y-axis of the chart is the per-application gap in success rate: score(with data) − score(without data), with the median across seeds plotted as the point and the inter-quartile range as the vertical whisker.

How to estimate novelty?

Claude Code constructed examples to test the prompts it wrote. It did this several times per run by making inferences on example user inputs and checking the response. With access to real data, it copied those examples from the trace dump. Without access, I was surprised to see that it generated synthetic examples without any additional prompting. Immediately, I wanted to know how different the synthetic examples were from the real ones.

turn 1/1

Realbaseline (episode c5a7cb5d)

user

Japan July refined zinc imports off 47.5 pct yr / yr .

assistant

{
  "person": [],
  "location": [
    "Japan"
  ],
  "organization": [],
  "miscellaneous": [
    "July",
    "refined zinc",
    "refined zinc imports",
    "47.5 pct yr / yr"
  ]
}

Syntheticsynth seed4 row20

user

The COP30 climate summit in Belém, Brazil drew delegates from 190 nations. UN Climate Chief Simon Stiell praised pledges from China, India, and the European Union. US Climate Envoy John Podesta and Brazilian President Luiz Inácio Lula da Silva co-chaired the closing session.

assistant

{
  "person": [
    "Simon Stiell",
    "John Podesta",
    "Luiz In\u00e1cio Lula da Silva"
  ],
  "organization": [
    "UN",
    "European Union"
  ],
  "location": [
    "Bel\u00e9m",
    "Brazil",
    "China",
    "India"
  ],
  "miscellaneous": [
    "COP30"
  ]
}

To investigate the difference, I devised a dataset-synthesis pipeline and ran it on the seven applications. Given just the application agent’s config and the dataset schema, Claude Code was instructed to generate 20–40 example conversations. Across five independent seeds per application, I compute the maximum mean discrepancy ( $\mathrm{MMD}^2$ ) in embedding space (Voyage voyage-3-large) between the synthetic corpus and the real-trace corpus, and report the median across seeds as the per-application novelty score. In other words, novelty measures how different the real traces are from what Claude Code guesses the data should look like without access to those traces. Across the seven applications this score tracks the data-ablation gap at Spearman ρ = +0.79 (exact two-sided p = 0.040; at n = 7 the asymptotic approximation is unreliable, so I report the exact permutation value). MMD² was the first and only drift estimator I tried: a standard non-parametric two-sample distance, fixed before I looked at the gaps. So this is a single pre-chosen statistic, not the best of a search over estimators.

Dataset-novelty estimator (MMD²)

The goal is to estimate how surprising a dataset is to a coding agent like Claude Code or Codex that is instructed to be an agent engineer. To do this, I compare a real dataset to a dataset generated by the coding agent.

Each application is an LLM function with a defined input/output contract, like answering a customer-service ticket, extracting entities from a sentence, or playing a turn of Wordle. An inference is one call to that function: the input it received plus the output it returned, recorded as one row in a JSONL file. An episode is one logical interaction with the function, identified by a shared episode_id. A single-turn application like NER has exactly one inference per episode. A multi-turn application like Wordle chains several inferences into one episode (one inference per turn of the game). For each application I assume two corpora of such rows:

corpus	source	size
Real baseline $\mathcal{B}$	actual rows logged from prior runs of the function on real users / tasks	hundreds to ~20 k rows
Synthetic $\mathcal{S}$	rows invented by an agent given only the function’s spec (no real data seen)	25–170 rows per seed

Because the coding agent is not conditioned on real data to generate the synthetic dataset, the divergence between its distribution over datasets given the task and the distribution over real datasets is an indicator of novelty. Therefore, I want a scalar that measures the divergence between the distribution of rows in $\mathcal{S}$ and the distribution of rows in $\mathcal{B}$ . I chose Maximum Mean Discrepancy ( $\mathrm{MMD}^2$ )⁵, which is a standard non-parametric estimator. It compares the kernel-induced means of two finite samples and goes to zero as the two samples are drawn from the same underlying distribution. A larger MMD² means the coding agent’s knowledge of the application, given the config, covers less of the actual deployment.

Generating the synthetic corpus

The synthetic corpus $\mathcal{S}$ is produced by a coding agent (Claude Code or Codex) given only:

the function’s machine-readable specification: input schema, output schema, system prompt, available tools, and the set of defined evaluation metrics;
the schemas of the two output files (inferences.jsonl row schema and feedback.jsonl row schema).

The agent has no access to real data during synthesis.

The procedure has five steps (read the spec, plan input coverage, generate inputs and outputs, calibrate periodically with a few live probe calls, then emit feedback values), reproduced verbatim in the SKILL.md and methodology.md instruction files below.

The output is two files: inferences.jsonl (one row per inference) and feedback.jsonl (one row per metric value, linked to the inference or episode it scores). Both are schema-validated before the run exits. The episode budget is a parameter set per run; in this analysis it was set to 20–40 episodes per application.

I run K independent agent seeds per application (K = 5 in this analysis), so the dataset-novelty estimator can be aggregated across runs.

The instruction files the synthesis agent reads are reproduced below.

SKILL.md

The top-level instruction the synthesis agent receives:

---
name: dataset-synthesis
description: Synthesize representative inferences and feedback for an LLM application described by a TensorZero configuration. Use when a plausible baseline corpus is needed for a function that has not yet collected real data.
---

# TensorZero Dataset Synthesis

You are synthesizing a _plausible_ dataset for a TensorZero function. You will produce two JSONL files that look like what `inferences.jsonl` and `feedback.jsonl` _would_ contain after the function had run live for a while. Crucially, you do **not** have any real baseline data to draw from, but the configuration files should provide you with enough information about the application to generate sensible examples.

## Environment

- T0 config files: `{config_dir}/`
- Gateway URL: `{gateway_url}` (you may POST to `/inference` to spot-check your understanding of the input structure)
- Output directory: `{output_dir}/` — write `inferences.jsonl` and `feedback.jsonl` here
- Isolated container. No Python or `pip`; `node` and `curl` are on `$PATH`; `jq` is not installed. Use `node -e "..."` for JSONL parsing.
- Emit rows with `variant_name: "initial"` only.

## Task

- Function: `{function_name}`
- Metrics defined for this function: `{metric_name_list}` (read their `kind`, `level`, and `optimize` fields from `tensorzero.toml`)
- Budget: at least `{min_episodes}`, but no more than `{max_episodes}` episodes. An episode is one logical interaction with the function — a single inference for single-turn functions, a chain of inferences sharing one `episode_id` for multi-turn.
- Output files:
  - `{output_dir}/inferences.jsonl` — one row per inference call (see reference/inferences_schema.md)
  - `{output_dir}/feedback.jsonl` — one row per metric value, with `target_id` referring to the `inference_id` or `episode_id` of a row in the inferences file (see reference/feedback_schema.md)

## Workflow

Five steps. See reference/methodology.md for the long form; the short version:

1. **Read the spec.** Open `{config_dir}/tensorzero.toml` and the linked schema / template files. Note: input schema, output schema, function type (chat vs tool), defined metrics, and whether the function is one-shot or part of a multi-turn episode.
2. **Hypothesize the input distribution.** What kinds of users / states does this function see in deployment? Sketch a coverage plan: how many length buckets, which schema slots vary, which edge cases matter. Aim for diversity, not just a single canonical mode.
3. **Generate inputs.** Plan out at least `{min_episodes}` but no more than `{max_episodes}` distinct episodes. For multi-turn functions, decide each episode's length up front based on what's realistic for the task (a 4-turn episode contributes 4 rows sharing one `episode_id`).
4. **Spot-check via the gateway.** Periodically POST a synthetic input to `{gateway_url}/inference` to confirm your understanding of the input structure is correct and to see what the `initial` variant's output actually looks like. Gateway calls are expensive — treat this as a calibration step, not as the way to generate every row. Generate outputs yourself in between checks.
5. **Generate feedback rows.** For each metric in `{metric_name_list}`, emit one feedback row per appropriate target (per-inference or per-episode based on the metric's `level`).

After generating, validate:

```bash
node /skill/scripts/validate.js {output_dir} \
  --config {config_dir}/tensorzero.toml \
  --min-episodes {min_episodes} --max-episodes {max_episodes}
```

The validator checks schema compliance, referential integrity, budget, and the `variant_name == "initial"` invariant. Fix any errors it reports before exiting.

## Output contract

When you exit, `{output_dir}/` must contain exactly:

- `inferences.jsonl` — every row conforms to the schema in `reference/inferences_schema.md`; the rows span at least `{min_episodes}` and at most `{max_episodes}` distinct `episode_id`s
- `feedback.jsonl` — one or more rows per metric, with every `target_id` referring to an `id` (for inference-level metrics) or `episode_id` (for episode-level metrics) that exists in `inferences.jsonl`

Do not write any other files in `{output_dir}/`. Do not modify `{config_dir}/`. Stay within budget — don't issue gateway calls indefinitely.

## Principles

- **Quality of coverage beats quantity of duplicates.**
- **Use the gateway as a calibration tool.** A periodic `/inference` call confirms your understanding of the input structure and shows you what the `initial` variant actually emits. It's not a way to generate every row — gateway calls are expensive, and it's fine to generate outputs yourself between checks.
- **Don't peek.** You don't have baseline data. If you find yourself wanting to "look at a real example," that's the signal to make a better-reasoned guess from the spec instead.
- **Plausibility includes failure.** Some inferences will fail their metric. Your feedback distribution should reflect a realistic failure rate for the task — not 100% success.

reference/methodology.md

The longer methodology the agent can consult:

# Synthesis methodology

The recipe for producing a faithful `inferences.jsonl` + `feedback.jsonl` from spec alone. Five steps, each with concrete things to look for.

## 1. Read the spec carefully

Open `{config_dir}/tensorzero.toml`. For the target function, capture:

- **Type**: `chat` vs `tool` / `json`. This determines `output` shape.
- **Schemas**: input (per-role or named), output (for tool-call functions). Read every referenced `.json` and every `.minijinja` template.
- **Metrics**: which are defined, their `type`, `level`, `optimize`. These dictate the `feedback.jsonl` rows you'll write.
- **System prompt**: usually inside the variant's template. Read it — this is the strongest signal about what the function is for.
- **Tool list** (for tool functions): names, descriptions, argument schemas.

Two patterns to check early:

```bash
# What kind of function?
grep -A2 "^\[functions\.\"{function_name}\"\]" {config_dir}/tensorzero.toml

# Which metrics are defined?
grep -E "^\[metrics\." {config_dir}/tensorzero.toml
```

## 2. Hypothesize the input distribution

Before generating anything, sketch a plan. For each schema slot in the input:

- What value ranges / shapes does it plausibly take in deployment?
- Are there subpopulations (long vs short, simple vs nested, single vs multi-entity)?
- What's the realistic length / complexity distribution?

Write the plan as a comment-level outline before the first row. Something like:

```
Plan for {function_name}
(budget: at least {min_episodes}, at most {max_episodes} episodes)
  - Target ~N episodes × K turns each
  - Mix of the major user intents the function supports
  - Vary user tone / register across episodes
  - Cover authentication / setup steps the function expects before the main action
```

Don't skip this step. Generating without a plan reliably produces a stack of near-duplicates of the same canonical input.

## 3. Generate inputs

For each row in your plan:

- Construct the `input.messages` array per the schema rules in inferences_schema.md.
- For multi-turn: episode by episode. Within one episode, mint a fresh `episode_id`, then chain inferences — each turn's `input.messages` is the previous turns' `input` plus `assistant` reply plus next user turn.

Tools you'll use:

- `node -e` for any structured generation (writing JSON bodies, looping, minting UUIDs).
- A working directory in `/tmp` for intermediate files (probe bodies, response captures).
- `curl` to call the gateway.

## 4. Spot-check via the gateway

Periodically — not on every row — POST a synthetic input to the gateway and look at the response. The purpose is calibration, not generation:

- Confirm the input shape you've been building actually parses (template name correct, schema arguments well-formed).
- See what the `initial` variant's output structure looks like for that input, so the outputs you generate yourself stay faithful to it.
- Catch drift early — if the first spot-check shows your `arguments` object missing a required field, fix the generator before producing more rows.

```bash
node -e "
  const body = { /* function_name, variant_name: 'initial', input: ... */ };
  process.stdout.write(JSON.stringify(body));
" > /tmp/req.json

curl -sf {gateway_url}/inference \
     -H 'Content-Type: application/json' \
     --data @/tmp/req.json > /tmp/resp.json
```

A reasonable cadence: one spot-check before you start generating, one after the first episode, and one every ~5 episodes thereafter. Cheaper than per-row, sufficient to catch most schema mistakes.

Assemble each inference row from:

- Your minted `id` and `episode_id`
- The current `created_at`
- `"initial"` as `variant_name`
- Your `input` from step 3
- An `output` you write yourself, matching the structure you saw in the spot-checks (gateway response → guide for your own generation)

For multi-turn: within one episode, each turn's `input.messages` extends the previous turn's by appending the assistant reply and the next user message. Keep the chain coherent across turns of the same `episode_id`.

Write each row to `{output_dir}/inferences.jsonl` immediately — don't batch, so a crash mid-run preserves progress.

## 5. Generate feedback rows

For each metric in `{metric_name_list}`:

- Determine its `level` (inference vs episode) from the TZ config.
- For inference-level: walk every inference row and emit one feedback row per (inference, metric).
- For episode-level: walk every distinct `episode_id` and emit one feedback row per (episode, metric).

For the `value`:

- **If the metric is verifiable from the row alone** (e.g. exact_match against a known gold answer, or a length / format check), compute it programmatically.
- **Otherwise**, predict the value from input + output using your understanding of the task. Stay calibrated — see "realistic value distributions" in feedback_schema.md.

Write to `{output_dir}/feedback.jsonl`.

## Iteration / self-audit

After roughly a third of the planned episodes, stop and inspect what you've produced:

```bash
# Count rows and unique episodes
wc -l {output_dir}/inferences.jsonl
grep -o '"episode_id":"[^"]*"' {output_dir}/inferences.jsonl | sort -u | wc -l

# Variety of input templates / first 80 chars
node -e "
  require('readline').createInterface({input: require('fs').createReadStream('{output_dir}/inferences.jsonl')}).on('line', l => {
    const r = JSON.parse(l);
    const m = r.input.messages[0];
    const c = Array.isArray(m.content) ? m.content[0] : m.content;
    const s = typeof c === 'string' ? c : JSON.stringify(c.arguments || c);
    console.log(s.slice(0, 80));
  });" | sort -u | head -20
```

Ask:

- Am I converging on one mode? (lots of near-identical first lines)
- Did I cover all the schema slots I planned for?
- Does my feedback distribution look reasonable?

If yes to mode collapse — diversify the remaining episodes by deliberately picking cases that look different from what's there. You have room to add more episodes up to `{max_episodes}`; you do not have to stop at the planned count if your coverage feels thin.

## When to stop and validate

Once you've reached at least `{min_episodes}` episodes (and no more than `{max_episodes}`) with each metric covered, run:

```bash
node /skill/scripts/validate.js {output_dir}
```

Read its output and fix any errors. Then exit.

## Anti-patterns

- **Skipping step 2.** "I'll just start generating" gives mode collapse 100% of the time.
- **Skipping step 4 entirely.** Without any spot-checks you have no signal that your input shape parses or that your outputs resemble what the model actually emits.
- **Treating all episodes as length 1.** For multi-turn functions, single-turn episodes are _unrepresentative_.
- **Generating one giant batch and writing at the end.** Write incrementally so a crash doesn't lose work.
- **Ignoring the metric `level`.** Inference-level vs episode-level changes which `target_id` you reference.

reference/inferences_schema.md

The schema for inferences.jsonl rows:

# `inferences.jsonl` row schema

One JSON object per line. Every row represents one call to `/inference` against the function. Multiple rows can share an `episode_id` (multi-turn episodes).

## Fields

| field          | type                   | required | notes                                                                                                                                  |
| -------------- | ---------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `id`           | string (UUID v7)       | yes      | Unique per inference. UUID v7 sorts by timestamp — see "minting UUIDs" below.                                                          |
| `episode_id`   | string (UUID v7)       | yes      | One UUID per logical episode. For single-turn functions, this is fresh per row. For multi-turn, all rows in the same episode share it. |
| `created_at`   | string (ISO 8601, UTC) | yes      | E.g. `"2026-05-15T18:42:11.123Z"`. Should be monotonic within an episode.                                                              |
| `variant_name` | string                 | yes      | Always `"initial"` for this skill.                                                                                                     |
| `input`        | object                 | yes      | `{"messages": [...]}` — the request body's `input` field. See "input shape" below.                                                     |
| `output`       | array                  | yes      | The gateway's response content blocks. Shape depends on function type. See "output shape" below.                                       |

## Minting UUIDs

UUID v7 is required because TensorZero uses the embedded timestamp to order rows.

```js
// Node-only UUID v7 minter (no external deps)
function uuidv7() {
  const ts = BigInt(Date.now());
  const tsHex = ts.toString(16).padStart(12, "0");
  const rand = crypto.randomBytes(10);
  rand[0] = (rand[0] & 0x0f) | 0x70; // version 7
  rand[2] = (rand[2] & 0x3f) | 0x80; // RFC 4122 variant
  const r = rand.toString("hex");
  return `${tsHex.slice(0, 8)}-${tsHex.slice(8, 12)}-${r.slice(0, 4)}-${r.slice(4, 8)}-${r.slice(8, 20)}`;
}
```

For an episode of N turns, mint one `episode_id`, then mint N `id`s, advancing `created_at` by ~1s between them.

## Input shape

The `input.messages` field follows the standard chat-message format. Each message is `{role, content}` where:

- `role`: `"system" | "user" | "assistant"`
- `content`: either a string (rare) or an array of content blocks

The most common content block for a templated function is:

```json
{
  "type": "template",
  "name": "<template_name>",
  "arguments": {
    /* object matching the schema */
  }
}
```

`<template_name>` and the `arguments` shape come from the TZ config. Two co-existing styles:

**Legacy (per-role schemas):**

```toml
[functions."my_fn"]
user_schema = "functions/my_fn/user_schema.json"

[functions."my_fn".variants.initial]
user_template = "functions/my_fn/initial/user_template.minijinja"
```

`<template_name>` is the role name (`"user"`, `"system"`, `"assistant"`).

**New (named schemas):**

```toml
[functions."my_fn"]
schemas.user_query.path = "functions/my_fn/user_query_schema.json"

[functions."my_fn".variants.initial]
templates.user_query.path = "functions/my_fn/initial/user_query.minijinja"
```

`<template_name>` is the key under `schemas.` / `templates.` (e.g. `"user_query"`).

For roles that have no schema, use either `"content": "Hello"` or `[{"type":"text","text":"Hello"}]`.

> **Filesystem path mangling**: function and tool names containing `::` (e.g. `"my_function::act"`) appear in the TZ config as `[functions."my_function::act"]`, but on disk the corresponding directory is `functions/my_function____act/` (four underscores). When reading template / schema files, translate `::` → `____` in the path. A quick `find /config -type f` confirms the actual layout if you're unsure.

### Example: a templated user input

```json
"input": {
  "messages": [{
    "role": "user",
    "content": [{
      "type": "template",
      "name": "user",
      "arguments": { "observation": "Hello, this is a sample user message." }
    }]
  }]
}
```

For a multi-turn episode, append assistant + tool result messages between user turns. The third turn's `input.messages` will hold 5 entries (system?, user₀, assistant₀, user₁, assistant₁).

## Output shape

Depends on the function's `type` in the TZ config — there are three forms.

**`type = "chat"`** — list of content blocks:

```json
"output": [
  { "type": "text", "text": "The model's reply." }
]
```

Tools and text can mix in the same list (a text block followed by a `tool_call`, or several `tool_call`s).

**`type = "chat"` with tools** — same list, with `tool_call` blocks:

```json
"output": [
  {
    "type": "tool_call",
    "name": "<tool_name>",
    "arguments": { /* matching tool schema */ }
  }
]
```

Real rows often include extra fields like `id`, `raw_name`, `raw_arguments` carried back from the underlying model API. Reproduce only `type` + `name` + `arguments` unless you also call the gateway; the extras are post-hoc.

**`type = "json"`** — a single object with `raw` (the unparsed string) and `parsed` (the matched JSON):

```json
"output": {
  "raw": "{\"person\": [], \"location\": [\"Japan\"]}",
  "parsed": {
    "person": [],
    "location": ["Japan"]
  }
}
```

`parsed` must conform to the function's `output_schema`. `raw` is the literal string the model emitted; usually it's just `JSON.stringify(parsed)` with whatever whitespace the model used.

If you're not sure which form applies, look at `[functions."<fn>"]` in `tensorzero.toml` — the `type` field tells you.

## Common mistakes

- **`id == episode_id`.** They must be distinct UUIDs even for single-turn functions.
- **String content where the schema expects template.** If the function has a `user_schema.json`, the user message MUST use `{"type":"template", "name":"user", "arguments":{...}}` — a plain string will be rejected.
- **`variant_name` set to something other than `"initial"`.** This skill only emits `initial`-variant rows; we're characterizing the baseline distribution.
- **Outputs invented by hand.** Always ground via the gateway (see methodology.md). A hand-written `tool_call` argument is very likely to drift from how the model actually phrases things.
- **`created_at` in the wrong format.** ISO 8601 UTC, either with the `Z` suffix (e.g. `"2026-05-15T18:42:11.123Z"`) or the explicit `+00:00` offset. Non-UTC timezone offsets are rejected.

reference/feedback_schema.md

The schema for feedback.jsonl rows:

# `feedback.jsonl` row schema

One JSON object per line. Every row represents one piece of feedback associated with either a single inference or a whole episode.

## Fields

| field         | type          | required | notes                                                                                                                |
| ------------- | ------------- | -------- | -------------------------------------------------------------------------------------------------------------------- |
| `kind`        | string enum   | yes      | One of `"boolean"`, `"float"`, `"comment"`, `"demonstration"`. Determines the `value` type.                          |
| `metric_name` | string        | yes      | Must match a metric defined under `[metrics.<name>]` in `tensorzero.toml`.                                           |
| `target_id`   | string (UUID) | yes      | Resolves to an `inferences.id` (for inference-level metrics) or `inferences.episode_id` (for episode-level metrics). |
| `value`       | varies        | yes      | Type depends on `kind`. See below.                                                                                   |

## Reading the metric definition

For each metric you emit feedback for, locate its definition in the TZ config:

```toml
[metrics.exact_match]
type     = "boolean"        # → kind in feedback row
level    = "inference"      # → target_id resolves to inferences.id
optimize = "max"            # informational; bigger value is better

[metrics.cost]
type     = "float"
level    = "episode"        # → target_id resolves to inferences.episode_id
optimize = "min"
```

Three rules that drop out of this:

- **`kind`** in the feedback row matches **`type`** in the metric definition.
- **`level = "inference"`** ⇒ `target_id` is one of the `id`s in `inferences.jsonl`. One feedback row per (inference, metric) pair.
- **`level = "episode"`** ⇒ `target_id` is one of the `episode_id`s. One feedback row per (episode, metric) pair.

## `value` shape by `kind`

| kind            | type        | example                      | notes                                                                |
| --------------- | ----------- | ---------------------------- | -------------------------------------------------------------------- |
| `boolean`       | bool or 0/1 | `true`, `false`, `1`, `0`    | Both forms are accepted; prefer `true` / `false`.                    |
| `float`         | number      | `0.73`, `12.4`               | Range is metric-defined — read its bounds from the TZ config if any. |
| `comment`       | string      | `"Failed: incorrect output"` | Natural-language feedback from users or developers.                  |
| `demonstration` | object      | `{ "output": [...] }`        | Edited drafts, labels, human-generated content.                      |

For this skill, focus on `boolean` and `float` — they're the metrics that drive optimization.

## Examples

**Inference-level boolean:**

```json
{
  "kind": "boolean",
  "metric_name": "exact_match",
  "target_id": "<inference_id>",
  "value": false
}
```

**Episode-level float:**

```json
{
  "kind": "float",
  "metric_name": "reward",
  "target_id": "<episode_id>",
  "value": 0.42
}
```

## Realistic value distributions

You don't have ground-truth labels, but you should produce a feedback distribution that's _plausible_ for the task — not 100% success and not 100% failure.

For a boolean metric:

- A 100% success rate is a red flag — it suggests you tilted your synthetic inputs toward easy cases. Re-balance.

For a float metric:

- Bound by the metric's natural range (often [0, 1] for accuracy-style or unbounded for cost / reward).
- Distribute across the range — don't pile everything at the mean.
- If you don't know what the natural range is, generate a few real outputs first via the gateway and inspect them.

The point of this corpus is to be a _prior_ over what the function's baseline behavior looks like — it does not need to be correct, but it must be plausible. The downstream measurement (input/output/feedback novelty against the real baseline) will surface where the prior was wrong.

## Common mistakes

- **`target_id` points at an `episode_id` for an inference-level metric (or vice versa).** Read the metric's `level` first.
- **`kind` mismatched with `metric.type`.** A `float` metric must receive `kind: "float"` feedback rows, even if the values look 0/1.
- **`metric_name` not in the TZ config.** Emitting feedback for a metric the function doesn't define will fail validation.
- **Missing rows.** Every inference should be covered by at least one feedback row from an inference-level metric, and every episode by at least one episode-level metric (if any are defined). The validator counts coverage.

scripts/validate.js

The validator the agent runs before exiting:

#!/usr/bin/env node
/**
 * Validate a dataset-synthesis run's output. Mirrors the contract from the
 * skill's reference docs.
 *
 * Usage:
 *   node validate.js <output_dir> [--config <tensorzero.toml>]
 *                                 [--min-episodes <N>] [--max-episodes <N>]
 *
 * Exits 0 on success, 1 on any error. Errors go to stderr; the summary line
 * ("PASS" or "FAIL — N error(s):") and per-file counts go to stdout so the
 * agent can `> validate.log 2>&1` for a single file.
 */
"use strict";

const fs = require("fs");
const path = require("path");

const UUID_RE =
  /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i;
const ALLOWED_FEEDBACK_KINDS = new Set([
  "boolean",
  "float",
  "comment",
  "demonstration",
]);
const ALLOWED_OUTPUT_TYPES = new Set([
  "text",
  "tool_call",
  "raw_text",
  "thought",
]);
const REQUIRED_INF_FIELDS = [
  "id",
  "episode_id",
  "created_at",
  "variant_name",
  "input",
  "output",
];
const REQUIRED_FB_FIELDS = ["kind", "metric_name", "target_id", "value"];

// ── Arg parsing ──────────────────────────────────────────────────────────

function parseArgs(argv) {
  const args = {
    outputDir: null,
    config: null,
    minEpisodes: null,
    maxEpisodes: null,
  };
  for (let i = 0; i < argv.length; i++) {
    const a = argv[i];
    if (a === "--config") args.config = argv[++i];
    else if (a === "--min-episodes") args.minEpisodes = Number(argv[++i]);
    else if (a === "--max-episodes") args.maxEpisodes = Number(argv[++i]);
    else if (a === "-h" || a === "--help") {
      console.log(
        "usage: validate.js <output_dir> [--config <toml>] [--min-episodes N] [--max-episodes N]",
      );
      process.exit(0);
    } else if (!args.outputDir) args.outputDir = a;
    else {
      console.error(`unexpected arg: ${a}`);
      process.exit(2);
    }
  }
  if (!args.outputDir) {
    console.error("output_dir is required");
    process.exit(2);
  }
  return args;
}

// ── JSONL loader ─────────────────────────────────────────────────────────

function loadJsonl(filePath, errors) {
  if (!fs.existsSync(filePath)) {
    errors.push(`missing file: ${filePath}`);
    return [];
  }
  const raw = fs.readFileSync(filePath, "utf8");
  const rows = [];
  raw.split("\n").forEach((line, idx) => {
    if (!line.trim()) return;
    try {
      const obj = JSON.parse(line);
      if (typeof obj !== "object" || obj === null || Array.isArray(obj)) {
        errors.push(
          `${path.basename(filePath)}:${idx + 1}: top-level value must be an object`,
        );
        return;
      }
      rows.push(obj);
    } catch (e) {
      errors.push(
        `${path.basename(filePath)}:${idx + 1}: bad JSON (${e.message})`,
      );
    }
  });
  return rows;
}

// ── Inferences ───────────────────────────────────────────────────────────

function validateInferences(rows, errors) {
  if (rows.length === 0) {
    errors.push("inferences.jsonl is empty");
    return;
  }
  const idsSeen = new Set();
  rows.forEach((r, i) => {
    const tag = `inferences.jsonl:${i + 1}`;
    for (const k of REQUIRED_INF_FIELDS) {
      if (!(k in r)) errors.push(`${tag}: missing required field '${k}'`);
    }
    const rid = r.id;
    const eid = r.episode_id;
    if (typeof rid === "string") {
      if (!UUID_RE.test(rid))
        errors.push(`${tag}: id is not a valid UUID: '${rid}'`);
      if (idsSeen.has(rid)) errors.push(`${tag}: duplicate id '${rid}'`);
      idsSeen.add(rid);
    }
    if (typeof eid === "string" && !UUID_RE.test(eid)) {
      errors.push(`${tag}: episode_id is not a valid UUID: '${eid}'`);
    }
    if (typeof rid === "string" && rid === eid) {
      errors.push(
        `${tag}: id and episode_id are identical (must be distinct UUIDs)`,
      );
    }

    if (r.variant_name !== "initial") {
      errors.push(
        `${tag}: variant_name must be 'initial', got ${JSON.stringify(r.variant_name)}`,
      );
    }

    const inp = r.input;
    if (typeof inp !== "object" || inp === null || !("messages" in inp)) {
      errors.push(`${tag}: input must be an object with a 'messages' array`);
    } else {
      const msgs = inp.messages;
      if (!Array.isArray(msgs) || msgs.length === 0) {
        errors.push(`${tag}: input.messages must be a non-empty array`);
      }
    }

    const out = r.output;
    if (Array.isArray(out)) {
      // chat / tool function: list of content blocks
      out.forEach((blk, j) => {
        if (typeof blk !== "object" || blk === null) {
          errors.push(`${tag}: output[${j}] must be an object`);
          return;
        }
        const t = blk.type;
        if (!ALLOWED_OUTPUT_TYPES.has(t)) {
          const allowed = [...ALLOWED_OUTPUT_TYPES].sort();
          errors.push(
            `${tag}: output[${j}].type '${t}' not in [${allowed.map((x) => `'${x}'`).join(", ")}]`,
          );
        }
      });
    } else if (typeof out === "object" && out !== null) {
      // json function: {raw, parsed}
      if (!("raw" in out) && !("parsed" in out)) {
        errors.push(
          `${tag}: output is an object but has neither 'raw' nor 'parsed' ` +
            `(json-function output expects both)`,
        );
      }
    } else {
      errors.push(
        `${tag}: output must be a list of content blocks (chat/tool) ` +
          `or an object with 'raw'+'parsed' (json), got ${typeof out}`,
      );
    }
  });
}

// ── Feedback ─────────────────────────────────────────────────────────────

function validateFeedback(rows, errors) {
  rows.forEach((r, i) => {
    const tag = `feedback.jsonl:${i + 1}`;
    for (const k of REQUIRED_FB_FIELDS) {
      if (!(k in r)) errors.push(`${tag}: missing required field '${k}'`);
    }
    const kind = r.kind;
    if (!ALLOWED_FEEDBACK_KINDS.has(kind)) {
      const allowed = [...ALLOWED_FEEDBACK_KINDS].sort();
      errors.push(
        `${tag}: kind '${kind}' not in [${allowed.map((x) => `'${x}'`).join(", ")}]`,
      );
    }
    const tid = r.target_id;
    if (typeof tid === "string" && !UUID_RE.test(tid)) {
      errors.push(`${tag}: target_id is not a valid UUID: '${tid}'`);
    }
    const v = r.value;
    if (
      kind === "boolean" &&
      !(typeof v === "boolean" || typeof v === "number")
    ) {
      errors.push(
        `${tag}: boolean feedback value must be bool or 0/1, got ${typeof v}`,
      );
    }
    if (kind === "float" && typeof v !== "number") {
      errors.push(
        `${tag}: float feedback value must be a number, got ${typeof v}`,
      );
    }
  });
}

// ── Cross-validation (referential integrity, metric resolution) ──────────

function validateCross(inferences, feedback, metricDefs, errors, warnings) {
  const inferenceIds = new Set(
    inferences.filter((r) => typeof r.id === "string").map((r) => r.id),
  );
  const episodeIds = new Set(
    inferences
      .filter((r) => typeof r.episode_id === "string")
      .map((r) => r.episode_id),
  );

  const targetsInference = new Map(); // inference_id → Set<metric_name>
  const targetsEpisode = new Map(); // episode_id   → Set<metric_name>

  feedback.forEach((r, i) => {
    const tag = `feedback.jsonl:${i + 1}`;
    const mname = r.metric_name;
    const tid = r.target_id;
    const kind = r.kind;

    if (metricDefs && !(mname in metricDefs)) {
      const defined = Object.keys(metricDefs).sort().join(", ") || "(none)";
      errors.push(
        `${tag}: metric_name '${mname}' not defined in tensorzero.toml ` +
          `(defined metrics: ${defined})`,
      );
      return;
    }
    const mdef = metricDefs ? metricDefs[mname] : null;

    if (mdef) {
      if (kind && mdef.type && kind !== mdef.type) {
        errors.push(
          `${tag}: kind '${kind}' mismatches metric.type '${mdef.type}' ` +
            `for metric '${mname}'`,
        );
      }
      const level = mdef.level;
      if (level === "inference") {
        if (!inferenceIds.has(tid)) {
          const hint = episodeIds.has(tid)
            ? "this might be an episode_id — try matching against inferences.id instead"
            : "the value does not appear as any row's id in inferences.jsonl";
          errors.push(
            `${tag}: target_id '${tid}' does not match any inference id ` +
              `(metric '${mname}' is inference-level; ${hint})`,
          );
        } else {
          if (!targetsInference.has(tid)) targetsInference.set(tid, new Set());
          targetsInference.get(tid).add(mname);
        }
      } else if (level === "episode") {
        if (!episodeIds.has(tid)) {
          const hint = inferenceIds.has(tid)
            ? "this looks like an inference id — try matching against episode_id instead"
            : "the value does not appear as any row's episode_id in inferences.jsonl";
          errors.push(
            `${tag}: target_id '${tid}' does not match any episode_id ` +
              `(metric '${mname}' is episode-level; ${hint})`,
          );
        } else {
          if (!targetsEpisode.has(tid)) targetsEpisode.set(tid, new Set());
          targetsEpisode.get(tid).add(mname);
        }
      }
    } else {
      // No metric defs → just verify target_id exists somewhere
      if (!inferenceIds.has(tid) && !episodeIds.has(tid)) {
        errors.push(
          `${tag}: target_id '${tid}' does not match any inference id or ` +
            `episode_id in inferences.jsonl`,
        );
      }
    }
  });

  if (metricDefs) {
    const infMetrics = Object.entries(metricDefs)
      .filter(([, d]) => d.level === "inference")
      .map(([n]) => n);
    const epMetrics = Object.entries(metricDefs)
      .filter(([, d]) => d.level === "episode")
      .map(([n]) => n);
    if (infMetrics.length) {
      const uncovered = [...inferenceIds].filter(
        (id) => !targetsInference.has(id),
      );
      if (uncovered.length) {
        warnings.push(
          `${uncovered.length}/${inferenceIds.size} inferences have no inference-level feedback`,
        );
      }
    }
    if (epMetrics.length) {
      const uncovered = [...episodeIds].filter((id) => !targetsEpisode.has(id));
      if (uncovered.length) {
        warnings.push(
          `${uncovered.length}/${episodeIds.size} episodes have no episode-level feedback`,
        );
      }
    }
  }
}

// ── Budget ───────────────────────────────────────────────────────────────

function validateBudget(inferences, args, errors) {
  const nEpisodes = new Set(
    inferences
      .filter((r) => typeof r.episode_id === "string")
      .map((r) => r.episode_id),
  ).size;
  if (args.minEpisodes !== null && nEpisodes < args.minEpisodes) {
    errors.push(
      `episode count ${nEpisodes} is below the minimum ${args.minEpisodes}`,
    );
  }
  if (args.maxEpisodes !== null && nEpisodes > args.maxEpisodes) {
    errors.push(
      `episode count ${nEpisodes} exceeds the maximum ${args.maxEpisodes}`,
    );
  }
}

// ── Minimal TOML parser for [metrics.*] blocks ───────────────────────────

function parseMetricDefs(configPath) {
  const metrics = {};
  if (!fs.existsSync(configPath)) return metrics;
  const blockRe = /^\s*\[metrics\.["']?([^"'\]]+?)["']?\]\s*$/;
  const kvRe = /^\s*(type|level|optimize)\s*=\s*["']?([^"'\s#]+)/;
  const sectionRe = /^\s*\[.+\]\s*$/;
  let current = null;
  for (const line of fs.readFileSync(configPath, "utf8").split("\n")) {
    const m = line.match(blockRe);
    if (m) {
      current = m[1];
      metrics[current] = {};
      continue;
    }
    if (sectionRe.test(line) && !blockRe.test(line)) {
      current = null;
      continue;
    }
    if (current) {
      const kv = line.match(kvRe);
      if (kv) metrics[current][kv[1]] = kv[2];
    }
  }
  return metrics;
}

// ── Driver ───────────────────────────────────────────────────────────────

function main() {
  const args = parseArgs(process.argv.slice(2));
  if (
    !fs.existsSync(args.outputDir) ||
    !fs.statSync(args.outputDir).isDirectory()
  ) {
    console.error(`output_dir does not exist: ${args.outputDir}`);
    process.exit(1);
  }

  const errors = [];
  const warnings = [];
  const inferences = loadJsonl(
    path.join(args.outputDir, "inferences.jsonl"),
    errors,
  );
  const feedback = loadJsonl(
    path.join(args.outputDir, "feedback.jsonl"),
    errors,
  );

  // Extras: ignore subdirectories (orchestrator's _meta/ sits there).
  const extras = fs.readdirSync(args.outputDir).filter((name) => {
    if (name === "inferences.jsonl" || name === "feedback.jsonl") return false;
    return fs.statSync(path.join(args.outputDir, name)).isFile();
  });
  if (extras.length)
    warnings.push(
      `unexpected files in ${args.outputDir}: [${extras.join(", ")}]`,
    );

  validateInferences(inferences, errors);
  validateFeedback(feedback, errors);

  const metricDefs = args.config ? parseMetricDefs(args.config) : {};
  validateCross(
    inferences,
    feedback,
    Object.keys(metricDefs).length ? metricDefs : null,
    errors,
    warnings,
  );
  validateBudget(inferences, args, errors);

  const nEpisodes = new Set(inferences.map((r) => r.episode_id)).size;
  const fbByMetric = {};
  for (const r of feedback)
    fbByMetric[r.metric_name] = (fbByMetric[r.metric_name] || 0) + 1;

  console.log(
    `inferences:        ${inferences.length} rows  (${nEpisodes} unique episodes)`,
  );
  console.log(`feedback:          ${feedback.length} rows`);
  if (Object.keys(fbByMetric).length) {
    console.log(`  per metric:      ${JSON.stringify(fbByMetric)}`);
  }
  if (warnings.length) {
    console.log("\nWARNINGS:");
    for (const w of warnings) console.log(`  · ${w}`);
  }
  if (errors.length) {
    console.error(`\nFAIL — ${errors.length} error(s):`);
    for (const e of errors) console.error(`  ✗ ${e}`);
    process.exit(1);
  }
  console.log("\nPASS");
  process.exit(0);
}

main();

Embedding step

The MMD² analysis runs offline on the eval host (outside the Claude Code sandbox), where Python is available. Each row is rendered to a single string via $\,\mathtt{json.dumps(\{input,\,output\},\,sort\_keys=True)}\,$ , which keeps the model’s actual outputs alongside the inputs and drops per-row bookkeeping (id, episode_id, created_at, variant_name). That string is then passed through a text-embedding model to produce a fixed-dimensional vector.

The same embedding function is applied to both corpora and the output vectors are L2-normalized so that pairwise squared distances fall in $[0, 4]$ . Truncation cap per input depends on the embedder’s context window (Voyage and ZeroEntropy at 32 k tokens, OpenAI and Gemini at 8 k tokens); the same cap is applied symmetrically to synth and baseline so their inputs see the same content.

The MMD² estimator

Maximum Mean Discrepancy (MMD) is a kernel-based two-sample distance for testing whether two finite samples $X = \{x_i\}_{i=1}^n$ and $Y = \{y_j\}_{j=1}^m$ come from the same underlying distribution.

Fix a positive-definite kernel $k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ with associated reproducing-kernel Hilbert space $\mathcal{H}$ and feature map $\phi(x) := k(\cdot, x)$ . Provided the kernel is measurable and satisfies the moment condition $\mathbb{E}_{x \sim p}\bigl[\sqrt{k(x, x)}\bigr] < \infty$ for the distributions $p$ and $q$ being compared, the mean embeddings

\mu_p := \mathbb{E}_{x \sim p}[\phi(x)] \in \mathcal{H}, \qquad \mu_q := \mathbb{E}_{y \sim q}[\phi(y)] \in \mathcal{H}

are well-defined elements of $\mathcal{H}$ . Bounded kernels (such as the Gaussian RBF below, where $0 \le k(x, y) \le 1$ ) automatically satisfy this condition for every $p, q$ . The population MMD² is then defined as the squared RKHS distance between the two mean embeddings:

\mathrm{MMD}^2(p, q) \;:=\; \lVert \mu_p - \mu_q \rVert_{\mathcal{H}}^2.

Given finite samples $X = \{x_i\}_{i=1}^n$ and $Y = \{y_j\}_{j=1}^m$ from $p$ and $q$ , plug in the empirical mean embeddings $\hat{\mu}_X := \tfrac{1}{n}\sum_i \phi(x_i)$ and $\hat{\mu}_Y := \tfrac{1}{m}\sum_j \phi(y_j)$ , expand the squared norm, and apply the reproducing-kernel identity $\langle \phi(a), \phi(b)\rangle_{\mathcal{H}} = k(a, b)$ to get an estimator written purely in terms of pairwise kernel evaluations:

\begin{aligned} \lVert \hat{\mu}_X - \hat{\mu}_Y \rVert_{\mathcal{H}}^2 \;=\; &\tfrac{1}{n^2}\!\sum_{i,j} k(x_i, x_j) \\ &\;-\; \tfrac{2}{nm}\!\sum_{i,j} k(x_i, y_j) \\ &\;+\; \tfrac{1}{m^2}\!\sum_{i,j} k(y_i, y_j). \end{aligned}

With a characteristic kernel (such as the Gaussian RBF used below), $\mathrm{MMD}^2(p, q) = 0$ if and only if $p = q$ . The metric is then a faithful distributional distance on the space of probability measures, not just a moment comparison.

For my use case (single-sample novelty against a fixed baseline), I treat the synthetic corpus $\mathcal{S}$ as $X$ and the deployment baseline $\mathcal{B}$ as $Y$ , and report the resulting per-(env, seed) $\widehat{\mathrm{MMD}^2}$ as the novelty score.

Kernel choice

I use the Gaussian radial basis function (RBF) kernel:

k(x, y) = \exp\!\left(-\frac{\lVert x - y \rVert^2}{2\sigma^2}\right)

The median heuristic sets $\sigma^2$ per (env, seed) to the median squared pairwise distance over a random 500-row subsample of the aggregate sample $\mathcal{S} \cup \mathcal{B}$ :

\sigma^2 = \mathrm{median}\bigl\{\lVert z_i - z_j \rVert^2 \,:\, z_i, z_j \in Z_{\text{sub}} \subset \mathcal{S} \cup \mathcal{B},\; i \ne j\bigr\}.

U-statistic MMD²

For finite samples $X = \{x_i\}_{i=1}^n$ and $Y = \{y_j\}_{j=1}^m$ of any sizes $n$ and $m$ , I use the unbiased U-statistic estimator:

\begin{aligned} \widehat{\mathrm{MMD}^2_U}(X, Y) = \;&\frac{1}{n(n-1)}\!\sum_{i \ne j} k(x_i, x_j) \\ &\;-\; \frac{2}{nm}\!\sum_{i,j} k(x_i, y_j) \\ &\;+\; \frac{1}{m(m-1)}\!\sum_{i \ne j} k(y_i, y_j). \end{aligned}

Being unbiased matters in this setup specifically because $n$ varies substantially across envs (25–174 synthetic rows depending on env and seed): a biased estimator would introduce a per-env offset that contaminates cross-env comparisons. The unbiased estimator can return slightly negative values when the two distributions are nearly identical, which is sample variance around a true MMD² of zero, not an error.

I report this estimator as the per-(env, seed) novelty score:

\mathrm{novelty}(\text{env},\,\text{seed}) = \widehat{\mathrm{MMD}^2_U}\bigl(\mathcal{S}_{\text{env},\text{seed}},\;\mathcal{B}_{\text{env}}\bigr).

Aggregating across seeds

For each (env, embedder) there are K MMD² values, one per synthesis seed. I report the median across the K seeds as the per-env point estimate, with the inter-quartile range $[Q_1, Q_3]$ as the seed-spread error bar:

\mathrm{novelty}_{\text{env}} = \mathrm{median}_k\,\mathrm{novelty}(\text{env},\,k)

The IQR captures variation across agent runs. Median + IQR is robust to a single anomalous seed (e.g. one Wordle run that happens to land in a less typical region of the distribution).

In the chart, the median is the X-axis point position and the IQR is the horizontal whisker. The same convention is used for the Y-axis (Δ success rate across eval seeds) so both axes display the same kind of error bar.

Inside the chart

Two axes organize the chart. The first is visibility: can Claude Code see (or correctly guess) the real input distribution? That is what drift measures, so low drift means high visibility. The second is tolerance: even when Claude Code guesses wrong, does the task care? Data helps only when an application scores low on both: Claude Code is guessing and the metric punishes it for it.

Read application by application, the seven sort into a few groups. NER is the clean no-effect pole: the optimization model has the corpus memorized and its broader knowledge of NER covers the rest, so visibility is high and data has nothing to add. YC bench is the data-helps pole: the simulator postdates the model’s training cutoff and the harness underspecifies what the agent observes, so visibility is low, and the metric is unforgiving. NDA is the outlier that forces the second axis onto the page: its visibility is just as low as YC bench’s, but the task tolerates the gap, so data barely moves the needle. Two more no-ops fall out for a reason the two axes don’t cover, which I flag below.

Application	Sees the real distribution?	Does the gap hurt?	Data helps?
NER	Yes: corpus memorized, generic task shape	—	No
Customer service	Yes: model already recognizes τ-bench	—	No
Software Eng.	— capability ceiling (no prompt could move it)	—	No
NDA	No: invents the wrong document genre	No: extraction is genre-agnostic	No
Wordle	Partly	Somewhat	A little
Science	No	Yes	Yes
Business mgmt (YC bench)	No	Yes	Yes (largest gap)

The first three rows are no-ops because Claude Code is not guessing, or (for software engineering) because the agent model cannot improve no matter what the prompt says. The bottom three are the cases where data earns its keep. NDA is the row that does the conceptual work, sitting between them: guessing, but on a task that does not punish guessing.

Entity extraction (NER)

With or without 100 real traces, Claude Code improved the NER agent by around 60 percentage points. The data made no measurable difference, and once I started looking at why, I could see that NER is cooked into Claude Code at two levels: corpus memorization and task-shape knowledge. Either alone would have been enough to make data ablation a no-op.

Claude Code has the specific corpus memorized. In the without-data condition, when Claude Code constructed probes to test its prompt edits, I noticed something striking: 24 of 31 probes across seeds were character-for-character copies of CoNLL++ rows (three examples below). Claude Code was pulling these sentences straight from its training distribution and using them to check the prompts it was writing. 100 real traces have nothing to add.

Three verbatim probes vs. their CoNLL++ matches

These are the top-three highest-similarity probes from the without-data run, paired with the baseline row each probe’s nearest-neighbor search points at. Cosine similarity tops out at 0.95 (rather than 1.00) only because the role prefix differs ([user] vs [user:text]); the body text is character-identical.

Probe (seed 1):

West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .

CoNLL++ match (validation split): identical.

Probe (seed 2):

Germany ‘s representative to the European Union ‘s veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .

CoNLL++ match (training split): identical.

Probe (seed 1):

The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .

CoNLL++ match (training split): identical.

Even without that memorization, Claude Code’s knowledge of what NER data looks like is good enough. When I asked Claude Code to synthesize an NER corpus from the application spec alone, it produced 0 verbatim CoNLL++ rows out of 130. Yet the synthetic and real CoNLL++ corpora still land in the same embedding-space neighborhood, which is what produces the low novelty score. The synthetic rows are modern, naturally-punctuated prose (“The COP30 climate summit in Belém, Brazil drew delegates from 190 nations.”); CoNLL++ is Reuters-style August-1996 news with Penn-tokenization (“BRUSSELS 1996-08-22 . EU rejects German call to boycott British lamb .”). The two corpora share zero sentences and zero persons; overlap concentrates in geopolitical place names and perennial organizations. Claude Code’s knowledge covers the shape of NER data, namely entity-rich short news prose in the four CoNLL categories, even when it does not reproduce the specific corpus.

Business management (YC bench)

YC bench sits at the opposite extreme. Without 100 real traces, the optimized CEO agent averaged 0.6 successful tasks per episode (a task is a contract the CEO accepted, assigned to an employee, and completed before its deadline); with the traces, it averaged 8. This operational lift translated to a 20 percentage point increase in survival rate, the simulation’s primary metric. The data was doing essentially all the work, and when I dug in, none of the things that made NER a no-op were in place: Claude Code did not memorize the benchmark, the knowledge it inherits from the configuration only covers half the application, and the optimization-time instruction barely elicits even that half.

Claude Code almost certainly did not see this benchmark. YC bench was published April 6, 2026, three months after Sonnet 4.6’s training data cutoff of January 2026. Barring some pre-release artifact that found its way into training, the CLI grammar, the structured opener, and the state schema are not in Claude Code’s training corpus.

The knowledge Claude Code inherits from the configuration covers actions, not observations. When I asked Claude Code to synthesize a YC-bench corpus from the application spec alone, it used the correct yc-bench CLI vocabulary on its output side. Every major subcommand (market browse, task accept, task assign, task dispatch, sim resume) appeared within ten percentage points of the real distribution, because the application’s system prompt lists every command and flag verbatim. The user-side observation schema, however, is just observation: string, a pass-through with no structure documented anywhere, so Claude Code had to guess. It guessed a plausible JSON-event format:

{
  "event": "simulation_started",
  "funds_cents": ...,
  "employee_count": ...
}

The format is internally consistent with the spec’s hints (“All commands return JSON”, “Funds are in cents”) but disjoint from the real Markdown opener (## Simulation Start — Take Immediate Action). 0 of 636 synth rows reproduced that canonical header. Claude Code knew what to do; it did not know what the environment would show it.

The optimization-time instruction elicited even less. In the without-data condition, Claude Code constructed probes to test the prompts it wrote, but its instruction did not ask for full episode simulation. 0 of 11 probes across seeds contained any yc-bench-specific token (see below); the 11 collapsed to 5 generic “Simulation started. You are the CEO” strings. Even the action-side knowledge, which is right there in the system template, never surfaced. With access to real traces, Claude Code copied the structured opener nearly verbatim (top NN cosine = 0.95) and wrote a prompt that handled the actual CLI workflow. The data closes a gap the harness underspecifies and the optimization instruction cannot bridge.

A real opener vs. the entire without-data probe set

The with-data run copies real baseline rows nearly verbatim (top NN cosine = 0.95). The without-data run fabricates generic CEO roleplay.

Real opener (also reproduced by the with-data run):

## Simulation Start — Take Immediate Action
- current_time: 2025-01-01T00:00:00
- horizon_end: 2026-01-01T00:00:00
- funds: $250,000.00
- monthly_payroll: $22,340.00
- runway: ~11.2 months
- employees: 3
- active_tasks: 0
- planned_tasks: 0

**Your immediate priority**: generate revenue before payroll drains your runway.
You MUST complete these steps now:
1. `yc-bench market browse --required-prestige-lte 1` — find tasks you can accept
2. `yc-bench task accept --task-id <UUID>` — accept 2-3 suitable tasks
3. `yc-bench employee list` — get employee IDs
4. `yc-bench task assign --task-id <UUID> --employee-id <UUID>` — assign employees
5. `yc-bench task dispatch --task-id <UUID>` — dispatch tasks
6. `yc-bench sim resume` — advance simulation

Synthetic openers (without-data run), all 11 probes collapsing to 5 distinct strings:

Simulation started. You are the CEO. What is your first action?

Simulation started. Company initialized with $50,000 funds. You have 3 employees.

Simulation started. You are the CEO. Begin by checking company status.

Simulation started. What is your first action?

Simulation started.

No yc-bench CLI, no structured state fields, no immediate-action list. The one number that does appear ($50,000) is off by 5x from the real $250,000 initial funds.

Contract extraction (NDA)

NDA caught my eye as a clear outlier on the chart. Its novelty score is high, comparable to YC bench, which predicts a large data-ablation gap. But the actual gap was small. On F1, the optimized extraction agent reached 66% with 100 real traces and 64% with none, about two percentage points apart. On strict exact-match, the chart’s primary metric, the gap is essentially zero. That broke the trend the other six applications followed and warranted a closer look.

High novelty: Claude Code invents the wrong document genre. When I asked Claude Code to synthesize an NDA corpus from the application spec alone, it produced clean, short, contemporary template-style NDAs (“This Non-Disclosure Agreement is entered into as of March 5, 2024, by and between…”), averaging 443 characters per document. The real Kleister-NDA corpus is SEC-EDGAR filings, averaging 19,328 characters, about forty-four times longer, with multi-section legalese (WHEREAS, IN WITNESS WHEREOF, NOW, THEREFORE), full confidentiality clauses, and OCR provenance markers from their EX-10.x exhibit form (Exhibit, dex##.htm, page-number artifacts). Those markers appear in 39% to 80% of real rows and in zero synth rows. The cause is again a harness underspecification: the system template says only “Given the OCR text of an NDA, extract the following fields”, and the user-side schema does not constrain length, provenance, or structure. Claude Code extrapolates from the words “NDA” and “OCR text” and writes a perfectly reasonable contemporary NDA template, which happens not to be what Kleister-NDA contains. The without-data optimization probes shared the same template register: 18 probes across seeds collapsed to 11 unique openers, three of them repetitions of the same “This Non-Disclosure Agreement is entered into as of…” phrase.

Small gap: the extraction task is genre-agnostic. The output side stayed faithful in both runs. 100% of synth outputs parsed, all four fields (effective_date, jurisdiction, party, term) were always populated, and null rates per field landed within fifteen percentage points of the real corpus. The application agent’s prior knowledge of how to read an NDA and pull out four fields generalizes across genres. It handles the contemporary templates Claude Code practiced against and the SEC-EDGAR filings the test set actually contains. The data adds two F1 points and roughly zero exact-match points, not twenty, because the extraction skill is already in the application agent’s knowledge, whichever corpus Claude Code practiced on.

The novelty score measures a real distributional gap on NDA. For this task, that gap turns out to be orthogonal to the metric.

The remaining applications

The remaining four split along the same two axes. Scientific paper reproduction and Wordle both sit in the data-helps quadrant (Claude Code is at least partly guessing and the metric cares), which is why they land on the positive side of the chart. Science behaves like a milder YC bench: novelty is high, the metric is unforgiving, and the data does real work. Wordle is milder still, worth a few percentage points.

Software engineering and customer service are the two no-ops the visibility/tolerance axes do not explain: both lose the data dependence for reasons upstream of prior knowledge. Software engineering lands near zero because gpt-5.4-mini hits a performance ceiling on terminal-bench that no prompt proposed by Claude Code could move, with or without data: the agent model, not its visibility into the data, is the binding constraint. Customer service (τ-bench retail) lands near zero for an adjacent reason: gpt-5.4-mini has likely been trained on enough τ-bench traces that it recognizes the task from the user turn alone, so prompt optimization makes no difference either way.

What I take away

Data matters when the agent engineer’s prior knowledge does not. NER works without traces because the corpus is in Claude Code’s training data and the task shape is generic enough that its invented probes still land in the right neighborhood. YC bench falls apart without traces because the simulator postdates the training cutoff and the harness does not tell Claude Code enough to fill the gap. Embedding-space drift between Claude Code’s guesses and the real data tracks that pattern across all seven applications (Spearman ρ = +0.79, exact two-sided p = 0.040). But with n = 7 and one deliberate exception, I read it as evidence for the mechanism, not a law.

That exception is the second half of the lesson. NDA’s drift is high but its data-ablation gap is small: the task only requires reading each document and extracting four fields, which the application agent does on any reasonable NDA whether or not Claude Code practiced on the right genre. Drift tells you whether Claude Code is guessing, not whether the task punishes it for that. Visibility and tolerance are two different axes, and only their conjunction means data will help.

References

@Vtrivedy10. X post.
Osmani, A. (April 19, 2026). Agent Harness Engineering.
Mehta, V., & Bianconi, G. (March 23, 2026). We’re building an automated AI engineer, and it works. TensorZero blog.
Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (March 30, 2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv preprint arXiv:2603.28052.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 723–773.

Citation

@misc{jesson2026whendoesdatahelp,
  title        = {When does data help automated agent engineering?},
  author       = {Jesson, Andrew},
  year         = {2026},
  month        = may,
  howpublished = {andrewjesson.com},
  url          = {https://andrewjesson.com/blog/when-does-data-help-automated-agent-engineering/},
}