Skip to content

Code Graders

Code graders (also accepts code-judge for backward compatibility) are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.

Code graders communicate via stdin/stdout JSON:

Input (stdin):

{
"input_text": "What is 15 + 27?",
"criteria": "Correctly calculates 15 + 27 = 42",
"output_text": "The answer is 42.",
"expected_output_text": "42"
}
**Output (stdout):**
```json
{
"score": 1.0,
"assertions": [
{ "text": "Answer contains correct value (42)", "passed": true }
]
}
Output FieldTypeDescription
scorenumber0.0 to 1.0
assertionsArray<{ text, passed, evidence? }>Per-aspect results with verdict and optional evidence
validators/check_answer.py
import json, sys
data = json.load(sys.stdin)
output_text = data.get("output_text", "")
assertions = []
if "42" in output_text:
assertions.append({"text": "Output contains correct value (42)", "passed": True})
else:
assertions.append({"text": "Output does not contain expected value (42)", "passed": False})
passed = sum(1 for a in assertions if a["passed"])
score = passed / len(assertions) if assertions else 0.0
print(json.dumps({
"score": score,
"assertions": assertions,
}))
validators/check_answer.ts
import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const outputText: string = data.output_text ?? "";
const assertions: Array<{ text: string; passed: boolean }> = [];
if (outputText.includes("42")) {
assertions.push({ text: "Output contains correct value (42)", passed: true });
} else {
assertions.push({ text: "Output does not contain expected value (42)", passed: false });
}
const passed = assertions.filter(a => a.passed).length;
console.log(JSON.stringify({
score: passed > 0 ? 1.0 : 0.0,
assertions,
reasoning: `Passed ${passed} check(s)`,
}));
assertions:
- name: my_validator
type: code-grader
command: [./validators/check_answer.py]

The @agentv/eval package provides a declarative API with automatic stdin/stdout handling. Use defineCodeGrader (formerly defineCodeJudge) to skip boilerplate:

#!/usr/bin/env bun
import { defineCodeGrader } from '@agentv/eval';
export default defineCodeGrader(({ outputText, criteria }) => {
const assertions: Array<{ text: string; passed: boolean }> = [];
if (outputText.includes(criteria)) {
assertions.push({ text: 'Output matches expected outcome', passed: true });
} else {
assertions.push({ text: 'Output does not match expected outcome', passed: false });
}
const passed = assertions.filter(a => a.passed).length;
return {
score: assertions.length === 0 ? 0 : passed / assertions.length,
assertions,
};
});

SDK exports: defineCodeGrader, Message, ToolCall, TraceSummary, CodeGraderInput, CodeGraderResult

Code graders can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).

Add a target block to the evaluator config:

assertions:
- name: contextual-precision
type: code-grader
command: [bun, scripts/contextual-precision.ts]
target:
max_calls: 10 # Default: 50

Use createTargetClient from the SDK:

#!/usr/bin/env bun
import { createTargetClient, defineCodeGrader } from '@agentv/eval';
export default defineCodeGrader(async ({ inputText, outputText }) => {
const target = createTargetClient();
if (!target) return { score: 0, assertions: [{ text: 'Target not configured', passed: false }] };
const response = await target.invoke({
question: `Is this relevant to: ${inputText}? Response: ${outputText}`,
systemPrompt: 'Respond with JSON: { "relevant": true/false }'
});
const result = JSON.parse(response.rawText ?? '{}');
return { score: result.relevant ? 1.0 : 0.0 };
});

Use target.invokeBatch(requests) for multiple calls in parallel.

Environment variables (set automatically when target is configured):

VariableDescription
AGENTV_TARGET_PROXY_URLLocal proxy URL
AGENTV_TARGET_PROXY_TOKENBearer token for authentication

Beyond the basic text fields (input_text, output_text, expected_output_text, criteria), code graders receive additional structured context:

FieldTypeDescription
input_filesstring[]Paths to input files referenced in the eval
inputMessage[]Full resolved input message array
expected_outputMessage[]Expected agent behavior including tool calls
outputMessage[]Actual agent execution trace with tool calls
traceTraceSummaryLightweight execution metrics (tool calls, errors)
token_usage{input, output}Token consumption
cost_usdnumberEstimated cost in USD
duration_msnumberTotal execution duration
start_timestringISO timestamp of first event
end_timestringISO timestamp of last event
file_changesstring | nullUnified diff of workspace file changes (when workspace_template is configured)
workspace_pathstring | nullAbsolute path to the workspace directory (when workspace_template is configured)
{
"event_count": 5,
"tool_names": ["fetch", "search"],
"tool_calls_by_name": { "search": 2, "fetch": 1 },
"error_count": 0,
"llm_call_count": 2
}
FieldTypeDescription
event_countnumberTotal tool invocations
tool_namesstring[]Unique tool names used
tool_calls_by_nameRecord<string, number>Count per tool
error_countnumberFailed tool calls
llm_call_countnumberNumber of LLM calls (assistant messages)

Use expected_output for retrieval context in RAG evals (tool calls with outputs) and output for the actual agent execution trace from live runs.

When workspace_template is configured on a target, code graders receive the workspace path in two ways:

  1. JSON payload: workspace_path field in the stdin input
  2. Environment variable: AGENTV_WORKSPACE_PATH

This enables functional grading — running commands like npm test, pytest, or cargo test directly in the agent’s workspace.

#!/usr/bin/env bun
import { readFileSync } from "fs";
import { execFileSync } from "child_process";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const cwd = input.workspace_path;
const assertions: Array<{ text: string; passed: boolean }> = [];
// Stage 1: Install dependencies
try {
execFileSync("npm", ["install"], { cwd, stdio: "pipe" });
assertions.push({ text: "npm install passed", passed: true });
} catch { assertions.push({ text: "npm install failed", passed: false }); }
// Stage 2: Typecheck
try {
execFileSync("npx", ["tsc", "--noEmit"], { cwd, stdio: "pipe" });
assertions.push({ text: "typecheck passed", passed: true });
} catch { assertions.push({ text: "typecheck failed", passed: false }); }
// Stage 3: Run tests
try {
execFileSync("npm", ["test"], { cwd, stdio: "pipe" });
assertions.push({ text: "tests passed", passed: true });
} catch { assertions.push({ text: "tests failed", passed: false }); }
const passed = assertions.filter(a => a.passed).length;
console.log(JSON.stringify({
score: assertions.length > 0 ? passed / assertions.length : 0,
assertions,
}));
targets.yaml
targets:
- name: my_agent
provider: cli
command: "my-agent --task {INPUT_FILE} --output {OUTPUT_FILE}"
workspace_template: ./workspace-template
# dataset.eval.yaml
tests:
- id: implement-feature
criteria: Agent implements the feature correctly
input: "Implement the TODO functions in src/index.ts"
assertions:
- name: functional-check
type: code-grader
command: [bun, scripts/functional-check.ts]

See examples/features/functional-grading/ for a complete working example.

Run a grader from .agentv/graders/ by name — no manual JSON piping required:

Terminal window
# Pass agent output and input directly
agentv eval assert rouge-score --agent-output "The fox jumps over the dog" --agent-input "Summarise this"
# Or pass a JSON file with { output, input } fields
agentv eval assert rouge-score --file result.json

The command:

  1. Discovers the grader script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}
  2. Passes { output_text, output, input, input_text } to the script via stdin
  3. Prints the grader’s JSON result to stdout
  4. Exits 0 if score >= 0.5, exit 1 otherwise

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits agentv eval assert instructions for code graders so external grading agents can run them directly.

Pipe JSON directly to the grader script for full control:

Terminal window
echo '{"input_text":"What is 2+2?","criteria":"4","output_text":"4","expected_output_text":"4"}' | python validators/check_answer.py