Benchmarking
This page explains how to create and run benchmarks to evaluate agent performance.
What are Benchmarks?
Benchmarks in Prompt Spec are structured test cases that evaluate an agent’s performance on specific tasks. They allow you to:
- Test how well an agent responds to particular queries
- Verify that agents use tools correctly
- Evaluate responses based on custom criteria
- Compare performance across different agent configurations
Creating Benchmarks
Benchmarks are defined in the benchmarks
section of an agent specification YAML file:
benchmarks:
- name: "Password Reset Inquiry"
description: "Tests how the agent handles a password reset request"
messages:
- role: "user"
content: "I need to reset my password. Can you help?"
evaluationCriteria:
- key: "helpfulness"
description: "Does the agent provide a clear solution?"
type: "boolean"
- key: "completeness"
description: "Does the agent cover all necessary steps?"
type: "scale"
min: 1
max: 5
Benchmark Components
Messages
The messages
section defines the conversation that will be sent to the agent:
messages:
- role: "user"
content: "What's the status of my order #12345?"
- role: "assistant"
content: "I'll check that for you right away."
- role: "user"
content: "Thanks, I'm also wondering about shipping options."
Expected Tool Calls
The expectedToolCalls
section specifies which tools the agent should use and with what arguments:
expectedToolCalls:
- tool: "checkOrderStatus"
expectedArgs: { orderId: "12345" }
- tool: "getShippingOptions"
expectedArgs: { region: "any" }
Evaluation Criteria
The evaluationCriteria
section defines how the agent’s performance will be evaluated:
evaluationCriteria:
- key: "correctTool"
description: "Did the agent use the correct tool?"
type: "boolean"
- key: "informationCompleteness"
description: "How complete was the information provided?"
type: "scale"
min: 1
max: 5
- key: "responseTime"
description: "Response time metric"
type: "custom"
evaluator: "timeEvaluator"
Advanced Benchmark Features
Conditional Benchmarks
You can create benchmarks that conditionally run based on the results of other benchmarks:
benchmarks:
- name: "Initial Query"
id: "initial"
# ... benchmark definition ...
- name: "Follow-up Query"
runCondition: "benchmarks.initial.result.score > 0.8"
# ... benchmark definition ...
Random Variations
You can define benchmarks with random variations to test robustness:
benchmarks:
- name: "Product Query"
variations:
- content: "Tell me about your premium plan."
- content: "What features are in your premium tier?"
- content: "What do I get with the premium subscription?"
variationStrategy: "random"
# ... rest of benchmark ...
Running Benchmarks
Using the CLI
To run benchmarks using the command line:
# Run all benchmarks in a specification
prompt-spec test path/to/spec.yaml
# Run a specific benchmark
prompt-spec test path/to/spec.yaml --benchmark "Password Reset Inquiry"
# Run with detailed output
prompt-spec test path/to/spec.yaml --verbose
Programmatic API
To run benchmarks programmatically:
import { testSpec } from "prompt-spec";
const results = await testSpec("path/to/spec.yaml", {
benchmarks: ["Password Reset Inquiry"], // Optional: specific benchmarks
verbose: true, // Optional: detailed output
outputPath: "./results.json", // Optional: save results
});
console.log(`Overall score: ${results.score}`);
Analyzing Results
Benchmark results include:
- Overall score
- Individual benchmark scores
- Detailed evaluation for each criterion
- Tool usage analysis
- Response content analysis
Example results structure:
{
"score": 0.85,
"benchmarks": [
{
"name": "Password Reset Inquiry",
"score": 0.85,
"criteria": [
{ "key": "helpfulness", "score": 1, "max": 1 },
{ "key": "completeness", "score": 4, "max": 5 }
],
"conversationLength": 3,
"toolCalls": [{ "tool": "resetPassword", "score": 1 }]
}
]
}
Next Steps
- Learn about Optimization using benchmark results
- Explore Examples of different benchmark configurations
- Check out the CLI Reference for all available commands