Benchmarking

This page explains how to create and run benchmarks to evaluate agent performance.

What are Benchmarks?

Benchmarks in Prompt Spec are structured test cases that evaluate an agent’s performance on specific tasks. They allow you to:

Test how well an agent responds to particular queries
Verify that agents use tools correctly
Evaluate responses based on custom criteria
Compare performance across different agent configurations

Creating Benchmarks

Benchmarks are defined in the benchmarks section of an agent specification YAML file:


benchmarks:
  - name: "Password Reset Inquiry"
    description: "Tests how the agent handles a password reset request"
    messages:
      - role: "user"
        content: "I need to reset my password. Can you help?"
    evaluationCriteria:
      - key: "helpfulness"
        description: "Does the agent provide a clear solution?"
        type: "boolean"
      - key: "completeness"
        description: "Does the agent cover all necessary steps?"
        type: "scale"
        min: 1
        max: 5

Benchmark Components

Messages

The messages section defines the conversation that will be sent to the agent:


messages:
  - role: "user"
    content: "What's the status of my order #12345?"
  - role: "assistant"
    content: "I'll check that for you right away."
  - role: "user"
    content: "Thanks, I'm also wondering about shipping options."

Expected Tool Calls

The expectedToolCalls section specifies which tools the agent should use and with what arguments:


expectedToolCalls:
  - tool: "checkOrderStatus"
    expectedArgs: { orderId: "12345" }
  - tool: "getShippingOptions"
    expectedArgs: { region: "any" }

Evaluation Criteria

The evaluationCriteria section defines how the agent’s performance will be evaluated:


evaluationCriteria:
  - key: "correctTool"
    description: "Did the agent use the correct tool?"
    type: "boolean"
  - key: "informationCompleteness"
    description: "How complete was the information provided?"
    type: "scale"
    min: 1
    max: 5
  - key: "responseTime"
    description: "Response time metric"
    type: "custom"
    evaluator: "timeEvaluator"

Advanced Benchmark Features

Conditional Benchmarks

You can create benchmarks that conditionally run based on the results of other benchmarks:


benchmarks:
  - name: "Initial Query"
    id: "initial"
    # ... benchmark definition ...
 
  - name: "Follow-up Query"
    runCondition: "benchmarks.initial.result.score > 0.8"
    # ... benchmark definition ...

Random Variations

You can define benchmarks with random variations to test robustness:


benchmarks:
  - name: "Product Query"
    variations:
      - content: "Tell me about your premium plan."
      - content: "What features are in your premium tier?"
      - content: "What do I get with the premium subscription?"
    variationStrategy: "random"
    # ... rest of benchmark ...

Running Benchmarks

Using the CLI

To run benchmarks using the command line:


# Run all benchmarks in a specification
prompt-spec test path/to/spec.yaml
 
# Run a specific benchmark
prompt-spec test path/to/spec.yaml --benchmark "Password Reset Inquiry"
 
# Run with detailed output
prompt-spec test path/to/spec.yaml --verbose

Programmatic API

To run benchmarks programmatically:


import { testSpec } from "prompt-spec";
 
const results = await testSpec("path/to/spec.yaml", {
  benchmarks: ["Password Reset Inquiry"], // Optional: specific benchmarks
  verbose: true, // Optional: detailed output
  outputPath: "./results.json", // Optional: save results
});
 
console.log(`Overall score: ${results.score}`);

Analyzing Results

Benchmark results include:

Overall score
Individual benchmark scores
Detailed evaluation for each criterion
Tool usage analysis
Response content analysis

Example results structure:


{
  "score": 0.85,
  "benchmarks": [
    {
      "name": "Password Reset Inquiry",
      "score": 0.85,
      "criteria": [
        { "key": "helpfulness", "score": 1, "max": 1 },
        { "key": "completeness", "score": 4, "max": 5 }
      ],
      "conversationLength": 3,
      "toolCalls": [{ "tool": "resetPassword", "score": 1 }]
    }
  ]
}

Next Steps

Learn about Optimization using benchmark results
Explore Examples of different benchmark configurations
Check out the CLI Reference for all available commands