Introducing BenchmarkLlm: A Comprehensive Framework for Benchmarking LLM Agents
Learn how to systematically benchmark, evaluate, and compare LLM-based agents and workflows with BenchmarkLlm - featuring metrics collection, LLM-as-Judge evaluation, and comparative analysis.
Introducing BenchmarkLlm: A Comprehensive Framework for Benchmarking LLM Agents
When building LLM-based agents and workflows, one of the biggest challenges is understanding how different approaches compare. Is a multi-agent pipeline better than a single agent? Does prompt chaining improve quality? How much does it cost in tokens and latency?
BenchmarkLlm is a comprehensive benchmarking and evaluation framework that answers these questions systematically. It provides attribute-based benchmark discovery, detailed metrics collection, LLM-as-Judge quality evaluation, and comparative analysis - all in a clean, extensible package.
Why BenchmarkLlm?
Traditional benchmarking approaches for LLM applications often fall short:
- Manual comparison is time-consuming and inconsistent
- Token counting requires instrumenting every call
- Quality assessment is subjective without structured evaluation
- Side-by-side analysis is difficult to reproduce
BenchmarkLlm solves these problems with a unified framework that handles discovery, execution, metrics collection, evaluation, and reporting automatically.
Key Features
1. Attribute-Based Benchmark Discovery
Define benchmarks using simple attributes - no manual registration required:
[WorkflowBenchmark("content-generation",
Prompt = "The benefits of test-driven development",
Description = "Compare different content generation approaches")]
public class ContentBenchmarks
{
[BenchmarkLlm("multi-agent", Description = "3-agent pipeline")]
public async Task<BenchmarkOutput> MultiAgent(string prompt)
{
var (workflow, agentModels) = MultiAgentPipeline.Create();
var content = await WorkflowRunner.RunAsync(workflow, prompt);
return BenchmarkOutput.WithModels(content, agentModels);
}
[BenchmarkLlm("single-agent", Baseline = true, Description = "Single agent baseline")]
public async Task<BenchmarkOutput> SingleAgent(string prompt)
{
var (workflow, agentModels) = SingleAgentPipeline.Create();
var content = await WorkflowRunner.RunAsync(workflow, prompt);
return BenchmarkOutput.WithModels(content, agentModels);
}
}
The [WorkflowBenchmark] attribute marks a class as containing benchmarks and defines the shared prompt. Individual methods are marked with [BenchmarkLlm], and one can be designated as the Baseline for comparison.
2. Automatic Metrics Collection
BenchmarkLlm transparently wraps your chat clients to collect detailed metrics without modifying your code:
- API call count - How many LLM calls were made
- Token usage - Input and output tokens per call and in aggregate
- Latency - Execution time for each call and total duration
- Response details - Character count, streaming status, timestamps
The MetricsCollectingChatClient decorator intercepts all calls (both synchronous and streaming) and records everything automatically.
3. LLM-as-Judge Quality Evaluation
Quality assessment is powered by an LLM judge that evaluates content across 8 dimensions on a 1-5 scale:
| Dimension | What It Measures |
|---|---|
| Completeness | Coverage of the topic |
| Structure | Organization and logical flow |
| Accuracy | Factual correctness |
| Engagement | Readability and writing quality |
| Evidence Quality | Statistics, examples, citations |
| Balance | Coverage of different perspectives |
| Actionability | Practical implementation guidance |
| Depth | Analysis depth vs surface-level summary |
Each dimension includes a reasoning explanation, giving you insight into why scores were assigned.
4. Comparative Analysis
When running multiple benchmarks, BenchmarkLlm generates comprehensive comparative analysis:
var settings = new BenchmarkLlmSettings
{
Model = "gpt-4o",
EvaluationModel = "gpt-4o",
Evaluate = true,
Exporters = ["console", "markdown", "analysis"]
};
await BenchmarkLlmHost.RunAsync(settings);
The comparative evaluator:
- Collects metrics across all benchmarks
- Identifies strengths and weaknesses for each approach
- Generates an LLM-powered verdict
- Calculates delta metrics against the baseline
Running Benchmarks
Basic Usage
var settings = new BenchmarkLlmSettings
{
Model = "gpt-4o", // Model for benchmarks
Filter = "*", // Run all benchmarks (glob pattern)
ArtifactsPath = "./runs", // Where to save results
Exporters = ["console"] // Output format
};
await BenchmarkLlmHost.RunAsync(settings);
With Quality Evaluation
var settings = new BenchmarkLlmSettings
{
Model = "gpt-4o",
EvaluationModel = "gpt-4o-mini", // Judge model (can be different)
Evaluate = true,
Exporters = ["console", "markdown", "json"]
};
await BenchmarkLlmHost.RunAsync(settings);
Filtering Benchmarks
Use glob patterns to run specific benchmarks:
// Run only multi-agent benchmarks
settings.Filter = "*multi*";
// Run all benchmarks in a category
settings.Filter = "content-generation/*";
// Run a specific benchmark
settings.Filter = "content-generation/single-agent";
Output Structure
Each benchmark run creates a structured output directory:
./runs/2025-01-06_120000_test-driven-development/
├── run-config.json # Input configuration
├── environment.json # Runtime environment details
├── content-generation/
│ ├── multi-agent/
│ │ ├── output.md # Generated content
│ │ └── metrics.json # Detailed metrics
│ └── single-agent/
│ ├── output.md
│ └── metrics.json
├── results.json # All results in JSON
├── comparison.md # Comparison report
└── analysis.md # LLM comparative analysis
Export Formats
BenchmarkLlm supports multiple export formats:
| Exporter | Output | Use Case |
|---|---|---|
console | Formatted table | Quick review during development |
markdown | Human-readable report | Documentation and sharing |
json | Structured data | Programmatic analysis and CI integration |
analysis | LLM comparative analysis | Deep insights on approach differences |
Architectural Patterns
BenchmarkLlm uses several design patterns that make it extensible and maintainable:
- Attribute Discovery - Reflection-based scanning for automatic registration
- Decorator Pattern -
MetricsCollectingChatClientwraps clients non-intrusively - Strategy Pattern -
IResultExporterenables pluggable output formats - LLM-as-Judge - Consistent, reproducible quality assessment
Integration with DotNetAgents.Infrastructure
BenchmarkLlm integrates seamlessly with the DotNetAgents infrastructure:
// The benchmark runner automatically wraps ChatClientFactory
// All clients created during benchmark execution are instrumented
var chatClient = ChatClientFactory.Create("gpt-4o");
This means your existing workflows work without modification - just add the benchmark attributes and run.
Re-Evaluating Previous Runs
Need to re-evaluate a previous run with a different judge model? BenchmarkLlm supports that:
await BenchmarkLlmHost.EvaluateRunAsync(
runPath: "./runs/2025-01-06_120000_test-driven-development",
model: "claude-3-opus"
);
This reads the saved outputs and generates new quality scores without re-running the benchmarks.
Best Practices
1. Define a Clear Baseline
Always mark one benchmark as baseline for meaningful comparisons:
[BenchmarkLlm("simple-approach", Baseline = true)]
2. Use Consistent Prompts
The [WorkflowBenchmark] attribute ensures all benchmarks in a class use the same prompt, enabling fair comparison.
3. Return Agent Metadata
Use BenchmarkOutput.WithModels() to track which models each agent used:
return BenchmarkOutput.WithModels(content, new Dictionary<string, string>
{
["Researcher"] = "gpt-4o",
["Writer"] = "gpt-4o-mini"
});
4. Run Multiple Times
LLM outputs are non-deterministic. Consider running benchmarks multiple times and analyzing variance.
5. Use Different Evaluation Models
Try different judge models to reduce evaluation bias. A cheaper model like gpt-4o-mini often works well for evaluation.
Conclusion
BenchmarkLlm provides a comprehensive solution for understanding how your LLM agents and workflows perform. With automatic metrics collection, structured quality evaluation, and comparative analysis, you can make data-driven decisions about which approaches work best for your use cases.
Whether you’re comparing prompt engineering strategies, evaluating multi-agent architectures, or optimizing for cost and latency, BenchmarkLlm gives you the tools to measure what matters.
Get started by adding [WorkflowBenchmark] and [BenchmarkLlm] attributes to your existing workflows and run your first benchmark today.