Introducing BenchmarkLlm: A Comprehensive Framework for Benchmarking LLM Agents
Learn how to systematically benchmark, evaluate, and compare LLM-based agents and workflows with BenchmarkLlm - featuring metrics collection, LLM-as-Judge evaluation, and comparative analysis.
When building LLM-based agents and workflows, one of the biggest challenges is understanding how different approaches compare. Is a multi-agent pipeline better than a single agent? Does prompt chaining improve quality? How much does it cost in tokens and latency?
BenchmarkLlm is a comprehensive benchmarking and evaluation framework that answers these questions systematically. It provides attribute-based benchmark discovery, detailed metrics collection, LLM-as-Judge quality evaluation, and comparative analysis - all in a clean, extensible package.
Why BenchmarkLlm?
Traditional benchmarking approaches for LLM applications often fall short:
- Manual comparison is time-consuming and inconsistent
- Token counting requires instrumenting every call
- Quality assessment is subjective without structured evaluation
- Side-by-side analysis is difficult to reproduce
BenchmarkLlm solves these problems with a unified framework that handles discovery, execution, metrics collection, evaluation, and reporting automatically.
Key Features
1. Attribute-Based Benchmark Discovery
Define benchmarks using simple attributes - no manual registration required:
[WorkflowBenchmark("content-generation", Description = "Compare content generation approaches")]
public class ContentBenchmarks
{
[BenchmarkLlm("multi-agent", Description = "3-agent pipeline")]
public async Task<string> MultiAgent(string prompt)
{
var workflow = MultiAgentPipeline.Create();
return await WorkflowRunner.RunAsync(workflow, prompt);
}
[BenchmarkLlm("single-agent", Baseline = true, Description = "Single agent baseline")]
public async Task<string> SingleAgent(string prompt)
{
var workflow = SingleAgentPipeline.Create();
return await WorkflowRunner.RunAsync(workflow, prompt);
}
}
The [WorkflowBenchmark] attribute marks a class as containing benchmarks. Individual methods are marked with [BenchmarkLlm], and one can be designated as the Baseline for comparison. The prompt is defined via the attribute’s Prompt property or passed at runtime.
2. Automatic Metrics Collection
BenchmarkLlm uses OpenTelemetry to collect detailed metrics without modifying your code:
- API call count - How many LLM calls were made
- Token usage - Input and output tokens per call and in aggregate
- Latency - Execution time for each call and total duration
- Response details - Character count, streaming status, timestamps
The BenchmarkTelemetryCollector captures all spans from Microsoft.Extensions.AI’s built-in instrumentation automatically.
3. LLM-as-Judge Quality Evaluation
Quality assessment is powered by an LLM judge. Two evaluator types are available:
- Content Evaluator (default) - For articles, summaries, and text generation tasks
- Agent Task Evaluator - For tool-use and multi-step agent workflows, focusing on task completion
Both evaluate across 8 dimensions on a 1-5 scale:
| Dimension | What It Measures |
|---|---|
| Completeness | Coverage of the topic / task completion |
| Structure | Organization and logical flow |
| Accuracy | Factual / decision correctness |
| Engagement | Communication quality |
| Evidence Quality | Use of data, examples, tool responses |
| Balance | Appropriate scope |
| Actionability | Practical guidance / actions taken |
| Depth | Handling of details and edge cases |
Each dimension includes a reasoning explanation, giving you insight into why scores were assigned.
4. Comparative Analysis
When running multiple benchmarks, BenchmarkLlm generates comprehensive comparative analysis:
var settings = new BenchmarkLlmSettings
{
EvaluationProvider = "azure",
EvaluationModel = "gpt-4o",
Evaluate = true,
Exporters = ["console", "markdown", "analysis"]
};
await BenchmarkLlmHost.RunAsync(settings);
The comparative evaluator:
- Collects metrics across all benchmarks
- Identifies strengths and weaknesses for each approach
- Generates an LLM-powered verdict
- Calculates delta metrics against the baseline
Running Benchmarks
Basic Usage
var settings = new BenchmarkLlmSettings
{
Filter = "*", // Run all benchmarks (glob pattern)
ArtifactsPath = "./runs", // Where to save results
Exporters = ["console"] // Output format
};
await BenchmarkLlmHost.RunAsync(settings);
Benchmarks create their own chat clients via ChatClientFactory, so no model configuration is needed at the settings level.
With Quality Evaluation
var settings = new BenchmarkLlmSettings
{
EvaluationProvider = "azure", // Provider for judge
EvaluationModel = "gpt-4o-mini", // Judge model
EvaluatorType = "content", // "content" or "task"
Evaluate = true,
Exporters = ["console", "markdown", "json"]
};
await BenchmarkLlmHost.RunAsync(settings);
Filtering Benchmarks
Use glob patterns to run specific benchmarks:
// Run only multi-agent benchmarks
settings.Filter = "*multi*";
// Run all benchmarks in a category
settings.Filter = "content-generation/*";
// Run a specific benchmark
settings.Filter = "content-generation/single-agent";
Output Structure
Each benchmark run creates a structured output directory:
./runs/2025-01-06_120000_test-driven-development/
├── run-config.json # Input configuration
├── environment.json # Runtime environment details
├── content-generation/
│ ├── multi-agent/
│ │ ├── output.md # Generated content
│ │ └── metrics.json # Detailed metrics
│ └── single-agent/
│ ├── output.md
│ └── metrics.json
├── results.json # All results in JSON
├── comparison.md # Comparison report
└── analysis.md # LLM comparative analysis
Export Formats
BenchmarkLlm supports multiple export formats:
| Exporter | Output | Use Case |
|---|---|---|
console | Formatted table | Quick review during development |
markdown | Human-readable report | Documentation and sharing |
json | Structured data | Programmatic analysis and CI integration |
analysis | LLM comparative analysis | Deep insights on approach differences |
Architectural Patterns
BenchmarkLlm uses several design patterns that make it extensible and maintainable:
- Attribute Discovery - Reflection-based scanning for automatic registration
- OpenTelemetry Integration -
BenchmarkTelemetryCollectorcaptures metrics via standard instrumentation - Strategy Pattern -
IResultExporterandIContentEvaluatorenable pluggable outputs and evaluators - LLM-as-Judge - Consistent, reproducible quality assessment with specialized evaluators
Integration with DotNetAgents.Infrastructure
BenchmarkLlm integrates seamlessly with the DotNetAgents infrastructure:
// The benchmark runner automatically wraps ChatClientFactory
// All clients created during benchmark execution are instrumented
var chatClient = ChatClientFactory.Create("gpt-4o");
This means your existing workflows work without modification - just add the benchmark attributes and run.
Re-Evaluating Previous Runs
Need to re-evaluate a previous run with a different judge model? BenchmarkLlm supports that:
await BenchmarkLlmHost.EvaluateRunAsync(
runPath: "./runs/2025-01-06_120000_test-driven-development",
model: "gpt-4o-mini"
);
This reads the saved outputs and generates new quality scores without re-running the benchmarks.
Best Practices
1. Define a Clear Baseline
Always mark one benchmark as baseline for meaningful comparisons:
[BenchmarkLlm("simple-approach", Baseline = true)]
2. Use Consistent Prompts
The [WorkflowBenchmark] attribute ensures all benchmarks in a class use the same prompt, enabling fair comparison.
3. Use ChatClientFactory for Model Configuration
Create chat clients in your benchmarks via ChatClientFactory to ensure proper instrumentation:
var client = ChatClientFactory.Create("azure", "gpt-4o");
4. Run Multiple Times
LLM outputs are non-deterministic. Consider running benchmarks multiple times and analyzing variance.
5. Use Different Evaluation Models
Try different judge models to reduce evaluation bias. A cheaper model like gpt-4o-mini often works well for evaluation.
Conclusion
BenchmarkLlm provides a comprehensive solution for understanding how your LLM agents and workflows perform. With automatic metrics collection, structured quality evaluation, and comparative analysis, you can make data-driven decisions about which approaches work best for your use cases.
Whether you’re comparing prompt engineering strategies, evaluating multi-agent architectures, or optimizing for cost and latency, BenchmarkLlm gives you the tools to measure what matters.
Get started by adding [WorkflowBenchmark] and [BenchmarkLlm] attributes to your existing workflows and run your first benchmark today.
Found this helpful?
Comments