benchmarkingevaluationllmagentsmetricstesting

Introducing BenchmarkLlm: A Comprehensive Framework for Benchmarking LLM Agents

Learn how to systematically benchmark, evaluate, and compare LLM-based agents and workflows with BenchmarkLlm - featuring metrics collection, LLM-as-Judge evaluation, and comparative analysis.

When building LLM-based agents and workflows, one of the biggest challenges is understanding how different approaches compare. Is a multi-agent pipeline better than a single agent? Does prompt chaining improve quality? How much does it cost in tokens and latency?

BenchmarkLlm is a comprehensive benchmarking and evaluation framework that answers these questions systematically. It provides attribute-based benchmark discovery, detailed metrics collection, LLM-as-Judge quality evaluation, and comparative analysis - all in a clean, extensible package.

Why BenchmarkLlm?

Traditional benchmarking approaches for LLM applications often fall short:

Manual comparison is time-consuming and inconsistent
Token counting requires instrumenting every call
Quality assessment is subjective without structured evaluation
Side-by-side analysis is difficult to reproduce

BenchmarkLlm solves these problems with a unified framework that handles discovery, execution, metrics collection, evaluation, and reporting automatically.

Key Features

1. Attribute-Based Benchmark Discovery

Define benchmarks using simple attributes - no manual registration required:

[WorkflowBenchmark("content-generation", Description = "Compare content generation approaches")]
public class ContentBenchmarks
{
    [BenchmarkLlm("multi-agent", Description = "3-agent pipeline")]
    public async Task<string> MultiAgent(string prompt)
    {
        var workflow = MultiAgentPipeline.Create();
        return await WorkflowRunner.RunAsync(workflow, prompt);
    }

    [BenchmarkLlm("single-agent", Baseline = true, Description = "Single agent baseline")]
    public async Task<string> SingleAgent(string prompt)
    {
        var workflow = SingleAgentPipeline.Create();
        return await WorkflowRunner.RunAsync(workflow, prompt);
    }
}

The [WorkflowBenchmark] attribute marks a class as containing benchmarks. Individual methods are marked with [BenchmarkLlm], and one can be designated as the Baseline for comparison. The prompt is defined via the attribute’s Prompt property or passed at runtime.

2. Automatic Metrics Collection

BenchmarkLlm uses OpenTelemetry to collect detailed metrics without modifying your code:

API call count - How many LLM calls were made
Token usage - Input and output tokens per call and in aggregate
Latency - Execution time for each call and total duration
Response details - Character count, streaming status, timestamps

The BenchmarkTelemetryCollector captures all spans from Microsoft.Extensions.AI’s built-in instrumentation automatically.

3. LLM-as-Judge Quality Evaluation

Quality assessment is powered by an LLM judge. Two evaluator types are available:

Content Evaluator (default) - For articles, summaries, and text generation tasks
Agent Task Evaluator - For tool-use and multi-step agent workflows, focusing on task completion

Both evaluate across 8 dimensions on a 1-5 scale:

Dimension	What It Measures
Completeness	Coverage of the topic / task completion
Structure	Organization and logical flow
Accuracy	Factual / decision correctness
Engagement	Communication quality
Evidence Quality	Use of data, examples, tool responses
Balance	Appropriate scope
Actionability	Practical guidance / actions taken
Depth	Handling of details and edge cases

Each dimension includes a reasoning explanation, giving you insight into why scores were assigned.

4. Comparative Analysis

When running multiple benchmarks, BenchmarkLlm generates comprehensive comparative analysis:

var settings = new BenchmarkLlmSettings
{
    EvaluationProvider = "azure",
    EvaluationModel = "gpt-4o",
    Evaluate = true,
    Exporters = ["console", "markdown", "analysis"]
};

await BenchmarkLlmHost.RunAsync(settings);

The comparative evaluator:

Collects metrics across all benchmarks
Identifies strengths and weaknesses for each approach
Generates an LLM-powered verdict
Calculates delta metrics against the baseline

Running Benchmarks

Basic Usage

var settings = new BenchmarkLlmSettings
{
    Filter = "*",               // Run all benchmarks (glob pattern)
    ArtifactsPath = "./runs",   // Where to save results
    Exporters = ["console"]     // Output format
};

await BenchmarkLlmHost.RunAsync(settings);

Benchmarks create their own chat clients via ChatClientFactory, so no model configuration is needed at the settings level.

With Quality Evaluation

var settings = new BenchmarkLlmSettings
{
    EvaluationProvider = "azure",       // Provider for judge
    EvaluationModel = "gpt-4o-mini",    // Judge model
    EvaluatorType = "content",          // "content" or "task"
    Evaluate = true,
    Exporters = ["console", "markdown", "json"]
};

await BenchmarkLlmHost.RunAsync(settings);

Filtering Benchmarks

Use glob patterns to run specific benchmarks:

// Run only multi-agent benchmarks
settings.Filter = "*multi*";

// Run all benchmarks in a category
settings.Filter = "content-generation/*";

// Run a specific benchmark
settings.Filter = "content-generation/single-agent";

Output Structure

Each benchmark run creates a structured output directory:

./runs/2025-01-06_120000_test-driven-development/
├── run-config.json          # Input configuration
├── environment.json         # Runtime environment details
├── content-generation/
│   ├── multi-agent/
│   │   ├── output.md       # Generated content
│   │   └── metrics.json    # Detailed metrics
│   └── single-agent/
│       ├── output.md
│       └── metrics.json
├── results.json             # All results in JSON
├── comparison.md            # Comparison report
└── analysis.md              # LLM comparative analysis

Export Formats

BenchmarkLlm supports multiple export formats:

Exporter	Output	Use Case
`console`	Formatted table	Quick review during development
`markdown`	Human-readable report	Documentation and sharing
`json`	Structured data	Programmatic analysis and CI integration
`analysis`	LLM comparative analysis	Deep insights on approach differences

Architectural Patterns

BenchmarkLlm uses several design patterns that make it extensible and maintainable:

Attribute Discovery - Reflection-based scanning for automatic registration
OpenTelemetry Integration - BenchmarkTelemetryCollector captures metrics via standard instrumentation
Strategy Pattern - IResultExporter and IContentEvaluator enable pluggable outputs and evaluators
LLM-as-Judge - Consistent, reproducible quality assessment with specialized evaluators

Integration with DotNetAgents.Infrastructure

BenchmarkLlm integrates seamlessly with the DotNetAgents infrastructure:

// The benchmark runner automatically wraps ChatClientFactory
// All clients created during benchmark execution are instrumented
var chatClient = ChatClientFactory.Create("gpt-4o");

This means your existing workflows work without modification - just add the benchmark attributes and run.

Re-Evaluating Previous Runs

Need to re-evaluate a previous run with a different judge model? BenchmarkLlm supports that:

await BenchmarkLlmHost.EvaluateRunAsync(
    runPath: "./runs/2025-01-06_120000_test-driven-development",
    model: "gpt-4o-mini"
);

This reads the saved outputs and generates new quality scores without re-running the benchmarks.

Best Practices

1. Define a Clear Baseline

Always mark one benchmark as baseline for meaningful comparisons:

[BenchmarkLlm("simple-approach", Baseline = true)]

2. Use Consistent Prompts

The [WorkflowBenchmark] attribute ensures all benchmarks in a class use the same prompt, enabling fair comparison.

3. Use ChatClientFactory for Model Configuration

Create chat clients in your benchmarks via ChatClientFactory to ensure proper instrumentation:

var client = ChatClientFactory.Create("azure", "gpt-4o");

4. Run Multiple Times

LLM outputs are non-deterministic. Consider running benchmarks multiple times and analyzing variance.

5. Use Different Evaluation Models

Try different judge models to reduce evaluation bias. A cheaper model like gpt-4o-mini often works well for evaluation.

Conclusion

BenchmarkLlm provides a comprehensive solution for understanding how your LLM agents and workflows perform. With automatic metrics collection, structured quality evaluation, and comparative analysis, you can make data-driven decisions about which approaches work best for your use cases.

Whether you’re comparing prompt engineering strategies, evaluating multi-agent architectures, or optimizing for cost and latency, BenchmarkLlm gives you the tools to measure what matters.

Get started by adding [WorkflowBenchmark] and [BenchmarkLlm] attributes to your existing workflows and run your first benchmark today.

Found this helpful?