← Back to Home
benchmarkingevaluationllmagentsmetricstesting

Introducing BenchmarkLlm: A Comprehensive Framework for Benchmarking LLM Agents

Learn how to systematically benchmark, evaluate, and compare LLM-based agents and workflows with BenchmarkLlm - featuring metrics collection, LLM-as-Judge evaluation, and comparative analysis.

By Roman Weis |

Introducing BenchmarkLlm: A Comprehensive Framework for Benchmarking LLM Agents

When building LLM-based agents and workflows, one of the biggest challenges is understanding how different approaches compare. Is a multi-agent pipeline better than a single agent? Does prompt chaining improve quality? How much does it cost in tokens and latency?

BenchmarkLlm is a comprehensive benchmarking and evaluation framework that answers these questions systematically. It provides attribute-based benchmark discovery, detailed metrics collection, LLM-as-Judge quality evaluation, and comparative analysis - all in a clean, extensible package.

Why BenchmarkLlm?

Traditional benchmarking approaches for LLM applications often fall short:

  • Manual comparison is time-consuming and inconsistent
  • Token counting requires instrumenting every call
  • Quality assessment is subjective without structured evaluation
  • Side-by-side analysis is difficult to reproduce

BenchmarkLlm solves these problems with a unified framework that handles discovery, execution, metrics collection, evaluation, and reporting automatically.

Key Features

1. Attribute-Based Benchmark Discovery

Define benchmarks using simple attributes - no manual registration required:

[WorkflowBenchmark("content-generation",
    Prompt = "The benefits of test-driven development",
    Description = "Compare different content generation approaches")]
public class ContentBenchmarks
{
    [BenchmarkLlm("multi-agent", Description = "3-agent pipeline")]
    public async Task<BenchmarkOutput> MultiAgent(string prompt)
    {
        var (workflow, agentModels) = MultiAgentPipeline.Create();
        var content = await WorkflowRunner.RunAsync(workflow, prompt);
        return BenchmarkOutput.WithModels(content, agentModels);
    }

    [BenchmarkLlm("single-agent", Baseline = true, Description = "Single agent baseline")]
    public async Task<BenchmarkOutput> SingleAgent(string prompt)
    {
        var (workflow, agentModels) = SingleAgentPipeline.Create();
        var content = await WorkflowRunner.RunAsync(workflow, prompt);
        return BenchmarkOutput.WithModels(content, agentModels);
    }
}

The [WorkflowBenchmark] attribute marks a class as containing benchmarks and defines the shared prompt. Individual methods are marked with [BenchmarkLlm], and one can be designated as the Baseline for comparison.

2. Automatic Metrics Collection

BenchmarkLlm transparently wraps your chat clients to collect detailed metrics without modifying your code:

  • API call count - How many LLM calls were made
  • Token usage - Input and output tokens per call and in aggregate
  • Latency - Execution time for each call and total duration
  • Response details - Character count, streaming status, timestamps

The MetricsCollectingChatClient decorator intercepts all calls (both synchronous and streaming) and records everything automatically.

3. LLM-as-Judge Quality Evaluation

Quality assessment is powered by an LLM judge that evaluates content across 8 dimensions on a 1-5 scale:

DimensionWhat It Measures
CompletenessCoverage of the topic
StructureOrganization and logical flow
AccuracyFactual correctness
EngagementReadability and writing quality
Evidence QualityStatistics, examples, citations
BalanceCoverage of different perspectives
ActionabilityPractical implementation guidance
DepthAnalysis depth vs surface-level summary

Each dimension includes a reasoning explanation, giving you insight into why scores were assigned.

4. Comparative Analysis

When running multiple benchmarks, BenchmarkLlm generates comprehensive comparative analysis:

var settings = new BenchmarkLlmSettings
{
    Model = "gpt-4o",
    EvaluationModel = "gpt-4o",
    Evaluate = true,
    Exporters = ["console", "markdown", "analysis"]
};

await BenchmarkLlmHost.RunAsync(settings);

The comparative evaluator:

  • Collects metrics across all benchmarks
  • Identifies strengths and weaknesses for each approach
  • Generates an LLM-powered verdict
  • Calculates delta metrics against the baseline

Running Benchmarks

Basic Usage

var settings = new BenchmarkLlmSettings
{
    Model = "gpt-4o",           // Model for benchmarks
    Filter = "*",               // Run all benchmarks (glob pattern)
    ArtifactsPath = "./runs",   // Where to save results
    Exporters = ["console"]     // Output format
};

await BenchmarkLlmHost.RunAsync(settings);

With Quality Evaluation

var settings = new BenchmarkLlmSettings
{
    Model = "gpt-4o",
    EvaluationModel = "gpt-4o-mini",  // Judge model (can be different)
    Evaluate = true,
    Exporters = ["console", "markdown", "json"]
};

await BenchmarkLlmHost.RunAsync(settings);

Filtering Benchmarks

Use glob patterns to run specific benchmarks:

// Run only multi-agent benchmarks
settings.Filter = "*multi*";

// Run all benchmarks in a category
settings.Filter = "content-generation/*";

// Run a specific benchmark
settings.Filter = "content-generation/single-agent";

Output Structure

Each benchmark run creates a structured output directory:

./runs/2025-01-06_120000_test-driven-development/
├── run-config.json          # Input configuration
├── environment.json         # Runtime environment details
├── content-generation/
│   ├── multi-agent/
│   │   ├── output.md       # Generated content
│   │   └── metrics.json    # Detailed metrics
│   └── single-agent/
│       ├── output.md
│       └── metrics.json
├── results.json             # All results in JSON
├── comparison.md            # Comparison report
└── analysis.md              # LLM comparative analysis

Export Formats

BenchmarkLlm supports multiple export formats:

ExporterOutputUse Case
consoleFormatted tableQuick review during development
markdownHuman-readable reportDocumentation and sharing
jsonStructured dataProgrammatic analysis and CI integration
analysisLLM comparative analysisDeep insights on approach differences

Architectural Patterns

BenchmarkLlm uses several design patterns that make it extensible and maintainable:

  • Attribute Discovery - Reflection-based scanning for automatic registration
  • Decorator Pattern - MetricsCollectingChatClient wraps clients non-intrusively
  • Strategy Pattern - IResultExporter enables pluggable output formats
  • LLM-as-Judge - Consistent, reproducible quality assessment

Integration with DotNetAgents.Infrastructure

BenchmarkLlm integrates seamlessly with the DotNetAgents infrastructure:

// The benchmark runner automatically wraps ChatClientFactory
// All clients created during benchmark execution are instrumented
var chatClient = ChatClientFactory.Create("gpt-4o");

This means your existing workflows work without modification - just add the benchmark attributes and run.

Re-Evaluating Previous Runs

Need to re-evaluate a previous run with a different judge model? BenchmarkLlm supports that:

await BenchmarkLlmHost.EvaluateRunAsync(
    runPath: "./runs/2025-01-06_120000_test-driven-development",
    model: "claude-3-opus"
);

This reads the saved outputs and generates new quality scores without re-running the benchmarks.

Best Practices

1. Define a Clear Baseline

Always mark one benchmark as baseline for meaningful comparisons:

[BenchmarkLlm("simple-approach", Baseline = true)]

2. Use Consistent Prompts

The [WorkflowBenchmark] attribute ensures all benchmarks in a class use the same prompt, enabling fair comparison.

3. Return Agent Metadata

Use BenchmarkOutput.WithModels() to track which models each agent used:

return BenchmarkOutput.WithModels(content, new Dictionary<string, string>
{
    ["Researcher"] = "gpt-4o",
    ["Writer"] = "gpt-4o-mini"
});

4. Run Multiple Times

LLM outputs are non-deterministic. Consider running benchmarks multiple times and analyzing variance.

5. Use Different Evaluation Models

Try different judge models to reduce evaluation bias. A cheaper model like gpt-4o-mini often works well for evaluation.

Conclusion

BenchmarkLlm provides a comprehensive solution for understanding how your LLM agents and workflows perform. With automatic metrics collection, structured quality evaluation, and comparative analysis, you can make data-driven decisions about which approaches work best for your use cases.

Whether you’re comparing prompt engineering strategies, evaluating multi-agent architectures, or optimizing for cost and latency, BenchmarkLlm gives you the tools to measure what matters.

Get started by adding [WorkflowBenchmark] and [BenchmarkLlm] attributes to your existing workflows and run your first benchmark today.