Testing FabrCore Agents — Test Harness, Fake Chat Clients, and LLM Evaluation

Eric Brasher April 15, 2026 at 9:00 AM 7 min read

FabrCore agents run inside Orleans grains, which makes them powerful but also raises a question: how do you test them without spinning up a full silo? The answer is FabrCoreTestHarness — a lightweight in-memory test host that replaces Orleans entirely, gives you deterministic control over LLM responses, and integrates with Microsoft's evaluation libraries for scoring real agent output.

The Test Architecture

The testing stack replaces the production Orleans infrastructure with in-memory equivalents while keeping your agent code untouched:

Component	Purpose
`FabrCoreTestHarness`	Wires DI, creates agents, provides InitializeAgent/SendMessage helpers
`TestFabrCoreAgentHost`	In-memory `IFabrCoreAgentHost` — replaces the Orleans grain
`FakeChatClient`	Deterministic `IChatClient` with sequential response support
`TestChatClientService`	Dual-mode `IFabrCoreChatClientService` — mock or live LLM

Your agent extends FabrCoreAgentProxy in production and the same class is instantiated directly in tests. The harness feeds it a TestFabrCoreAgentHost instead of the real Orleans grain, and a FakeChatClient instead of a real LLM connection. The agent code has no idea it is running in a test — it calls CreateChatClientAgent, RunAsync, and RunStreamingAsync exactly as it would in production.

Mock Mode: Deterministic Tests Without an LLM

FakeChatClient lets you script exactly what the LLM returns. This is the foundation for fast, offline, deterministic tests that validate routing logic, JSON parsing, error handling, and tool invocation patterns.

C# — Mock test with sequential LLM responses

[TestClass]
public class MyAgentTests
{
    [TestMethod]
    public async Task OnMessage_ReturnsExpectedResponse()
    {
        using var harness = new FabrCoreTestHarness();

        // Configure sequential LLM responses
        var chatClient = FakeChatClient.WithSequentialResponses(
            """{"effort": "small", "reasoning": "Simple question"}""",
            "The answer is 42."
        );

        var agent = harness.CreateMockAgent<MyAgent>(chatClient);
        await harness.InitializeAgent(agent);

        var response = await harness.SendMessage(agent, "What is the answer?");

        Assert.IsNotNull(response.Message);
        Assert.IsTrue(response.Message.Contains("42"));
    }
}

FakeChatClient provides three factory methods:

Method	Behavior
`WithTextResponse(text)`	Always returns the same text on every call
`WithJsonResponse(json)`	Alias for WithTextResponse — returns JSON content
`WithSequentialResponses(r1, r2, ...)`	Returns r1 on the first call, r2 on the second, and so on

Sequential responses are particularly useful for agents that make multiple LLM calls in a single turn — for example, a planning step followed by an execution step. You script both responses up front, and the test verifies the full flow end to end.

Live Mode: Integration Tests with Real LLMs

When you need to verify actual agent behavior against a real model, switch to live mode. The CreateLiveAgent method reads model configuration and API keys directly from a fabrcore.json file in your test project — no running FabrCore host required.

C# — Live integration test

[TestClass]
[TestCategory("Integration")]
public class MyAgentIntegrationTests
{
    [TestMethod]
    public async Task OnMessage_ProducesCoherentResponse()
    {
        using var harness = new FabrCoreTestHarness();
        var agent = harness.CreateLiveAgent<MyAgent>();

        if (agent is null)
        {
            Assert.Inconclusive("Requires fabrcore.json with valid API keys.");
            return;
        }

        await harness.InitializeAgent(agent);
        var response = await harness.SendMessage(agent, "What is the capital of France?");

        Assert.IsNotNull(response.Message);
        Assert.IsTrue(response.Message.Contains("Paris",
            StringComparison.OrdinalIgnoreCase));
    }
}

If the fabrcore.json file is missing or contains placeholder API keys, CreateLiveAgent returns null and the test skips gracefully with Assert.Inconclusive(). This means integration tests can live in the same project as mock tests — they simply skip on machines without credentials. Run them separately with dotnet test --filter TestCategory=Integration.

The TestFabrCoreAgentHost also captures everything the agent does during a test, giving you assertion points beyond just the response text:

C# — Asserting on agent behavior via TestFabrCoreAgentHost

// Check messages sent by the agent to other agents
Assert.AreEqual(1, harness.AgentHost.SentMessages.Count);

// Check events sent
Assert.AreEqual(0, harness.AgentHost.SentEvents.Count);

// Check timers registered
CollectionAssert.Contains(harness.AgentHost.RegisteredTimers, "my-timer");

// Check status message set by agent or plugins
Assert.AreEqual("Processing..", harness.AgentHost.CurrentStatusMessage);

Testing OnMessageBusy

The harness also supports testing busy-state behavior. SendBusyMessage calls your agent's OnMessageBusy method directly, so you can verify what callers see when the agent is already processing another request:

C# — Testing the busy-state response

[TestMethod]
public async Task OnMessageBusy_ReturnsDefaultBusyResponse()
{
    using var harness = new FabrCoreTestHarness();
    var chatClient = FakeChatClient.WithTextResponse("ok");
    var agent = harness.CreateMockAgent<MyAgent>(chatClient);
    await harness.InitializeAgent(agent);

    var response = await harness.SendBusyMessage(agent, "Are you there?");

    Assert.IsNotNull(response.Message);
    Assert.IsTrue(response.Message.Contains("currently processing",
        StringComparison.OrdinalIgnoreCase));
}

LLM Evaluation with Microsoft.Extensions.AI.Evaluation

Beyond pass/fail assertions, you may want to measure how well your agent's responses actually perform. Microsoft's Microsoft.Extensions.AI.Evaluation libraries provide AI-judged and algorithmic metrics that score real LLM output without manual review.

The quality evaluators — RelevanceEvaluator, FluencyEvaluator, CoherenceEvaluator, GroundednessEvaluator, and others — use a separate LLM as a judge to score responses on a 1–5 scale. You can run multiple evaluators concurrently using CompositeEvaluator:

C# — Evaluating agent response quality

[TestMethod]
[TestCategory("Evaluation")]
public async Task Agent_Response_MeetsQualityThresholds()
{
    using var harness = new FabrCoreTestHarness();
    var agent = harness.CreateLiveAgent<MyAgent>();
    if (agent is null) { Assert.Inconclusive("No API key"); return; }

    await harness.InitializeAgent(agent);
    var response = await harness.SendMessage(agent,
        "Explain how photosynthesis works.");

    // Set up the evaluator's judge LLM
    var evalClient = await harness.GetChatClient("default");
    var chatConfig = new ChatConfiguration(evalClient);

    var evaluator = new CompositeEvaluator(
        new RelevanceEvaluator(),
        new FluencyEvaluator(),
        new CoherenceEvaluator());

    var result = await evaluator.EvaluateAsync(
        "Explain how photosynthesis works.",
        response.Message!,
        chatConfig);

    // Assert quality thresholds (1-5 scale, 3+ is Good)
    Assert.IsTrue(
        result.Get<NumericMetric>("Relevance").Value >= 3.0);
    Assert.IsTrue(
        result.Get<NumericMetric>("Fluency").Value >= 3.0);
    Assert.IsTrue(
        result.Get<NumericMetric>("Coherence").Value >= 3.0);
}

For RAG agents that retrieve context before answering, the GroundednessEvaluator checks whether the response stays grounded in the retrieved data. Algorithmic evaluators like BLEUEvaluator and F1Evaluator provide fast, deterministic scoring against reference texts without any LLM calls at all.

The evaluation libraries also include a reporting pipeline with DiskBasedReportingConfiguration that stores results across runs, caches LLM judge responses, and generates HTML reports showing scores, trends, and comparisons across test executions. Run dotnet tool run aieval report after your eval tests to generate the report.

Evaluator Category	Examples	Scale	Requires LLM
Quality	Relevance, Fluency, Coherence, Groundedness, Completeness	1–5	Yes
NLP	BLEU, GLEU, F1	0–1	No
Safety	Hate, Violence, Self-Harm, Sexual	0–7 (lower is safer)	Azure AI Foundry

Running Tests

All three categories of tests — mock, integration, and evaluation — live in the same MSTest project. Separate them using test categories:

Shell — Running different test categories

# All mock tests (fast, no API key needed)
dotnet test --filter "TestCategory!=Integration&TestCategory!=Evaluation"

# Integration tests only (requires fabrcore.json with API key)
dotnet test --filter "TestCategory=Integration"

# Evaluation tests only (requires API key for both agent and judge LLM)
dotnet test --filter "TestCategory=Evaluation"

# Everything
dotnet test

Mock tests run in milliseconds with no external dependencies. Integration and evaluation tests require a fabrcore.json in the test project with valid API keys, but skip gracefully with Assert.Inconclusive() on machines without credentials. This makes the test suite safe to run in CI environments where API keys may not be available.

Testing Docs AI Evaluation Docs

Built with FabrCore on .NET 10.

Eric Brasher

Builder of FabrCore and OpenCaddis.