Chat History Compaction Strategies

Eric Brasher February 21, 2026 at 10:18 AM 5 min read

Unbounded chat history will eventually exceed your LLM's context window. FabrCore's compaction infrastructure handles this — here's how to use it.

Every message exchanged between a user and an agent adds to the conversation history. In a long-running session, that history grows until it hits the model's context limit. At that point, you either lose messages, get truncation artifacts, or the model simply refuses to respond. FabrCore provides a purpose-built compaction system so you never have to deal with this manually.

Compaction Infrastructure

CompactionService is the core engine that handles the heavy lifting. It performs three operations in sequence: token estimation across the current message history, message splitting to identify which messages to preserve and which to summarize, and LLM-based summarization that condenses older messages into a compact summary while retaining critical context.

CompactionConfig controls how and when compaction triggers. Here are the key properties:

PropertyDefaultDescription
Enabled true Master switch for compaction. When false, TryCompactAsync returns immediately.
KeepLastN 20 Number of recent messages to always preserve verbatim. These are never summarized.
MaxContextTokens (model-dependent) Maximum token budget for the conversation context. When exceeded, compaction triggers.
Threshold 0.75 Compaction triggers when token usage exceeds this fraction of MaxContextTokens.

Together, these components give you a compaction pipeline that preserves recent context while condensing older exchanges into summaries that the model can still reference.

The TryCompactAsync Pattern

The recommended approach is to call TryCompactAsync() before each model invocation. This ensures the context is within budget before you send the next request to the LLM:

Agent OnMessage Pattern
public override async Task<AgentMessage> OnMessage(AgentMessage message)
{
    // Compact before model invocation
    var compaction = await TryCompactAsync();
    if (compaction?.WasCompacted == true)
    {
        logger.LogInformation("Compacted: {Original} -> {Compacted} messages",
            compaction.OriginalMessageCount, compaction.CompactedMessageCount);
    }

    // Now invoke the model with compacted history
    var response = await _agent.GetResponseAsync(message.Message);
}

TryCompactAsync() is a no-op when compaction is disabled or when the token usage is below the threshold. When it does compact, it returns a result object with WasCompacted, OriginalMessageCount, and CompactedMessageCount so you can log the operation for observability.

The pattern is intentionally explicit: you call it, you see the result, you decide what to log. No hidden side effects.

Configuration

Compaction settings live in your agent configuration. Add a CompactionConfig block to control behavior per agent:

fabrcore.json
{
    "CompactionConfig": {
        "Enabled": true,
        "KeepLastN": 20,
        "MaxContextTokens": 8000,
        "Threshold": 0.75
    }
}

With these settings, compaction triggers when the conversation reaches 75% of 8,000 tokens (6,000 tokens). The last 20 messages are always kept verbatim, and everything older is summarized by the LLM into a condensed history block.

You can tune these values per agent. A customer support agent might keep more messages (KeepLastN: 40) to maintain detailed context, while a quick-turnaround coding assistant might keep fewer (KeepLastN: 10) to maximize the token budget for code output.

Why Explicit Over Automatic

You might wonder why compaction isn't automatic — why not trigger it whenever a message is added to history? Three reasons:

Compaction calls an LLM. The summarization step sends older messages to the model and asks for a condensed version. That's a real API call with real latency and real cost. Auto-triggering during every message add would introduce unpredictable latency spikes in what should be a fast operation.

Auto-triggering would be surprising. If adding a message to history silently triggered an LLM call, you'd have hidden side effects in what looks like a simple append operation. Debugging latency issues becomes much harder when message storage has invisible dependencies on model availability.

Explicit calls give you control over timing. You might want to compact before a model call but not before a tool call. You might want to skip compaction for short conversations. You might want to compact on a schedule rather than per-message. The explicit TryCompactAsync() pattern lets you place compaction exactly where it makes sense for your agent's workflow.

Next Steps

Chat history compaction is one piece of FabrCore's persistence infrastructure. For the full picture — including storage backends, session management, and message retrieval — check the persistence documentation.


Eric Brasher

Builder of FabrCore and OpenCaddis.