Managing Agent Lifecycles at Scale with Orleans Grains
Every FabrCore agent is an Orleans grain — a lightweight virtual actor with its own identity, state, and single-threaded execution guarantee. This means agents activate on demand, deactivate when idle, survive failures, and scale across a cluster without any thread-safety gymnastics on your part. In this post we walk through the lifecycle methods that make it all work.
The Grain Foundation
At its core, FabrCoreAgentProxy is the base class every agent extends. It wires together the Orleans grain runtime, LLM chat clients, tool resolution, and inter-agent messaging behind a simple set of overridable methods. The constructor takes exactly three parameters — AgentConfiguration, IServiceProvider, and IFabrCoreAgentHost — and does nothing async. All setup happens later in the lifecycle.
using FabrCore.Core;
using FabrCore.Sdk;
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
[AgentAlias("support-agent")]
[Description("Handles customer support inquiries")]
[FabrCoreCapabilities("Lookup orders, check status, process returns.")]
public class SupportAgent : FabrCoreAgentProxy
{
private AIAgent? _agent;
private AgentSession? _session;
public SupportAgent(
AgentConfiguration config,
IServiceProvider serviceProvider,
IFabrCoreAgentHost fabrcoreAgentHost)
: base(config, serviceProvider, fabrcoreAgentHost) { }
public override async Task OnInitialize() { /* ... */ }
public override async Task<AgentMessage> OnMessage(AgentMessage message) { /* ... */ }
public override Task OnEvent(EventMessage eventMessage) { /* ... */ }
}
When a message targets "user1:support-agent", Orleans activates the grain if it is not already in memory. The grain stays resident until idle, at which point Orleans deactivates it — flushing custom state and chat history automatically. If the agent is needed again, it re-activates and OnInitialize runs once more. This activation/deactivation cycle is entirely transparent to calling code.
Lifecycle Methods in Detail
FabrCore exposes a clear set of lifecycle hooks. Understanding when each runs is key to building reliable agents.
| Method | When It Runs | Purpose |
|---|---|---|
Constructor | Grain activation | DI wiring only — no async work |
OnInitialize() | Before first message or on reconfigure | Set up LLM client, resolve tools, create threads |
OnMessage(AgentMessage) | Request or OneWay message received | Core message processing, return a response |
OnMessageBusy(AgentMessage) | Message arrives while OnMessage is running | Handle concurrent messages (default: "busy" response) |
OnEvent(EventMessage) | Fire-and-forget event | React to stream event notifications |
GetHealth(HealthDetailLevel) | Health check request | Return custom health metrics |
OnInitialize — Wiring Up the Agent
OnInitialize is called once before the first message is processed, or when an agent is reconfigured. This is where you resolve tools from configured plugins, MCP servers, and standalone tool aliases, then create the chat client agent that connects to your LLM provider.
public override async Task OnInitialize()
{
// Step 1: Resolve tools from plugins, standalone tools, and MCP servers
var tools = await ResolveConfiguredToolsAsync();
// Step 2: Add local tool methods defined in this class
tools.Add(AIFunctionFactory.Create(LookupOrder));
// Step 3: Create the chat client agent
var result = await CreateChatClientAgent(
chatClientConfigName: config.Models ?? "default",
threadId: config.Handle ?? fabrcoreAgentHost.GetHandle(),
tools: tools);
_agent = result.Agent;
_session = result.Session;
}
The call to ResolveConfiguredToolsAsync() is required before CreateChatClientAgent — tools are not auto-resolved. It discovers plugins via [PluginAlias], standalone tools via [ToolAlias], and connects any configured MCP servers.
OnMessage — Processing Requests
Every request or one-way message enters OnMessage. Orleans guarantees single-threaded execution, so you never have two OnMessage calls running simultaneously on the same grain. The recommended pattern streams the LLM response back to the caller:
public override async Task<AgentMessage> OnMessage(AgentMessage message)
{
var response = message.Response();
var chatMessage = new ChatMessage(ChatRole.User, message.Message);
await foreach (var update in _agent!.RunStreamingAsync(chatMessage, _session!))
{
response.Message += update.Text;
}
return response;
}
Token counts are automatically captured and attached to the response Args (e.g., _tokens_input, _tokens_output). Chat history is auto-flushed after OnMessage completes, and compaction runs if the configured token threshold is exceeded.
OnEvent — Fire-and-Forget Events
Events use the EventMessage class (CloudEvents-inspired) and arrive via the AgentEvent stream. They are one-way — no response is expected.
public override Task OnEvent(EventMessage eventMessage)
{
switch (eventMessage.Type)
{
case "order.status-changed":
logger.LogInformation("Order status changed: {Data}", eventMessage.Data);
break;
}
return Task.CompletedTask;
}
Health Monitoring
Every agent grain exposes health information through GetHealth. The base implementation returns basic status, uptime, and message counts. Override it to surface domain-specific metrics:
public override AgentHealthStatus GetHealth(HealthDetailLevel level)
{
var health = base.GetHealth(level);
if (level >= HealthDetailLevel.Detailed)
{
health = health with
{
Message = _isReady ? "Ready" : "Initializing"
};
}
return health;
}
Health states include Healthy, Degraded, Unhealthy, and NotConfigured. The diagnostics API at /fabrcoreapi/diagnostics/agents aggregates health across the cluster, making it straightforward to monitor hundreds of agents from a single dashboard. At the Detailed level, health responses include agent type, uptime, messages processed, active timer count, and reminder count.
Why Orleans Grains Matter for AI Agents
The grain model solves several problems that emerge when running AI agents at scale:
- Isolation — Each agent grain runs single-threaded. No locks, no race conditions, no shared mutable state between agents. Custom state is persisted automatically on deactivation.
- Location transparency — Agents can live on any silo in the cluster. Orleans routes messages to the correct machine transparently. Add silos to scale horizontally.
- Automatic lifecycle — Grains activate on first message and deactivate when idle. No manual resource management. Chat history and custom state flush automatically.
- Failure recovery — If a silo goes down, grains re-activate on a healthy silo. Persistent reminders survive restarts. Timers resume on activation.
- Concurrent message handling —
OnMessageis marked[AlwaysInterleave], allowingOnMessageBusyto handle messages that arrive while the agent is already processing. Stale message protection kicks in after 5 minutes.
This combination means you can go from a single-server development setup to a multi-silo production cluster without changing your agent code. The lifecycle methods stay the same — Orleans and FabrCore handle the rest.
Built with FabrCore on .NET 10.
Builder of FabrCore and OpenCaddis.