Introducing FabrCore Swarm — Distributed Multi-Agent Task Orchestration

What happens when a single request requires the coordinated effort of five, ten, or twenty specialized AI agents — each with its own tools, domain knowledge, and operational constraints? We built FabrCore Swarm to find out. It is an experimental orchestration layer that takes a natural-language goal, builds a dependency-aware task plan, fans work out across your existing agents, and adapts in real time when things go wrong.

Swarm is not a standalone product. It is a library that sits on top of FabrCore and treats your pre-existing agents as the workforce. You bring the domain experts; Swarm brings the coordination.

What FabrCore Swarm Is

FabrCore Swarm is a hierarchical multi-agent orchestration system. At its core, it manages three things:

Planning — decomposing a user’s goal into a directed acyclic graph (DAG) of tasks, each assigned to an agent best suited to execute it.
Execution — dispatching tasks in dependency-resolved waves, passing results from completed tasks forward as context for downstream work.
Recovery — detecting failures, consulting subject-matter experts, retrying, skipping, or replanning mid-execution when the original plan hits a wall.

The system is built entirely on FabrCore primitives — Orleans grains, agent messaging, event streaming, and persistent state — so it inherits all of FabrCore’s durability, scalability, and observability guarantees out of the box.

The Swarm Constellation: Six Permanent Agents

Every swarm instance is composed of six specialized agents that form a permanent coordination layer. None of these agents perform domain work themselves — they exist solely to plan, delegate, monitor, and recover.

Agent	Role	Uses LLM?
Orchestrator	Owns the full plan lifecycle. Receives the user’s goal, gates approval, drives execution, handles failure decisions, and enforces termination policy.	Yes — for failure reasoning
Planner	Builds and revises the task plan. Discovers available agents, creates tasks, links dependencies, and optionally consults subject-matter experts before committing.	Yes — for plan construction
Supervisor	Manages task dispatch. Receives batches of ready tasks from the orchestrator, pushes them to the correct workers, collects results, and escalates roadblocks.	No — pure state machine
Workers	One per client agent. Each worker is permanently paired with a specific domain agent. It delegates tasks, validates completion, and manages shared-state subscriptions.	Yes — for delegation and validation
Blackboard	Per-plan shared key-value store with push-event subscriptions. Workers read and write results here; downstream tasks consume them as input context.	No — pure state machine
Factory	Registry discovery. Exposes the live agent registry so the planner can see what agents are available and what they can do.	No

This separation matters. The orchestrator never touches domain data. The planner never executes tasks. The supervisor never reasons about failures. Each agent has a narrow scope, which keeps the system predictable and debuggable even when dozens of tasks are in flight.

How Planning Works

When a user submits a goal, the orchestrator forwards it to the planner. The planner’s first move is discovery — it queries the live agent registry to see which client agents are currently online, what capabilities they advertise, what plugins and tools they have loaded, and what model they’re running.

This is not a static manifest. Capabilities are projected in real time from two sources: class-level metadata attributes on the agent (descriptions, capability tags, notes) and the agent’s actual runtime health state (loaded plugins, available tools, configured model). The planner sees a live, accurate picture of what each agent can actually do right now.

With that information, the planner constructs a task graph:

Each task has a description, an assigned agent alias, and an optional list of dependencies.
Dependencies are data flows — if Task B depends on Task A, then Task A’s result is passed to Task B as input context.
Tasks with no unmet dependencies form a “wave” and execute in parallel.
The planner validates that every assigned agent alias actually exists in the live registry before finalizing.

Optionally, the planner can consult subject-matter expert (SME) agents before building the plan. These are domain-knowledgeable agents that can answer questions about constraints, data formats, or operational rules — giving the planner richer context to make better task assignments.

Once the plan is built, it goes to the user for approval. The orchestrator presents a formatted view showing every task, its assigned agent, dependency chains, and execution waves. Nothing executes until the user says “go.”

How Execution Works

After approval, the orchestrator provisions a per-plan blackboard and starts a recurring drive loop. Every few seconds, the drive loop checks the plan state and pushes the next wave of ready tasks forward.

The execution path for a single task:

Dependency resolution. The orchestrator identifies tasks whose dependencies are all satisfied. Results from completed upstream tasks are propagated as input context, pulled from the blackboard.
Approval gate. If a task is flagged as requiring approval, the orchestrator pauses and asks the user before dispatch.
Dispatch to supervisor. Ready tasks are sent as a batch. The supervisor resolves each task’s agent alias to the correct worker and pushes the assignment.
Worker delegation. The worker receives the task, subscribes to the blackboard for any shared data it needs, and delegates the work to its paired client agent. The delegation message includes the task description and all input context from upstream dependencies.
Client execution. The client agent runs its own tools and logic — completely unaware that it’s part of a swarm. It receives a message, does its work, and returns a result.
Completion validation. The worker validates that the client’s response actually accomplishes the task — not just that it returned something. If the response is partial or analysis-only (describing what should be done instead of doing it), the worker re-delegates with clarified instructions.
Result propagation. The completed result is written to the blackboard. The supervisor reports back to the orchestrator. Downstream tasks can now see this result in their input context.

All of this happens across parallel waves. While one set of tasks is executing, the orchestrator is already identifying the next set of tasks that will become ready once the current wave completes.

How Client Agents Adapt into the Swarm

One of the most important design decisions in Swarm is that client agents do not need to know they are part of a swarm. There is no swarm SDK for clients to implement, no special interface to adopt, and no protocol to follow.

Clients register their agents with FabrCore as they normally would — specifying an alias, description, plugins, and tools. To make an agent available to the swarm, the client adds the agent’s alias to a swarm definition file. That’s it.

The swarm discovers capabilities from two sources automatically:

Class metadata — description, capability tags, and notes declared as attributes on the agent class.
Runtime health — the actual plugins, tools, and model currently loaded by the running agent instance.

This zero-registration discovery means the planner always sees what agents can actually do right now, not what a static config file says they should be able to do. If an agent goes offline or changes its tool configuration, the planner sees the updated state on its next planning cycle.

Workers bridge the gap between swarm coordination and client execution. Each worker is permanently paired with one client agent and manages all communication with it. The worker handles context injection (passing upstream results into the delegation message), completion validation, blackboard subscriptions, and roadblock escalation. The client agent just receives a well-formed message and responds to it.

The swarm also supports optional behavioral overlays — lightweight guidance that shapes how workers delegate and how clients interpret their tasks. These overlays are configured at the swarm definition level and injected at runtime, allowing the same client agent to behave differently depending on which swarm it’s participating in.

When Plans Break: Replanning and Recovery

Real-world execution rarely follows the happy path. Tools fail, agents encounter unexpected data, external services go down, or a task’s result reveals that the original plan was based on incorrect assumptions. Swarm is built for this.

Recovery happens at multiple levels:

Worker level. If a client returns a partial result, the worker re-delegates with clearer instructions. A stall guard limits re-delegation attempts to prevent infinite loops.
Worker SME consultation. Before escalating a roadblock, workers can consult subject-matter expert agents for guidance on how to proceed.
Supervisor level. The supervisor tracks roadblock fingerprints. If the same issue repeats, it escalates rather than retrying the same failing approach.
Orchestrator reasoning. When a task fails, the orchestrator consults its own LLM session with full plan context. It can choose to retry the task, skip it, consult an SME, ask the user for input, or trigger a full replan.
Replanning. The orchestrator sends the current plan state — including what succeeded, what failed, and why — back to the planner. The planner can add new tasks, remove failed ones, reassign work to different agents, or restructure dependencies. The execution engine picks up from where it left off with the revised plan.
Human escalation. When automated recovery is exhausted, the orchestrator escalates to the user with a specific question. The user’s answer is routed back through the system to the blocked task, which resumes from where it stopped.

This creates a recovery hierarchy: agent → SME → supervisor → orchestrator → human. Problems are handled at the lowest level possible, and only escalate when cheaper options are exhausted. The orchestrator prefers retries and skips over replanning, and prefers replanning over asking the user — minimizing human interruption while still guaranteeing that stuck situations get resolved.

Shared State: The Blackboard

Tasks in a swarm often need to share data beyond simple dependency chains. The blackboard is a per-plan key-value store that provides this coordination layer.

Any worker can read from or write to the blackboard during task execution. Workers subscribe to the blackboard at the start of their task and receive push notifications when other workers write new entries. This means a worker does not need to poll — if an upstream task writes a result that a downstream worker is waiting for, the downstream worker is notified immediately.

The blackboard is scoped to a single plan execution and is torn down when the plan completes. This prevents data leakage between runs and keeps the coordination surface clean.

Large results — datasets, documents, or analysis outputs — can be written to the blackboard instead of being embedded directly in messages, keeping the message channel lightweight while still making the data available to any task that needs it.

Termination and Safety Guarantees

Autonomous multi-agent systems must terminate. A swarm that loops forever or burns through resources without converging is worse than no swarm at all. FabrCore Swarm enforces termination at multiple levels:

Guard	What It Limits
Wall-clock timeout	Total plan execution time.
Per-task timeout	Individual task duration. Timed-out tasks are detected by the dependency resolver.
Task retry limit	Maximum attempts per task before it is marked as failed.
Drive-loop iteration limit	Maximum number of execution cycles per plan.
Roadblock limit	Maximum total roadblocks before the plan is considered failed.
Replan attempt limit	Maximum number of times the plan can be revised mid-execution.
Worker delegation guard	Maximum re-delegation attempts per task to prevent infinite delegation loops.
Completion validation guard	Maximum validation attempts before force-completing a task.
Roadblock fingerprinting	Repeated identical roadblocks are escalated rather than retried, preventing loops on the same error.

These bounds are configurable but enforced by default. Every plan will terminate — successfully, partially, or with a clear failure reason — but it will never run forever.

State Persistence and Resumption

All swarm state is persisted through FabrCore’s standard Orleans grain persistence. The orchestrator persists the full plan — task list, results, dependencies, counters, timestamps. The supervisor persists its worker mappings and task tracking. The blackboard persists its entries and subscriber list.

If the host process restarts mid-execution, the orchestrator reactivates, restores its plan state, and resumes the drive loop from where it left off. Tasks that were in flight at the time of the restart are detected as timed out and re-dispatched. There is no manual recovery step — the swarm picks up and continues.

Human-in-the-Loop by Design

Swarm is not a fully autonomous system. It is designed with explicit human checkpoints:

Plan approval. No tasks execute until the user reviews and approves the plan.
Task-level approval. Individual tasks can be flagged as requiring approval before dispatch.
Roadblock escalation. When automated recovery fails, the user receives a specific, contextualized question — not a generic “something went wrong.”
Progress notifications. The orchestrator sends throttled status updates so the user can monitor execution without being overwhelmed.

The goal is to keep the human in control of what matters — approving the plan, unblocking genuine roadblocks — while the swarm handles the mechanical work of dispatching, monitoring, and recovering.

What Makes This Sophisticated

Plenty of systems can dispatch tasks to agents. The sophistication in FabrCore Swarm comes from the layers of intelligence and resilience built around that dispatch:

Multi-level LLM reasoning. The planner reasons about task decomposition. The orchestrator reasons about failure recovery. Workers reason about delegation and completion. Each uses a specialized LLM session with only the context it needs.
DAG-based parallel execution. Tasks are not executed sequentially. The dependency resolver partitions the plan into waves that execute in parallel, with automatic context propagation between waves.
Zero-registration agent discovery. No swarm-side registry to maintain. Capabilities are projected from live agent metadata and health state, ensuring the planner always sees ground truth.
Completion validation. Workers do not blindly accept client responses. A separate validation step catches partial or analytical responses and re-delegates for actual execution.
Graduated recovery. Five levels of recovery before a task is marked as failed: worker re-delegation, SME consultation, supervisor escalation, orchestrator reasoning, and human escalation.
Mid-execution replanning. The planner can revise the task graph while execution is in progress, adding, removing, or reassigning tasks based on what has succeeded and what has failed.
Blackboard event streaming. Workers coordinate through a shared store with push notifications, enabling real-time data sharing without polling or re-dispatch.
Full state persistence. Every agent in the constellation persists its state. The swarm can resume from a host restart without data loss or manual intervention.
Configurable termination policy. Nine distinct guards ensure every plan terminates, with sensible defaults and full configurability for production tuning.

Experimental Status

FabrCore Swarm is experimental. The API surface may change, the coordination patterns may evolve, and we are actively testing it against increasingly complex real-world scenarios. But the core architecture — hierarchical agents, dependency-aware planning, graduated recovery, and zero-registration client adaptation — is proving remarkably resilient.

If you are building multi-agent systems on FabrCore and want to coordinate work across your agents at scale, Swarm is where we are exploring what that looks like. We will share more as it matures.

Learn More

FabrCore Swarm is built on the FabrCore framework. Visit the documentation site for guides on building agents, plugins, and tools.

FabrCore Docs FabrCore on GitHub Talk to Us