oyjz66/multi-agent-parallel-orchestrator

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

1025615864<1025615864@qq.com>

feat(dispatch): extract event_bus, payloads, and utils modules

3efe1699

44 commits

.trae
agents
references
scripts
.gitignore
COLLEAGUE_HANDOFF_2026-03-09.md
README.md
SKILL.md

Multi-Agent Parallel Orchestrator

This skill turns multi-agent collaboration into a durable coordination board instead of a chat-memory workflow.

Current Capabilities

Dual coordination modes: shared-board for lower token cost and easier human handoff, centralized-dispatch for higher-throughput ranking and automation.
Topology-aware auto mode selection using optimize_for, agent_count, machine_count, window_count, and shared_storage.
while loop execution with one-time workspace and branch confirmation reused through --while-session.
Stable agent identity with agent_id, plus host_id and window_id tracking for multi-machine and multi-window teams.
SQLite-backed source of truth with coordination.db, exported state.json, and human-readable STATUS.md / AGENTS.md / SESSIONS.md.
Shared-board local projections now also include NOTIFICATIONS.md, so direct CLI users can see actionable board signals without running the dispatch service.
Dependency-aware ranking with priority, specialty, critical-path pressure, downstream pressure, and path-scope ownership checks.
Semi-automatic owned_paths inference from git changes, entry files, and workspace scan candidates.
Generated handoff bundles under project-plan/handoffs/ for cross-window takeover and review handoff.
Optional lightweight HTTP dispatch service through coordination_dispatch_server.py and coordination_dispatch_client.py, now running coordination commands in-process instead of shelling out to a new Python subprocess per request.
Structured dispatch APIs for next-action, claim-next, and claim-review-next, so centralized mode can use JSON scheduling calls without replacing shared-board CLI flows.
Structured session and lease APIs for centralized-dispatch, so service users can register windows, receive a stable registration contract with agent_id, session_id, and expected_session_epoch, inspect SESSIONS.md lease state as JSON, renew leases safely, reap stale idle sessions, and reclaim stale work without going back through raw CLI text parsing.
Optional persisted event stream for centralized-dispatch, so service users can poll restart-safe coordination events without changing shared-board behavior.
Structured centralized notifications plus event-retention pruning, so service users can subscribe to actionable review or blocker signals by workers, reviewers, workers:<specialty>, reviewers:<specialty>, or agent:<id>, and keep coordination.db history bounded over long runs.
Event-retention policy is projected back into state.json, STATUS.md, and doctor output, so later agents can see the current pruning rules without inspecting server startup flags.
Shared-board direct CLI can now read local event history and prune it with update_coordination_status.py events and update_coordination_status.py prune-events, so event-backed workflows do not require centralized-dispatch.
Those local shared-board events now cover preflight, intake scan, validate / validate-repair, and normal board mutations, so the persisted timeline is useful even without the dispatch service.
Shared-board notifications now also supports local subscription-style views by target, specialty, kind, priority, and agent:<id>, so separate windows can poll just the reviewer or worker slice they care about.
Shared-board local notifications and events can now watch with --watch-seconds, so a local window can long-poll for the next matching signal without needing centralized-dispatch.
Shared-board next-action, next-claimable, and next-reviewable now also support --watch-seconds, so idle windows can wait for the next actionable work item without busy looping.
Session registry is now promoted into state.json, coordination.db, and SESSIONS.md, so host/window/agent activity is visible as a first-class coordination surface.
Handoff inbox state is now projected into HANDOFFS.md and handoff-index.json, and shared-board users can list, claim, acknowledge, and watch the next ready handoff without the dispatch service.
Event cursors now let both local CLI users and centralized readers resume from the last seen event or notification, then acknowledge what they have consumed.
Event pruning now archives deleted rows into project-plan/archive/events/*.jsonl, so retention stays bounded without losing audit history.
Centralized-dispatch now exposes lightweight SSE feeds for /events/stream and /notifications/stream, plus read-only /event-archives and /mode-advisor.
State mutations now support optimistic concurrency through revision and --expected-revision, so stale writers can fail fast instead of overwriting newer board state.
Session-bound centralized writes now also carry session_epoch, so a host can reject stale windows after idle-session recovery instead of letting an old binding keep writing.
Centralized lease mutations can now return and consume lease_token, so a reclaimed or re-claimed module can reject stale heartbeats and stale completion/release attempts instead of silently accepting an older holder.
claim, claim-next, start-review, and claim-review-next now auto-reclaim expired leases inside the same state mutation, so stale work can be taken over without a separate reclaim pass.
Empty-board reads no longer rewrite revision behind the scenes; read-only summary, state, and other load paths now keep optimistic-concurrency counters stable until a real state mutation occurs.
A stdio MCP entrypoint now exposes state, notifications, handoffs, dispatch, and generic coordination commands without creating a second coordination backend.
Intake-created tasks now split candidate_* dependency and ownership hints from confirmed depends_on / owned_paths, so rough scan suggestions stop over-serializing the board by default.
Preflight now supports safe, balanced, and throughput safety profiles, and projects the chosen policy into STATUS.md, state.json, and preflight.json.
validate --repair now checks and repairs not only markdown projections but also session_rows, handoff_rows, and cursor_rows, so stale derived bindings, orphan handoff entries, and broken cursor metadata can be normalized without manual SQLite edits.
doctor now reports repository-level counts and health signals for sessions, handoffs, cursors, archive files, stale leases, and current mode advice, so operators can inspect board integrity from one command before deciding whether to repair.
Verification coverage across workflow tests, enhancement tests, shared-board smoke, centralized-dispatch smoke, and bundle verification.

Best Fit

One machine with multiple terminals, windows, or chat threads
Multiple machines sharing the same visible project-plan/ directory
Projects that need recoverable, auditable, handoff-friendly coordination

Current Boundaries

This is still not a full centralized scheduler.
The lightweight dispatch server is a thin coordination entrypoint, not an independent orchestration engine.
For long-running, high-concurrency, cross-machine setups, the next step is still a real central coordination service.

Quick Start

0. Bootstrap the board and optionally emit host-ready MCP config

python scripts/bootstrap_coordination.py --project-root "<project-root>"
python scripts/bootstrap_coordination.py --project-root "<project-root>" --emit-mcp-config

When --emit-mcp-config is used, the skill writes project-plan/mcp-host-config.json with a ready-to-import stdio MCP server entry for the current project root. Hosts should then register a worker or reviewer window once and cache the canonical registration.binding fields (agent_id, session_id, expected_session_epoch) before making later dispatch or lease calls.

1. Run preflight once before a `while` session

python scripts/preflight_coordination.py --project-root "<project-root>" --confirm-workspace --confirm-branch --session-kind while-loop --mode auto --optimize-for token --agent-count 2 --window-count 2
python scripts/preflight_coordination.py --project-root "<project-root>" --confirm-workspace --confirm-branch --session-kind while-loop --mode auto --optimize-for token --safety-profile safe

For cross-machine work that should optimize for throughput:

python scripts/preflight_coordination.py --project-root "<project-root>" --confirm-workspace --confirm-branch --session-kind while-loop --mode auto --optimize-for efficiency --agent-count 6 --machine-count 2 --window-count 4 --cross-machine

2. Initialize the first agent and seed intake

python scripts/update_coordination_status.py init-agent --project-root "<project-root>" --thread-key "worker-a" --role worker --while-session
python scripts/scan_project_intake.py --project-root "<project-root>" --thread-key "worker-a" --seed-tasks

3. Keep looping with `next-action`

python scripts/update_coordination_status.py next-action --project-root "<project-root>" --agent-id "<agent-id>" --specialty "backend"
python scripts/update_coordination_status.py next-action --project-root "<project-root>" --agent-id "<agent-id>" --specialty "backend" --watch-seconds 5
python scripts/update_coordination_status.py claim-next --project-root "<project-root>" --agent-id "<agent-id>" --reviewer-id "<reviewer-id>" --specialty "backend" --while-session
python scripts/update_coordination_status.py claim-review-next --project-root "<project-root>" --agent-id "<reviewer-id>" --while-session

4. Generate a handoff bundle

python scripts/update_coordination_status.py handoff-bundle --project-root "<project-root>" --module "<module>" --agent-id "<agent-id>"

The bundle is written to:

project-plan/handoffs/<module>.handoff.md

Lightweight Centralized Entry Point

Start the proxy:

python scripts/coordination_dispatch_server.py --project-root "<project-root>" --host 127.0.0.1 --port 8765

Call it through the client:

python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" preflight --confirm-workspace --confirm-branch --session-kind while-loop --mode centralized-dispatch --optimize-for efficiency --agent-count 6 --machine-count 2 --window-count 4 --cross-machine
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" update summary
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" validate --repair
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" state
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" state format=summary
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" events since=0 limit=50
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" events-stream since=0 limit=20 timeout=5
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" notifications since=0 limit=50 target=reviewers:backend kind=review-ready
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" notifications-stream since=0 limit=20 timeout=5 target=reviewers:backend kind=review-ready
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" mode-advisor
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" prune-events --max-rows 2000
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" event-archives
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" sessions
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" register-session --thread-key "worker-a" --role worker --host-id machine-a --window-id window-1
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" lease-heartbeat --agent-id "<agent-id>" --session-id "<session-id>" --expected-session-epoch 1 --expected-revision 12 --lease-token "<lease-token>" --module "core-auth" --ttl-seconds 600
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" dispatch claim-next --agent-id "<agent-id>" --session-id "<session-id>" --expected-session-epoch 1 --expected-revision 12 --specialty backend
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" lease-heartbeat --agent-id "<agent-id>" --session-id "<session-id>" --expected-session-epoch 1 --expected-revision 13 --lease-token "<lease-token>" --module "core-auth" --ttl-seconds 600
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" reap-sessions --max-age-seconds 86400
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" lease-reclaim
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" dispatch next-action --agent-id "<agent-id>" --specialty backend
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" dispatch claim-next --agent-id "<agent-id>" --reviewer-id "<reviewer-id>" --specialty backend --while-session
python scripts/coordination_dispatch_client.py --server-url "http://127.0.0.1:8765" dispatch claim-next --agent-id "<agent-id>" --specialty backend --expected-revision 12

Structured session-bound writes should replay the full registration/lease contract, especially session_id, expected_session_epoch, expected_revision, and the latest lease_token where required.

This does not replace shared-board. In local or lower-overhead runs, direct CLI remains first-class:

python scripts/update_coordination_status.py next-action --project-root "<project-root>" --agent-id "<agent-id>" --specialty "backend"
python scripts/update_coordination_status.py next-reviewable --project-root "<project-root>" --agent-id "<reviewer-id>" --specialty "backend" --watch-seconds 5
python scripts/update_coordination_status.py claim-next --project-root "<project-root>" --agent-id "<agent-id>" --reviewer-id "<reviewer-id>" --specialty "backend" --while-session
python scripts/update_coordination_status.py notifications --project-root "<project-root>"
python scripts/update_coordination_status.py notifications --project-root "<project-root>" --target "reviewers:backend" --since 0 --limit 20
python scripts/update_coordination_status.py notifications --project-root "<project-root>" --target "reviewers:backend" --kind "review-ready" --cursor-id "review-feed"
python scripts/update_coordination_status.py ack-notifications --project-root "<project-root>" --cursor-id "review-feed"
python scripts/update_coordination_status.py notifications --project-root "<project-root>" --target "agent:agent-1234abcd" --since 0 --limit 20
python scripts/update_coordination_status.py notifications --project-root "<project-root>" --target "reviewers:backend" --kind "review-ready" --since 0 --limit 20 --watch-seconds 5
python scripts/update_coordination_status.py events --project-root "<project-root>" --since 0 --limit 50 --notification-only --target "reviewers:backend"
python scripts/update_coordination_status.py events --project-root "<project-root>" --cursor-id "event-feed"
python scripts/update_coordination_status.py ack-events --project-root "<project-root>" --cursor-id "event-feed"
python scripts/update_coordination_status.py events --project-root "<project-root>" --source "validate" --since 0 --limit 20 --watch-seconds 5
python scripts/update_coordination_status.py handoffs --project-root "<project-root>"
python scripts/update_coordination_status.py next-handoff --project-root "<project-root>" --agent-id "<agent-id>" --watch-seconds 5
python scripts/update_coordination_status.py claim-handoff --project-root "<project-root>" --module "<module>" --agent-id "<agent-id>"
python scripts/update_coordination_status.py ack-handoff --project-root "<project-root>" --module "<module>" --agent-id "<agent-id>"
python scripts/update_coordination_status.py mode-advisor --project-root "<project-root>"
python scripts/update_coordination_status.py prune-events --project-root "<project-root>" --max-rows 2000
python scripts/update_coordination_status.py event-archives --project-root "<project-root>"

In other words:

shared-board: direct CLI is the normal path, with local notifications, cursor-aware event history, handoff inbox commands, pruning, archive inspection, and NOTIFICATIONS.md / HANDOFFS.md projections.
centralized-dispatch: service endpoints, structured dispatch, persisted event polling, SSE streams, notification polling, archive inspection, and mode advice become the higher-efficiency path.

MCP Entry Point

Run the stdio MCP bridge when another host wants structured coordination tools:

python scripts/coordination_mcp_server.py --project-root "<project-root>"

Or generate a host-importable config directly from bootstrap:

python scripts/bootstrap_coordination.py --project-root "<project-root>" --emit-mcp-config

That writes:

project-plan/mcp-host-config.json

The file contains a ready mcpServers entry pointing at scripts/coordination_mcp_server.py with the current project root and folder name.

It exposes:

coordination_state
coordination_events
coordination_notifications
coordination_handoffs
coordination_sessions
coordination_mode_advisor
coordination_session_register
coordination_session_reap
coordination_lease_heartbeat
coordination_lease_reclaim
coordination_dispatch
coordination_command

For hosts that keep a long-lived worker or reviewer window, prefer the structured MCP or dispatch path that carries registration.binding (agent_id, session_id, expected_session_epoch) and then the latest lease_token returned after a claim, review claim, or heartbeat. That gives the host an explicit session binding plus per-lease compare-and-swap style protection for renew/release/finish flows.

Host lifecycle rules:

Persist registration.binding from register-session or coordination_session_register and send those exact fields on later dispatch, lease-heartbeat, and session-bound status-mutation calls.
registration.follow_up_fields is canonical for which fields a host must replay on structured follow-up calls; dispatch reuses agent_id, session_id, expected_session_epoch, while lease and status mutations also carry lease_token.
Structured conflict payloads use shared keys current_session_epoch, current_revision, and lease_token without changing the success envelope.
Structured HTTP endpoints and MCP tool calls both treat state-revision-conflict, session-binding-conflict, session-epoch-conflict, and lease-token-conflict as the canonical conflict set; HTTP maps them to 409, including /leases/reclaim, while malformed payloads still use invalid-arguments semantics.
A session cannot be rebound to a new host/window while it still holds a live lease; use session-reap only for stale idle sessions, and lease-reclaim for expired module leases.