MCP Dynamic Discovery and the Plugin Manager That Lost State

Apr 19, 2026 1980 words · 10 min read

GoatFlow 0.8.2 is half about the MCP server and half about the plugin manager. The MCP half is the planned work: dynamic tool discovery, SSE transport, and the admin SQL endpoint promoted to a first-class REST resource. The plugin manager half started as a bug investigation and grew into three resilience features, because the root cause turned out to be a class of failure rather than a single incident.

MCP v2: Tools From the REST API

The Problem

The MCP server shipped in 0.6.5 maintained its own tool registry. Every /api/v1/ endpoint had a corresponding hand-written MCP tool implementation, an input schema handcrafted against the OpenAPI spec, and a route handler that duplicated the real REST handler’s logic with a different shape. Fourteen tools, three files of registry glue, and a per-endpoint tax on every new feature. Adding a field meant touching two schemas and one handler; keeping them in sync was manual discipline.

The second problem was authorisation. MCP handlers re-implemented the admin middleware logic inline — they had to, because the real handlers ran behind Gin middleware the MCP server didn’t execute. A subtle divergence in role resolution meant API tokens couldn’t resolve to admin even when the underlying account was an admin; the MCP side didn’t look up the database role, it trusted a synthesised one.

The Solution

The MCP server in 0.8.2 generates its tool list dynamically from two sources: the routing engine’s own YAML spec (so every /api/v1/ endpoint becomes a tool automatically, with input schema derived from OpenAPI) and an optional MCPToolSpec on GKRegistration (so plugins declare tools with rich schemas when the auto-generated one isn’t enough).

The key design move was the API bridge. Instead of MCP handlers, the server now executes each tool by invoking the real Gin handler with a synthetic request context. That means MCP tool calls traverse exactly the same middleware stack as a real HTTP request: auth, RBAC, org scoping, rate limiting, audit logging. There’s no parallel permission system. If a REST endpoint is admin-only, the MCP tool is admin-only by construction.

Two new YAML fields give fine-grained control per route: mcp_description overrides the auto-generated tool description with LLM-friendly text, and mcp: false opts a route out of MCP tool generation entirely (useful for destructive endpoints where “let the model call it” is never the right answer).

Transport moved to MCP’s 2025-03-26 Streamable HTTP spec with three endpoints — POST for JSON-RPC requests, GET for a server-to-client SSE notification stream with 30-second heartbeat, DELETE for session termination. ListChanged: true is advertised so the client is notified when plugin enable/disable/upload changes the tool list. The stdio proxy shim that 0.6.5 required is gone.

The Benefits

internal/mcp/server.go shrank from roughly 1050 lines to 130. The 14 hardcoded tool implementations are gone along with their glue. The maintenance tax moved from “per endpoint” to “per novel MCP behaviour” — and most endpoints have no novel behaviour, so they cost nothing.

The API-bridge approach also means a compromised or buggy tool implementation is impossible. There is no tool implementation. There’s only the REST handler, which is already audited and covered by the REST test suite.

The admin SQL capability, previously only reachable via MCP, is now a standard REST endpoint at POST /api/v1/admin/sql with the same statement allowlist (SELECT, DESCRIBE, EXPLAIN, SHOW TABLES, SHOW COLUMNS) and admin middleware. MCP picks it up automatically through the dynamic discovery path. Any HTTP client can use it now, not just MCP clients.

Plugin Manager: Three Features, One Bug

The Problem

A downstream product built on GoatFlow’s plugin API maintains a long-lived peer table — the kind of state that only exists in memory. The product was silently losing entries. Not crashing, not logging errors, just… missing peers the operator knew had been added. Restarting the plugin got them back for a while. Then they’d disappear again.

Tracing the failure led to goatflow’s plugin loader. EnsureLoaded — called from AllWidgets and a handful of other places on dashboard loads — trusted a boolean discovered[].Loaded cache flag to decide whether a plugin needed to be spawned. Several different reload/replace/unregister code paths set that flag, and at least one of them was failing to keep it in sync with the manager’s actual registry. When the flag said “not loaded” but a gRPC process was still running, the loader spawned a duplicate. The duplicate died within ~300ms — socket collision with the original, visible as acceptAndServe error: timeout waiting for accept in the logs — and the plugin manager routed to the ghost. Any state the original held was effectively forgotten.

The user-visible symptom was “peers disappear.” The underlying issue was that stateful gRPC plugins are invisibly fragile to any kind of discovery-cache desync. This applies to anything a plugin holds in memory: scheduler state, connection pools, websocket registries, retry queues. One dashboard load away from a silent reset.

The Solution

Three changes together, because any one alone leaves a gap.

Ground truth. EnsureLoaded now checks manager.Get(name) — the actual registry — before deciding to load. The discovered[].Loaded flag still exists as a fast-path hint that skips the lookup when it agrees with reality, but it never makes policy decisions on its own. When a desync is detected, a warn-level log emits with the plugin name so operators can spot which reload path is buggy.

Bounded shutdown. GRPCPlugin.Shutdown(ctx) used to ignore its context and block on the Shutdown RPC indefinitely. Manager.ShutdownAll called it serially with context.Background(). A single hung plugin wedged every subsequent plugin’s shutdown turn and prevented process exit. Now the Shutdown RPC runs in a goroutine with a select { case <-ch: ... case <-ctx.Done(): ... } guard, each plugin gets its declared ResourcePolicy.ShutdownTimeout (10s default) as a deadline, and cmd/goats wraps the whole pass in a 30s overall ceiling. client.Kill() always runs afterwards as a supervised teardown — the Kill is a safety net, not conditional.

Periodic health probes. A background goroutine probes every loaded plugin every 60s via a reserved function name on the existing Call path. The plugin doesn’t need to implement a handler — any response within 5s (including the “unknown function” error most plugins return for the reserved name) means the gRPC channel is alive. Only a context-deadline-exceeded counts as a failure. Three consecutive failures flip HealthStatus.Healthy to false with a warn log; a subsequent success flips it back. State is exposed via HealthStatus(name) and AllHealthStatuses() so admin UIs can render per-plugin health.

Deliberately not included: auto-restart. The interaction between automated restarts, hot-reload-on-binary-change, and crash-loop backoff needs a design pass rather than a late-release addition. It’s queued for 0.8.3.

The Benefits

A plugin holding long-lived network state can now survive indefinite dashboard traffic without losing it, because the ground-truth check closes the duplicate-spawn window that was silently killing it. Process exit is bounded whether plugins cooperate or not, so operational churn (rolling deploys, Helm upgrades, container orchestrator resets) doesn’t get stuck on a misbehaving plugin. And zombie plugins — processes that are alive but not responding — are now detectable within ~3 minutes rather than “whenever someone notices something’s off,” which for in-memory plugin state is often never.

The health-check contract deserves special note. The choice to treat “unknown function” errors as healthy was deliberate. The failure we care about is “the plugin can’t respond at all” — a wedged event loop, a crashed goroutine leaking file descriptors, a stalled TLS session. If the plugin’s RPC stack can return any response in 5 seconds, the underlying process is alive and responsive. Rich health payloads (DB connectivity checks, external dependency pings, etc.) are a natural extension for 0.8.3, but the base signal of “can the Call path round-trip” covers most zombie detection without requiring plugin authors to implement anything new.

Go 1.25 and the Dependency Audit

The Problem

Dependabot flagged two vulnerabilities on the default branch: a high-severity panic in go-jose/go-jose/v3 JWE decryption (fixed in v3.0.5), and a medium-severity out-of-memory in golang.org/x/image’s TIFF decoder (fixed in v0.38.0). The go-jose upgrade was a clean one-line bump. The x/image upgrade wouldn’t apply — v0.38.0 requires Go 1.25, and the project was still on 1.24.

The Solution

The Go floor moved to 1.25. It touched more files than a cleaner project would: every Go-using Dockerfile (backend, toolbox, tests, goatkit, route-tools, config-manager, and a Playwright runner that installs Go via curl), the go.mod directive, the SDK toolchain declaration, the Makefile’s GO_IMAGE default, the .env.development and .env.example files (the Makefile reads GO_IMAGE from .env as its single source of truth, so the env templates had to move too), three helper scripts with their own fallback images, the GitHub Actions setup-go version, and the README badge. Nine places in total before anything compiles.

The toolbox dev-tool pins needed a second round of bumps because their older versions transitively depended on golang.org/x/tools@v0.25.x, which has constant-arithmetic source that fails to compile under Go 1.25 (invalid array length -delta * delta in tokeninternal.go). goimports, gosec, staticcheck, and golangci-lint all moved to versions published after Go 1.25’s release. golangci-lint v2 changed its Go import path to /v2/cmd/..., so the Dockerfile RUN line needed adjusting too.

With the floor raised, the govulncheck version pin (added to an earlier toolbox change as a 1.24 workaround, then documented as “revisit when the base image bumps”) was reverted to latest. Every indirect dep that go mod tidy had been holding back because of 1.24 constraints got a natural refresh — x/sys, x/text, x/tools.

The Benefits

Two Dependabot alerts cleared. The project is on the current Go release rather than trailing behind. The workaround-pin on govulncheck — which had been a source of subtle CI drift because latest and v1.1.4 have different vulnerability databases — is gone. And the toolchain now supports the language features that newer ecosystem libraries are starting to assume, so future dependency upgrades are less likely to hit compile walls.

One follow-up remains: the WASM-builder stage in the main Dockerfile uses tinygo/tinygo:0.32.0, which only supports Go source up to roughly 1.22. If a WASM plugin declares go 1.25 in its own go.mod, that stage will fail. TinyGo 0.34+ is needed for current Go source support, and the bump is scheduled for 0.8.3 alongside the WASM plugin rebuild.

Lessons

The EnsureLoaded fix is the most instructive piece of work in this release, and not because the bug was hard. It was a nine-line change once diagnosed. What made it instructive was how well it hid. A cache flag that desyncs is a classic failure mode — caches always desync eventually; that’s the point of calling them caches — but it hid for weeks because the symptom was “peers disappear slowly” rather than a crash or an error log. Silent state loss is the worst observability class because nothing alarms on it. The health-check additions exist partly because finding this bug took too long, and a “plugin zombie detected” log would have pointed at the area in a fraction of the time.

The MCP rewrite is a reminder that API bridges beat parallel implementations. The question “how do we expose this REST endpoint via MCP?” had fourteen hand-written answers before this release; there is now one, and it’s “don’t, just run the Gin handler.” The same pattern works for GraphQL, gRPC gateways, and any other “make this API shape available over that protocol” problem. Any time you find yourself reimplementing authorisation or validation for the B-protocol version of an A-protocol endpoint, stop. Bridge instead.

The Go 1.25 bump is the smallest story technically and the largest operationally. A toolchain upgrade touches every build artefact, every developer’s environment, every CI runner. The single line in go.mod is the tip of an iceberg. Writing the migration commit’s body out longhand — listing every file that moved — took longer than the actual edits, and it was worth every minute. Future-you (or your replacement) needs to know which nine places they’ve already checked when the next bump comes around.