State of MCP Reliability

The state of
MCP reliability.

I pointed one CLI at the official MCP reference servers and gave each a grade. Not a security audit — this asks the two questions nobody scores: do the tools actually work, and can an agent use them?

4 servers graded 2 earned an A 1 C · 1 F 5 real agent tool-confusions surfaced

Four servers, one A–F grade each

Every grade is the weighted blend of the layers that ran: L1 static (schema quality), L2 behavioral (does it run without crashing), L3 agent-usability (can an LLM pick the right tool). Read-only, run locally.

memory

9 tools · knowledge-graph store

94 / 100

L1 static89

L2 behavioral100

L3 agent-usability89

Top confusion

an agent reached for read_graph when asked to delete_relations

everything

13 tools · reference test server

92 / 100

L1 static87

L2 behavioral96

L3 agent-usability92

Top confusion

a resource fetch got mistaken for echo

filesystem

14 tools · local file access

71 / 100

L1 static79

L2 behavioral64

L3 agent-usability75

Top confusion · the headline finding

an agent asked for read_file — a tool that doesn't exist. The real one is read_text_file. Three tools tripped the model this way.

sequential-thinking

1 tool · structured reasoning

57 / 100

L1 static100

L2 behavioral0 · not exercised

L3 agent-usability100

Why the F — honestly

Its single tool isn't safe to call read-only, so L2 couldn't exercise it and scored 0, dragging the blend to an F. Not a broken server — a conservative grader that won't vouch for what it can't verify.

Where the model reached for the wrong tool

L3 hands an LLM the tool catalog and a realistic task, then checks which tool it picks — without ever calling the server. A wrong pick is a description-clarity signal: the schema is technically valid but reads ambiguously to an agent.

filesystem

read_file→ model picked →read_text_filethe finding

filesystem

read_multiple_files→search_files

filesystem

edit_file→read_text_file

memory

delete_relations→read_graph

everything

gzip resource→echo

Five layers. Three ran in this table.

mcp-vitals grades a server the way you'd grade a dependency you're about to trust — from static hygiene up to how it behaves under an agent and an attacker.

Static

Schema quality: descriptions, typed & documented params, naming, required fields.

✓ ran

Behavioral

Generates inputs from the schema, calls read-only tools, watches for crashes vs graceful errors + latency.

✓ ran

Agent-usability

An LLM picks a tool for a task. Measures selection accuracy & argument validity. Never calls the server.

✓ ran

Adversarial

Scans descriptions & prompts for injection, concealment and over-permission patterns.

in CLI

Ops

Transport security, type coverage, graceful-error rate under load.

in CLI

Honest limitations

L2 doesn't know what a meaningful argument is — it builds inputs from JSON Schema alone. A tool needing a specific well-formed payload (sequential-thinking's single tool) scores low even though the server is fine. Read that as "needs tool-aware test cases," not "broken."
L3 tasks are auto-generated, one per tool — a starting signal, not a human-labeled benchmark. The confusions are real; a curated calibration set (next milestone) would make the accuracy numbers authoritative.
This is a starter table. Official reference servers, read-only, run locally. The harness grades any list of targets — the point is the method, reproducible on your own server.

Grade your own MCP server

# install from the public repo — any stdio or http server
pipx install git+https://github.com/enached134-ctrl/mcp-vitals

mcpvitals grade "npx -y @modelcontextprotocol/server-memory" \
  --behavioral --agent --min-grade B

# → report.html + score.json, exits non-zero below the gate

Or gate it in CI

Drop the composite Action into a workflow and fail the build when a server regresses below a grade:

- uses: enached134-ctrl/mcp-vitals@v1
  with:
    target: "npx -y your-server"
    min-grade: B