mcp-vitals
Live report · graded locally · v0.2

The state of
MCP reliability.

I pointed one CLI at the official MCP reference servers and gave each a grade. Not a security audit — this asks the two questions nobody scores: do the tools actually work, and can an agent use them?

4 servers graded 2 earned an A 1 C · 1 F 5 real agent tool-confusions surfaced
The league table

Four servers, one A–F grade each

Every grade is the weighted blend of the layers that ran: L1 static (schema quality), L2 behavioral (does it run without crashing), L3 agent-usability (can an LLM pick the right tool). Read-only, run locally.

memory
9 tools · knowledge-graph store
A
94 / 100
L1 static89
L2 behavioral100
L3 agent-usability89
Top confusion
an agent reached for read_graph when asked to delete_relations
everything
13 tools · reference test server
A
92 / 100
L1 static87
L2 behavioral96
L3 agent-usability92
Top confusion
a resource fetch got mistaken for echo
filesystem
14 tools · local file access
C
71 / 100
L1 static79
L2 behavioral64
L3 agent-usability75
Top confusion · the headline finding
an agent asked for read_file — a tool that doesn't exist. The real one is read_text_file. Three tools tripped the model this way.
sequential-thinking
1 tool · structured reasoning
F
57 / 100
L1 static100
L2 behavioral0 · not exercised
L3 agent-usability100
Why the F — honestly
Its single tool isn't safe to call read-only, so L2 couldn't exercise it and scored 0, dragging the blend to an F. Not a broken server — a conservative grader that won't vouch for what it can't verify.
Agent-usability · L3

Where the model reached for the wrong tool

L3 hands an LLM the tool catalog and a realistic task, then checks which tool it picks — without ever calling the server. A wrong pick is a description-clarity signal: the schema is technically valid but reads ambiguously to an agent.

filesystem
read_file→ model picked →read_text_filethe finding
filesystem
read_multiple_filessearch_files
filesystem
edit_fileread_text_file
memory
delete_relationsread_graph
everything
gzip resourceecho
How the grade is computed

Five layers. Three ran in this table.

mcp-vitals grades a server the way you'd grade a dependency you're about to trust — from static hygiene up to how it behaves under an agent and an attacker.

L1

Static

Schema quality: descriptions, typed & documented params, naming, required fields.

✓ ran
L2

Behavioral

Generates inputs from the schema, calls read-only tools, watches for crashes vs graceful errors + latency.

✓ ran
L3

Agent-usability

An LLM picks a tool for a task. Measures selection accuracy & argument validity. Never calls the server.

✓ ran
L4

Adversarial

Scans descriptions & prompts for injection, concealment and over-permission patterns.

in CLI
L5

Ops

Transport security, type coverage, graceful-error rate under load.

in CLI
Honest limitations
  • L2 doesn't know what a meaningful argument is — it builds inputs from JSON Schema alone. A tool needing a specific well-formed payload (sequential-thinking's single tool) scores low even though the server is fine. Read that as "needs tool-aware test cases," not "broken."
  • L3 tasks are auto-generated, one per tool — a starting signal, not a human-labeled benchmark. The confusions are real; a curated calibration set (next milestone) would make the accuracy numbers authoritative.
  • This is a starter table. Official reference servers, read-only, run locally. The harness grades any list of targets — the point is the method, reproducible on your own server.
Reproduce it

Grade your own MCP server

# install from the public repo — any stdio or http server
pipx install git+https://github.com/enached134-ctrl/mcp-vitals

mcpvitals grade "npx -y @modelcontextprotocol/server-memory" \
  --behavioral --agent --min-grade B

# → report.html + score.json, exits non-zero below the gate

Or gate it in CI

Drop the composite Action into a workflow and fail the build when a server regresses below a grade:

- uses: enached134-ctrl/mcp-vitals@v1
  with:
    target: "npx -y your-server"
    min-grade: B