We scanned 15,000 MCP servers. Here's where our scanner breaks down.
We run SpiderRating, an open-source security scanner for MCP (Model Context Protocol) servers. Over the past month, we scanned 15,674 MCP servers and skills using static analysis (46 rules + YARA supply chain patterns) and built a calibration database of 10,970 manually verified findings.
Here's what surprised us -- and where we got it wrong.
Our scanner is least accurate on security tools
We grouped MCP servers by purpose and compared our scanner's accuracy. Security-themed servers -- scanners, firewalls, pentest tools, CTF challenges -- had a 2.05x higher false positive rate than the rest of the ecosystem.
| Group | FP rate | Real vulns per repo | Sample |
|---|---|---|---|
| Security-themed MCP servers | 55.5% | 1.6 | 226 FP / 181 TP across 110 repos |
| Everything else | 27.0% | 1.8 | 2,855 FP / 7,708 TP across 4,213 repos |
To be clear: security tools don't have more real vulnerabilities. They actually have slightly fewer (1.6 vs 1.8 true positives per repo). The problem is that our scanner produces far more noise on them.
Why? Security tools legitimately contain attack patterns in their source code. A scanner's detection rules look identical to the patterns it's detecting. An MCP firewall's test suite contains the exact exploit payloads it's supposed to block. Our regex-based scanner can't distinguish "code that detects SQL injection" from "code that is vulnerable to SQL injection."
The extreme case: cisco-ai-defense/mcp-scanner triggered 145 false positives against only 1 true positive (99.3% FP rate). Its code is full of injection patterns -- because detecting injection patterns is what it does.
This is a fundamental limitation of static analysis, not a problem with security tools. It forced us to build a "by-design" detection system: when the scanner recognizes that a repo is a security tool, it downgrades confidence on findings that match the tool's core function rather than flagging them as vulnerabilities.
The numbers
From 15,674 scanned servers:
- 11.8% earned a RECOMMENDED verdict (safe to use without caveats)
- 49.0% were CONSIDER (usable, but with risks)
- 25.5% were ALLOW_WITH_RISK
- 13.6% were NOT_RECOMMENDED
- Average security score: 5.28/10 (median 5.71)
What we found (calibrated true positive rates)
Not all scanner rules are equally reliable. We tracked every finding against manual verification to get real accuracy numbers:
| Vulnerability | TP rate | Verified sample |
|---|---|---|
| Path traversal | 76.1% | 67 findings |
| SQL injection | 66.0% | 1,024 findings |
| SSRF | 66.2% | 859 findings |
| Child process injection | 50.3% | 1,107 findings |
| Prototype pollution | 50.2% | 214+216 findings |
| Dangerous eval | 44.5% | 654 findings |
| Timing attack | 36.8% | 76 findings |
| Command injection | 32.1% | 327 findings |
| Hardcoded credential | 2.7% | 263 findings |
| Data exfiltration | 0.5% | 217 findings |
| Prompt injection | 1.2% | 168 findings |
The bottom three are basically broken. Most "hardcoded credential" findings are mock data in tutorials. Most "data exfiltration" hits are documentation files. We've since added confidence markers to every risk flag so downstream consumers know which findings to trust.
We fixed what we found
We didn't just scan -- we submitted fix PRs to upstream projects. 6 have been merged so far:
- [upstash/context7#2235](https://github.com/upstash/context7/pull/2235) -- path traversal (critical) + command injection (high). Context7 has 7,700+ stars.
- [topoteretes/cognee#2423](https://github.com/topoteretes/cognee/pull/2423) -- command injection in API handler.
- [agentic-community/mcp-gateway-registry#655](https://github.com/agentic-community/mcp-gateway-registry/pull/655) -- command injection (critical) in the gateway registry.
- Plus 3 more across moeru-ai/airi, Flux159/mcp-server-kubernetes, and others.
Our overall merge rate is 35.3% (6/17). We treat each merged PR as ground truth validation that our scanner found a real issue, and each rejection as calibration data for reducing false positives.
What we got wrong
We're being transparent about our limitations:
- Three scanner rules are essentially broken (hardcoded_credential at 2.7% TP, data_exfiltration at 0.5%, prompt_injection at 1.2%). We've marked these as low-confidence rather than removing them, so the data stays complete but flagged.
- "Score" doesn't mean "safe." A high score means our scanner didn't find issues, not that there aren't any. We only do static analysis -- runtime behavior, supply chain depth, and authentication bypass are blind spots.
- Category bias exists. Database MCP servers get more SQL injection flags, shell tools get more command injection flags. This is expected but inflates findings in certain categories.
- Our description quality scoring (mean 2.61/10) may be too harsh. The 7-dimension rubric penalizes terse descriptions that are technically correct but don't meet our "AI-agent-friendly" standard. This is a calibration issue we're still tuning.
The data
Everything is public:
- Ratings: spiderrating.com -- every server has a detail page with score breakdown, risk flags, and decision verdict
- Scanner: github.com/teehooai/spidershield (MIT license)
- Statistics: spiderrating.com/statistics -- 50+ aggregate stats from the scan data
- Decision API:
GET /api/v1/decide/mcp-tool?slug={owner}/{repo}-- structured JSON verdict for any rated server - Comparisons: spiderrating.com/compare -- side-by-side security comparison of any two servers
We're considering open-sourcing the full calibration dataset (10,970 verified TP/FP observations with per-category accuracy rates). If that would be useful to you, let us know.
---
*Built by a small team that got frustrated with installing MCP servers without knowing if they were safe. We scan daily and publish everything publicly. No signup required for any of the above.*