There are dozens of blog posts comparing AI agent frameworks. Most of them read the docs, list features in a table, and declare a winner. I wanted to do something different.
I've spent the past year building production LLM systems — RAG pipelines, multi-agent architectures, agentic document processors — across pharma, banking, and audit. In that time, I've used DSPy, LangGraph, and CrewAI extensively. Not in toy demos. In systems that run daily, process real documents, and get scrutinized by domain experts.
What I've learned is that choosing between these frameworks isn't about which has more GitHub stars or a nicer API. It's about how you think about the problem. Each framework embodies a fundamentally different philosophy.
These aren't competing on the same axis. They're different answers to the question: “What should a developer control, and what should the framework handle?”
What makes this comparison different
Instead of abstract feature tables, this series builds the same pipeline in all three frameworks and compares them across dimensions that matter in real projects:
MCP Integration
The Model Context Protocol is quickly gaining traction as a way to connect agents to external tools. I'll integrate each framework with existing MCP servers, including mcp-python-repl — a Python REPL server I built and maintain — to show how cleanly each framework adopts the protocol.
Agent Skills (agentskills.io)
Agent capabilities should be structured, reusable, and documented. I'll define each agent's skills using the open Agent Skills specification — the standard format maintained by Anthropic for giving agents new capabilities via SKILL.md files. That said, recent evaluations — notably Vercel's findings — show that agents don't always trigger skills reliably. An interesting counterpoint: DSPy's prompt optimizers (like GEPA, a reflective optimizer that uses textual feedback to evolve prompts) can directly address this by treating skill trigger reliability as an optimizable metric. We'll test whether optimization closes the gap that passive context (like AGENTS.md) currently wins.
LLM-as-a-Judge Evaluation
Every team building agents struggles with evaluation. I'll implement LLM-as-a-Judge in each framework — not just as a simple scorer, but as a full evaluation pipeline. In DSPy, the judge is a module you can optimize. In LangGraph, it's a node in your graph. In CrewAI, it's an agent with a reviewer role.
Along the way, I'll also compare dependency weight, lines of code for the same result, and how each framework handles the inevitable moment when output quality isn't good enough.
The use case: Company Research Agent
We need a pipeline complex enough to reveal meaningful differences but simple enough that the framework — not the business logic — stays in focus. Here's what we're building:
Four steps, each testing a different framework capability:
| Step | What it does | What it tests |
|---|---|---|
| Researcher | Searches the web for recent news, financials, key events | MCP tool integration, structured extraction |
| Writer | Transforms raw facts into a concise analyst-style summary | Core LLM interaction, structured output |
| Reviewer | Scores the summary on accuracy, completeness, conciseness | Evaluation, conditional branching |
| Feedback loop | If score < threshold, sends feedback back to the Writer | Control flow, retry logic, state management |
We'll use real companies — Apple, Tesla, Nvidia — so the outputs are verifiable and the pipeline faces real-world messiness.
The key insight is that this pipeline exercises everything we want to compare: tool usage (MCP), structured agent capabilities (Agent Skills), quality assessment (LLM-as-a-Judge), and the improvement loop that separates toy demos from production systems.
The shared foundation
To keep the comparison fair, all three implementations share the same building blocks. We define them once in a common/ package — same tools, same models, same eval criteria. Only the framework changes.
Pydantic models
Identical data contracts across all frameworks:
from pydantic import BaseModel, Field
class CompanyFacts(BaseModel):
"""Structured output from the Researcher step."""
company_name: str
sector: str
recent_news: list[str] = Field(
description="3-5 recent news headlines with dates"
)
financial_highlights: list[str] = Field(
description="Key financial metrics or events"
)
key_events: list[str] = Field(
description="Notable recent events"
)
sources: list[str] = Field(
description="URLs of sources used"
)
class AnalystSummary(BaseModel):
"""Structured output from the Writer step."""
summary_text: str = Field(description="200-word max analyst summary")
key_risks: list[str]
outlook: str = Field(
description="One sentence: bullish, bearish, or neutral"
)
confidence_score: float = Field(ge=0, le=1)
# ── Pre-validation gate (no LLM needed) ──────────────────
def structural_check(summary: AnalystSummary) -> list[str]:
"""Cheap checks before calling the LLM judge.
Returns a list of issues. Empty = pass."""
issues = []
word_count = len(summary.summary_text.split())
if word_count > 200:
issues.append(f"Summary too long: {word_count} words (max 200)")
if not summary.key_risks:
issues.append("Missing key_risks")
if not summary.outlook:
issues.append("Missing outlook")
return issues
# ── LLM judge models (only what needs intelligence) ──────
class ClaimVerification(BaseModel):
"""One verified claim from the summary."""
claim: str = Field(description="The factual claim made in the summary")
source_url: str = Field(description="Source URL that should support this claim")
supported: bool = Field(description="Does the source actually support the claim?")
reasoning: str = Field(description="Why supported or not")
class ReviewResult(BaseModel):
"""What the LLM judge actually evaluates — no vague 0-10 scores."""
# Accuracy: verified claims / total claims
claim_verifications: list[ClaimVerification]
accuracy_ratio: float = Field(
ge=0, le=1,
description="verified_claims / total_claims"
)
# Completeness: facets covered / expected facets
expected_facets: list[str] = Field(
description="Facets we expected (news, financials, risks, outlook, events)"
)
covered_facets: list[str] = Field(
description="Facets actually present in the summary"
)
completeness_ratio: float = Field(
ge=0, le=1,
description="covered_facets / expected_facets"
)
# Conciseness (only subjective score — constrained 1-5 with rubric)
conciseness_rating: int = Field(
ge=1, le=5,
description="1=verbose, 5=tight"
)
# Decision
feedback: str
issues: list[str]
approved: boolAgent Skills
Each agent's capabilities are defined as an Agent Skill — not a flat config file, but a directory with scripts, references, and assets that the agent can navigate progressively. Here's the Researcher skill structure:
company-researcher/
├── SKILL.md # Main instructions + frontmatter
├── scripts/
│ ├── extract_financials.py # Parses financial data from raw HTML
│ └── validate_sources.py # Checks that cited URLs are accessible
├── references/
│ ├── output-schema.md # Full CompanyFacts model documentation
│ ├── search-strategies.md # Sector-specific search patterns
│ └── quality-checklist.md # Self-evaluation checklist before submission
└── assets/
└── sector-taxonomy.json # Standard sector classificationAnd the SKILL.md itself references these files — the agent loads them on demand:
---
name: company-researcher
description: >
Researches a public company using web search and extracts structured
financial facts with source attribution. Use when given a company name
and asked to produce a CompanyFacts report.
metadata:
author: faunaris-ai
version: "1.0"
allowed-tools: web_search execute_python
---
# Company Researcher
## When to use
Activate when the user provides a company name and needs structured
research output (news, financials, key events).
## Instructions
1. Classify the company sector using [sector-taxonomy.json](assets/sector-taxonomy.json)
2. Apply the appropriate search strategy from [search-strategies.md](references/search-strategies.md)
3. Search for recent news (past 30 days), financial highlights, and key events
4. Run [extract_financials.py](scripts/extract_financials.py) on raw financial pages
5. Validate all sources with [validate_sources.py](scripts/validate_sources.py)
6. Return a CompanyFacts object — see [output-schema.md](references/output-schema.md)
## Quality checklist
Before submitting, verify against [quality-checklist.md](references/quality-checklist.md):
- [ ] At least 3 recent news items with dates
- [ ] At least 2 financial highlights with specific numbers
- [ ] Every claim has a verified source URL
- [ ] Sector classification matches taxonomyThis isn't a toy example. The skill references scripts the agent must run, docs it must read, and a checklist it must verify. The real test in Part 3 will be: does each framework's agent actually navigate this structure, or does it ignore the references and wing it?
A note on evaluation: why we don't use 0-10 scores
Most LLM-as-a-Judge implementations ask the judge to “rate accuracy from 0 to 10.” This is a trap. A score of 7 means nothing — it's subjective, non-reproducible across runs, and impossible to debug. If two runs of the same input give 6 and 8, what did you learn? Nothing.
Our approach separates evaluation into two stages:
Stage 1: structural checks (no LLM, no cost). Before the judge even fires, we run deterministic checks in plain Python — word count, schema validation, required fields. If the summary is 500 words or has no sources, why spend tokens on an LLM review? Send it back immediately with concrete feedback.
Stage 2: the LLM judge (only what needs intelligence). For accuracy, the judge verifies each claim against its cited source — not “rate accuracy 0-10” but “does source X support claim Y? yes or no.” Accuracy becomes verified_claims / total_claims. For completeness, we define expected facets upfront and check which ones appear: covered_facets / expected_facets. The only subjective score we allow is conciseness (1-5 with a rubric), because conciseness is genuinely subjective.
What's next
In Part 2, we build the core pipeline in all three frameworks — same use case, same models, same tools, different paradigms.
In Part 3, we integrate MCP servers, adopt Agent Skills from agentskills.io, and run a full dependency audit.
In Part 4, we build LLM-as-a-Judge evaluation in each framework, unleash DSPy's optimizer, and deliver the final verdict with a decision matrix.
The companion code will be available in the GitHub repository as each part is published.