There are dozens of blog posts comparing AI agent frameworks. Most of them read the docs, list features in a table, and declare a winner. I wanted to do something different.

I've spent the past year building production LLM systems — RAG pipelines, multi-agent architectures, agentic document processors — across pharma, banking, and audit. In that time, I've used DSPy, LangGraph, and CrewAI extensively. Not in toy demos. In systems that run daily, process real documents, and get scrutinized by domain experts.

What I've learned is that choosing between these frameworks isn't about which has more GitHub stars or a nicer API. It's about how you think about the problem. Each framework embodies a fundamentally different philosophy.

DSPy
“Define what you want. I'll figure out the best prompts.”
Treats prompts as learnable parameters and compiles your pipeline to optimize them automatically. Your job is to write modules and metrics. DSPy's job is to make the LLM perform.
LangGraph
“Draw your workflow. I'll execute it reliably.”
Models agent pipelines as state graphs with explicit nodes, edges, and conditional branching. Your job is to engineer the control flow. LangGraph gives you checkpoints, time-travel debugging, and full visibility into every state transition.
CrewAI
“Describe your team. I'll coordinate them.”
Uses a role-based metaphor where agents have backstories, goals, and tasks — like assembling a crew of specialists. Your job is to define who does what. CrewAI handles delegation and communication.

These aren't competing on the same axis. They're different answers to the question: “What should a developer control, and what should the framework handle?”


What makes this comparison different

Instead of abstract feature tables, this series builds the same pipeline in all three frameworks and compares them across dimensions that matter in real projects:

01

MCP Integration

The Model Context Protocol is quickly gaining traction as a way to connect agents to external tools. I'll integrate each framework with existing MCP servers, including mcp-python-repl — a Python REPL server I built and maintain — to show how cleanly each framework adopts the protocol.

02

Agent Skills (agentskills.io)

Agent capabilities should be structured, reusable, and documented. I'll define each agent's skills using the open Agent Skills specification — the standard format maintained by Anthropic for giving agents new capabilities via SKILL.md files. That said, recent evaluations — notably Vercel's findings — show that agents don't always trigger skills reliably. An interesting counterpoint: DSPy's prompt optimizers (like GEPA, a reflective optimizer that uses textual feedback to evolve prompts) can directly address this by treating skill trigger reliability as an optimizable metric. We'll test whether optimization closes the gap that passive context (like AGENTS.md) currently wins.

03

LLM-as-a-Judge Evaluation

Every team building agents struggles with evaluation. I'll implement LLM-as-a-Judge in each framework — not just as a simple scorer, but as a full evaluation pipeline. In DSPy, the judge is a module you can optimize. In LangGraph, it's a node in your graph. In CrewAI, it's an agent with a reviewer role.

Along the way, I'll also compare dependency weight, lines of code for the same result, and how each framework handles the inevitable moment when output quality isn't good enough.

What we deliberately skip: Security — all three are open-source Python libraries; security depends on your deployment. Pricing — all free. Generic benchmarks — unless you run them yourself on your use case, they're noise.

The use case: Company Research Agent

We need a pipeline complex enough to reveal meaningful differences but simple enough that the framework — not the business logic — stays in focus. Here's what we're building:

Four steps, each testing a different framework capability:

StepWhat it doesWhat it tests
ResearcherSearches the web for recent news, financials, key eventsMCP tool integration, structured extraction
WriterTransforms raw facts into a concise analyst-style summaryCore LLM interaction, structured output
ReviewerScores the summary on accuracy, completeness, concisenessEvaluation, conditional branching
Feedback loopIf score < threshold, sends feedback back to the WriterControl flow, retry logic, state management

We'll use real companies — Apple, Tesla, Nvidia — so the outputs are verifiable and the pipeline faces real-world messiness.

The key insight is that this pipeline exercises everything we want to compare: tool usage (MCP), structured agent capabilities (Agent Skills), quality assessment (LLM-as-a-Judge), and the improvement loop that separates toy demos from production systems.


The shared foundation

To keep the comparison fair, all three implementations share the same building blocks. We define them once in a common/ package — same tools, same models, same eval criteria. Only the framework changes.

Pydantic models

Identical data contracts across all frameworks:

common/models.pyPython
from pydantic import BaseModel, Field

class CompanyFacts(BaseModel):
    """Structured output from the Researcher step."""
    company_name: str
    sector: str
    recent_news: list[str] = Field(
        description="3-5 recent news headlines with dates"
    )
    financial_highlights: list[str] = Field(
        description="Key financial metrics or events"
    )
    key_events: list[str] = Field(
        description="Notable recent events"
    )
    sources: list[str] = Field(
        description="URLs of sources used"
    )

class AnalystSummary(BaseModel):
    """Structured output from the Writer step."""
    summary_text: str = Field(description="200-word max analyst summary")
    key_risks: list[str]
    outlook: str = Field(
        description="One sentence: bullish, bearish, or neutral"
    )
    confidence_score: float = Field(ge=0, le=1)


# ── Pre-validation gate (no LLM needed) ──────────────────

def structural_check(summary: AnalystSummary) -> list[str]:
    """Cheap checks before calling the LLM judge.
    Returns a list of issues. Empty = pass."""
    issues = []
    word_count = len(summary.summary_text.split())
    if word_count > 200:
        issues.append(f"Summary too long: {word_count} words (max 200)")
    if not summary.key_risks:
        issues.append("Missing key_risks")
    if not summary.outlook:
        issues.append("Missing outlook")
    return issues


# ── LLM judge models (only what needs intelligence) ──────

class ClaimVerification(BaseModel):
    """One verified claim from the summary."""
    claim: str = Field(description="The factual claim made in the summary")
    source_url: str = Field(description="Source URL that should support this claim")
    supported: bool = Field(description="Does the source actually support the claim?")
    reasoning: str = Field(description="Why supported or not")

class ReviewResult(BaseModel):
    """What the LLM judge actually evaluates — no vague 0-10 scores."""
    # Accuracy: verified claims / total claims
    claim_verifications: list[ClaimVerification]
    accuracy_ratio: float = Field(
        ge=0, le=1,
        description="verified_claims / total_claims"
    )
    # Completeness: facets covered / expected facets
    expected_facets: list[str] = Field(
        description="Facets we expected (news, financials, risks, outlook, events)"
    )
    covered_facets: list[str] = Field(
        description="Facets actually present in the summary"
    )
    completeness_ratio: float = Field(
        ge=0, le=1,
        description="covered_facets / expected_facets"
    )
    # Conciseness (only subjective score — constrained 1-5 with rubric)
    conciseness_rating: int = Field(
        ge=1, le=5,
        description="1=verbose, 5=tight"
    )
    # Decision
    feedback: str
    issues: list[str]
    approved: bool

Agent Skills

Each agent's capabilities are defined as an Agent Skill — not a flat config file, but a directory with scripts, references, and assets that the agent can navigate progressively. Here's the Researcher skill structure:

common/skills/company-researcher/Bash
company-researcher/
├── SKILL.md                      # Main instructions + frontmatter
├── scripts/
│   ├── extract_financials.py     # Parses financial data from raw HTML
│   └── validate_sources.py       # Checks that cited URLs are accessible
├── references/
│   ├── output-schema.md          # Full CompanyFacts model documentation
│   ├── search-strategies.md      # Sector-specific search patterns
│   └── quality-checklist.md      # Self-evaluation checklist before submission
└── assets/
    └── sector-taxonomy.json      # Standard sector classification

And the SKILL.md itself references these files — the agent loads them on demand:

common/skills/company-researcher/SKILL.mdMarkdown
---
name: company-researcher
description: >
  Researches a public company using web search and extracts structured
  financial facts with source attribution. Use when given a company name
  and asked to produce a CompanyFacts report.
metadata:
  author: faunaris-ai
  version: "1.0"
allowed-tools: web_search execute_python
---

# Company Researcher

## When to use
Activate when the user provides a company name and needs structured
research output (news, financials, key events).

## Instructions
1. Classify the company sector using [sector-taxonomy.json](assets/sector-taxonomy.json)
2. Apply the appropriate search strategy from [search-strategies.md](references/search-strategies.md)
3. Search for recent news (past 30 days), financial highlights, and key events
4. Run [extract_financials.py](scripts/extract_financials.py) on raw financial pages
5. Validate all sources with [validate_sources.py](scripts/validate_sources.py)
6. Return a CompanyFacts object — see [output-schema.md](references/output-schema.md)

## Quality checklist
Before submitting, verify against [quality-checklist.md](references/quality-checklist.md):
- [ ] At least 3 recent news items with dates
- [ ] At least 2 financial highlights with specific numbers
- [ ] Every claim has a verified source URL
- [ ] Sector classification matches taxonomy

This isn't a toy example. The skill references scripts the agent must run, docs it must read, and a checklist it must verify. The real test in Part 3 will be: does each framework's agent actually navigate this structure, or does it ignore the references and wing it?


A note on evaluation: why we don't use 0-10 scores

Most LLM-as-a-Judge implementations ask the judge to “rate accuracy from 0 to 10.” This is a trap. A score of 7 means nothing — it's subjective, non-reproducible across runs, and impossible to debug. If two runs of the same input give 6 and 8, what did you learn? Nothing.

Our approach separates evaluation into two stages:

Stage 1: structural checks (no LLM, no cost). Before the judge even fires, we run deterministic checks in plain Python — word count, schema validation, required fields. If the summary is 500 words or has no sources, why spend tokens on an LLM review? Send it back immediately with concrete feedback.

Stage 2: the LLM judge (only what needs intelligence). For accuracy, the judge verifies each claim against its cited source — not “rate accuracy 0-10” but “does source X support claim Y? yes or no.” Accuracy becomes verified_claims / total_claims. For completeness, we define expected facets upfront and check which ones appear: covered_facets / expected_facets. The only subjective score we allow is conciseness (1-5 with a rubric), because conciseness is genuinely subjective.

The deeper question: How can a judge evaluate without ground truth? For this series, we use source-grounded evaluation — the judge checks whether the summary's claims are supported by the cited sources. This works well for prototyping and catches hallucinations effectively. However, in an enterprise context — pharma, banking, audit — there is no shortcut: you need to build evaluation datasets with domain experts who define what a correct output looks like for your specific use case. Source-grounded evaluation won't catch subtle errors in interpretation, missing context, or domain-specific nuance. As Vercel demonstrated with their agent evals, the quality of your evaluation suite determines everything — they found that targeting APIs outside model training data was the only way to measure real capability. The same principle applies here: test what the LLM can't already guess. In Part 4, we'll implement both source-grounded and reference-based evaluation and show how each framework handles them.

What's next

In Part 2, we build the core pipeline in all three frameworks — same use case, same models, same tools, different paradigms.

In Part 3, we integrate MCP servers, adopt Agent Skills from agentskills.io, and run a full dependency audit.

In Part 4, we build LLM-as-a-Judge evaluation in each framework, unleash DSPy's optimizer, and deliver the final verdict with a decision matrix.

The companion code will be available in the GitHub repository as each part is published.