Tutorial: Research Review Pipeline¶

Build a pipeline that extracts claims from research papers, checks the methodology, and runs a blind adversarial review — in three phases, each building on the last.

What you'll learn:

Phase	What you build	Heddle concepts
1	A claim extractor you can test immediately	Worker configs, Workshop test bench, eval suites
2	A three-stage review pipeline	Pipeline configs, stage dependencies, input mappings
3	Blind adversarial review with parallel branches	Knowledge silos, blind workers, parallel stages, synthesis

Prerequisites: Heddle installed and configured (heddle setup completed). If you haven't done this yet, see the Getting Started guide.

The Problem¶

You read research papers and need to evaluate them systematically. Not just "is this interesting?" but: What claims are being made? What evidence supports them? Are there methodological gaps the authors don't acknowledge? Would an independent reviewer reach the same conclusions?

Doing this by hand is slow. Asking a single AI to both summarize and critique produces shallow, self-confirming output — the same model that extracted the claims will rubber-stamp them when asked to review.

This tutorial builds a pipeline that solves each problem in a separate, testable step.

How It Maps to Heddle¶

What you want	Heddle concept	Why
"Extract the claims from this paper"	Worker with an extraction prompt	One worker = one job. Focused prompts produce better output than general-purpose ones.
"Check whether the methodology supports the claims"	Second worker, different prompt	A separate worker can apply different criteria without the extraction context contaminating its judgment.
"Summarize the review"	Third worker in a pipeline	The summarizer sees structured output from prior stages, not raw text.
"Get a genuinely independent opinion"	Blind worker (empty knowledge silo)	A worker with no access to domain knowledge can't pattern-match against the analytical frame. It evaluates from first principles.
"Did my prompt change make things better?"	Workshop eval suite	Run the same test cases before and after. Compare scores. No guessing.
"Run extraction → review → summary automatically"	Pipeline config	Define the stages in YAML. Heddle handles the data flow and parallelism.

Phase 1: Extract Claims from a Paper¶

By the end of this phase, you'll have a working claim extractor that you can test on any research abstract.

What you're building¶

A single worker — claim_extractor — that takes a research abstract and returns structured output:

A list of claims, each tagged with its type (empirical, methodological, interpretive, recommendation)
The evidence supporting each claim
Limitations the authors acknowledge
Potential issues the authors don't acknowledge

That last point is important. A good extractor doesn't just parrot the abstract — it flags gaps.

Step 1: Set up the example¶

Copy the example configs into your Heddle project:

# From your heddle directory
cp -r examples/research-review/phase-1/workers/claim_extractor.yaml \
      configs/workers/claim_extractor.yaml

Or create the file yourself. Here's the full config:

# configs/workers/claim_extractor.yaml

name: "claim_extractor"

description: "Extracts structured claims from research text with confidence basis and limitations."

system_prompt: |
  You are a research claim extractor. You read academic text and identify
  the distinct claims being made, what evidence supports each one, and
  what limitations the authors acknowledge.

  INPUT FORMAT:
  - text (string): Research abstract or paper section to analyze
  - domain (string, optional): Research domain for context

  OUTPUT FORMAT:
  Respond with ONLY a JSON object:
  {
    "claims": [
      {
        "claim": "A clear statement of what is being asserted",
        "type": "empirical | methodological | interpretive | recommendation",
        "evidence": "What evidence or data supports this claim",
        "confidence_basis": "How confident the claim is and why",
        "limitations": "Any caveats, qualifications, or acknowledged weaknesses"
      }
    ],
    "methodology_summary": "Brief description of the study design or approach",
    "stated_limitations": ["Limitations the authors explicitly acknowledge"],
    "unstated_concerns": ["Potential issues the authors do NOT acknowledge"]
  }

  RULES:
  - Extract ALL distinct claims, not just the main finding
  - type must be one of: empirical, methodological, interpretive, recommendation
  - Do not invent evidence — only report what the text states
  - unstated_concerns should flag real methodological issues the authors missed
  - If the text is too short or vague to extract meaningful claims, return
    {"error": "Insufficient content for claim extraction"}

input_schema:
  type: object
  required: [text]
  properties:
    text:
      type: string
      minLength: 50
    domain:
      type: string

output_schema:
  type: object
  required: [claims, methodology_summary, stated_limitations, unstated_concerns]
  properties:
    claims:
      type: array
      items:
        type: object
        required: [claim, type, evidence, confidence_basis]
        properties:
          claim:
            type: string
          type:
            type: string
            enum: [empirical, methodological, interpretive, recommendation]
          evidence:
            type: string
          confidence_basis:
            type: string
          limitations:
            type: string
    methodology_summary:
      type: string
    stated_limitations:
      type: array
      items:
        type: string
    unstated_concerns:
      type: array
      items:
        type: string

default_model_tier: "standard"
max_input_tokens: 16000
max_output_tokens: 4000
reset_after_task: true
timeout_seconds: 90

Take a moment to read the config. Notice:

system_prompt is the only instructions the LLM sees. It defines INPUT FORMAT, OUTPUT FORMAT, and RULES — three sections that keep the worker focused.
input_schema and output_schema are JSON Schema. Heddle validates input before calling the LLM and output after. If either fails, you get a clear error — not garbage output.
default_model_tier: "standard" means this worker uses Claude Sonnet by default. Claim extraction benefits from stronger reasoning than a local model provides. You can override this in Workshop.

Step 2: Validate the config¶

Before testing, check that the config is well-formed:

heddle validate configs/workers/claim_extractor.yaml

You should see:

✓ configs/workers/claim_extractor.yaml — valid worker config

If you get errors, check your YAML indentation. The most common mistake is mixing tabs and spaces.

Step 3: Test in Workshop¶

Start Workshop:

heddle workshop

Open http://localhost:8080 in your browser. You'll see claim_extractor in the Workers list. Click Test.

The test bench shows input fields matching the schema: a text box and an optional domain field. Paste this abstract from the sample data:

This study examines the relationship between urban green space coverage and respiratory health outcomes across 12 mid-size cities (population 100,000-500,000) in the midwestern United States over a 5-year period (2019-2024). Using satellite-derived NDVI measurements cross-referenced with county-level hospitalization records for asthma, COPD, and bronchitis, we found a statistically significant inverse correlation (r=-0.67, p<0.01) between green space coverage percentage and age-adjusted respiratory hospitalization rates...

Set domain to public health / urban planning and click Run.

You'll see structured JSON output with:

Multiple claims extracted (the main correlation finding, the socioeconomic moderation effect, the seasonal interaction, the policy recommendation)
Each tagged with type, evidence, and confidence basis
Stated limitations (socioeconomic confound, seasonal effect)
Unstated concerns (e.g., ecological fallacy from county-level data, no individual-level exposure measurement, potential selection bias in which cities were studied)

Try different tiers. Switch to local and run the same abstract. Then frontier. Compare: Does the local model miss claims? Does the frontier model catch more unstated concerns? This comparison is a core Heddle workflow — the tier system exists to make it trivial.

Step 4: Run the eval suite¶

Testing one abstract at a time tells you if the worker works. An eval suite tells you if it works consistently.

The example includes a test suite with all four sample abstracts. To run it in Workshop:

Go to the Workers list → claim_extractor → Eval
Load the test cases from examples/research-review/phase-1/eval/test_suite.json (or paste them manually)
Choose field_match scoring — this checks whether specific output fields match expected values
Click Run Suite

The eval checks that methodology_summary captures the study design accurately for each abstract. In a real project, you'd add expected values for claim count, claim types, and specific limitations you expect the extractor to catch.

Set a baseline. Once you have a passing eval run, click Set as Golden Baseline. Now any future changes to the system prompt or model tier will be compared against this baseline. If your "improvement" makes things worse, you'll see it immediately.

Step 5: Iterate on the prompt¶

This is where the real work happens. Look at the eval results:

Are claims too vague? Add a RULE: "Each claim must be specific enough to be falsifiable."
Missing unstated concerns? Add examples of common methodological gaps to the system prompt.
Wrong claim types? Add clearer definitions for each type.

Edit the system prompt in Workshop, save (creating a new version), and re-run the eval. Compare against the golden baseline. This is the Workshop flywheel: edit → test → evaluate → compare → repeat.

What you have now¶

A working, tested claim extractor with:

A YAML config you can version-control
An eval suite that catches regressions
A golden baseline for comparison
Zero infrastructure (no NATS, no message bus, no deployment)

You can already use this as a standalone tool: paste any abstract into Workshop, get structured claims out. But one worker working alone hits a ceiling — it can extract claims, but it can't evaluate whether those claims are well-supported. That's Phase 2.

Phase 2: The Review Pipeline¶

Phase 1 gave you a claim extractor. But extracted claims sitting in JSON aren't a review. You need:

A methodology reviewer that checks whether the study design actually supports the claims being made
A review summarizer that synthesizes the extraction and review into a structured report

Phase 2 chains these three workers into a pipeline — data flows automatically from one stage to the next.

What you're building¶

  claim_extractor ──► methodology_reviewer ──► review_summarizer
  (extract claims)    (check the evidence)      (write the report)

Heddle concepts introduced¶

Pipeline config — a YAML file defining stages, their order, and how data flows between them
Input mappings — how one stage's output becomes the next stage's input (dot-notation paths like extract.output.claims)
Stage dependencies — Heddle infers the execution order from the input mappings. If stage B reads from stage A's output, A runs first. Stages that don't depend on each other run in parallel automatically.
Pipeline editor — Workshop's visual graph of stage dependencies

Step 1: Create the methodology reviewer¶

Copy the Phase 2 configs:

cp examples/research-review/phase-2/workers/*.yaml configs/workers/
cp examples/research-review/phase-2/orchestrators/*.yaml configs/orchestrators/

Open configs/workers/methodology_reviewer.yaml. This worker takes three inputs: the original text, the claims array from the extractor, and the methodology_summary. For each claim, it assesses whether the described methodology can logically support it.

The key output field is strength — an enum of strong, moderate, weak, or unsupported. This forces the LLM into a concrete assessment rather than vague hedging like "the evidence is somewhat supportive."

Test it in Workshop: construct an input with a text, a claims array (you can paste output from the claim extractor), and a methodology summary. Run it and check whether the strength assessments are defensible.

Step 2: Create the review summarizer¶

Open configs/workers/review_summarizer.yaml. This worker receives structured data from both prior stages — it never sees raw text. Its job is synthesis, not re-analysis.

The output includes a verdict (accept / revise / reject) and a report field — a narrative review written as if addressing the paper's authors. The prompt explicitly says "Do not add new analysis — synthesize what the extraction and review found."

Step 3: Define the pipeline¶

Open configs/orchestrators/research_review.yaml. Three stages:

pipeline_stages:
  - name: "extract"
    worker_type: "claim_extractor"
    input_mapping:
      text: "goal.context.text"
      domain: "goal.context.domain"

  - name: "review"
    worker_type: "methodology_reviewer"
    input_mapping:
      text: "goal.context.text"
      claims: "extract.output.claims"
      methodology_summary: "extract.output.methodology_summary"

  - name: "summarize"
    worker_type: "review_summarizer"
    input_mapping:
      claims: "extract.output.claims"
      methodology_summary: "extract.output.methodology_summary"
      stated_limitations: "extract.output.stated_limitations"
      unstated_concerns: "extract.output.unstated_concerns"
      claim_reviews: "review.output.claim_reviews"
      overall_methodology_score: "review.output.overall_methodology_score"
      design_strengths: "review.output.design_strengths"
      design_weaknesses: "review.output.design_weaknesses"

Read the input_mapping entries. claims: "extract.output.claims" means "take the claims field from the extract stage's output and pass it as the claims input to this stage." Heddle sees that review references extract.output.* and infers the dependency: extract must finish before review starts.

The summarize stage reads from both extract and review — it waits for both to complete. Since review already depends on extract, the chain is strictly sequential: extract → review → summarize.

Validate the pipeline:

heddle validate configs/orchestrators/research_review.yaml

Step 4: Test the pipeline in Workshop¶

Open Workshop and navigate to the Pipeline List. Click research_review to open the pipeline editor. You'll see the dependency graph — three stages in a straight line.

To test, you can submit a goal through the pipeline editor or via CLI:

heddle submit "Review this paper" \
    --context text="<paste abstract here>" \
    --context domain="public health"

The output from summarize is a complete review report with verdict, strong and weak claims, key concerns, and a narrative report.

Step 5: Evaluate each stage independently¶

The pipeline runs end-to-end, but you should evaluate each worker separately. If the final report is bad, you need to know which stage produced the problem:

Did the extractor miss a claim? → Fix the extractor prompt.
Did the reviewer misjudge methodology strength? → Fix the reviewer.
Did the summarizer produce a misleading verdict? → Fix the summarizer.

Create eval suites for the methodology reviewer and summarizer the same way you did for the claim extractor in Phase 1. Test each worker in isolation in Workshop before relying on the pipeline.

Phases 1 and 2 give you automated extraction and review. But there's a structural problem: the methodology reviewer reads the same claims that the extractor produced. If the extractor framed a weak claim favorably, the reviewer is primed to accept that framing.

Phase 3 fixes this with Heddle's blind audit pattern — the same approach used in the Adversarial Review guide, applied to research review.

What you're building¶

  claim_extractor ──┬──► methodology_reviewer ──┐
                    │                            ├──► review_synthesizer
                    └──► neutralizer ──► blind_reviewer ──┘
                         (strips framing)   (no domain knowledge)

Two review paths run in parallel:

The sighted path (methodology reviewer) evaluates claims with full context
The blind path (neutralizer → blind reviewer) strips the original framing and evaluates from first principles

A review synthesizer merges both reports, explicitly flagging where the sighted and blind reviewers disagree. Those disagreements are the most valuable output — they're where the original framing may have hidden a weakness.

Heddle concepts introduced¶

Knowledge silos — the blind reviewer has an empty knowledge_silos list. It cannot access domain-specific reference material.
Terminology neutralizer — a worker that strips loaded academic language so the blind reviewer isn't primed by disciplinary framing
Parallel pipeline branches — the sighted and blind paths run simultaneously since neither depends on the other
Synthesis — a final worker merges parallel results into one report

Step 1: Create the terminology neutralizer¶

Copy the Phase 3 configs:

cp examples/research-review/phase-3/workers/*.yaml configs/workers/
cp examples/research-review/phase-3/orchestrators/*.yaml configs/orchestrators/

Open configs/workers/terminology_neutralizer.yaml. This worker rewrites claims in plain language, preserving all factual content while removing:

Discipline-specific jargon ("statistically significant" → "the statistical test indicates this is unlikely to be random")
Hedging that minimizes limitations ("somewhat" → removed)
Framing that presupposes conclusions ("as expected" → removed)
Rhetorical emphasis ("critically", "notably" → removed)

The output includes framing_flags — specific phrases that were neutralized. These flags feed into the final synthesizer so it can assess whether loaded language affected the sighted review.

Test it in Workshop: paste the claims output from Phase 1 and check whether the neutralized versions preserve all the numbers and measurements while stripping the evaluative language.

Open configs/workers/blind_reviewer.yaml. Two things make this worker "blind":

knowledge_silos: [] — explicitly empty. No domain reference material is injected into its prompt.
Its input mapping only provides neutralized claims. It never sees the original text, the paper's title, or the authors' framing.

The prompt tells it to evaluate from first principles: does the logic hold? Does the evidence support the claim? What alternative interpretations could explain the same data?

The most important output field is alternative_interpretations — explanations that the original paper didn't consider. Because the blind reviewer doesn't know the field's conventions, it's more likely to propose interpretations that domain experts would overlook or dismiss as "not how we do things."

Step 3: Create the review synthesizer¶

Open configs/workers/review_synthesizer.yaml. This worker replaces the Phase 2 review_summarizer — it merges two independent reviews instead of one.

The key output fields are agreements (where both reviewers concur) and disagreements (where they diverge). Each disagreement includes likely_cause — why the reviewers reached different conclusions — and resolution — which assessment is more defensible.

The prompt also checks for framing_effects: cases where the loaded language flagged by the neutralizer may have biased the sighted review's judgment.

Step 4: Update the pipeline¶

Open configs/orchestrators/research_review_blind.yaml. The pipeline now has five stages with parallel branches:

# After extract, two paths run IN PARALLEL:
- name: "sighted_review"         # Path A: sighted
  input_mapping:
    text: "goal.context.text"     # Gets original text
    claims: "extract.output.claims"

- name: "neutralize"             # Path B: blind (step 1)
  input_mapping:
    claims: "extract.output.claims"

- name: "blind_review"           # Path B: blind (step 2)
  input_mapping:
    neutralized_claims: "neutralize.output.neutralized_claims"

Heddle sees that sighted_review and neutralize both depend only on extract — not on each other. They run concurrently. blind_review waits for neutralize. synthesize waits for everything.

Validate:

heddle validate configs/orchestrators/research_review_blind.yaml

Run the pipeline on one of the sample abstracts and examine the synthesize output. Look at the disagreements array:

Does the blind reviewer flag logical gaps the sighted reviewer accepted?
Does the sighted reviewer defend claims that the blind reviewer calls "weak"?
Do the framing_effects connect loaded language to specific sighted review assessments?

The disagreements are the most valuable output. They don't mean the sighted review is wrong — they mean there's something worth a human looking at. The pipeline surfaces these points; the human decides.

What's Next¶

You now have a five-worker pipeline that extracts claims, reviews them from two independent perspectives, and synthesizes the results. Here are two directions to push it further:

Idea 1: Multi-Model Comparison¶

Run the blind reviewer on all three model tiers — local, standard, and frontier — and compare where they agree and disagree. This surfaces model-specific biases: a local model might miss subtle logical gaps that a frontier model catches, but it might also flag "issues" that are actually fine.

In practice: duplicate the blind reviewer stage three times in the pipeline, each targeting a different tier. The synthesizer then has three independent reviews to merge. Where all three agree on a problem, you have high confidence. Where they disagree, you have something worth investigating.

This is a natural use of Heddle's tier system — the infrastructure for running the same prompt against different models already exists.

Idea 2: Batch Processing with RAG¶

Ingest a collection of papers into Heddle's vector store, then run the review pipeline on each one. After individual reviews are complete, use a goal-decomposition orchestrator to synthesize cross-paper findings: "What methodological patterns appear across these 20 papers? Where do multiple papers make conflicting claims?"

This requires:

A custom ingestor (or use the CSV/text ingestor from the Document Intake example) to feed papers into the RAG pipeline
An orchestrator config with a synthesis goal
A larger knowledge silo for the cross-paper analyst

This moves you from "reviewing one paper" to "reviewing a literature" — the kind of work that takes a human researcher weeks and that no single LLM conversation can hold in context.

This tutorial uses the example configs in examples/research-review/. Each phase directory contains the complete working configs for that phase — you can copy them directly or build them step by step following the walkthrough above.

Tutorial: Research Review Pipeline¶

The Problem¶

How It Maps to Heddle¶

Phase 1: Extract Claims from a Paper¶

What you're building¶

Step 1: Set up the example¶

Step 2: Validate the config¶

Step 3: Test in Workshop¶

Step 4: Run the eval suite¶

Step 5: Iterate on the prompt¶

What you have now¶

Phase 2: The Review Pipeline¶

What you're building¶

Heddle concepts introduced¶

Step 1: Create the methodology reviewer¶

Step 2: Create the review summarizer¶

Step 3: Define the pipeline¶

Step 4: Test the pipeline in Workshop¶

Step 5: Evaluate each stage independently¶

Phase 3: Blind Adversarial Review¶

What you're building¶

Heddle concepts introduced¶

Step 1: Create the terminology neutralizer¶

Step 2: Create the blind reviewer¶

Step 3: Create the review synthesizer¶

Step 4: Update the pipeline¶

Step 5: Compare sighted vs. blind results¶

What's Next¶

Idea 1: Multi-Model Comparison¶

Idea 2: Batch Processing with RAG¶