The personal research operator in the principal's voice.
Saigar is a single-user agentic research system: eighteen agents, sixty-six stages, four pipelines, and one principal. It runs the literature, debates the position, drafts the artefact, designs the layout, and never breaches the principal's voice. This is what Sagar is building.
Saigar is what happens when you take an autonomous research pipeline derived from AutoResearchClaw, give it a typed harness, layer eighteen specialist agents into it, build a learning loop that improves across runs, render the outputs in a broadsheet aesthetic, and constrain everything to one principal's voice and one principal's editorial sign-off.
It produces four artefact types today. The Brief is a weekly payments competitive intelligence document. The Paper is a long-form research article published to sagarbharambe.com with full citation auditability. The Wire snippet is a 95 to 110 word morning note from regulator filings and central-bank releases, shipped at 07:00 CEST. The Predictive Intel assessment is a quarterly company report with a bear/bull red-team and a ratification scorecard tracking the prior edition's predictions. Brief uses 23 stages, Paper 25, Wire 5, Predictive Intel 13. The eight senior agents (Orchestrator, Researcher, Synthesizer, Writer, Critic, Reflector, Designer, Publisher) span every pipeline; ten junior specialists handle Wire and Predictive Intel.
The principal stays in the loop at three to four gates per pipeline: literature screen approval, analytical approach approval, publish approval, and design approval where it applies. Everything else is autonomous, recorded, and reviewable.
Saigar does not autopublish. The principal reviews and signs the artefact before it becomes editorial output. Saigar handles literature, debate, draft. The principal handles judgment.
The autonomy is real (the system can pivot a hypothesis, refine an analysis, regenerate a violating paragraph, all without intervention). The supervision is real too. Both, in the same system.
What this showcase covers
This site is Saigar in eight pages. Each one explains a layer; together they answer "what did Sagar build, how does it work, and what comes next."
Who Saigar is for and the problem it solves. The shape of one principal's research life and where the system inserts itself.
ReadA live mock of the operating surface: today's lead, runs in flight, gates awaiting decision. The product as the principal will use it.
ReadThe four pipelines side by side. Brief, Paper, Wire snippet, and Predictive Intel as instantiations. Where each agent enters, where each gate appears.
ReadEighteen specialist agents, each with a domain, a state machine, and a failure taxonomy. Plus the cross-run loop the Reflector closes and the seven-archetype layout choice the Designer makes.
ReadFive processes on one VPS. Postgres-as-queue. Anthropic API direct. The deployment posture and the six reasoning checkpoints.
ReadHow Saigar improves across runs. The Reflector at stage 24, the signal lifecycle, the principal-supervised /learning view.
ReadThe reference artefact. The agentic commerce identity-layer read. With the mechanism diagram animating in newspaper or slides.
Readv1 to v2 to v3. From four pipelines to many. From single-pipeline runs to cross-pipeline orchestration. From principal-only to a small editorial team.
ReadSaigar is currently running a Brief on agentic commerce identity infrastructure. The Researcher retained 22 of 30 candidate sources. The Synthesizer's hypothesis at stage 8 took the structural framing: agent identity attestation, not the rails layer, is the contested surface. Gate B was approved at 09:43 this morning. The Writer is currently drafting slide 4 (the mechanism diagram).
Open the DeskOne reader. One operator. One signature on every artefact.
Saigar is built around a single person and the shape of his research life. That constraint is the most important architectural choice in the system.
Sagar Bharambe is a payments practitioner based in Delft, the Netherlands. His professional terrain is European retail payments: scheme dynamics, regulatory shifts at the AFM and DNB, the open banking spillovers from PSD3 and the Instant Payments Regulation, the slow churn at the long tail of acquirers and PSPs and what the next layer of European stablecoin and tokenisation efforts mean for any of it.
The work is reading-heavy. The output is opinion-heavy. The volume of regulator filings, scheme press releases, banking association statements, market data, EBA consultation papers, AFM guidance letters, ECB working papers, and ISO 20022 progression notes is large enough that staying current is a job in itself. The synthesis required to turn it into useful editorial output, the kind that can land on sagarbharambe.com or in front of a counterparty, is another job.
Saigar is not a research assistant. It is the operator who runs the research, in the principal's voice, on the principal's schedule, with the principal's standards.
The shape of the problem
The conventional answer to "I read too much, I write too little" is a reading queue and a discipline of writing. The conventional answer fails for two reasons. The first is that the reading is unbounded; whatever discipline you bring, the literature outpaces you. The second is that synthesis is the hard part, not the reading; once you have decided what you think, the writing is shorter and easier.
The unconventional answer, and the one Saigar takes, is to delegate the literature scanning, the source ingestion, the multi-perspective debate that turns evidence into a position, and the first draft. The principal does what only the principal can do: judge whether the position is right, edit the draft into the principal's voice, sign the artefact, publish.
What Saigar carries
Three deliverables in v1:
The Brief. A bi-weekly payments competitive intelligence document. Eight to nine slides. Cover, TLDR, market state with a chart, mechanism diagram (animated SVG comparing flow mechanics across payment paths), reads (corporate or regulatory interpretations), implications by audience, sources, closing. Rendered as a single self-contained HTML file with two reading modes (newspaper scroll and slides). Internal cross-references navigate with a back chip. The Brief is private to the principal; it is operational intelligence, not editorial output.
The Paper. A long-form research article, ~1500 to 2500 words, published to sagarbharambe.com. Full 25-stage pipeline including the experiment subsystem (stages 10 to 13) for any quantitative analysis. Markdown with frontmatter, hyperlinked citations, references auto-generated.
The Wire snippet. A 95 to 110 word morning note. The wire curator picks up to five items from the last 24 hours of regulator filings, central-bank releases, working papers, and payments-domain news; the wire drafter writes one note per item; the wire critic enforces voice and citation discipline; the publisher commits them to the "On the wire" column at desk.sagarbharambe.com. Five stages, one source per snippet, capped to one per channel per day.
The Predictive Intel assessment. A quarterly company report. Seven specialists run in sequence: a company researcher pulls filings and IR feeds; a metrics extractor lifts a standard KPI schema; a bear analyst and a bull analyst each generate falsifiable predictions in their own voice; a reconciler reads both and writes the house view; from edition two onward, a ratification agent scores the prior edition's predictions against what actually happened. Thirteen stages, single Gate B for publish approval, MDX output to sagarbharambe.com/desk/predictive-intel.
The audit trail. Every artefact traces to a run. Every run traces to a trigger. Every LLM invocation records its system prompt hash, request, response, cost. The principal can ask "why does this Brief say what it says" and follow the chain back. This is not a feature; it is the foundational commitment of the system.
The constraints that shape the architecture
Saigar is built for one person. Not "starts with one person, scales to teams later." One person, ever. This eliminates entire categories of architectural complexity: no role-based access control, no multi-tenancy plumbing, no organisation-level abstractions. The data model has a singleton principal table. The auth boundary is one passkey. The infrastructure budget reflects this; €30/month for hosting plus €60-170/month in LLM cost is workable for personal use, untenable for ten users, irrelevant for a hundred.
Saigar is hosted on Hetzner Falkenstein. Anthropic API calls go to the EU endpoint where available. Object storage stays in the same region. This matches the regulatory environment the principal works in (payments, EU jurisdiction) and the principal's own data residency preferences.
The principal has prose constraints that are non-negotiable. No em-dashes. No "not X but Y" framing. A list of banned phrases ("doing the heavy lifting," "the real question is," "what most people miss"). These are enforced at three checkpoints: the Writer's system prompt, the post-generation validator, the Critic. Voice violations leak past zero checkpoints in v1's eval target. This is the most idiosyncratic thing about Saigar, and the most important.
There is no separate admin role, no operator role, no support tier. The principal sees the metrics dashboard, the run logs, the failed run queue, the configuration surfaces. Saigar is a tool for one person who builds it and uses it; the principal-operator distinction collapses. This makes the UI simpler and the failure modes more honest.
What "in the principal's voice" actually means
Voice is not a property of words. It is a property of habitual choices: which framings the principal prefers, which sources the principal weights, which interpretive moves the principal accepts and rejects. Saigar makes those choices visible, persistent, and supervised.
The voice constraint table is the visible layer. The learning loop is the deeper layer. After every run, the Reflector reads what the principal accepted, edited, rejected, and produces signals that future runs consume. After ten runs, Saigar knows that the principal prefers contrarian framings on payments adoption topics; after thirty runs, Saigar knows which corporate filings to weight up and which regulatory press releases to weight down. The principal supervises this learning; signals are reviewable, rejectable, weakenable on the /learning view.
The result, projected forward, is a system that does not just take instruction; it inherits taste.
What the principal sees, when the principal looks.
The Desk is Saigar's operating surface. Five destinations, four approval flows, one principal. This is a live mock of the Today view with a Brief 47 cycle in flight.
A Brief is awaiting publish.
Brief 47 on agent identity infrastructure is at gate C. The Critic flagged 2 minor findings (unverified URL on slide 5, weak relevance on one citation). Citation pass rate: 96%. Voice violations: 0. The Reflector has not yet run; it will fire on publish approval.
Gate C · Publish approval
Auto-advance: never (gate C is affirmative-only)
Bi-weekly · Wednesday
Edition 47 awaiting publish. Edition 48 schedules next Wednesday.
LiveOn-demand · idle
Last published: 'Web Bot Auth, the substrate everyone is quietly building on' (28 Apr 2026)
IdleContinuous · v2
Not yet active. Ships in v2.
InactiveRecent activity
The Desk does not push notifications. The principal opens it on their own schedule. Live runs propagate through Postgres LISTEN/NOTIFY to server-sent events; the page updates incrementally without refresh. Failed runs surface "needs your attention" cards. Cost-cap warnings surface inline.
Sixty-six stages. Four pipelines. One operating model.
Saigar's central abstraction is the Pipeline. Adding a new artefact type means writing YAML, not modifying code. Every pipeline draws from the same canonical flow: problem decomposition, literature, synthesis, drafting, voice review, design, publish, reflection.
Four pipelines, one operating model
Stages execute in order. Some are skippable; some open gates that pause for principal review. The Brief uses 23 stages, the Paper 25 (with the experiment subsystem at stages 10 to 13 producing the analytical block), the Wire snippet 5 (curate, draft, critique, archive, publish), and Predictive Intel 13 (with bear/bull/reconciler producing the house view, plus a ratification scorecard from edition two onward). The Reflector closes the cross-run learning loop; the Designer picks one of seven archetype layouts for Briefs before publish.
The Brief pipeline
Bi-weekly payments competitive intelligence. Default cadence: Wednesday 09:00 Europe/Amsterdam. Topic Ranker proposes three candidates 48 hours before; the principal accepts, overrides, or skips the cycle.
Brief skips stages 9 through 13 (the experiment subsystem). Brief is synthesis-driven, not empirical; the Synthesizer's stage 8 debate produces the hypothesis, the Writer drafts directly from there. Stage 14 result analysis still runs because the Synthesizer needs to produce the qualifications and surprises that go into the Implications slide.
Output is a single self-contained HTML file with two reading modes: newspaper (continuous scroll, expandable sections) and slides (one-per-viewport). Internal cross-references navigate with a back chip. The MechanismDiagram component is the distinctive visual element; it animates packets along SVG paths at relative settlement speeds.
Read Brief 47name: brief
display_name: EU Payments Bi-weekly Brief
version: 1.0.0
cadence:
scheduled:
cron: "0 9 * * 3"
timezone: Europe/Amsterdam
source_profile:
privileged_types:
- regulator_filing
- scheme_disclosure
jurisdictions: [NL, EU]
recency: 90d
min_total_sources: 12
max_total_sources: 30
output:
type: brief
format: html_dual_mode
exportable_to: [pdf]
stages:
- { ord: 1, name: subject_resolve }
- { ord: 2, name: problem_decompose, agent: researcher }
- { ord: 3, name: search_strategy, agent: researcher }
- { ord: 4, name: literature_collect,agent: researcher }
- { ord: 5, name: literature_screen, gate: A }
- { ord: 6, name: knowledge_extract, agent: researcher }
- { ord: 7, name: synthesis, agent: synthesizer }
- { ord: 8, name: hypothesis_gen, agent: synthesizer }
- { ord: 9, name: experiment_design, gate: B, skip_if: brief }
- { ord: 14, name: result_analysis, agent: synthesizer }
- { ord: 15, name: research_decision,agent: synthesizer }
- { ord: 16, name: outline, agent: writer }
- { ord: 17, name: draft, agent: writer }
- { ord: 18, name: peer_review, agent: critic }
- { ord: 19, name: revision, agent: writer }
- { ord: 20, name: principal_review, gate: C }
- { ord: 22, name: export }
- { ord: 23, name: citation_verify, agent: critic }
- { ord: 24, name: reflection, agent: reflector }
cost:
cap_usd: 25.00
warning_threshold_usd: 18.00
The Paper pipeline
On-demand long-form research articles. The principal types a topic, the Paper runs end-to-end (60-90 minutes typical wall clock), publishes to sagarbharambe.com on gate C approval.
Paper uses all 25 stages including the experiment subsystem (stages 10 to 13). For quantitative analysis, the Synthesizer designs the analytical approach at stage 9, generates code at stage 10, plans resources at stage 11, runs the experiment in a sandboxed runner at stage 12, and iteratively refines based on the analysis at stage 13. Stage 14 produces the result analysis with a multi-perspective debate.
Output is Markdown with frontmatter, ready for sagarbharambe.com. Hyperlinked citations resolve to a references section. Internal cross-references work in standard Markdown anchor syntax. The publishing path is principal-supplied (REST API, git push, custom CMS).
See the Synthesizername: paper
output:
type: article
format: markdown
publish_to: sagarbharambe.com
# All 25 stages declared, including:
stages:
# ... (1-8 same as brief) ...
- { ord: 9, name: experiment_design, gate: B }
- { ord: 10, name: code_generation, agent: synthesizer }
- { ord: 11, name: resource_planning, agent: synthesizer }
- { ord: 12, name: experiment_run, agent: synthesizer }
- { ord: 13, name: iterative_refine, agent: synthesizer }
- { ord: 14, name: result_analysis, agent: synthesizer }
- { ord: 15, name: research_decision, agent: synthesizer }
- { ord: 16, name: outline, agent: writer }
- { ord: 17, name: draft, agent: writer }
- { ord: 18, name: peer_review, agent: critic }
- { ord: 19, name: revision, agent: writer }
- { ord: 20, name: principal_review, gate: C }
- { ord: 21, name: external_publish }
- { ord: 22, name: export }
- { ord: 23, name: citation_verify }
- { ord: 24, name: reflection, agent: reflector }
cost:
cap_usd: 25.00
warning_threshold_usd: 20.00
Pipeline as primitive
Adding a new pipeline is a configuration change, not a code change. v2 candidates already drafted: regulatory monitoring (continuous), investment thesis stress-test (on-demand with adversarial debate), quarterly market wrap (scheduled monthly), deal/M&A memos (on-demand with comparable transactions), strategic competitor assessments (scheduled quarterly), org design choices (on-demand).
Each declares its cadence, sources, voice profile, gates, and output. The Orchestrator dispatches against any pipeline through the same code path. New specialist agents register via plugin. The build cost of a new pipeline is roughly 1-2 weeks: author the YAML, register specialist agents if novel, register the artefact renderer if novel, run an end-to-end test against a fixture topic.
This is the architectural commitment that makes Saigar future-proof: the abstraction over pipeline definition rather than the proliferation of pipeline-specific code.
See the v2/v3 roadmapEighteen specialists. No generalist.
Each agent owns specific stages, a state machine, and a failure taxonomy. The Orchestrator coordinates; the others execute. Eight senior generalists run on every pipeline; ten junior specialists run on Wire and Predictive Intel only. Click any senior agent for the deep panel.
Eight senior generalists
Active on every pipeline.
The harness. Owns the run state machine, dispatches to agents, manages gates, enforces cost. Six LLM-mediated reasoning checkpoints with deterministic fallback. The most architecturally important agent.
Reads the world. Decomposes the subject into sub-questions, executes searches across OpenAlex/S2/arXiv/web/RSS, screens for relevance, extracts structured knowledge cards.
The analytical core. Multi-agent debate at stage 8 (3 positions × 2 rounds + synthesis = 7 calls). Stage 15 PROCEED/REFINE/PIVOT verdicts. The experiment subsystem for Paper runs.
The agent that says no. Three reviewer personas at stage 18 (Evidence, Voice, Structure). Four-layer citation verification at stage 23. Voice constraint third checkpoint.
Authors the artefact. Article (Markdown) or Brief (SlidePlan). Voice constraint first checkpoint. Cross-reference authoring. The agent the principal sees most.
The closing-the-loop agent. Stage 24 reads the run trace and produces structured learning_signal records that future runs consume. Principal-supervised on the /learning view.
The form-follows-evidence agent. Stage 25 picks one of seven archetypes (regulatory_dispatch, data_deep_dive, comparison, explainer, opinion, decision_tree, wind_tunnel) and emits a layout_plan that weaves block components into Writer sections. Opus 4.7. Gate D approves before publish.
The shipping agent. Stage 27 serialises approved sections to MDX, applies the Designer layout plan, and pushes to the personal-site repo with figure binaries. The GitHub Action deploys; desk.sagarbharambe.com goes live.
Three wire specialists
Active only on the morning Wire snippet pipeline. Each tuned to the 95-110 word format.
Picks up to five items from the last 24 hours of regulator filings, central-bank releases, working papers, and payments-domain news. Scores materiality against principal interest signals; dedupes against the prior week. One per channel per day, hard cap.
Writes one 95 to 110 word note per curated item: a tight kicker line, the substance, the why-it-matters. One source per snippet, dated, in voice. Target band is enforced by a hard validator before handoff.
Scans every snippet for voice violations, citation discipline, and band drift. Returns approved or revise-with-reason. Softer than the senior Critic on style nits because the format is short, harder on factual claims because there is nowhere to hide.
Seven predictive-intel specialists
Active only on the quarterly company-assessment pipeline. Together they produce the bear/bull red-team and the ratification scorecard.
Reads everything filed by one specific company. Pulls the last four quarters of SEC filings, IR feeds, regulatory disclosures, and major press releases for the subject company. Hydrates thin bodies via direct fetch when the channel index does not carry full text.
Lifts a standard KPI schema from the filings: revenue, TPV, take rate, net revenue margin, geographic split, year-on-year growth, guidance. The structured output is the ground truth that next quarter's ratification agent grades against.
Skipped on edition one. From edition two onward, loads the prior edition's predictions and grades each as confirmed, partial, wrong, or too-early against current metrics. The scorecard opens the report. Self-correcting research, on the record.
Argues the bear case in its own voice. Generates four to ten falsifiable predictions, each with a one-sentence basis. Adversarial by construction; the system prompt forbids hedging. The bear and the bull never see each other's drafts.
Mirror of the bear. Argues the bull case in its own voice with the same four to ten falsifiable predictions plus key catalysts. Independence is the design point; if both agree on a prediction, the reconciler treats that as a high-confidence house view.
Reads both red-team drafts and writes the house view. Surfaces the predictions where the two disagreed (six to eight key disagreements per edition) and adjudicates each. The output is a single integrated synthesis with the structured prediction set that next quarter's ratification agent will grade.
Reads the revised draft and writes the headline. Atelier register, declarative, no question-mark titles, no buzzwords. One-shot Haiku call; output is the title that lands on desk.sagarbharambe.com. Prevents the "Q2 2026 Assessment" generic fallback from ever being the published title.
How an agent invocation actually works
Every agent invocation is a structured exchange: typed task packet in, typed result out, recorded.
The Orchestrator builds a task packet from the run state, the pipeline definition, the principal's interest signals, and any active learning_signal rows that scope-match. The agent receives the packet, validates it against its Pydantic schema, and refuses fast on mismatch. The agent makes one or more LLM calls, validates each response against its output schema, and persists the result.
If the agent fails, the Orchestrator's recovery routing checkpoint reasons over the failure code, the failure history of this run, and active signals about this failure pattern. Recovery options range from retry-same-config to spawn-pivot-child to escalate-to-principal. Each option has a deterministic fallback if the LLM call fails.
Every step is recorded: the system prompt hash, the request, the response, the cost, the latency, the consumed signals. This is the audit trail that makes "why did Saigar say this" answerable.
# Typed task packet (example: stage 7 synthesis)
class SynthesisTask(BaseModel):
schema_version: Literal["1.0"]
run_id: UUID
knowledge_cards: list[UUID]
sub_questions: list[SubQuestion]
voice_profile: VoiceProfile
learning_signals: list[LearningSignalRef] = []
# Typed output
class SynthesisOutput(BaseModel):
schema_version: Literal["1.0"]
clusters: list[FindingCluster]
patterns: list[Pattern]
contradictions: list[Contradiction]
gaps: list[KnowledgeGap]
confidence: float
# Schema mismatch is a SEVERE FAILURE.
# No silent degradation, no defaults, no
# autocorrection. Validation is strict.
Orchestrator
The Orchestrator runs every registered pipeline by walking its declared stages, dispatching tasks, opening gates, enforcing cost, and emitting events. It is the only component that knows what a pipeline is.
The Orchestrator is a hybrid. The run lifecycle is a deterministic state machine: pending → running → gated → running → complete. The judgement-heavy decisions are LLM-mediated at six named checkpoints with deterministic fallback. The combination is the real architectural innovation.
The six reasoning checkpoints
Each checkpoint produces a structured decision recorded to orchestrator_decision with the input context, the LLM response, and the deterministic fallback that would have run otherwise. Per-checkpoint disable flag, per-checkpoint cost ceiling (€0.26), per-checkpoint timeout (30s).
Reads the subject, the principal's interest signals, the corpus state, the pipeline definition. Produces: estimated cost band, recommended cost cap, model routing overrides per stage, source budget override, risk flags.
A subject the principal has covered three times before doesn't need full source breadth; the Researcher can run with min_total_sources=8 instead of 12. A controversial topic benefits from Opus on the synthesis stage. A quantitative subject benefits from a higher source budget. The deterministic fallback (use pipeline defaults) works fine; the LLM checkpoint makes the choice better when it works.
Fallback: use pipeline defaults across the board.
When a stage fails, instead of the hardcoded recipe the Orchestrator reasons over the failure code, the failure history of this run, and active signals about this failure pattern. Output: retry-same / retry-modified / skip / escalate / abort-with-partial.
A second timeout on stage 6 extraction with a 200kb source body gets a smarter response than "retry with same config." The Orchestrator can suggest "chunk the source into halves and extract each independently."
Fallback: the deterministic recovery recipes from the failure taxonomy.
After declarative skip_if evaluation, the Orchestrator reasons over whether the stage can be skipped or partially executed given prior outputs and corpus state. Conservative: confidence threshold for skip is 0.7 (higher than the default 0.55).
Stage 6 extraction can be partial when 60% of the candidate sources are already extracted from prior runs on the same subject. The Orchestrator can decide to re-extract only the new 40%, saving cost and time.
Fallback: respect declarative skip_if only.
When crossing the cost warning threshold, instead of binary continue-or-halt, the Orchestrator evaluates: continue, skip cosmetic stages (second revision, citation re-verify), halt with current draft, or request principal cap raise.
The deterministic fallback halts at hard cap regardless. The reasoning checkpoint can decide that skipping a second revision pass leaves the artefact 95% as good for 60% of the remaining cost. Or it can recommend escalating to the principal for a cap raise rather than abandoning.
Fallback: emit cost.warning at threshold, halt at hard cap.
Second-opinion on the Synthesizer's PROCEED/REFINE/PIVOT verdict. The Orchestrator can downgrade (PIVOT → REFINE → PROCEED) but cannot upgrade. Considers pivot count, remaining budget, principal preference for assertiveness.
The Synthesizer reasons within a stage; the Orchestrator reasons across the run. The Synthesizer might want to PIVOT on marginal evidence; the Orchestrator can see "we're at pivot count 1 with 60% of cost cap consumed" and override to REFINE. The downgrade-only constraint prevents the Orchestrator from manufacturing pivots the Synthesizer didn't see grounds for.
Fallback: affirm the Synthesizer's verdict.
Decides whether stage 24 reflection runs at all (some runs have thin signal worth learning from), what signal kinds to expect, which existing signals might be superseded. Routes priority for the Reflector.
A Brief on a familiar topic with no surprises, no Critic findings, no principal overrides has thin signal; running stage 24 produces low-quality reflection that pollutes the signal space. A Paper that pivoted twice and was edited at gate C has rich signal worth processing. The reflection routing checkpoint lets the Orchestrator give the Reflector the right context.
Fallback: always run stage 24 with default scope.
The state machine
Run states: pending, running, gated, paused, complete, failed, cancelled. Transitions are guarded by Postgres CHECK constraints; the database refuses invalid transitions. The Orchestrator must be idempotent across crashes; on restart, it finds the highest-ord stage with status complete-or-skipped and resumes from the next ord. State changes write to Postgres before the next step.
The kill switch
If a checkpoint is producing bad decisions, the principal can disable it from the Desk's settings. A disabled checkpoint always uses its fallback. This is the safety valve: when reasoning calibration drifts, fall back to deterministic policy while investigating.
Researcher
The Researcher reads the world for the run. Multi-source retrieval, source ranking, screening, structured knowledge extraction.
Five stages, one mission: turn a subject into a knowledge graph the Synthesizer can reason over. The Researcher is also the read-side gateway for the system; nothing else fetches from external sources.
Stage 2 · Problem decompose
One Sonnet call. Reads the subject, the pipeline source profile, the principal's interest signals, the voice profile. Produces 3 to 7 sub-questions, 2 to 3 framing angles, an explicit out-of-scope list. The decomposition shapes everything downstream.
Stage 3 · Search strategy
One Sonnet call producing query specifications per channel. OpenAlex queries (keywords, year range, type filters), arXiv queries with category filters, web search queries (3-6 specific phrasings, not paraphrases), regulator feed selectors. Each query annotated with the sub-question(s) it targets.
Stage 4 · Literature collect
The mechanical heavy-lifting stage. Per channel: issue queries with rate-limit awareness, parse responses into normalised CandidateSource shape, dedup against existing source rows, insert new sources, fetch full bodies where available, chunk, embed (Anthropic embeddings primary, OpenAI text-embedding-3-small as fallback). Failed channels degrade gracefully; circuit breaker opens after 5 consecutive failures within 60 seconds, stays open for 5 minutes.
Stage 5 · Literature screen (Gate A)
Per candidate, a Haiku-tier call producing a relevance score and one-line rationale per sub-question. Below threshold or matching the out-of-scope list → rejected. Cost-conscious: ~€0.00-0.00 per source. If retained count drops below min_total_sources, emit source.insufficient and trigger search-strategy refinement (jump back to stage 3).
Gate A: principal sees retained vs rejected sets and approves before stage 6 proceeds. Auto-advance after 12 hours.
Stage 6 · Knowledge extract
Per retained source, a Sonnet call (Haiku for short sources). Produces a KnowledgeCard with discrete findings, entity mentions, quantitative claims, temporal anchors, methodology notes, observed biases, extraction confidence. The KnowledgeCard collection is the knowledge graph the Synthesizer reads.
Channels in v1
OpenAlex
Free, broad coverage. Primary academic channel.
Semantic Scholar
Better quality scoring, optional API key.
arXiv
Free, current preprints.
Web search
Tavily / Brave / Anthropic web_search; resolved in Phase 2.
RSS
Feed parsing for press releases, regulator publications.
Scrape
Structured HTML extraction with selectors declared in pipeline YAML.
Continuous mode (v2 hook)
For pipelines with cadence.continuous.enabled: true (none in v1, regulatory pipeline in v2), the Researcher polls registered source feeds. Each cycle creates a short run that polls, ingests new items, applies materiality scoring, and produces zero or more corpus_update artefacts. The continuous trigger code path exists in v1 as a no-op; activating it requires only a YAML registration.
Synthesizer
Where independent claims become a synthesised position. The Synthesizer's verdicts shape what the Writer eventually says.
The Synthesizer's load-bearing innovation is the multi-agent debate at stage 8. Three debater personas argue from different framings, two rounds of cross-examination, then a synthesiser integrates. Seven Sonnet calls for one hypothesis. The principal's prior multi-agent debate work, made into a first-class stage.
Stage 8 · Hypothesis with multi-agent debate
async def hypothesis_gen_with_debate(self, task):
# Phase 1: parallel positions
positions = await asyncio.gather(*[
self.invoke_debater(task, position_index=i)
for i in range(task.debate_config.positions) # default 3
])
# Phase 2: cross-examination (default 2 rounds)
for round_idx in range(task.debate_config.rounds):
positions = await asyncio.gather(*[
self.invoke_debater_with_context(
task, position_index=i, prior_positions=positions
)
for i in range(task.debate_config.positions)
])
# Phase 3: synthesis
consensus = await self.invoke_synthesiser(
task, final_positions=positions
)
return Hypothesis(
statement=consensus.statement,
debate_positions=positions,
consensus_rationale=consensus.rationale,
confidence=consensus.confidence,
falsifiability=consensus.falsifiability
)
Three debater personas as system prompts: Position 1 the conventional read (what consensus says), Position 2 the contrarian read (what dissenting evidence implies), Position 3 the structural read (what the underlying mechanics suggest). Round 2 lets each debater update positions in light of others' arguments. The synthesiser is a fourth call that integrates.
For pipelines with require_disconfirming_position true (v2 thesis stress-test), Position 2 is constrained to find the strongest disconfirming case. The debate is real: each debater receives the same KnowledgeCards but argues from its assigned framing.
Stages 10-13 · The experiment subsystem (Paper only)
For Paper pipelines that involve quantitative analysis or empirical claims, stages 10 to 13 mirror AutoResearchClaw's experiment subsystem. Code generation produces the analytical script. Resource planning sizes the run. Experiment run executes in a sandboxed Python runner with bounded time and memory. Iterative refine handles failures, partial results, and dimensional issues. The Synthesizer here is the strategic owner; the actual code execution happens in a separate runner process.
Brief skips this subsystem entirely (skip_if: brief in the pipeline YAML). Brief is synthesis-driven, not empirical.
Stage 15 · The PROCEED / REFINE / PIVOT decision
The decision the Orchestrator executes. Inputs: the hypothesis, the analysis, the remaining cost budget, the pivot count. Decision logic:
- PROCEED if hypothesis holds with high confidence and no critical surprises. Confidence threshold: 0.6.
- REFINE if results are partial, ambiguous, or methodologically weak. Jump to stage 13 with refinement instructions. Confidence threshold: 0.5.
- PIVOT if results refute the hypothesis or surface a meaningfully different finding. Spawn a child run with a new hypothesis. Capped at 2 pivots per chain. Confidence threshold: 0.65.
The Orchestrator's pivot evaluation checkpoint can downgrade the verdict (PIVOT → REFINE, REFINE → PROCEED) but cannot upgrade. This prevents manufactured pivots while preserving the Synthesizer's ability to flag genuine refutation.
Critic
The agent that says no. Peer review, citation verification, voice constraint enforcement, materiality scoring. The Critic's function is to find what the Writer or Synthesizer missed.
Three reviewer personas at stage 18, four-layer citation verification at stage 23, third voice checkpoint when the Writer's first two fail. The Critic is the most conservative agent in Saigar; auto-admit happens only when the peer review score is high and citation verification is clean.
Stage 18 · Three-reviewer peer review
Three reviewer personas run in parallel:
Evidence reviewer
Cross-references every claim against the KnowledgeCards. Flags unsupported assertions, weak evidence chains, missing corroboration.
Voice reviewer
Scans for the principal's prose constraints: em-dashes, "not X but Y" framing, banned phrases, AI-slop patterns, Atelier register adherence.
Structure reviewer
Checks artefact coherence, the load-bearing of each section, whether the lead actually leads.
After three independent reviews, a synthesis call produces a unified PeerReview output: overall score 0-100, dimension scores, findings list with severity (blocker/major/minor/nit), recommendation (accept/revise/reject). Findings include location, excerpt, issue, and optional suggestion.
A revise recommendation triggers stage 19 (Writer revision); after revision, stage 18 runs again at most once. A second revise becomes reject, surfaced to the principal at gate C.
Stage 23 · Four-layer citation verification
Layer 1 · Primary identifier
mechanicalarXiv ID resolvable, DOI valid via CrossRef, regulator filing reference resolvable. No LLM call.
Layer 2 · Title and metadata match
mechanicalSemantic Scholar lookup verifies title and authors match the citation's metadata. No LLM call.
Layer 3 · Body presence check
mechanicalThe cited content exists in the source body at the specified position. Text matching, no LLM call.
Layer 4 · Relevance assessment
Haiku callAn LLM call confirms the source actually supports the claim being made. One Haiku call per citation. Threshold: relevance ≥ 0.6.
A citation passes if it clears layers 1-3 (mechanically verified) AND layer 4 (relevance ≥ 0.6). Failures: weak (verified but low relevance, removed and paragraph regenerated) or fabricated (failed layer 1 or 2, removed entirely, surfaced to principal). Cost: ~€0.00-0.00 per citation, cached by (source_id, claim_hash) so repeat citations across runs reuse the verdict.
Voice constraint third checkpoint
When the Writer's post-generation validator detects a voice violation that survives one regeneration, the Critic runs a focused review on the offending paragraph. Three options: auto_fix (mechanical replacement, e.g. em-dash → comma), accept_with_flag (borderline, surfaced to principal), block (unrecoverable, gate C rejection with violation).
Voice violations leak past zero checkpoints in v1's eval target.
Writer
The agent the principal sees most. Every word that lands on the Desk's archive or sagarbharambe.com flows from the Writer. Voice is policy, not preference.
The Writer authors in two modes. Article mode produces Markdown for sagarbharambe.com publishing. Brief mode produces a SlidePlan that the renderer turns into dual-mode HTML. The Writer is the first checkpoint in the three-layer voice constraint enforcement.
Stage 16 · Outline / SlidePlan
For Article mode: one Sonnet call produces an ArticleOutline with 4-7 sections, each with purpose, findings to use, target word count.
For Brief mode: one Sonnet call produces a SlidePlan with 7-9 slides selected from the Brief Component Library catalogue (cover, TLDR, chart, mechanism, read, implications, sources, closing). The Writer maps findings to specific slides, identifies whether a mechanism diagram is appropriate, authors the cover headline and deck line.
Stage 17 · Section-by-section drafting
The draft is produced section-by-section, not in one call. Each section is one Sonnet invocation with the section's purpose, the findings to use, the voice constraints, the target word count, and the drafts of preceding sections (for coherence). For Brief mode, "section" is "slide". Cost: 6-12 Sonnet calls per Brief stage 17, ~€0.42-1.27.
Voice constraint authoring (first checkpoint)
The Writer's system prompt incorporates voice constraints as hard rules:
Voice constraints (these are policy, not suggestions): - Do not use em-dashes (—) anywhere. Use commas, colons, parentheses, or sentence breaks. - Do not use en-dashes (–) anywhere. - Do not use the construction "not X, but Y" or "not just X but Y". State the position directly. - Do not use these phrases: "this is the hard truth", "doing the heavy lifting", "the real question is", "what most people miss", "here's the thing", "this is where it gets interesting", "let that sink in". - ... [full list from voice_constraint table]
After generation, a deterministic post-generation validator scans for violations. If found, the Writer regenerates the offending paragraph once with the violation embedded as a forbidden example. If the regeneration still violates, the Critic's third checkpoint takes over.
Cross-reference authoring (v0.3)
The Writer authors inline reference markers between sections of the artefact:
[see Mechanism, slide 4]for slide-level reference[see Chart: agent traffic share, slide 3]for component-level reference[Source: Mastercard Oct 2025 disclosure]for source reference[see "Mechanism analysis"](#mechanism-analysis)for Article anchor
Cardinality target: 2-5 internal references per Brief, not "every other sentence is a link." Cross-link only when the reference adds navigational value. Broken cross-references are flagged as minor Critic findings at stage 18.
Expandable content rule
The Writer does not generate expanded_content reflexively. The system prompt instructs: only produce expanded content when there is substantively more to say that a serious reader would want. If the surface body says everything worth saying, leave expanded_content null. The render layer omits the "read more" affordance when the field is null. This prevents the lazy expansion failure mode where the Writer pads the brief with restated material.
Reflector
Saigar's cross-run improvement loop owner. After every artefact-producing run, the Reflector reads the trace and produces structured learning_signal records that future runs consume.
Without the Reflector, every Saigar run is a fresh start. With it, the system inherits taste over time: which framings the principal accepts, which sources behave consistently, which voice constraints get violated on which topic patterns. Principal-supervised throughout.
Six signal kinds in v1
topic_acceptance
Which topic patterns the principal accepts vs overrides at scheduled cycle proposals.
source_credibility
Per-publisher credibility patterns derived from citation verification outcomes and principal feedback.
debate_position_preference
Which debate positions the principal accepts in stage 8 outputs, scoped by topic pattern.
voice_violation_pattern
Which voice constraints get violated on which topic patterns, used to strengthen Writer enforcement.
critic_finding_pattern
Recurring finding categories that surface across runs (e.g., consistent unsupported claims about a topic).
cost_outcome
Cost-vs-quality patterns: when did high-cost stages produce better outcomes vs. when did frugal runs match.
Signal lifecycle
A signal is born from a Reflector stage 24 invocation, a recurring Critic finding pattern, or principal manual entry. It accumulates strength as more runs reinforce it (initial 0.4-0.6, capped at 1.0). It weakens when contradicting runs occur. It expires if not reinforced for 6 runs (default decay). It can be rejected, weakened, or strengthened by the principal at any time.
At task creation, each agent queries learning_signal for active rows where subject matches and scope matches the task context. Active signals condition the system prompt. Every agent_invocation records which signals it consumed in metadata.signals_consumed.
Drift mitigation: novelty injection
An opt-in flag in the principal's preferences causes each agent to ignore one randomly-selected matching signal per N runs (default off, principal-enabled per agent). This guards against Saigar collapsing toward narrow framings. The principal observes drift on the /learning view and enables novelty injection on specific agents as needed.
This was a deliberate addition. The risk of cross-run learning is over-fitting to past principal preferences in ways that narrow Saigar's range. The mitigation is supervised: novelty injection is opt-in, not default. The principal sees recent injections on the /learning view and can disable if injections are causing more harm than they prevent.
Sample signal
{
"signal_kind": "debate_position_preference",
"subject": "synthesizer",
"scope": {
"pipeline": "brief",
"topic_pattern": "agentic commerce / payment infrastructure"
},
"observation": "On agentic commerce topics across 4
runs, the principal accepted structural framings
(identity-layer focus) over rails-layer consensus by a
margin of 3-1. Edits at gate C consistently strengthened
the structural reading.",
"prescription": {
"weight_adjustment": {
"position_3_structural": 0.15,
"position_1_consensus": -0.10
}
},
"initial_strength": 0.55,
"expires_after_runs": 6,
"rationale": "Four runs is enough to suggest a pattern
but not enough for high confidence. Set strength
mid-range; reinforce on next confirming run."
}
Five processes. One VPS. Postgres-as-everything.
Saigar runs on Hetzner CX31 in Falkenstein. Caddy, Next.js, FastAPI, Brief renderer, Worker. Postgres for queue, events, storage. EU residency throughout.
System topology
The five processes
Reverse proxy with automatic TLS via Let's Encrypt. Routes to Next.js on /, FastAPI on /api/*, Brief renderer on internal port only.
App Router. Server components for static surfaces. Client components only for interactive (gate approvals, live progression). SSE for run-level event streaming.
REST + SSE. Pydantic 2 for validation. SQLAlchemy 2 async with asyncpg. Authentication via passkey (WebAuthn) or bearer token for CLI.
Internal-only service. Accepts SlidePlan JSON, validates against schema, renders to single self-contained HTML with bundled toggle and navigation JavaScript. Deterministic, cached by SlidePlan hash.
Long-running. Hosts Orchestrator + 6 agents + APScheduler + continuous trigger loop + event subscriber + nightly learning_signal sweeper. systemd-managed, auto-restart on crash.
Why no Redis, no Celery
v1 single-user load is trivial: a few runs per week. One process is simpler than a distributed worker pool.
The queue is implemented via Postgres SELECT FOR UPDATE SKIP LOCKED (the pgmq pattern). LISTEN/NOTIFY drives the event bus. The scheduler runs in-process. No Redis to operate, no Celery workers to scale, no broker to monitor.
The Orchestrator's interface to the queue is abstracted (a JobQueue interface). If v2 or v3 load demands it, the worker can be split into queue-consumer instances and the queue migrated to Redis as a configuration change. v1 doesn't need it.
# In-process queue via Postgres
async def claim_next_task(conn):
async with conn.transaction():
row = await conn.fetchrow("""
SELECT id, run_id, stage_ord, payload
FROM task
WHERE status = 'pending'
ORDER BY priority DESC, created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
""")
if row:
await conn.execute(
"UPDATE task SET status = 'claimed' WHERE id = $1",
row['id']
)
return row
# Event bus via LISTEN/NOTIFY
async def emit_event(conn, event):
async with conn.transaction():
await conn.execute(
"INSERT INTO event ... VALUES ...",
...
)
await conn.execute(
"INSERT INTO event_outbox ... VALUES ...",
...
)
await conn.execute(
f"NOTIFY {event.channel}, '{event.id}'"
)
Cost model
Per-run cost composition
| Component | Brief | Paper |
|---|---|---|
| Researcher | €1.27-3.40 | €2.55-6.80 |
| Synthesizer | €1.27-2.55 | €2.55-5.10 |
| Critic | €0.42-0.85 | €0.68-1.27 |
| Writer | €0.68-1.70 | €0.51-1.27 |
| Orchestrator (incl. 6 reasoning checkpoints) | €0.42-1.02 | €0.51-1.19 |
| Reflector (stage 24) | €0.34-0.68 | €0.42-0.85 |
| Total per run | €4.42-10.20 | €7.22-16.49 |
Disaster recovery posture
RTO 4 hours
From disaster to running Saigar. Provision new VPS, restore latest dump, replay WAL.
RPO 24 hours
Source data loss bound. Run history loss bound: 7 days. Published artefacts: near-zero (they live on sagarbharambe.com).
Quarterly tested
Test restore to a separate VPS verifies procedures work. RTO/RPO are not heroic; v1 single-user, downtime is annoying not catastrophic.
The system inherits taste.
Saigar's cross-run improvement loop. The Reflector at stage 24 produces structured signals. Future runs consume them. The principal supervises throughout.
The three-phase lifecycle
Observation
During every run, signals are captured passively: principal feedback, gate decisions, citation outcomes, Critic findings, cost-vs-quality outcomes, topic acceptance vs override. No special instrumentation; these flow into existing tables.
Prescription
At stage 24, the Reflector reads the run trace, queries existing signals, and produces zero or more new learning_signal rows or updates to existing ones. Deduplicates: similar signals reinforce rather than duplicate.
Consumption
At task creation, each agent queries learning_signal for relevant active rows and incorporates them into the system prompt. Every agent_invocation records consumed signals for audit.
The /learning view
The principal's supervision surface for learning_signal rows. Every active signal is reviewable. The principal can accept, reject, strengthen, weaken, or expire any signal at any time.
Active learning signals
3 unreviewed · 18 active · 2 expiring this weekStructural framings on agentic commerce topics
On agentic commerce topics across 4 runs, the principal accepted structural framings (identity-layer focus) over rails-layer framings by a margin of 3-1.
Network agent-pay announcements overstate transaction volumes
Across 3 runs, network press releases on Agent Pay and Intelligent Commerce transaction counts have been corroborated at 60-80% of stated figures by independent reporting. Down-weight in screening.
'Doing the heavy lifting' violations on financial topics
3 violations across 2 runs. Strengthen this constraint's emphasis in Writer system prompt for financial pipelines.
Agentic commerce topics accepted; pure crypto topics overridden
The principal has accepted 4 of 4 agentic commerce topic proposals; overridden 3 of 3 pure crypto topic proposals in favor of payments-AI intersection topics.
The novelty injection toggle
Drift mitigation, opt-in per agent.
The risk of cross-run learning is over-fitting to past principal preferences in ways that narrow Saigar's range. After ten runs of contrarian framings on payments, will Saigar still produce a non-contrarian framing when the evidence genuinely calls for one?
The mitigation is the novelty injection toggle. When enabled per agent, the agent ignores one randomly-selected matching signal per N runs. The principal sees recent injections on the /learning view: "On run 47, the Synthesizer ignored signal X (debate position preference on payments topics)."
Default: off across all agents. The principal observes drift on /learning and enables novelty injection where useful. Most likely first candidate: the Topic Ranker, where drift toward past topic preferences is most visible.
The reference artefact.
Brief 47 on agentic commerce identity infrastructure. The visual reference for the spec set, the Phase 2 acceptance test for the build, the principal's own first published Brief.
What follows is a condensed, embedded preview. The full Brief is a single self-contained HTML file (~250 KB) with mode toggle, internal cross-references, animated mechanism diagram, expandable sections.
Bi-weekly · Edition 47 · 02 May 2026
The agentic commerce question, settled at the wrong layer.
Why platform-vs-network framing misses the actual contest, and what Web Bot Auth's quiet adoption tells you about where trust will live in 2026 and 2027.
TL;DR
Volume · Production
Five concrete agentic commerce deployments live by April 2026: Mastercard Agent Pay, Visa Intelligent Commerce, Amazon Buy for Me, Coinbase x402, and the ChatGPT retailer apps that replaced Instant Checkout in March. The pilot phase is over.
Mechanism · Latency
Three settlement paths exist in production. Browser-driven flows trip legacy fraud loops at ~3 seconds. Network-tokenized routes (Agentic Tokens, TAP) settle at ~1.5 seconds. Agent-native protocols (ACP, UCP) settle at under one second. The differential is structural.
Trajectory · Identity
Web Bot Auth (IETF RFC 9421, Cloudflare-led) has quietly become the substrate every commercial protocol builds on. The networks tokenise above it. The platforms compose protocols above it. Whoever attests agent identity defines the trust assumption.
How the three agent-checkout paths actually settle
Body · Where this lands
The agentic commerce question through 2024 and into early 2025 was whether this was real. Whether AI agents could actually be trusted to spend money. Whether merchants would accept traffic that did not come from humans. Whether fraud engines could even tell the difference. By September 29, 2025, when the first live agentic transaction settled on Mastercard's network, that question had a public answer. The first wave of pilots converted to production through Q4 2025 and Q1 2026. Five concrete deployments now exist: Mastercard Agent Pay, Visa Intelligent Commerce, Amazon Buy for Me, Coinbase x402, and the ChatGPT retailer apps that replaced Instant Checkout in March 2026.
The volume question is settled. The interesting question moved up the stack.
The party that controls agent identity attestation defines the trust assumption everything else builds on. Through 2026, that party is converging on Cloudflare's Web Bot Auth standard, and almost nobody is talking about it.
Read 1 · Mastercard's disclosure, October 2025
On the Q3 earnings call, Mastercard CEO Michael Miebach told analysts that the company processed its "first agentic transaction" during the quarter. The transaction settled September 29, 2025. He committed Mastercard to being "at the center" of agentic commerce going forward. The infrastructure positioning was specific: Ethoca's real-time dispute data and Mastercard Threat Intelligence as the fraud-prevention layer beneath Agent Pay, Mastercard Agentic Tokens scoped per agent and per session as the credential model, Web Bot Auth as the underlying agent identity layer.
The strategic read is straightforward. Mastercard is betting that consumers will trust the network's brand over the LLM vendor's brand to adjudicate what an agent can and cannot do with their money. The Microsoft Copilot Checkout integration, announced in parallel, is the validation channel for this thesis. If consumers click "buy with Mastercard Agent Pay" inside Copilot more often than they grant ACP-flavoured authorisations inside ChatGPT, the network wins the trust positioning. If they do not, the platforms do.
The bet worth watching: Mastercard's Multi-Token Network and the Chainlink tie-up are positioning the network as the trusted conversion layer between fiat card acceptance and onchain settlement. Stablecoin-native agents (Coinbase Agent.market with x402, Stripe and Tempo's MPP) operate on rails Mastercard does not control. The Multi-Token Network is the bridge that keeps Mastercard relevant in agent-to-agent flows. It is also the most architecturally interesting product the network shipped in 2025.
Read 2 · Google's UCP launch, NRF January 2026
The Universal Commerce Protocol launched on January 12, 2026 at NRF with a deliberate choice of partners. Etsy, Shopify, Target, Wayfair, and Walmart on the merchant side. Adyen, American Express, Mastercard, PayPal, Stripe, Visa, and Worldpay on the payments side. The protocol composes over Google's earlier A2A and AP2 (Agent Payments Protocol) and over Anthropic's MCP, which had been donated to the Linux Foundation's Agentic AI Foundation in December 2025. UCP initially powers a checkout feature inside Google's AI Mode and the Gemini app, using Google Pay and payment methods saved in Google Wallet, with PayPal arriving in the months following.
Two things are worth noting about the launch. First, every major card network endorsed UCP while each was also actively building their own protocol. Visa has TAP. Mastercard has Agent Pay's Acceptance Framework. Both networks publicly endorse UCP because the alternative, refusing to interoperate with the platform that owns the consumer agent surface, is worse than the cost of supporting a competing standard. The endorsement is hedging.
Second, Google launched UCP with the merchants. OpenAI launched Instant Checkout without them, beyond Etsy and a handful of Shopify partners. By March 2026, OpenAI shuttered Instant Checkout and replaced it with retailer apps inside ChatGPT that route payments through merchant-native checkout. The protocol-led model hit the merchant willingness-to-integrate boundary at single-digit merchant counts. The merchant-led model launched with five anchor retailers already integrated. This is a positioning lesson, not a technology one. A protocol with five major retailers behind it on day one creates gravity that a protocol without them does not.
Implications by audience
Click any audience to expand the implications.
The pragmatic stack as of mid-2026 is Mastercard Agent Pay plus Visa TAP for network-attested traffic, plus ACP and UCP support for direct integration with the major AI platforms. Stripe shipped one-line ACP integration. Worldpay's MCP server is live. Commercetools has agent commerce built in. The decision is no longer "should we accept agent traffic." It is "in what order do we add protocol support, and which integrations get prioritised this quarter."
Merchants who block AI crawlers are blocking revenue. Amazon's decision to block AI crawlers cost them roughly 600 million product listings from AI search results and an 18% month-over-month drop in ChatGPT referral traffic. Walmart now captures around 20% of ChatGPT referral traffic. The directional read is that AI discoverability is the new SEO and the cost of opting out is now measurable.
Visa and Mastercard have placed parallel bets at different scopes. Visa went broad early with Intelligent Commerce. Skyfire, Nekuda, PayOS, and Ramp as pilot partners. Geographic rollout starting in Europe and the UK before Asia Pacific and LATAM. Mastercard went deep with Agent Pay and the Microsoft partnership. Both networks are building the agent identity layer themselves rather than ceding it.
The bet that requires monitoring is stablecoin rails. Coinbase x402 and Stripe MPP with Tempo and Visa are eating B2B agent-to-agent payments where neither network has clear positioning. The conversion layer (Mastercard's Multi-Token Network) is the hedge. Whether the hedge is sufficient depends on how much B2B agent volume actually materialises in the next eighteen months.
Platforms are converging on the merchant-led integration model after OpenAI's protocol-only experiment failed. Google's UCP is the working version of this approach. Anthropic's MCP, donated to the Linux Foundation, is the substrate underneath. Whichever AI platform best disambiguates the consent flow (which agent acted, on whose authority, with what permissions, across which sessions) wins the trust argument. The technical substrate is largely neutral. Identity disambiguation is the differentiator.
Processors face the question of whether agent commerce flows go through them at all. If the networks issue tokens directly to agents and merchants accept those tokens, Stripe and Adyen become the compatibility layer rather than the primary flow. Stripe's MPP positioning (agent-to-agent payments, with Tempo and Visa as design partners) is the hedge against this disintermediation. The success criterion is whether MPP captures meaningful B2B agent-to-agent volume, or whether it becomes a strategic curiosity that the company supports without scaling.
The substrate question
Web Bot Auth is the substrate on which the network and platform layers depend. Cloudflare led the work. The IETF RFC 9421 standard was developed before agentic commerce became urgent. Visa and Mastercard adopted it because issuing each network's own agent identity attestation independently was technically and politically uncompetitive. Once Web Bot Auth became the substrate, network tokens became "Web Bot Auth plus brand attestation" and protocol tokens became "Web Bot Auth plus delegation flow." Everything above the substrate is differentiation. The substrate itself is the trust assumption.
The party that controls Web Bot Auth's evolution controls the trust assumption that everything builds on. Cloudflare's positioning here is quieter than Mastercard's or Google's, and more interesting because of it. Cloudflare partnered with Microsoft, Shopify, Checkout.com, Worldpay, and Adyen on the early Web Bot Auth work. The partnerships extended to Visa and Mastercard adoption through 2025. Cloudflare's revenue from this layer is small. Cloudflare's strategic positioning from this layer is large.
The question for 2026 through 2027 is whether the standards body keeps Web Bot Auth sufficiently open that the networks and platforms have to keep competing on top of it, or whether one party captures enough of the extension surface to make the substrate effectively their substrate. The first scenario keeps the layered stack interesting and competitive. The second scenario is where the architecture starts to consolidate uncomfortably under whichever party gets there first.
The current trajectory looks like the first scenario. Watch for that to change.
How Saigar produced this Brief
The mechanism diagram above is the Brief's distinctive visual element. The Writer authors the lane data (paths, nodes, packet timing). The renderer turns it into animated SVG. Saigar produces this from a topic prompt, through the 24-stage pipeline, in roughly 14 minutes wall clock, at a per-run cost of €7.16.
The full Brief, in its native format, includes additional components not shown in this preview: a market-state line chart of agent-initiated transaction volume over the last six quarters, hyperlinked source citations with reverse "cited in" anchoring, closing statement. The full version uses dual-mode rendering (newspaper or slides) and the back-chip for cross-references between sections.
From four pipelines, to many.
v1 shipped four pipelines. v2 adds further specialised pipelines and continuous monitoring. v3 evolves the architecture from pipelines-as-primitives to pipelines-as-orchestrated. Each step is opt-in, governed by registration not rebuild.
Four pipelines, eighteen agents, one loop
Brief, Paper, Wire snippet, and Predictive Intel. Eight senior agents (Orchestrator, Researcher, Synthesizer, Writer, Critic, Reflector, Designer, Publisher) plus ten junior specialists across the wire and predictive-intel pipelines. The cross-run learning loop. The seven-archetype layout palette for Briefs. Atelier register throughout.
What shipped
- Brief pipeline: weekly payments competitive intelligence
- Paper pipeline: long-form research articles to sagarbharambe.com
- Wire snippet pipeline: morning notes from regulator filings and central-bank releases
- Predictive Intel pipeline: quarterly company assessments with bear/bull red-team and ratification scorecard
- Eighteen specialist agents with reasoning checkpoints and the learning loop
- Dual-mode Brief HTML with mechanism diagrams and back chip
- The /learning view for principal supervision
- Full audit trail and reproducibility
What's deferred
- Continuous trigger fully active (code path exists, no continuous pipelines yet)
- Cross-pipeline orchestration
- Multi-tenancy (any code path)
- Public sharing of artefacts
- Auto-discovery of new sources
- Multilingual output (Dutch articles)
Continuous monitoring and specialist memos
v2 expands the pipeline catalogue. Each new pipeline is YAML registration plus optional specialist agents. The architecture absorbs them without core changes.
Regulatory monitoring
Polls EBA, ECB, AFM, DNB feeds. Scores materiality per item against principal interest signals. Material items surface as briefing cards on the Today view. Activates the v1 continuous trigger no-op.
activates continuousInvestment thesis stress-test
Adversarial debate variant. The Synthesizer's Position 2 is constrained to find the strongest disconfirming case. Output: a stress-test report with confidence intervals on the thesis.
debate variantQuarterly market wrap
Aggregates 90 days of Briefs and continuous regulatory updates into a synthesized quarterly view. Identifies trajectory shifts and emerging themes. Output: a Brief-format artefact at quarterly cadence.
aggregatingDeal / M&A memo
Specialised pipeline for transaction analysis. Comparable transactions retrieval, synergy analysis, regulatory friction assessment. Output: a structured memo.
specialist agentsCross-pipeline orchestration
The pipeline-as-primitive abstraction holds through v2. v3 introduces meta-pipelines: a pipeline whose stages are other pipelines. A regulatory event triggers a Brief; the Brief's findings trigger a Paper; the Paper's publication seeds a quarterly market wrap. The Reflector at this scale produces meta-signals: what kinds of pipeline cascades produced the highest-quality artefacts.
This is the moment when Saigar transitions from "an operator that runs research pipelines" to "an editorial system that orchestrates a body of work."
Editorial team expansion
The single-tenant constraint loosens. Saigar can be operated by a small editorial team where each member has their own voice profile, their own learning signals, their own approval gates. Cross-team signals become a possibility: the senior editor's voice constraints take precedence in shared pipelines; the staff editors learn faster because they consume the senior's signal set.
This is speculative. It would require revisiting almost every architectural decision in v1: the singleton principal, the auth boundary, the cost model, the audit trail. Not a v1 question.
Wire Curator
First agent in the morning wire pipeline. Reads the last 24 hours of inbound channels and decides what is worth a 95-110 word note.
Saigar's channel substrate ingests around forty regulator and macro feeds continuously. By 07:00 CEST every weekday the Curator faces a pile of fifty to two hundred new items. Its job is to pick at most five.
What it actually does
The Curator scores each new item on three axes: materiality (does this move a real thing in payments or financial-sector regulation), fit (does it match the principal's current interest signals from the /learning view), and freshness (would the principal already know this from another source). Items below thresholds are dropped silently; the rest are ranked.
The output is not a draft yet. It is a structured pick list: source URL, channel, kicker hint, why-it-matters hint, materiality score, dedupe key. The drafter consumes this list one item at a time.
Hard rules
- One item per channel per day, no exceptions
- Maximum five items per morning, no exceptions
- Dedupe against the prior seven days using a normalised title + source-host hash
- If fewer than three Tier-A items are available, the Curator picks fewer rather than dilute
Why Haiku
Curation is a high-volume triage problem with a tight schema. Sonnet costs five times more for marginal quality gains. Haiku at this stage runs at roughly €0.04 per morning across all candidate items, which makes the wire pipeline economically viable as a daily product.
Wire Drafter
Writes one tight 95-110 word note per curated item. Voice in, citation in, kicker in, why-it-matters in. No room to hide.
The wire format is the constraint. A note that runs 80 words feels rushed; a note that runs 130 words turns into a brief. The Drafter targets the band exactly because the band is what makes the wire column readable as a column.
The shape of one note
Every snippet has the same structure: a kicker line of two to four words categorising the item (regulator | data | market | academic), a one-sentence factual statement of what happened, two to three sentences of substance, and a closing why-it-matters that connects it to the principal's beat without editorialising.
Hard validators
- Word count must be 95 to 110 inclusive; out-of-band drafts are rejected before the Critic sees them
- Exactly one source link, dated, with publisher attribution
- No em-dashes, no rhetorical questions, no contractions, no "not X but Y" framing
- No second person; the principal's voice is third-person observer
Why Sonnet here, not Haiku
Curation is volume work; drafting is tone work. The drafter has to compress a regulator filing or a working paper down to the principal's voice register without losing the load-bearing facts. Haiku produces band-correct drafts that read as generic explainer copy. Sonnet produces band-correct drafts that pass the Critic on the first try about eighty percent of the time.
Wire Critic
A scaled-down Critic tuned to the wire format. Softer on style nits because the format is short, harder on factual claims because there is nowhere to hide.
Senior Critic peer review takes a thousand-word section apart over multiple personas. The Wire Critic gets a hundred words and one shot. The trade-off is reframed: do not argue about cadence, do argue about whether the verb describing what the regulator did is the right verb.
The four checks
- Voice rules. Banned punctuation, banned phrases, register drift. Hard fail on any violation.
- Citation discipline. Source link resolves, date present, publisher attributed.
- Factual posture. The drafter's verbs match what the source actually says. A regulator that "issued guidance" is different from one that "proposed a rule."
- Band drift. Word count revalidated; the drafter's count must match the Critic's recount.
Output contract
The Critic returns one of three verdicts: approved (Publisher commits the snippet), revise-with-reason (Drafter regenerates one time with the specific objection inlined into the prompt), or reject (snippet dropped from the morning column entirely). Reject is rare; the Curator has already filtered out the genuinely unfit material.
Why softer than the senior Critic
The senior Critic enforces rhetorical structure, drop-cap discipline, section-heading conventions, and four-layer citation verification. None of those apply to a 100-word note. Optimising the Wire Critic for the right surface area is what keeps the morning ship-time under twenty minutes end to end.
Company Researcher
First agent in the predictive intel pipeline. Reads everything filed, said, or disclosed by one specific company in the last four quarters.
The senior Researcher casts a wide net across topics. The Company Researcher does the opposite: scoped tight to one subject, but exhaustive within that scope. Visa for a Visa run, PayPal for a PayPal run.
What it pulls
- SEC EDGAR filings for the company's CIK (10-K, 10-Q, 8-K, proxy statements)
- Investor relations RSS feeds for press releases and earnings transcripts
- Regulatory disclosures from the company's home jurisdiction (Euronext, LSE, equivalents)
- Major payments-domain news mentioning the subject company
Body hydration
SEC filings carry only metadata in the channel index; the EDGAR adapter stores the URL but not the filing body. The Company Researcher detects thin bodies (under 600 characters of content) and re-fetches the full filing via direct HTTP, capped at eight refetches per run to stay inside the cost ceiling. Without this step the metrics extractor sees only filing titles and the assessment becomes a press-release summary rather than analysis.
Output
A structured context blob: financial signals (revenue lines, segment breakdowns), strategy signals (product launches, partnerships, executive commentary), product launches with dates, regulatory notes. The metrics extractor and the bear/bull analysts all read from the same blob, which means they see the same evidence and can be compared on interpretation alone.
Metrics Extractor
Reads the company researcher's output and lifts a fixed KPI schema. Without this stage, ratification is impossible.
The bear and bull analysts make claims in prose. The ratification agent next quarter has to grade those claims against numbers. The Metrics Extractor is the bridge: it produces the structured ground truth that the next edition will measure against.
The standard schema
Every company assessment fills the same JSON shape regardless of subject:
- revenue — net revenue, total revenue, year-on-year growth
- tpv — total payment volume processed, where disclosed
- take_rate — blended and per-segment
- net_revenue_margin — operating margin, EBIT, EBITDA where reported
- geographic_split — revenue by reporting region
- guidance — management's forward statements with confidence labels
Strict mode
The Metrics Extractor is forbidden from estimating. If a company does not disclose a KPI, the field is null with a reason code. Estimation is bear-and-bull territory. Mixing extraction and estimation in one stage was an early-prototype mistake that produced predictions ratifying themselves.
Why this is the most boring agent
It is supposed to be. The Extractor is a schema-mapping function with an LLM running it. The intelligence sits downstream in the bear/bull/reconciler red-team. The Extractor's job is to never be the source of disagreement.
Ratification Agent
Self-correcting research, on the record. From edition two onward, this agent grades the prior edition's predictions before the new edition is written.
Most research products have no track record because no one keeps score. The ratification scorecard at the top of every edition from Q2 onward forces the system to face what it got right and what it got wrong, in public, before it gets to make new claims.
The four-class verdict
- Confirmed — the prediction was directionally and quantitatively correct
- Partial — direction right, magnitude off, or threshold crossed in unexpected way
- Wrong — direction was wrong, full stop
- Too early — the time horizon has not yet elapsed; carried forward to the next edition
How the grade is reached
The agent loads the prior edition's structured prediction set from the database (written by the reconciler one quarter earlier). It loads the current metrics extractor output for the same company. For each prediction it cites the specific number or event that confirmed or falsified it, and writes a one-sentence verdict note.
Why this stage exists
The principal's view: a research operation that does not grade itself eventually drifts into being a comfortable story-telling machine. The ratification scorecard is the forcing function that keeps the bear and bull analysts honest, because next quarter's agent will read what they wrote and grade it against reality.
Bear Analyst
Half of the red-team. Writes the bearish case in its own voice, with falsifiable predictions, no hedging allowed.
The Bear Analyst's prompt forbids balance. Its job is not to weigh both sides; the reconciler does that. Its job is to construct the strongest disconfirming case the evidence supports, and to commit to predictions that can be graded next quarter.
What "falsifiable" means here
Every prediction must include a horizon (12 months or 3 years), a kind (quantitative or qualitative), and either a numerical threshold or a binary event. "Take rate will compress" is not a prediction. "Blended take rate will fall below 1.87% within four quarters" is a prediction.
Independence
The Bear and the Bull never see each other's drafts. They each read the same metrics extractor output and the same company researcher output, and they each produce their case in isolation. The reconciler downstream reads both and adjudicates. If both arrived at the same prediction independently, that becomes a high-confidence house view.
What it produces
- A bear thesis paragraph, atelier register, no hedging
- A 12-month outlook prose section with stated assumptions
- A 3-year trajectory section
- Four to ten structured predictions with horizon, metric, threshold, and basis
- Key risks list — what would have to be true for the bear case to be wrong
Why Haiku
Adversarial argument is a cheap LLM use case. Haiku produces the bear in roughly one-fifth the cost of Sonnet at fully comparable quality on this specific task, because the task constrains the agent so tightly. The reconciler is where Sonnet's reasoning matters.
Bull Analyst
The other half of the red-team. Writes the bullish case in its own voice, with falsifiable predictions, no hedging allowed.
Mirror of the Bear. The same constraints, the same independence, the same falsifiability discipline. The Bull's prompt is also forbidden from balance; its task is the strongest constructive case the evidence supports.
What it produces
- A bull thesis paragraph in voice
- A 12-month outlook with stated assumptions
- A 3-year trajectory section
- Four to ten structured predictions, schema identical to the Bear
- Key catalysts list — what would have to happen for the bull case to play out
Where the disagreement comes from
Two analysts reading the same evidence often produce different predictions for one of two reasons: differing weights on the same data points (where margin of safety matters), or differing forward assumptions (where the Bear sees a regulatory headwind and the Bull sees a competitive moat). The reconciler reads both and surfaces those exact disagreements as a section in the published assessment.
Why this design
A single analyst writing a balanced view is structurally biased toward the centre. Two adversarial analysts forced to commit to falsifiable predictions, then a third agent forced to adjudicate, produces predictions sharper than any centre-view could. The output is more disagreeable, which is the point.
Reconciler
The most architecturally important agent in the predictive intel pipeline. Reads both the bear and the bull, writes the house view, and commits to the prediction set that next quarter's ratification agent will grade.
The bear and bull each commit to a position. The Reconciler reads both positions, decides which prediction wins, which loses, which gets re-cast as a probability distribution over both, and writes the synthesis as a single integrated draft.
Why predictions come first
The Reconciler's tool schema is deliberately ordered with the structured prediction array first, prose sections second. LLMs emit fields in declaration order; putting predictions first guarantees that token budget exhaustion costs prose, not the prediction set. An assessment with strong prose and zero predictions is unratifiable next quarter; an assessment with a strong prediction set and shorter prose still keeps the track record.
What it surfaces
- A house thesis paragraph that does not split the difference
- A 12-month outlook that integrates both arguments
- A 3-year trajectory
- Six to eight key disagreements between the bear and the bull, each with the Reconciler's adjudication note
- A consolidated prediction set with horizon, kind, metric, threshold, and basis — written to the database, not just the document
Hard validators
The Reconciler must produce at least four predictions and ideally six or more. Below that threshold the run is rejected and rerun with a stronger output requirement, because under-prediction breaks the ratification design.
Title Generator
Writes the headline that lands on desk.sagarbharambe.com. One-shot Haiku. Atelier register, declarative, no question marks.
Until this agent existed, the publisher fell back to "Q2 2026 Assessment" or the subject's display name as the artefact title. That is fine for an internal draft and unacceptable for a published front-page item.
The constraints
- Declarative, not interrogative — no question-mark headlines
- No buzzwords (innovate, transform, paradigm, leverage, ecosystem, robust)
- No "X faces Y": one of the lazy headline patterns
- Should connect a specific company to a specific structural claim from the assessment
- Ten to fourteen words is the comfortable band; shorter is allowed when the claim is strong
How it reads the draft
The Title Generator runs after the writer's revision pass and before the designer. It receives the synthesis output (where the house view sits) and the revision output (the polished prose). It writes a title that names the company and asserts a position — for example, "PayPal Faces Durable Structural Pressure Amid Profitable Growth" rather than "PayPal Q2 2026 Update".
Fallback path
If the title fails the validator (banned words, question mark, length out of band), the publisher does not retry the title agent — it falls back to the prior pattern of subject_display_name + edition_label. The title is a polish stage, not a load-bearing one. A failed title generates an internal log line; the artefact still ships.