RAG Chatbot — Solution Architecture & Technical Specification

Architecture

Five components: a WordPress bridge, a sync worker, one Postgres with pgvector, a Fastify API that runs the retrieve → gate → generate → verify pipeline, and an embeddable chat widget. WordPress remains the source of truth; the RAG store holds embeddings, metadata, and verbatim scripture only.

Fig. 1 — System architecture. Scripture and disclaimers are enforced inside the API pipeline (stages 4–5), after the model and before the client.

Component summary

Component	Technology	Responsibility
WP bridge	Small additions to the existing knowledgebase plugin (PHP)	Delta endpoint (modified_after, paginated) + HMAC-signed save_post / trashed_post webhook. Read-only; WP stays the source of truth.
Sync worker	TypeScript (Node 22), Docker	Bulk load, 15-min incremental poll with content-hash gating, nightly reconciliation. Cleaning, chunking, scripture extraction, embedding calls.
Storage	Postgres 16 + pgvector (halfvec), PgBouncer	Two isolated tenant schemas, each with its own HNSW + GIN indexes. RLS as defense in depth. Holds vectors, metadata, verbatim scripture — not full mirrored content beyond cleaned text.
API	Fastify + zod (fastify-type-provider-zod), OpenAPI 3.1, Anthropic TS SDK	The 5-stage answer pipeline; typed-event SSE; abuse stack; per-tenant config (system prompts, disclaimers, thresholds).
Widget	Vanilla JS, ~15KB, tolerant reader	Embeds in both WP sites; fetch()+ReadableStream SSE; renders segments, citations, disclaimer; suggested-questions panel.
Sidecars	Haiku 4.5 (API), bge-reranker-v2-m3 (self-host)	Safety classification, MPA groundedness judging, cross-encoder reranking.

Design principles

Grounding is structural. Scripture verbatim-ness, citations-from-retrieved-set, and the MPA disclaimer are enforced by the data layer and DTO validation — never delegated to the prompt.
Tenant isolation below the application. Separate schemas with independent indexes; a forgotten WHERE clause cannot leak medical content into the Islamic-guidance brand.
One store. Vectors live next to metadata in Postgres — no dual-write consistency problem, no second database to operate. Sized for 170k articles / ~350k vectors without structural change.
Fail closed. Every uncertainty path (low retrieval score, hash mismatch, rail failure, timeout) degrades to the templated "not covered + closest articles" response — never to an ungrounded answer.

Data & Ingestion

Three ingestion modes share one cleaning-and-chunking core. Re-embedding is gated on a SHA-256 of the cleaned text, because WordPress bumps modified on metadata-only saves.

Fig. 2 — Ingestion: bulk load, 15-minute incremental poll with hash gating, nightly reconciliation.

Cleaning & chunking rules

Publish-only: status='publish' — drops MPA's 14,531 drafts from the index by construction.
Strippers: [elementor-template] shortcodes (present in nearly every article — CTA widgets, not content), Gutenberg comment markup, spacers.
Junk denylist: tester authors and zero-count test categories found in the real exports.
Chunking: ~450–500 tokens, ~50 overlap, on Gutenberg block boundaries; title › h2 › h3 breadcrumb prepended; verse-reference paragraph + its blockquote always kept in the same chunk; 65k-word MPA outliers split on heading boundaries.

Postgres schema (applied to kb_wc and kb_mpa)

pg-sql

CREATE TABLE article (
  post_id         bigint PRIMARY KEY,
  title           text NOT NULL,
  slug            text NOT NULL,
  url             text NOT NULL,
  status          text NOT NULL CHECK (status = 'publish'),
  published_at    timestamptz,
  modified_at     timestamptz NOT NULL,
  author_id       bigint,
  reviewer_ids    bigint[] DEFAULT '{}',     -- MPA clinician credentials
  category_ids    int[]    DEFAULT '{}',
  clean_text      text NOT NULL,
  word_count      int,
  content_sha256  bytea NOT NULL,             -- re-embed gate
  last_indexed_at timestamptz
);
CREATE INDEX ON article USING btree (modified_at);
CREATE INDEX ON article USING gin (category_ids);

CREATE TABLE chunk (
  id           bigserial PRIMARY KEY,
  post_id      bigint NOT NULL REFERENCES article ON DELETE CASCADE,
  seq          int    NOT NULL,
  heading_path text,
  body         text   NOT NULL,
  token_count  int    NOT NULL,
  embedding    halfvec(1024) NOT NULL,        -- voyage-4
  tsv          tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED,
  UNIQUE (post_id, seq)
);
CREATE INDEX ON chunk USING hnsw (embedding halfvec_cosine_ops) WITH (m=16, ef_construction=64);
CREATE INDEX ON chunk USING gin (tsv);

CREATE TABLE scripture (
  id         bigserial PRIMARY KEY,
  post_id    bigint NOT NULL REFERENCES article ON DELETE CASCADE,
  chunk_id   bigint REFERENCES chunk,
  seq        int,
  ref_label  text,                             -- "Surah Al Tawbah (9), Verse 128" — from the PRECEDING paragraph
  quote_html text NOT NULL,                   -- byte-for-byte wp-block-quote HTML
  sha256     bytea NOT NULL
);

CREATE TABLE ingest_run (
  id bigserial PRIMARY KEY, kind text NOT NULL,  -- 'bulk' | 'poll' | 'reconcile'
  watermark timestamptz, upserted int, deleted int, re_embedded int,
  started_at timestamptz DEFAULT now(), finished_at timestamptz, error text
);

Retrieval & Generation

One hybrid SQL query per request — no separate narrowing layer. A calibrated answerability gate sits before the LLM: below threshold, the model is never called and the API returns the templated "not covered + closest articles" response.

Fig. 3 — Request lifecycle. Two exits before the model: the safety classifier (step 2, MPA) and the answerability gate (step 5).

The hybrid query

pg-sql · Reciprocal Rank Fusion, tenant scoped by schema

WITH vec AS (
  SELECT id, post_id, row_number() OVER (ORDER BY embedding <=> $query_vec) AS r
  FROM kb_mpa.chunk ORDER BY embedding <=> $query_vec LIMIT 50
),
fts AS (
  SELECT id, post_id, row_number() OVER (ORDER BY ts_rank_cd(tsv, q) DESC) AS r
  FROM kb_mpa.chunk, websearch_to_tsquery('english', $query_text) q
  WHERE tsv @@ q LIMIT 50
)
SELECT id, post_id, SUM(1.0 / (60 + r)) AS rrf_score
FROM (SELECT * FROM vec UNION ALL SELECT * FROM fts) fused
GROUP BY id, post_id ORDER BY rrf_score DESC LIMIT 20;

Pipeline numbers: top-50 per arm → RRF fuse (k=60) → top-20 → cross-encoder rerank → 6–8 chunks, max 2–3 per article into generation. Categories are an optional explicit facet only — never an inferred pre-filter.
"Not covered" gate: reranker-score threshold calibrated per tenant on ~200 labeled in/out-of-domain queries; a structured insufficient_context flag from the model is the second gate.
Latency budget: embed ~80ms · hybrid query ~30ms · rerank ~120ms · generation 1.5–3s streamed → p95 < 4s, first token < 1.5s on the streaming path.

The verbatim-scripture pipeline

Fig. 4 — Verbatim scripture is a data-layer guarantee: 7,898 Wise Compass articles (90%) contain Gutenberg blockquotes; none pass through the model.

DTOs & Streams

Contracts defined once as zod schemas → OpenAPI 3.1 → generated client types. Versioned at /v1, additive-only — cached WordPress embeds cannot be force-upgraded.

typescript · zod DTOs

const ChatRequest = z.object({
  message:        z.string().min(1).max(1000),
  conversationId: z.string().uuid().optional(),   // multi-turn (phase 2)
  stream:         z.boolean().default(true),
});
// tenant resolved SERVER-SIDE from X-Site-Key + Origin — never from the body

const Segment = z.discriminatedUnion('type', [
  z.object({ type: z.literal('prose'),     text: z.string() }),
  z.object({ type: z.literal('scripture'), html: z.string(),
             reference: z.string(), articleId: z.number() }),   // server-substituted, hash-verified
]);

const Citation = z.object({
  articleId: z.number(), title: z.string(), url: z.string().url(),
  author: z.string().nullable(), reviewer: z.string().nullable(),   // MPA credentials surfaced
  score: z.number(),
});

const ChatResponse = z.object({
  requestId:       z.string().uuid(),
  status:          z.enum(['answered', 'not_covered', 'safety_redirect']),
  segments:        z.array(Segment),
  citations:       z.array(Citation),      // non-empty + from retrieved set, or API returns fallback
  closestArticles: z.array(Citation),      // populated when not_covered
  disclaimer:      z.string().nullable(),  // NON-NULLABLE for MPA — server-populated
  safetyFlags:     z.array(z.enum(['emergency','crisis','dosage_seeking'])),
  usage:           z.object({ inputTokens: z.number(), outputTokens: z.number() }),
});
// Errors: RFC 9457 application/problem+json

Streaming — typed events, tenant-specific release policy

Wise Compass message_start→ prose_delta ×n→ hold at placeholder→ scripture_block→ prose_delta ×n→ citations→ status→ done

MPA (medical) message_start→ buffered: rails run (entailment · number+unit · disclaimer)→ prose_delta burst→ citations→ status→ done

Not covered message_start→ status: not_covered + closestArticles[]→ done LLM never called · ~$0.0002

Fig. 5 — SSE event order per tenant. Raw model deltas never reach the client; the widget shows a "checking sources" state during the MPA buffer.

Transport: POST /v1/chat with Accept: text/event-stream; same endpoint with stream:false returns full JSON. Widget uses fetch()+ReadableStream (EventSource can't POST).
Connection discipline: PgBouncer transaction mode — retrieve, release the DB connection, then open the Anthropic stream. 15s heartbeats, 30s hard timeout → "not covered" fallback, per-IP concurrent-stream caps, X-Accel-Buffering: no.

API surface

Endpoint	Purpose
POST /v1/chat	Answer path — SSE + JSON
GET /v1/articles/:id	Citation hover-preview for the widget
POST /v1/feedback	Thumbs up/down keyed by requestId — feeds the eval set
POST /v1/ingest/webhook	WP save_post hook, HMAC-signed
GET /v1/healthz · GET /v1/admin/sync-status	Liveness + ingest watermarks/drift

AI Models

Role	Model	Price /MTok (in / out)	Why
Answer generation	claude-sonnet-5	$3 / $15 intro $2 / $10 → 31 Aug 2026	Best grounding-instruction adherence per dollar. Static system prompt prompt-cached — reads at ~0.1×.
Safety classifier	claude-haiku-4-5	$1 / $5	Pre-retrieval emergency / crisis / dosage routing, fixed signposting templates. ~650 tok/call.
MPA groundedness judge	claude-haiku-4-5	$1 / $5	Sentence-level entailment vs retrieved chunks + deterministic number+unit byte-match rule.
Escalation tier (optional)	claude-opus-4-8	$5 / $25	Only if eval shows Sonnet gaps on multi-article synthesis. Not in the base budget.
Embeddings	voyage-4 · 1024-dim	$0.06 /MTok 200M free tokens	Anthropic's recommended partner. Strong on Islamic transliteration + medical vocabulary. Matryoshka → 512-dim later without re-embedding. Corpus embeds for $0 under the free tier. Fallback: OpenAI text-embedding-3-small behind a provider-agnostic interface.
Reranker	bge-reranker-v2-m3 (self-host) or Voyage rerank	≈ $0	Its calibrated score is the "not covered" gate.

Token budget per query (single-turn)

Component	Tokens	Notes
System prompt (grounding rules, placeholder protocol, tenant config)	~1,300	Prompt-cached — ~0.1× after first request per 5-min window
Retrieved context (6–8 chunks, max 2–3/article)	~3,000	Fresh input
Scripture metadata + question + formatting	~250	Fresh input
Total input	~4,550	of which ~1,300 cached
Output	~450	Placeholders keep scripture out of output tokens
Query embedding (voyage-4)	~30	≈ $0.000002 — noise
Multi-turn follow-ups (phase 2)	+30–60% input	History rides in messages; cache absorbs most of it

Per-Query Price

Generation model	Per query	5k q/mo	10k q/mo	30k q/mo
claude-haiku-4-5	$0.0056	$28	$56	$168
claude-sonnet-5 (intro, → Aug 2026)	$0.0112	$56	$112	$336
claude-sonnet-5 (standard)	$0.0167	$84	$167	$501
claude-opus-4-8	$0.0279	$140	$279	$837

Basis: ~4,550 input (1,300 cached at 0.1×) + ~450 output. Add-ons: MPA safety rails ≈ +$0.003/query on the medical tenant only; multi-turn ≈ +40% input. Three dampeners stack: prompt caching (30–50% of input spend), exact-match Redis response cache (repeat queries → $0), and the "not covered" gate (gated queries skip the LLM, ~$0.0002).

One-time item	Cost	Notes
Full-corpus embedding — ~24M tokens incl. overlap	$0 – $1.43	voyage-4 free tier → $0; list $1.28–1.43
Full re-embed (model or chunking change)	< $1.50	Sub-hour, sub-$2 — never a reason to defer a fix

Growth Tier & Expected Monthly Costs

The Growth tier buys three things the Lean tier (~$31/mo, single instance) doesn't have: a redundant API pair behind a load balancer, an isolated staging environment for safe releases, and bigger DB compute to keep the HNSW index fully in RAM as the corpus grows toward 170k articles.

Fig. 6 — Growth-tier topology, ~$119/mo infrastructure. Every component EU-resident.

Environments & promotion flow (staging areas)

Fig. 7 — Three staging areas. Dev costs nothing (local + free Neon branches); staging idles to near-zero when unused; prod carries the redundancy.

Growth-tier infrastructure — line items

Component	Plan · Region	$/mo
Prod database — Postgres 16 + pgvector, both tenant schemas, PITR, failover	Supabase Pro + Small compute · Frankfurt	30
Staging database — full corpus copy, scale-to-zero	Neon Launch · Frankfurt	~8
Prod API — redundant pair, zero-downtime deploys, autoscaling	2× Render Standard · Frankfurt	50
Prod sync worker	Render Starter · Frankfurt	7
Staging API + worker	2× Render Starter · Frankfurt	14
Cache + rate-limit state	Upstash Redis fixed 250MB · eu-central-1	10
CDN / WAF / TLS · uptime · errors · tracing	Cloudflare Free · UptimeRobot · Sentry free · Langfuse self-host	0
Infrastructure total		~119

Development-phase AI spend (one-time, during the ~6-week build)

Item	Est. cost	Notes
Corpus embedding + ~5 re-embeds during chunking iteration	$0 – $8	voyage-4 200M free tokens absorb ~9 full re-embeds; list price shown
Retrieval calibration + golden-set iterations (~4–5k Sonnet queries)	$60 – 90	Threshold tuning on ~200 labeled queries/tenant × iterations
CI regression runs during build (~100 runs × ~50-query suite)	$60 – 90	Mix of Haiku (rails) and Sonnet (generation) calls
Safety-rail tuning (Haiku classifier + entailment judge, ~5k calls)	~$5	Fixed-template routing verified against sensitive probes
AI-assisted development tooling (Claude Code, build window)	$200 – 400	The lever behind the 6-week timeline; ~2 months of a Max-tier plan or metered API equivalent
Total development-phase AI spend	~$325 – 590	One-time; separate from the build fee

Expected all-in monthly — Growth tier

Line	10k q/mo	30k q/mo	60k q/mo
Infrastructure (above)	$119	$119	~$135DB compute step-up
Prod generation — Sonnet 5 std, incl. MPA rails, ~25% cache/gate savings	~$175	~$520	~$1,040
Staging LLM — smoke + regression traffic	~$15	~$15	~$20
Embedding — incremental re-index traffic	~$0	~$0	~$1
Total expected monthly	~$309	~$654	~$1,196

At Sonnet 5 intro pricing (through Aug 2026), subtract ~33% from the generation line. The per-tenant daily spend circuit breaker converts these projections into a contractual ceiling: when tripped, the tenant degrades to retrieval-only "closest articles" mode instead of overspending. Lean-tier reference for comparison: ~$31/mo infra, single instance, schemas-as-staging — appropriate until launch traffic is proven.

Risk Register

Every risk carries a concrete, testable defense. The four criticals are release-blocking.

CriticalLLM-rephrased Quran/Hadith — the stated project-failure condition (90% of WC articles contain scripture)

Defense — scripture never passes through the model: ingest-time extraction (quote + preceding reference paragraph), [[SCRIPTURE:id]] placeholders, server-side byte substitution, exact-hash output validation, CI byte-equality suite that blocks deploy on mismatch.

CriticalStale or retracted medical content served after WordPress edits

Defense — 15-min modified_after poller with content-hash gating, explicit publish/unpublish/trash handling with cascade deletes, nightly full reconciliation; webhook for latency, polling as source of truth. Live edits indexed < 20 min.

CriticalConfidently-wrong answers — nearest-neighbor always returns something

Defense — deterministic gate on the reranker score, calibrated per tenant on ~200 labeled queries; below threshold the LLM is never called. Second gate: structured insufficient_context flag. Adversarial off-topic eval set as a release-blocking metric.

CriticalCross-tenant leakage — medical chunks under the Islamic-guidance brand, or vice versa

Defense — separate schemas with independent indexes (isolation by construction), RLS backstop, tenant resolved server-side from site key + Origin, per-tenant system prompts, CI cross-tenant probe (100 cross-domain queries → zero foreign post_ids).

HighIndex pollution — 14,531 MPA drafts, universal Elementor CTAs, tester records, 65k-word outliers

Defense — ingestion gate: published-only, shortcode stripper, junk denylist, block-boundary chunking, per-article top-k cap (max 2–3 chunks via DISTINCT ON / MMR), chunk-count assertion.

HighGDPR Article 9 — free-text health queries are sensitive-category data

Defense — pre-launch DPIA, consent capture in the widget, EU residency end-to-end, pseudonymized session-scoped logs (30–90-day raw-query retention), PII scrub before embedding/LLM calls, self-hosted tracing.

HighPublic widget as wallet-attack & prompt-injection surface

Defense — Cloudflare in front; per-tenant CORS allowlist; per-IP token bucket (10 req/min) + concurrent-SSE caps; 1,000-char input cap; per-tenant daily spend circuit breaker → retrieval-only mode; Turnstile after ~10 req/session; retrieved chunks framed as untrusted data, zero tools on the generation call; DTO validation that citations come from the retrieved set.

MediumWidget/API contract drift — cached WP embeds outliving deploys

Defense — /v1 versioning with additive-only DTO evolution; tolerant-reader widget; cache-busting embed snippet; WP page-cache exclusion rules documented per site.

Compliance, Safety & QA Gates

Medical tenant (MPA) rails

Pre-retrieval classifier (regex + Haiku 4.5): routes to emergency | crisis | dosage_seeking | normal — the first three get fixed, clinician-approved signposting templates; the LLM is never invoked.
Extractive-first answers — "what our articles say", verbatim passages with attribution; never personalised advice, never diagnosis.
Groundedness rail — sentence-level entailment (Haiku judge) + deterministic rule: any number+unit span (mg, ml, doses, weeks) must appear byte-identical in a retrieved chunk.
Server-appended disclaimer — non-nullable DTO field; wording owned by the client's clinical/legal review; impossible for the model to omit.
Buffered release — MPA answers stream only after all rails pass.

GDPR package (pre-launch)

DPIA covering Article 9 health-query processing
Consent capture in the widget; privacy-policy addendum
EU residency: DB + API + inference path; SCC-backed Anthropic DPA, no-training confirmed
Pseudonymized logs, 30–90-day raw-query retention, PII scrub pre-embedding

Release-blocking QA gates (CI, run on staging)

Gate	Pass condition
Scripture byte-equality	100% exact match on sampled scripture articles — zero tolerance
Golden query set (per tenant)	Recall@8 ≥ target on client-provided questions (15–20/site: simple · overlap · out-of-scope)
Adversarial off-topic set	"Not covered" precision ≥ target; zero confident answers on out-of-domain probes
Cross-tenant probe	100 cross-domain queries → zero foreign post_ids
MPA sensitive-query suite	Disclaimer on 100%; emergency templates fire on crisis probes
Chunk-count assertion	Pipeline lands within tolerance of sized ~59.4k

Observability

Langfuse (self-hosted): per-request trace — retrieval scores, gate decision, tokens, cost — keyed by requestId
Sentry (errors) · UptimeRobot (uptime) · structured per-tenant/day cost logging feeding the spend circuit breaker

Roadmap — AI-Accelerated Build

Six weeks to full launch with AI-assisted development (a conventional team would quote 10–12). The compression comes from generating the ingestion pipeline, DTO layer, and eval harness against the real exports already in hand — no waiting on sample data.

P0 · Discovery & sign-offsblockers D1, C1, E1 + Zaheer briefing

gate

P1 · Ingestion & storageschemas, bulk load, scripture extraction

build

P2 · Retrieval & eval harnesshybrid query, rerank, gate calibration

build

P3 · Generation & safety railsorchestrator, substitution, rails, DTO API

build

P4 · Widget, SSE & syncWP bridge, abuse stack, staging deploy

build

P5 · QA & launchacceptance gates, DPIA, WC → MPA

launch

Fig. 8 — Build timeline. Overlapping phases run as parallel tracks; the two amber bars are client-gated, not engineering-gated.

Phase	Deliverables	Exit criterion
P0 · Discovery	Blocker sign-offs: disclaimer ownership (D1), public vs logged-in (C1), API account ownership (E1); Zaheer technical briefing; WP REST + app-password access confirmed	All blockers resolved in writing
P1 · Ingestion & storage	Tenant schemas migrated; cleaning pipeline (shortcodes, junk, published-only); block-aware chunker; scripture extraction; full bulk load + embeddings	Chunk count ≈ 59.4k; scripture table covers 7,898 WC articles; spot-audit passes
P2 · Retrieval & eval	Hybrid RRF query; reranker; answerability threshold calibrated on ~200 labeled queries/tenant; golden-query CI harness	Recall@8 target on golden set; "not covered" fires correctly on out-of-domain probes
P3 · Generation & rails	Fastify /v1 API with zod DTOs; Sonnet 5 orchestration with cached system prompt; scripture substitution + hash validator; MPA classifier, entailment rail, server disclaimer	Scripture byte-equality suite green; MPA sensitive-query suite green
P4 · Widget, SSE & sync	Typed-event SSE; embeddable widget (fetch + ReadableStream); WP delta endpoint + HMAC webhook (with Zaheer); poller + nightly reconciliation; abuse stack; staging environment	Live edit on WP reflected in index < 20 min; abuse limits verified under load
P5 · QA & launch	Full acceptance run against client test questions (H1); DPIA + consent UI; Wise Compass launch first; MPA launches after clinical sign-off of disclaimer + blocklist	All six CI gates green; client sign-off per tenant

Sequencing note: Wise Compass launches first — its risk profile is scripture-integrity, which is fully machine-verifiable. MPA follows once the human-owned items (disclaimer wording, topic blocklist, DPIA) clear clinical/legal review — those are calendar-bound, not engineering-bound.

Investment & Quote

One fixed price for the complete, launched system — both knowledge-base assistants, live on your sites. No hourly billing, no surprises.

Fixed-price build · one-time

$9,750

Design, build, testing, and launch of both assistants — delivered end to end.

⏱ 6–7 weeks to launch

Everything included

Two separate, isolated assistants — Wise Compass & My Patient Advice
Answers drawn only from your published articles, each linked to its source
Quran & Hadith reproduced word-for-word, never paraphrased
Medical disclaimer & safety guardrails on the health assistant
Embeddable chat widget for both WordPress sites
Automatic sync — new and edited articles picked up continuously
EU-hosted for GDPR · staging & production · acceptance-tested against your own questions

Fixed-price scope. Optional later enhancements — multi-turn conversation, Arabic-query tuning, high-availability — are quoted separately as Phase 2.

Payment milestones

MILESTONE 130%

$2,925

Project start

Kick-off, discovery sign-off, and secure ingestion of both knowledge bases into the search index.

MILESTONE 240%

$3,900

Working system

The assistant answers live from your content with correct source links — demonstrated to you before release.

MILESTONE 330%

$2,925

Launch acceptance

All quality & safety gates green — scripture accuracy, medical disclaimer, "not covered" handling — live on both sites.

After launch — running costs

Item	Basis	Typical
Cloud infrastructure — EU-hosted database, API & sync	Fixed monthly, accounts in your name	~$119/mo
AI usage — answering questions	Pay-as-you-go, ~$0.01–0.02 per question, with a spend cap you set	usage-based
Support & maintenance	Optional retainer — monitoring, updates, priority fixes	from $300/mo

Quote valid 30 days. Running costs are billed to your own provider accounts for full spend visibility — we set everything up and hand over the keys.

Open Items

Must resolve before build

Blocker Corpus discrepancy — brief claims 100k+/70k+; real exports hold 8.7k/34.5k (19,977 MPA published). All costs here use the real corpus; confirm whether exports are partial.
Blocker D1 — medical disclaimer wording from the client's clinical/legal review; we inject it, we don't author it.
Blocker C1 — public vs logged-in per site: drives abuse budget and MPA liability posture.
Blocker E1 — API account ownership (Anthropic + Voyage): client-owned or agency-managed with pass-through billing.
High B1/B2 — WP REST access + webhook: app passwords on both sites; Zaheer implements the delta endpoint and save hook.
High H1 — test questions: 15–20/site incl. 3–5 verbatim-scripture and 3–5 sensitive-medical probes — the backbone of the QA gates.
High D4 — MPA topic blocklist (dosages, mental-health crisis, emergencies) needs clinical sign-off.

Working assumptions

WordPress remains the source of truth; the RAG store holds embeddings + metadata + verbatim scripture only.
English-dominant queries at launch; Arabic input works via the multilingual embedder but is not quality-guaranteed until eval'd (C7).
Single-turn Q&A at launch; multi-turn is a phase-2 flag already present in the DTO.
Scales to 170k articles / ~350k vectors with no structural change — same HNSW, bigger DB compute.
Prices are July 2026 USD list ex-VAT; Sonnet 5 intro pricing expires 31 Aug 2026 — both figures quoted throughout.

Deliberately excluded

Dedicated vector DB (Pinecone/Weaviate) — $600–1,200/yr for a network hop and a dual-write problem at this scale.
Fine-tuning — grounding quality here is a retrieval problem, not a model problem.
True multi-AZ Postgres HA — deferred until traffic justifies ~$180+/mo.