Phase C — Vector RAG¶
Stav: ✅ Dokončeno 30.4.2026 (15 tasků, spine + expand + observability) Trvání: 1 den (29.4 → 30.4) Klíčový artefakt: chunks tabulka, hybrid retrieval, Cohere Rerank 4.0, retrieval_log
Cíl fáze¶
Postavit production-ready vector RAG pipeline:
- pgvector 0.8 + halfvec(1536) embedding storage
- Worker
embed_targetstage navazující na Phase B (5-stage pipeline) - Hybrid retrieval — vector (HNSW) + BM25 (czech_unaccent) + RRF fusion
- Cohere Rerank 4.0 cross-encoder
- HyDE conditional pro krátké queries
- Per-source diversity ve folder scope
- Retrieval observability (retrieval_log)
- Golden set ready pro Phase C.12 eval (deferred next session)
Architektonické decisions¶
C1: Embedding model = Gemini Embedding 2 GA¶
Released 2026-03-10 jako preview, ~04/2026 GA.
Vlastnosti:
- Native multimodal (text + image + video + audio + PDF do unified embedding space)
- 3072 native dimensions
- MTEB multilingual #1, V-NQ 93.4 pro text-image, CZ trained
- ENV-driven model ID (EMBEDDING_MODEL) pro budoucí re-embed flow
C2: halfvec(1536) — Matryoshka truncation¶
User decision 30.4: storage > kvalita.
Důvod: - Build čas ~2× rychlejší vs. full 3072 - Storage 4× menší (~9 GB při 3M chunks místo 36 GB) - MTEB recall ~ stejný díky Matryoshka tréninku Gemini-2 - Recoverable přes re-embed worker pokud bychom chtěli upgradnout na full quality
Implementace: Gemini vrátí 3072 → truncate [:1536] → L2 normalize (povinná po truncaci, jinak cosine sim zkreslené) → halfvec literal [v1,v2,...].
C3: Single chunks tabulka s inline embedding¶
Žádné separátní embeddings table. Důvody:
- Žádný JOIN při retrieve
- Jednodušší FK cascade
- Jednodušší re-embed migration
C4: czech_unaccent BM25 (simple + unaccent)¶
ispell_czech by vyžadoval self-host PG (blocked managed Supabase) → operational overhead. Vector leg pokryje sémantiku, BM25 s unaccent pokryje keyword match.
Známý trade-off: ~3-8 % recall loss na keyword-heavy queries (žádný stemmer → "občanská" ≠ "občanskou"). Kompenzováno:
1. Cohere Rerank 4.0 dělá heavy lifting
2. Prefix wildcards (obcansk:*) v _build_prefix_tsquery() helperu redukují loss
C5: Worker stage placement¶
embed_target běží mezi annotated a ready (NE po finalize). Důvod: pokud chunky chybí, report by byl ready ale retrieval by selhal → chat se rozbije.
C6: Multimodal figure embedding (priority)¶
User explicit decision 29.4: jdeme do multimodalu, je to klíčová value-add Phase C.
Implementation:
- Filter should_embed_figure(fig) — annotation ≥ 80 chars + entities + image_type ∉ {decorative, logo, footer}
- 1 chunk per qualifying figure
- text leg = build_figure_text(fig) (caption + summary + entities + observations)
- image leg = raw bytes z nemoreport-figures bucket
- Combined Gemini-2 multimodal call → unified vector
C7: Soft-fail (D7)¶
Pokud embed_target selže → parsed_metadata.embedding_status='failed', ale report jde do status='ready' přesto. Phase D chat fallback na full-report dump pro reporty bez chunks.
Stejný pattern u Cohere reranku — pokud API down/rate limit, fallback na hybrid RRF order, reranked=false v response.
C8: Hybrid retrieval — RRF fusion (k=60)¶
Postgres function hybrid_search_chunks_by_folder:
WITH vector_leg AS (
SELECT id, row_number() OVER (ORDER BY embedding <=> p_query_vec)::int AS rank
FROM chunks
WHERE report_id = p_report_id AND embedding IS NOT NULL
ORDER BY embedding <=> p_query_vec
LIMIT p_pre_fusion_n
),
bm25_leg AS (
SELECT id, row_number() OVER (ORDER BY ts_rank_cd(tsv, q) DESC)::int AS rank
FROM chunks WHERE report_id = p_report_id AND tsv @@ q
ORDER BY ts_rank_cd(tsv, q) DESC
LIMIT p_pre_fusion_n
),
fusion AS (
SELECT COALESCE(v.id, b.id) AS id,
(COALESCE(1.0/(60 + v.rank), 0.0) +
COALESCE(1.0/(60 + b.rank), 0.0))::float AS rrf_score
FROM vector_leg v FULL OUTER JOIN bm25_leg b ON v.id = b.id
)
SELECT ... FROM fusion ORDER BY rrf_score DESC LIMIT p_top_k;
RRF stably outperforms weighted fusion v benchmarcích, parameter-free (k=60 standard).
C9: Cohere Rerank 4.0 (ENV-flagged)¶
Released 2026-04-06. rerank-v4.0-pro (default) / rerank-v4.0-fast (cost fallback). 32K context, 100+ jazyků vč. CZ, $0.0025/search.
Flow:
- Pokud COHERE_RERANK_ENABLED=true: fetch top_k * 4 z hybrid → Cohere → top_k
- Graceful fallback na hybrid RRF order při API down/rate limit/timeout (8s)
- ENV COHERE_RERANK_MODEL swap pro/fast
C10: HyDE conditional¶
Aktivuje pro: query < 4 slova NEBO scope=multi_report.
Použití: gemini-3-flash-preview vygeneruje 2-4 věty hypotetické odpovědi → embed té místo query → lepší recall na sparse text.
C11: Per-source diversity ve folder scope¶
Pokud folder obsahuje > 1 source_type a top_k ≥ 4, post-process: - Detect dominant + missing types - Swap lowest-scored chunk dominant type za highest-scored missing type - Cap 2 swaps (aby se nepromote-ly irelevantní)
Use case: folder má main NemoReport + 2 přílohy. Bez diversity by top 5 bylo vše z main. S diversity AI vidí cross-source kontext.
Klíčové bugy / hotfixy¶
Bug C.1 — Type mismatch v RRF score¶
Postgres infered numeric pro 1.0 / (k + rank) division místo float. Function declared RETURN TABLE(... rrf_score float) → runtime error Returned type numeric does not match expected type double precision in column 21.
Fix: explicit ::float cast — (... )::float as rrf_score.
Bug C.2 — czech_unaccent declension matching¶
Po C.13 backfill 290 chunks BM25 leg vrátil 0 výsledků pro "občanská vybavenost" protože TSV obsahoval pouze "obcanskou" (akuzativ), ne "obcanska" (nominativ).
Fix: migrace 0017 — _build_prefix_tsquery() helper splituje query na tokens, aplikuje :* prefix wildcard, OR-uje. websearch_to_tsquery neumí ':*' syntax, takže buduje raw tsquery string.
'občanská vybavenost obchody'
→ 'obcanska:* | vybavenost:* | obchody:*'
→ matches 'obcanskou', 'obcanska', 'obcanskemu', ...
OR (vs AND) maximalizuje recall pro RRF — chunky matching jakýkoliv token vstoupí do BM25 leg, ts_rank_cd je vnitřně down-rankne podle počtu matched tokens.
Stage 4 — embed_target worker stage (per Phase C C.4)¶
Mezi annotated a ready. Per-target (kind ∈ {report, attachment}). Idempotent.
@broker.task(retry_on_error=True, max_retries=3)
async def embed_target(target_type: str, target_id: str):
target = _fetch_target(...)
_set_target_status(target, "embedding")
# Skip path: GEMINI_API_KEY chybí → soft-fail
if not embedding_available():
db.mark_embedding_status(target.kind, target.target_id, "skipped", reason="...")
return {"stage": "embed", "ok": True, "skipped": True}
# Idempotence: smaž existující chunky
db.delete_chunks_for_target(target.report_id, target.attachment_id)
# Load + chunk
sections = db.list_chunkable_sections(target.report_id, target.attachment_id)
tables = db.list_chunkable_tables(target.report_id, target.attachment_id)
figures = db.list_chunkable_figures(target.report_id, target.attachment_id)
chunk_specs = chunk_target(sections, tables, figures, ...)
# Embed batch
provider = get_embedding_provider()
rows = []
for spec in chunk_specs:
if spec.expected_embedding_type == "multimodal":
img_bytes = await storage.download_bytes(...)
result = await provider.embed_multimodal(spec.content, img_bytes, mime, "RETRIEVAL_DOCUMENT")
else:
result = await provider.embed_text(spec.content, "RETRIEVAL_DOCUMENT")
rows.append(_chunk_spec_to_row(target, spec, result))
db.insert_chunks(rows)
_add_target_cost(db, target, total_cost_cents)
db.mark_embedding_status(target.kind, target.target_id, "ok")
/retrieve endpoint¶
POST /retrieve
{
"query": "občanská vybavenost obchody dostupnost",
"scope": { "type": "folder", "report_id": "uuid" },
"top_k": 5
}
Stages:
- Tenant scope verify (pre-flight) — report must belong to user tenant
- Standalone rewrite (Phase C MVP no-op, Phase D bude full LLM rewrite)
- HyDE conditional —
should_use_hyde(query, scope)→ if yes, generate hypothetical doc - Embed query —
embed_text(input, RETRIEVAL_QUERY)(asymmetric — různý vector space než RETRIEVAL_DOCUMENT) - Hybrid retrieval RPC — fetch top_k×4 candidates s vector_rank, bm25_rank, rrf_score
- Cohere Rerank (volitelně, ENV-flagged) — top_k×4 → top_k
- Per-source diversity (folder + top_k ≥ 4)
- Insert retrieval_log (best-effort observability)
- Return chunks s rerank_score, vector_rank, bm25_rank, rrf_score
Latence E2E (z reálných measurement): - Long query (5 slov, no HyDE): ~1085 ms (embed 360 + retrieve 58 + rerank 667) - Short query (1 slovo, HyDE active): ~700 ms (embed 287 + rerank 350)
C.13 — Re-embed existing reports (backfill)¶
Po deploy jsem backfill spustil pro všech 19 ready reportů bez chunks:
while IFS= read -r rid; do
curl -X POST "https://nemoreport-ai-backend-v2.sliplane.app/admin/embed/report/${rid}" \
-H "X-Admin-Hash: ..."
done < report_ids.txt
Worker zpracoval 19/19 + 1 z předchozích = 20 reportů s chunks, total 290 chunks (238 text + 51 multimodal), cost 2.81 Kč.
Status flipnut z embedding → ready SQL UPDATE (admin endpoint nevolá finalize_target):
UPDATE nemoreport.reports
SET status = 'ready', ingestion_finished_at = COALESCE(ingestion_finished_at, NOW())
WHERE status = 'embedding' AND id IN (SELECT DISTINCT report_id FROM nemoreport.chunks);
E2E test (po deploy)¶
Live test přes admin diagnostic endpoint POST /admin/retrieve/{report_id}:
Query: "občanská vybavenost obchody dostupnost služeb"
Result: - Top 1: rrf=0.0328, vector_rank=1, bm25_rank=1 (oba souhlasí), rerank_score=0.7151 — Cohere ho nechal #1 - Top 2: rrf=0.0154, vector_rank=5, bm25=null, rerank_score=0.6545 — Cohere promotnul z vector#5 protože obsahuje "Dostupnost služeb v místě" - Total: 1085 ms (embed 360 + retrieve 58 + rerank 667)
Query: "povodne" (1 slovo → HyDE active)
used_hyde=true,rewritten_query=383 chars(HyDE generated 2-4 věty hypotetické odpovědi)- embed_ms=287, rerank_ms=350
pgTAP testy (26 nových)¶
tests/db/06_chunks.sql — 26/26 ✓:
- RLS visibility User A/B + cross-tenant deny (T1-T3)
- Defense-in-depth grants (T4-T8) — authenticated SELECT-only, service_role full CRUD
- FK cascade z attachments + reports + cross-tenant isolation při delete (T9-T12)
- Check constraints: content_type / source_type / embedding_type reject invalid (T13-T15)
- Generated tsvector strip diakritiku (T16) —
povodně → povodne - Všechny indexes existence vč. partial conditions (T17-T21)
- pgvector + unaccent extensions installed (T22-T23)
- czech_unaccent ts config existuje (T24)
- chunks NENÍ v Realtime publication (T25 — Phase D pull, ne live)
- 1 RLS policy count (T26)
Sweep test ověřuje že všech 21 nemoreport tabulek (vč. chunks) má service_role full CRUD.
Total pgTAP suite po Phase C: 134 testů.
Co bylo vytvořeno¶
Migrace (5 nových)¶
- 0014 chunks setup (pgvector + unaccent + czech_unaccent ts config + chunks tabulka + HNSW)
- 0015 search_chunks_by_folder RPC (vector-only, MVP)
- 0016 hybrid_search_chunks_by_folder RPC (BM25 + vector + RRF)
- 0017 hybrid_search_prefix_match (
_build_prefix_tsqueryhelper, fix declension) - 0018 retrieval_log
Backend Python moduly¶
app/embedding/—EmbeddingProviderABC,GeminiEmbedding2Provider,EmbeddingResultapp/retrieval/—service.py(orchestration),schemas.py(Pydantic),rerank.py(Cohere wrapper),rewrite.py(HyDE + standalone)app/ingestion/chunking.py— pure helpers (chunk_target,should_embed_figure,build_figure_text)app/routers/retrieve.py— POST /retrieve
Worker stages¶
embed_targetpřidaný meziannotate_targetafinalize_target- 6 nových DB metod (
list_chunkable_*,delete_chunks_for_target,insert_chunks,mark_embedding_status,insert_retrieval_log)
Admin diagnostic endpoint¶
POST /admin/embed/{kind}/{id}— backfill pro existující reportsPOST /admin/retrieve/{report_id}— bypassuje JWT pro testing
Co Phase C zanechala pro Phase D¶
POST /retrieveprodukčně připravený, vrací top-K chunks s rerank_scoreretrieval_logzachycuje všechny calls → A/B variant comparison ready- Chunks mají
source_type+attachment_filename+source_labeldenormalized → AI může citovat per chunk bez JOIN - 21 reportů s chunks v produkci pro testing chat flow
Defered z Phase C¶
- C.9 image resize (1568 long-edge před multimodal embed) — current raw bytes (~50-300 KB) work, optimalizace pro production
- C.12 golden set + eval harness — 40-60 anotovaných queries nad 13 MHTML reports, recall@10 / MRR / faithfulness / p95 latency, ~30-60 min práce; defer pro next session
Známé limity¶
- Cohere trial — ~1000 calls/měsíc, 10 RPM. Pro pilot/dev OK, pro produkci potřeba production key.
- No stemmer — czech_unaccent simple + unaccent → 3-8 % recall loss na keyword-heavy queries. Prefix wildcards redukují, Cohere rerank kompenzuje, ale ispell_czech by byl lepší (potřeba self-host PG).
- Phase D chat zatím nepoužívá
/retrieve—/chatposílá fullparsed_markdown(Phase D rewrite TBD).