Přeskočit obsah

Phase C — Vector RAG

Stav: ✅ Dokončeno 30.4.2026 (15 tasků, spine + expand + observability) Trvání: 1 den (29.4 → 30.4) Klíčový artefakt: chunks tabulka, hybrid retrieval, Cohere Rerank 4.0, retrieval_log

Cíl fáze

Postavit production-ready vector RAG pipeline:

  1. pgvector 0.8 + halfvec(1536) embedding storage
  2. Worker embed_target stage navazující na Phase B (5-stage pipeline)
  3. Hybrid retrieval — vector (HNSW) + BM25 (czech_unaccent) + RRF fusion
  4. Cohere Rerank 4.0 cross-encoder
  5. HyDE conditional pro krátké queries
  6. Per-source diversity ve folder scope
  7. Retrieval observability (retrieval_log)
  8. Golden set ready pro Phase C.12 eval (deferred next session)

Architektonické decisions

C1: Embedding model = Gemini Embedding 2 GA

Released 2026-03-10 jako preview, ~04/2026 GA.

Vlastnosti: - Native multimodal (text + image + video + audio + PDF do unified embedding space) - 3072 native dimensions - MTEB multilingual #1, V-NQ 93.4 pro text-image, CZ trained - ENV-driven model ID (EMBEDDING_MODEL) pro budoucí re-embed flow

C2: halfvec(1536) — Matryoshka truncation

User decision 30.4: storage > kvalita.

Důvod: - Build čas ~2× rychlejší vs. full 3072 - Storage 4× menší (~9 GB při 3M chunks místo 36 GB) - MTEB recall ~ stejný díky Matryoshka tréninku Gemini-2 - Recoverable přes re-embed worker pokud bychom chtěli upgradnout na full quality

Implementace: Gemini vrátí 3072 → truncate [:1536] → L2 normalize (povinná po truncaci, jinak cosine sim zkreslené) → halfvec literal [v1,v2,...].

C3: Single chunks tabulka s inline embedding

Žádné separátní embeddings table. Důvody: - Žádný JOIN při retrieve - Jednodušší FK cascade - Jednodušší re-embed migration

C4: czech_unaccent BM25 (simple + unaccent)

ispell_czech by vyžadoval self-host PG (blocked managed Supabase) → operational overhead. Vector leg pokryje sémantiku, BM25 s unaccent pokryje keyword match.

Známý trade-off: ~3-8 % recall loss na keyword-heavy queries (žádný stemmer → "občanská" ≠ "občanskou"). Kompenzováno: 1. Cohere Rerank 4.0 dělá heavy lifting 2. Prefix wildcards (obcansk:*) v _build_prefix_tsquery() helperu redukují loss

C5: Worker stage placement

embed_target běží mezi annotated a ready (NE po finalize). Důvod: pokud chunky chybí, report by byl ready ale retrieval by selhal → chat se rozbije.

C6: Multimodal figure embedding (priority)

User explicit decision 29.4: jdeme do multimodalu, je to klíčová value-add Phase C.

Implementation: - Filter should_embed_figure(fig) — annotation ≥ 80 chars + entities + image_type ∉ {decorative, logo, footer} - 1 chunk per qualifying figure - text leg = build_figure_text(fig) (caption + summary + entities + observations) - image leg = raw bytes z nemoreport-figures bucket - Combined Gemini-2 multimodal call → unified vector

C7: Soft-fail (D7)

Pokud embed_target selže → parsed_metadata.embedding_status='failed', ale report jde do status='ready' přesto. Phase D chat fallback na full-report dump pro reporty bez chunks.

Stejný pattern u Cohere reranku — pokud API down/rate limit, fallback na hybrid RRF order, reranked=false v response.

C8: Hybrid retrieval — RRF fusion (k=60)

Postgres function hybrid_search_chunks_by_folder:

WITH vector_leg AS (
  SELECT id, row_number() OVER (ORDER BY embedding <=> p_query_vec)::int AS rank
  FROM chunks
  WHERE report_id = p_report_id AND embedding IS NOT NULL
  ORDER BY embedding <=> p_query_vec
  LIMIT p_pre_fusion_n
),
bm25_leg AS (
  SELECT id, row_number() OVER (ORDER BY ts_rank_cd(tsv, q) DESC)::int AS rank
  FROM chunks WHERE report_id = p_report_id AND tsv @@ q
  ORDER BY ts_rank_cd(tsv, q) DESC
  LIMIT p_pre_fusion_n
),
fusion AS (
  SELECT COALESCE(v.id, b.id) AS id,
         (COALESCE(1.0/(60 + v.rank), 0.0) +
          COALESCE(1.0/(60 + b.rank), 0.0))::float AS rrf_score
  FROM vector_leg v FULL OUTER JOIN bm25_leg b ON v.id = b.id
)
SELECT ... FROM fusion ORDER BY rrf_score DESC LIMIT p_top_k;

RRF stably outperforms weighted fusion v benchmarcích, parameter-free (k=60 standard).

C9: Cohere Rerank 4.0 (ENV-flagged)

Released 2026-04-06. rerank-v4.0-pro (default) / rerank-v4.0-fast (cost fallback). 32K context, 100+ jazyků vč. CZ, $0.0025/search.

Flow: - Pokud COHERE_RERANK_ENABLED=true: fetch top_k * 4 z hybrid → Cohere → top_k - Graceful fallback na hybrid RRF order při API down/rate limit/timeout (8s) - ENV COHERE_RERANK_MODEL swap pro/fast

C10: HyDE conditional

Aktivuje pro: query < 4 slova NEBO scope=multi_report.

Použití: gemini-3-flash-preview vygeneruje 2-4 věty hypotetické odpovědi → embed té místo query → lepší recall na sparse text.

C11: Per-source diversity ve folder scope

Pokud folder obsahuje > 1 source_type a top_k ≥ 4, post-process: - Detect dominant + missing types - Swap lowest-scored chunk dominant type za highest-scored missing type - Cap 2 swaps (aby se nepromote-ly irelevantní)

Use case: folder má main NemoReport + 2 přílohy. Bez diversity by top 5 bylo vše z main. S diversity AI vidí cross-source kontext.

Klíčové bugy / hotfixy

Bug C.1 — Type mismatch v RRF score

Postgres infered numeric pro 1.0 / (k + rank) division místo float. Function declared RETURN TABLE(... rrf_score float) → runtime error Returned type numeric does not match expected type double precision in column 21.

Fix: explicit ::float cast — (... )::float as rrf_score.

Bug C.2 — czech_unaccent declension matching

Po C.13 backfill 290 chunks BM25 leg vrátil 0 výsledků pro "občanská vybavenost" protože TSV obsahoval pouze "obcanskou" (akuzativ), ne "obcanska" (nominativ).

Fix: migrace 0017 — _build_prefix_tsquery() helper splituje query na tokens, aplikuje :* prefix wildcard, OR-uje. websearch_to_tsquery neumí ':*' syntax, takže buduje raw tsquery string.

'občanská vybavenost obchody'
   'obcanska:* | vybavenost:* | obchody:*'
   matches 'obcanskou', 'obcanska', 'obcanskemu', ...

OR (vs AND) maximalizuje recall pro RRF — chunky matching jakýkoliv token vstoupí do BM25 leg, ts_rank_cd je vnitřně down-rankne podle počtu matched tokens.

Stage 4 — embed_target worker stage (per Phase C C.4)

Mezi annotated a ready. Per-target (kind ∈ {report, attachment}). Idempotent.

@broker.task(retry_on_error=True, max_retries=3)
async def embed_target(target_type: str, target_id: str):
    target = _fetch_target(...)
    _set_target_status(target, "embedding")

    # Skip path: GEMINI_API_KEY chybí → soft-fail
    if not embedding_available():
        db.mark_embedding_status(target.kind, target.target_id, "skipped", reason="...")
        return {"stage": "embed", "ok": True, "skipped": True}

    # Idempotence: smaž existující chunky
    db.delete_chunks_for_target(target.report_id, target.attachment_id)

    # Load + chunk
    sections = db.list_chunkable_sections(target.report_id, target.attachment_id)
    tables = db.list_chunkable_tables(target.report_id, target.attachment_id)
    figures = db.list_chunkable_figures(target.report_id, target.attachment_id)
    chunk_specs = chunk_target(sections, tables, figures, ...)

    # Embed batch
    provider = get_embedding_provider()
    rows = []
    for spec in chunk_specs:
        if spec.expected_embedding_type == "multimodal":
            img_bytes = await storage.download_bytes(...)
            result = await provider.embed_multimodal(spec.content, img_bytes, mime, "RETRIEVAL_DOCUMENT")
        else:
            result = await provider.embed_text(spec.content, "RETRIEVAL_DOCUMENT")
        rows.append(_chunk_spec_to_row(target, spec, result))

    db.insert_chunks(rows)
    _add_target_cost(db, target, total_cost_cents)
    db.mark_embedding_status(target.kind, target.target_id, "ok")

/retrieve endpoint

POST /retrieve
{
  "query": "občanská vybavenost obchody dostupnost",
  "scope": { "type": "folder", "report_id": "uuid" },
  "top_k": 5
}

Stages:

  1. Tenant scope verify (pre-flight) — report must belong to user tenant
  2. Standalone rewrite (Phase C MVP no-op, Phase D bude full LLM rewrite)
  3. HyDE conditionalshould_use_hyde(query, scope) → if yes, generate hypothetical doc
  4. Embed queryembed_text(input, RETRIEVAL_QUERY) (asymmetric — různý vector space než RETRIEVAL_DOCUMENT)
  5. Hybrid retrieval RPC — fetch top_k×4 candidates s vector_rank, bm25_rank, rrf_score
  6. Cohere Rerank (volitelně, ENV-flagged) — top_k×4 → top_k
  7. Per-source diversity (folder + top_k ≥ 4)
  8. Insert retrieval_log (best-effort observability)
  9. Return chunks s rerank_score, vector_rank, bm25_rank, rrf_score

Latence E2E (z reálných measurement): - Long query (5 slov, no HyDE): ~1085 ms (embed 360 + retrieve 58 + rerank 667) - Short query (1 slovo, HyDE active): ~700 ms (embed 287 + rerank 350)

C.13 — Re-embed existing reports (backfill)

Po deploy jsem backfill spustil pro všech 19 ready reportů bez chunks:

while IFS= read -r rid; do
  curl -X POST "https://nemoreport-ai-backend-v2.sliplane.app/admin/embed/report/${rid}" \
    -H "X-Admin-Hash: ..."
done < report_ids.txt

Worker zpracoval 19/19 + 1 z předchozích = 20 reportů s chunks, total 290 chunks (238 text + 51 multimodal), cost 2.81 Kč.

Status flipnut z embeddingready SQL UPDATE (admin endpoint nevolá finalize_target):

UPDATE nemoreport.reports
SET status = 'ready', ingestion_finished_at = COALESCE(ingestion_finished_at, NOW())
WHERE status = 'embedding' AND id IN (SELECT DISTINCT report_id FROM nemoreport.chunks);

E2E test (po deploy)

Live test přes admin diagnostic endpoint POST /admin/retrieve/{report_id}:

Query: "občanská vybavenost obchody dostupnost služeb"

Result: - Top 1: rrf=0.0328, vector_rank=1, bm25_rank=1 (oba souhlasí), rerank_score=0.7151 — Cohere ho nechal #1 - Top 2: rrf=0.0154, vector_rank=5, bm25=null, rerank_score=0.6545 — Cohere promotnul z vector#5 protože obsahuje "Dostupnost služeb v místě" - Total: 1085 ms (embed 360 + retrieve 58 + rerank 667)

Query: "povodne" (1 slovo → HyDE active)

  • used_hyde=true, rewritten_query=383 chars (HyDE generated 2-4 věty hypotetické odpovědi)
  • embed_ms=287, rerank_ms=350

pgTAP testy (26 nových)

tests/db/06_chunks.sql26/26 ✓:

  • RLS visibility User A/B + cross-tenant deny (T1-T3)
  • Defense-in-depth grants (T4-T8) — authenticated SELECT-only, service_role full CRUD
  • FK cascade z attachments + reports + cross-tenant isolation při delete (T9-T12)
  • Check constraints: content_type / source_type / embedding_type reject invalid (T13-T15)
  • Generated tsvector strip diakritiku (T16) — povodně → povodne
  • Všechny indexes existence vč. partial conditions (T17-T21)
  • pgvector + unaccent extensions installed (T22-T23)
  • czech_unaccent ts config existuje (T24)
  • chunks NENÍ v Realtime publication (T25 — Phase D pull, ne live)
  • 1 RLS policy count (T26)

Sweep test ověřuje že všech 21 nemoreport tabulek (vč. chunks) má service_role full CRUD.

Total pgTAP suite po Phase C: 134 testů.

Co bylo vytvořeno

Migrace (5 nových)

  • 0014 chunks setup (pgvector + unaccent + czech_unaccent ts config + chunks tabulka + HNSW)
  • 0015 search_chunks_by_folder RPC (vector-only, MVP)
  • 0016 hybrid_search_chunks_by_folder RPC (BM25 + vector + RRF)
  • 0017 hybrid_search_prefix_match (_build_prefix_tsquery helper, fix declension)
  • 0018 retrieval_log

Backend Python moduly

  • app/embedding/EmbeddingProvider ABC, GeminiEmbedding2Provider, EmbeddingResult
  • app/retrieval/service.py (orchestration), schemas.py (Pydantic), rerank.py (Cohere wrapper), rewrite.py (HyDE + standalone)
  • app/ingestion/chunking.py — pure helpers (chunk_target, should_embed_figure, build_figure_text)
  • app/routers/retrieve.py — POST /retrieve

Worker stages

  • embed_target přidaný mezi annotate_target a finalize_target
  • 6 nových DB metod (list_chunkable_*, delete_chunks_for_target, insert_chunks, mark_embedding_status, insert_retrieval_log)

Admin diagnostic endpoint

  • POST /admin/embed/{kind}/{id} — backfill pro existující reports
  • POST /admin/retrieve/{report_id} — bypassuje JWT pro testing

Co Phase C zanechala pro Phase D

  • POST /retrieve produkčně připravený, vrací top-K chunks s rerank_score
  • retrieval_log zachycuje všechny calls → A/B variant comparison ready
  • Chunks mají source_type + attachment_filename + source_label denormalized → AI může citovat per chunk bez JOIN
  • 21 reportů s chunks v produkci pro testing chat flow

Defered z Phase C

  • C.9 image resize (1568 long-edge před multimodal embed) — current raw bytes (~50-300 KB) work, optimalizace pro production
  • C.12 golden set + eval harness — 40-60 anotovaných queries nad 13 MHTML reports, recall@10 / MRR / faithfulness / p95 latency, ~30-60 min práce; defer pro next session

Známé limity

  • Cohere trial — ~1000 calls/měsíc, 10 RPM. Pro pilot/dev OK, pro produkci potřeba production key.
  • No stemmer — czech_unaccent simple + unaccent → 3-8 % recall loss na keyword-heavy queries. Prefix wildcards redukují, Cohere rerank kompenzuje, ale ispell_czech by byl lepší (potřeba self-host PG).
  • Phase D chat zatím nepoužívá /retrieve/chat posílá full parsed_markdown (Phase D rewrite TBD).