Datový model¶
DB schema nemoreport (Supabase Postgres 17). 22 tabulek, 18 migrací (0001-0018).
ER diagram (high-level)¶
erDiagram
auth_users ||--|| user_profiles : "owns"
auth_users ||--o{ tenant_members : "joins"
user_profiles ||--|| tenants : "personal_tenant"
tenants ||--o{ tenant_members : "members"
tenants ||--o{ reports : "owns"
reports ||--o{ attachments : "has many"
reports ||--o{ parsed_sections : "main + per-attachment"
reports ||--o{ figures : "extracted images"
reports ||--o{ parsed_tables : "extracted tables"
reports ||--o{ ingestion_jobs : "audit log"
reports ||--o{ chunks : "vector + BM25"
attachments ||--o{ parsed_sections : "may have"
attachments ||--o{ figures : "may have"
parsed_sections ||--o{ figures : "section context"
parsed_sections ||--o{ chunks : "source"
figures ||--o{ chunks : "multimodal source"
parsed_tables ||--o{ chunks : "table source"
tenants ||--o{ retrieval_log : "scoped"
reports ||--o{ conversations : "chat"
conversations ||--o{ messages : "thread"
messages ||--o{ feedback : "thumbs"
Klíčové tabulky¶
tenants¶
Multi-tenant container. Každý user má personal_tenant (auto-created), v budoucnu mohou být team workspaces.
user_profiles¶
1:1 mapping na auth.users. Drží personal_tenant_id pro custom JWT hook.
user_id uuid PK REFERENCES auth.users(id) ON DELETE CASCADE
personal_tenant_id uuid FK → tenants
display_name text
created_at timestamptz
tenant_members¶
M:N user × tenant s rolí (owner / member / viewer).
tenant_id uuid FK → tenants
user_id uuid FK → auth.users
role text CHECK in ('owner', 'member', 'viewer')
created_at timestamptz
PRIMARY KEY (tenant_id, user_id)
RLS helper private.user_has_tenant_access(p_tenant_id) kontroluje membership přes JWT claims (personal_tenant_id + active_tenant_id) NEBO tenant_members row.
reports¶
Top-level container = "folder". Statusový FSM ovládá worker pipeline.
id uuid PK
tenant_id uuid FK → tenants
title text
source text CHECK in ('v1-import', 'user-upload', 'nette')
external_id text -- pro Nette reports
filename text
content_type text
storage_path text -- R2 path
size_bytes int
clean_text text -- legacy v1, plain text
parsed_markdown text -- v2, markdown z OCR/parser
parsed_metadata jsonb -- {addresses, parcel_numbers, municipalities, embedding_status}
ingestion_cost_cents int default 0
ingestion_started_at timestamptz
ingestion_finished_at timestamptz
ingestion_error text
status text CHECK in (
'uploaded','parsing','parsed','annotating','annotated',
'embedding','ready','failed'
)
created_by uuid FK → auth.users
created_at timestamptz
updated_at timestamptz
deleted_at timestamptz -- soft delete (zatím nepoužité)
Status FSM:
uploaded → parsing → parsed → annotating → annotated → embedding → ready
↓ ↓ ↓
└─────────── failed ──┴───────────────────────┘
attachments (Phase B)¶
Přílohy reportu. Folder model — 1 report má 0..N attachments.
id uuid PK
tenant_id uuid FK → tenants
report_id uuid FK → reports ON DELETE CASCADE
source text CHECK in ('nette', 'user_upload')
nette_id text -- Nette přidělené ID (idempotence)
attachment_type text -- "vyjadreni_cetin" / "geo_plan" / "foto" / atd.
filename text
content_type text
size_bytes int
storage_path text
parsed_markdown text
parsed_metadata jsonb
ingestion_cost_cents int default 0
ingestion_started_at timestamptz
ingestion_finished_at timestamptz
ingestion_error text
status text CHECK ... -- stejný FSM jako reports
note text
created_by uuid FK → auth.users
created_at timestamptz
updated_at timestamptz
Idempotence partial index pro Nette:
CREATE UNIQUE INDEX idx_attachments_nette_unique
ON attachments (report_id, nette_id)
WHERE source = 'nette' AND nette_id IS NOT NULL;
parsed_sections (Phase B)¶
Markdown sekce z OCR/parser. Per report (main = attachment_id IS NULL) nebo per attachment.
id uuid PK
tenant_id uuid FK → tenants
report_id uuid FK → reports
attachment_id uuid FK → attachments -- NULL = main report content
source_type text CHECK in ('main','attachment','user_upload')
name text -- "Riziko povodní"
slug text -- "riziko-povodni" (diakritika-safe)
markdown text -- plný markdown sekce
tokens int default 0
order_in_doc int default 0
created_at timestamptz
figures (Phase B)¶
Extrahované obrázky / mapy / výkresy s AI anotacemi.
id uuid PK
tenant_id uuid FK → tenants
report_id uuid FK → reports -- VŽDY parent report
attachment_id uuid FK → attachments -- optional
section_id uuid FK → parsed_sections ON DELETE SET NULL
source_type text CHECK in ('main','attachment','user_upload')
section_name text
page_number int
bbox jsonb -- {x, y, width, height}
storage_path text -- nemoreport-figures bucket
content_type text -- image/png, image/jpeg, ...
width_px, height_px int
size_bytes int
caption text
annotation_json jsonb NOT NULL -- FigureAnnotation Pydantic struct
annotation_source text CHECK in ('mistral','gemini','manual','pending')
annotation_quality_score float -- 0-1, heuristic per Phase B
created_at timestamptz
updated_at timestamptz
annotation_json schema (FigureAnnotation):
{
image_type: 'map_zoning' | 'map_flood' | 'map_utility' | 'photo_property'
| 'technical_drawing' | 'scan_document' | 'graph' | 'decorative' | 'other',
summary: string, // CZ popis
entities: { kind: string, value: string }[], // adresy, parcely, obce
key_observations: string[],
legend: string,
dimensions: string,
}
parsed_tables (Phase B, currently unused)¶
Reservováno pro tabulky z Mistral OCR. Phase B to extract zatím nepopuluje.
id uuid PK
tenant_id uuid FK
report_id uuid FK
section_id uuid FK → parsed_sections ON DELETE SET NULL
caption text
markdown text -- markdown table format
rows_count int
created_at timestamptz
ingestion_jobs (Phase B)¶
Audit ledger pro retry observability + cost tracking per stage.
id uuid PK
tenant_id uuid FK
report_id uuid FK
attachment_id uuid FK
stage text CHECK in ('scan','parse','annotate','finalize') -- (Note: 'embed' added in Phase C, but jobs table doesn't enforce)
status text CHECK in ('queued','running','done','failed')
worker_id text
taskiq_task_id text
attempt int default 1
started_at timestamptz
finished_at timestamptz
error_text text
metrics jsonb default '{}' -- {cost_cents, ms, ...}
created_at timestamptz
updated_at timestamptz
chunks (Phase C)¶
Klíčová tabulka pro vector RAG. Single table s inline embedding (D2 decision).
id uuid PK
tenant_id uuid FK → tenants
report_id uuid FK → reports -- VŽDY parent (folder retrieval)
attachment_id uuid FK → attachments -- NULL = main report
section_id uuid FK → parsed_sections
table_id uuid FK → parsed_tables
figure_id uuid FK → figures
section_name text -- denormalized
section_slug text
attachment_filename text -- denormalized pro chat citaci
source_label text -- "Riziko povodní" / "map_flood: ulice Klíny"
content_type text CHECK in ('text','table','figure')
source_type text CHECK in ('main','attachment','figure')
content text NOT NULL
content_tokens int default 0
order_in_doc int NOT NULL
order_in_section int default 0
embedding halfvec(1536) -- nullable: D7 soft-fail
embedding_type text CHECK in ('text','multimodal')
embedding_model text -- "gemini-embedding-2"
embedding_version text -- "ga-2026-04"
tsv tsvector GENERATED ALWAYS AS (
to_tsvector('nemoreport.czech_unaccent', coalesce(content, ''))
) STORED
created_at timestamptz
Indexy:
- chunks_pkey (id) — primary key
- idx_chunks_report (report_id, order_in_doc) — folder retrieval ordering
- idx_chunks_tenant (tenant_id)
- idx_chunks_section_filter (report_id, section_slug, content_type) PARTIAL where section_slug NOT NULL — section scope retrieval
- idx_chunks_attachment (attachment_id) PARTIAL where attachment_id NOT NULL
- idx_chunks_tsv_gin USING gin(tsv) — BM25 leg
- idx_chunks_embedding_hnsw USING hnsw(embedding halfvec_cosine_ops) WITH (m=16, ef_construction=64) PARTIAL where embedding NOT NULL — vector leg
RLS: SELECT pro authenticated přes private.user_has_tenant_access(tenant_id). INSERT/UPDATE/DELETE jen service_role (defense-in-depth).
retrieval_log (Phase C)¶
Per-call observability log pro /retrieve endpoint.
id uuid PK
tenant_id uuid FK → tenants
user_id uuid FK → auth.users ON DELETE SET NULL
scope_type text CHECK in ('folder','section','multi_report')
scope_payload jsonb -- {report_id, section_slug, report_ids[]}
query_text text NOT NULL
rewritten_query text -- HyDE doc nebo standalone rewrite, NULL = passthrough
used_hyde boolean default false
fusion text CHECK in ('vector_only','hybrid_rrf') default 'hybrid_rrf'
reranked boolean default false
rerank_model text -- "rerank-v4.0-pro"
top_k int NOT NULL
result_count int NOT NULL
result_chunk_ids uuid[] default array[]::uuid[]
embed_ms int
retrieval_ms int
rerank_ms int
user_feedback int -- 1 / -1 / NULL — Phase D
created_at timestamptz
Konverzační tabulky (Phase A, used by /chat — Phase D rewrite)¶
conversations¶
id uuid PK
tenant_id uuid FK → tenants
report_id uuid FK → reports -- 1 conversation = 1 report (zatím)
title text
created_by uuid FK → auth.users
created_at timestamptz
updated_at timestamptz
messages¶
id bigserial PK
tenant_id uuid FK → tenants
conversation_id uuid FK → conversations
role text CHECK in ('user','assistant','system')
content text NOT NULL
source_message_id text -- legacy v1 mapping (B&G hash bez fragility)
created_at timestamptz
feedback¶
id uuid PK
tenant_id uuid FK → tenants
report_id uuid FK → reports
conversation_id uuid FK → conversations
message_id bigint FK → messages
section text
rating int CHECK in (1, -1)
feedback_text text
llm_model text
created_at timestamptz
Admin / migrace tabulky (Phase A)¶
llm_providers¶
Per-tenant LLM config. Admin endpointy umožňují přepnout default model.
id uuid PK
tenant_id uuid FK → tenants
provider text -- 'google-gla', 'openai', 'anthropic'
display_name text
api_key text -- ! sensitive, defense-in-depth REVOKE z authenticated
config jsonb -- {models[], default_model, defaults}
is_active boolean
created_at timestamptz
handle_claims_preauth, handle_claims¶
Migrace v1 → v2. Tester claim flow:
1. Admin volá POST /admin/migrate/preauth s emailem + handle (předregistruje)
2. User signup → handle_new_user trigger detekuje preauth row a importuje data v1
3. handle_claims audit log
migration_events, user_events¶
Audit logy.
import_manifest¶
Bookkeeping pro v1 → v2 import.
report_snapshots, report_images¶
Legacy v1 tables (zatím not migrated — read-only fallback).
RLS — defense-in-depth pattern¶
Všechny nemoreport.* tabulky mají ENABLE ROW LEVEL SECURITY. SELECT policy enforces tenant scoping. Pro citlivé tabulky (chunks, figures, parsed_, ingestion_jobs, retrieval_log, llm_providers) je INSERT/UPDATE/DELETE explicitly REVOKED* od role authenticated.
Důvod: i kdyby se omylem přidala permissive UPDATE policy, table grant by byla blokovala. Tj. 2 vrstvy ochrany: RLS policy + table grant.
Typická tabulka:
ALTER TABLE nemoreport.chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY "chunks_select" ON nemoreport.chunks FOR SELECT
TO authenticated USING (private.user_has_tenant_access(tenant_id));
REVOKE INSERT, UPDATE, DELETE ON nemoreport.chunks FROM authenticated;
GRANT SELECT, INSERT, UPDATE, DELETE ON nemoreport.chunks TO service_role;
Sequence grants (Phase A.1 lesson): explicit GRANT USAGE ON SEQUENCE ... TO authenticated pro každý PK sequence — alter default privileges to nepokrývá pro budoucí SERIAL columns.
Detail viz Bezpečnost.
Migrace¶
Aplikované migrace 0001-0018 (vždy nemoreport.* schema, NIKDY public):
| # | Název | Co dělá |
|---|---|---|
| 0001 | auth_tenants | tenants + tenant_members + user_profiles + handle_new_user trigger |
| 0002 | reports_conversations | reports + conversations + messages + feedback |
| 0003 | llm_providers | llm_providers admin tabulka |
| 0004 | storage_buckets | 3 R2 buckety + path-based RLS |
| 0005 | migration_tables | handle_claims, migration_events, user_events, import_manifest |
| 0006 | import_v1_function | nemoreport.import_v1() PL/pgSQL function |
| 0007 | jwt_hook | private.custom_access_token_hook() — injektuje claims |
| 0009 | grant_private_usage | grant USAGE ON private TO supabase_auth_admin |
| 0010 | grant_service_role_nemoreport | service_role full CRUD na celém schema |
| 0011 | revoke_authenticated_admin_tables | defense-in-depth REVOKE od authenticated |
| 0012 | grant_messages_seq | USAGE on messages_id_seq pro authenticated |
| 0013 | ingestion_tables | Phase B: attachments + parsed_* + figures + ingestion_jobs |
| 0014 | chunks_setup | Phase C: pgvector + halfvec + chunks tabulka + HNSW |
| 0015 | search_chunks_rpc | search_chunks_by_folder() RPC |
| 0016 | hybrid_search_chunks | hybrid_search_chunks_by_folder() RPC (BM25 + vector + RRF) |
| 0017 | hybrid_search_prefix_match | _build_prefix_tsquery() helper, fix declension matching |
| 0018 | retrieval_log | Phase C C.11: retrieval_log table |
Detail viz Databáze.
Storage buckety¶
4 buckety v Supabase Storage (managed backend, ne přímo Cloudflare R2):
nemoreport-uploads/
└── {tenant_id}/
└── reports/
└── {report_id}/
└── original.{pdf|docx|mhtml|...}
nemoreport-attachments/
└── {tenant_id}/
└── reports/
└── {report_id}/
└── attachments/
└── {attachment_id}.{ext}
nemoreport-figures/
└── {tenant_id}/
└── reports/
└── {report_id}/
└── figures/
└── {figure_id}.{png|jpg|...}
RLS path-based scoping {tenant_id}/... — user vidí jen vlastní.
Stav v produkci (30.4.2026)¶
- 22 tabulek
- 21 reportů s chunks (po C.13 backfill)
- 290 chunks (238 text + 51 multimodal)
- 116 figures (99 Gemini AI + 17 Mistral AI)
- 134 pgTAP testů (Phase A 80 + Phase B 28 + Phase C 26)
- Total ingestion cost: 281 halířů (2.81 Kč) za 21 reportů