Přeskočit obsah

Datový model

DB schema nemoreport (Supabase Postgres 17). 22 tabulek, 18 migrací (0001-0018).

ER diagram (high-level)

erDiagram
    auth_users ||--|| user_profiles : "owns"
    auth_users ||--o{ tenant_members : "joins"
    user_profiles ||--|| tenants : "personal_tenant"
    tenants ||--o{ tenant_members : "members"
    tenants ||--o{ reports : "owns"
    reports ||--o{ attachments : "has many"
    reports ||--o{ parsed_sections : "main + per-attachment"
    reports ||--o{ figures : "extracted images"
    reports ||--o{ parsed_tables : "extracted tables"
    reports ||--o{ ingestion_jobs : "audit log"
    reports ||--o{ chunks : "vector + BM25"
    attachments ||--o{ parsed_sections : "may have"
    attachments ||--o{ figures : "may have"
    parsed_sections ||--o{ figures : "section context"
    parsed_sections ||--o{ chunks : "source"
    figures ||--o{ chunks : "multimodal source"
    parsed_tables ||--o{ chunks : "table source"
    tenants ||--o{ retrieval_log : "scoped"
    reports ||--o{ conversations : "chat"
    conversations ||--o{ messages : "thread"
    messages ||--o{ feedback : "thumbs"

Klíčové tabulky

tenants

Multi-tenant container. Každý user má personal_tenant (auto-created), v budoucnu mohou být team workspaces.

id              uuid PK
name            text
type            text CHECK in ('personal', 'team')
created_at      timestamptz

user_profiles

1:1 mapping na auth.users. Drží personal_tenant_id pro custom JWT hook.

user_id              uuid PK REFERENCES auth.users(id) ON DELETE CASCADE
personal_tenant_id   uuid FK  tenants
display_name         text
created_at           timestamptz

tenant_members

M:N user × tenant s rolí (owner / member / viewer).

tenant_id    uuid FK  tenants
user_id      uuid FK  auth.users
role         text CHECK in ('owner', 'member', 'viewer')
created_at   timestamptz
PRIMARY KEY (tenant_id, user_id)

RLS helper private.user_has_tenant_access(p_tenant_id) kontroluje membership přes JWT claims (personal_tenant_id + active_tenant_id) NEBO tenant_members row.

reports

Top-level container = "folder". Statusový FSM ovládá worker pipeline.

id                       uuid PK
tenant_id                uuid FK  tenants
title                    text
source                   text CHECK in ('v1-import', 'user-upload', 'nette')
external_id              text  -- pro Nette reports
filename                 text
content_type             text
storage_path             text  -- R2 path
size_bytes               int
clean_text               text  -- legacy v1, plain text
parsed_markdown          text  -- v2, markdown z OCR/parser
parsed_metadata          jsonb -- {addresses, parcel_numbers, municipalities, embedding_status}
ingestion_cost_cents     int default 0
ingestion_started_at     timestamptz
ingestion_finished_at    timestamptz
ingestion_error          text
status                   text CHECK in (
                           'uploaded','parsing','parsed','annotating','annotated',
                           'embedding','ready','failed'
                         )
created_by               uuid FK  auth.users
created_at               timestamptz
updated_at               timestamptz
deleted_at               timestamptz  -- soft delete (zatím nepoužité)

Status FSM:

uploaded → parsing → parsed → annotating → annotated → embedding → ready
                  ↓                     ↓                      ↓
                  └─────────── failed ──┴───────────────────────┘

attachments (Phase B)

Přílohy reportu. Folder model — 1 report má 0..N attachments.

id                       uuid PK
tenant_id                uuid FK  tenants
report_id                uuid FK  reports ON DELETE CASCADE
source                   text CHECK in ('nette', 'user_upload')
nette_id                 text  -- Nette přidělené ID (idempotence)
attachment_type          text  -- "vyjadreni_cetin" / "geo_plan" / "foto" / atd.
filename                 text
content_type             text
size_bytes               int
storage_path             text
parsed_markdown          text
parsed_metadata          jsonb
ingestion_cost_cents     int default 0
ingestion_started_at     timestamptz
ingestion_finished_at    timestamptz
ingestion_error          text
status                   text CHECK ... -- stejný FSM jako reports
note                     text
created_by               uuid FK  auth.users
created_at               timestamptz
updated_at               timestamptz

Idempotence partial index pro Nette:

CREATE UNIQUE INDEX idx_attachments_nette_unique
  ON attachments (report_id, nette_id)
  WHERE source = 'nette' AND nette_id IS NOT NULL;

parsed_sections (Phase B)

Markdown sekce z OCR/parser. Per report (main = attachment_id IS NULL) nebo per attachment.

id            uuid PK
tenant_id     uuid FK  tenants
report_id     uuid FK  reports
attachment_id uuid FK  attachments  -- NULL = main report content
source_type   text CHECK in ('main','attachment','user_upload')
name          text  -- "Riziko povodní"
slug          text  -- "riziko-povodni" (diakritika-safe)
markdown      text  -- plný markdown sekce
tokens        int default 0
order_in_doc  int default 0
created_at    timestamptz

figures (Phase B)

Extrahované obrázky / mapy / výkresy s AI anotacemi.

id                          uuid PK
tenant_id                   uuid FK  tenants
report_id                   uuid FK  reports  -- VŽDY parent report
attachment_id               uuid FK  attachments  -- optional
section_id                  uuid FK  parsed_sections ON DELETE SET NULL
source_type                 text CHECK in ('main','attachment','user_upload')
section_name                text
page_number                 int
bbox                        jsonb  -- {x, y, width, height}
storage_path                text   -- nemoreport-figures bucket
content_type                text   -- image/png, image/jpeg, ...
width_px, height_px         int
size_bytes                  int
caption                     text
annotation_json             jsonb NOT NULL  -- FigureAnnotation Pydantic struct
annotation_source           text CHECK in ('mistral','gemini','manual','pending')
annotation_quality_score    float  -- 0-1, heuristic per Phase B
created_at                  timestamptz
updated_at                  timestamptz

annotation_json schema (FigureAnnotation):

{
  image_type: 'map_zoning' | 'map_flood' | 'map_utility' | 'photo_property'
            | 'technical_drawing' | 'scan_document' | 'graph' | 'decorative' | 'other',
  summary: string,           // CZ popis
  entities: { kind: string, value: string }[],   // adresy, parcely, obce
  key_observations: string[],
  legend: string,
  dimensions: string,
}

parsed_tables (Phase B, currently unused)

Reservováno pro tabulky z Mistral OCR. Phase B to extract zatím nepopuluje.

id          uuid PK
tenant_id   uuid FK
report_id   uuid FK
section_id  uuid FK  parsed_sections ON DELETE SET NULL
caption     text
markdown    text  -- markdown table format
rows_count  int
created_at  timestamptz

ingestion_jobs (Phase B)

Audit ledger pro retry observability + cost tracking per stage.

id              uuid PK
tenant_id       uuid FK
report_id       uuid FK
attachment_id   uuid FK
stage           text CHECK in ('scan','parse','annotate','finalize')  -- (Note: 'embed' added in Phase C, but jobs table doesn't enforce)
status          text CHECK in ('queued','running','done','failed')
worker_id       text
taskiq_task_id  text
attempt         int default 1
started_at      timestamptz
finished_at     timestamptz
error_text      text
metrics         jsonb default '{}'  -- {cost_cents, ms, ...}
created_at      timestamptz
updated_at      timestamptz

chunks (Phase C)

Klíčová tabulka pro vector RAG. Single table s inline embedding (D2 decision).

id                    uuid PK
tenant_id             uuid FK  tenants
report_id             uuid FK  reports  -- VŽDY parent (folder retrieval)
attachment_id         uuid FK  attachments  -- NULL = main report
section_id            uuid FK  parsed_sections
table_id              uuid FK  parsed_tables
figure_id             uuid FK  figures
section_name          text  -- denormalized
section_slug          text
attachment_filename   text  -- denormalized pro chat citaci
source_label          text  -- "Riziko povodní" / "map_flood: ulice Klíny"
content_type          text CHECK in ('text','table','figure')
source_type           text CHECK in ('main','attachment','figure')
content               text NOT NULL
content_tokens        int default 0
order_in_doc          int NOT NULL
order_in_section      int default 0
embedding             halfvec(1536)  -- nullable: D7 soft-fail
embedding_type        text CHECK in ('text','multimodal')
embedding_model       text  -- "gemini-embedding-2"
embedding_version     text  -- "ga-2026-04"
tsv                   tsvector GENERATED ALWAYS AS (
                        to_tsvector('nemoreport.czech_unaccent', coalesce(content, ''))
                      ) STORED
created_at            timestamptz

Indexy: - chunks_pkey (id) — primary key - idx_chunks_report (report_id, order_in_doc) — folder retrieval ordering - idx_chunks_tenant (tenant_id) - idx_chunks_section_filter (report_id, section_slug, content_type) PARTIAL where section_slug NOT NULL — section scope retrieval - idx_chunks_attachment (attachment_id) PARTIAL where attachment_id NOT NULL - idx_chunks_tsv_gin USING gin(tsv) — BM25 leg - idx_chunks_embedding_hnsw USING hnsw(embedding halfvec_cosine_ops) WITH (m=16, ef_construction=64) PARTIAL where embedding NOT NULL — vector leg

RLS: SELECT pro authenticated přes private.user_has_tenant_access(tenant_id). INSERT/UPDATE/DELETE jen service_role (defense-in-depth).

retrieval_log (Phase C)

Per-call observability log pro /retrieve endpoint.

id                  uuid PK
tenant_id           uuid FK  tenants
user_id             uuid FK  auth.users ON DELETE SET NULL
scope_type          text CHECK in ('folder','section','multi_report')
scope_payload       jsonb  -- {report_id, section_slug, report_ids[]}
query_text          text NOT NULL
rewritten_query     text  -- HyDE doc nebo standalone rewrite, NULL = passthrough
used_hyde           boolean default false
fusion              text CHECK in ('vector_only','hybrid_rrf') default 'hybrid_rrf'
reranked            boolean default false
rerank_model        text  -- "rerank-v4.0-pro"
top_k               int NOT NULL
result_count        int NOT NULL
result_chunk_ids    uuid[] default array[]::uuid[]
embed_ms            int
retrieval_ms        int
rerank_ms           int
user_feedback       int  -- 1 / -1 / NULL — Phase D
created_at          timestamptz

Konverzační tabulky (Phase A, used by /chat — Phase D rewrite)

conversations

id           uuid PK
tenant_id    uuid FK  tenants
report_id    uuid FK  reports  -- 1 conversation = 1 report (zatím)
title        text
created_by   uuid FK  auth.users
created_at   timestamptz
updated_at   timestamptz

messages

id              bigserial PK
tenant_id       uuid FK  tenants
conversation_id uuid FK  conversations
role            text CHECK in ('user','assistant','system')
content         text NOT NULL
source_message_id text  -- legacy v1 mapping (B&G hash bez fragility)
created_at      timestamptz

feedback

id              uuid PK
tenant_id       uuid FK  tenants
report_id       uuid FK  reports
conversation_id uuid FK  conversations
message_id      bigint FK  messages
section         text
rating          int CHECK in (1, -1)
feedback_text   text
llm_model       text
created_at      timestamptz

Admin / migrace tabulky (Phase A)

llm_providers

Per-tenant LLM config. Admin endpointy umožňují přepnout default model.

id              uuid PK
tenant_id       uuid FK  tenants
provider        text  -- 'google-gla', 'openai', 'anthropic'
display_name    text
api_key         text  -- ! sensitive, defense-in-depth REVOKE z authenticated
config          jsonb  -- {models[], default_model, defaults}
is_active       boolean
created_at      timestamptz

handle_claims_preauth, handle_claims

Migrace v1 → v2. Tester claim flow: 1. Admin volá POST /admin/migrate/preauth s emailem + handle (předregistruje) 2. User signup → handle_new_user trigger detekuje preauth row a importuje data v1 3. handle_claims audit log

migration_events, user_events

Audit logy.

import_manifest

Bookkeeping pro v1 → v2 import.

report_snapshots, report_images

Legacy v1 tables (zatím not migrated — read-only fallback).

RLS — defense-in-depth pattern

Všechny nemoreport.* tabulky mají ENABLE ROW LEVEL SECURITY. SELECT policy enforces tenant scoping. Pro citlivé tabulky (chunks, figures, parsed_, ingestion_jobs, retrieval_log, llm_providers) je INSERT/UPDATE/DELETE explicitly REVOKED* od role authenticated.

Důvod: i kdyby se omylem přidala permissive UPDATE policy, table grant by byla blokovala. Tj. 2 vrstvy ochrany: RLS policy + table grant.

Typická tabulka:

ALTER TABLE nemoreport.chunks ENABLE ROW LEVEL SECURITY;

CREATE POLICY "chunks_select" ON nemoreport.chunks FOR SELECT
  TO authenticated USING (private.user_has_tenant_access(tenant_id));

REVOKE INSERT, UPDATE, DELETE ON nemoreport.chunks FROM authenticated;
GRANT SELECT, INSERT, UPDATE, DELETE ON nemoreport.chunks TO service_role;

Sequence grants (Phase A.1 lesson): explicit GRANT USAGE ON SEQUENCE ... TO authenticated pro každý PK sequence — alter default privileges to nepokrývá pro budoucí SERIAL columns.

Detail viz Bezpečnost.

Migrace

Aplikované migrace 0001-0018 (vždy nemoreport.* schema, NIKDY public):

# Název Co dělá
0001 auth_tenants tenants + tenant_members + user_profiles + handle_new_user trigger
0002 reports_conversations reports + conversations + messages + feedback
0003 llm_providers llm_providers admin tabulka
0004 storage_buckets 3 R2 buckety + path-based RLS
0005 migration_tables handle_claims, migration_events, user_events, import_manifest
0006 import_v1_function nemoreport.import_v1() PL/pgSQL function
0007 jwt_hook private.custom_access_token_hook() — injektuje claims
0009 grant_private_usage grant USAGE ON private TO supabase_auth_admin
0010 grant_service_role_nemoreport service_role full CRUD na celém schema
0011 revoke_authenticated_admin_tables defense-in-depth REVOKE od authenticated
0012 grant_messages_seq USAGE on messages_id_seq pro authenticated
0013 ingestion_tables Phase B: attachments + parsed_* + figures + ingestion_jobs
0014 chunks_setup Phase C: pgvector + halfvec + chunks tabulka + HNSW
0015 search_chunks_rpc search_chunks_by_folder() RPC
0016 hybrid_search_chunks hybrid_search_chunks_by_folder() RPC (BM25 + vector + RRF)
0017 hybrid_search_prefix_match _build_prefix_tsquery() helper, fix declension matching
0018 retrieval_log Phase C C.11: retrieval_log table

Detail viz Databáze.

Storage buckety

4 buckety v Supabase Storage (managed backend, ne přímo Cloudflare R2):

nemoreport-uploads/
└── {tenant_id}/
    └── reports/
        └── {report_id}/
            └── original.{pdf|docx|mhtml|...}

nemoreport-attachments/
└── {tenant_id}/
    └── reports/
        └── {report_id}/
            └── attachments/
                └── {attachment_id}.{ext}

nemoreport-figures/
└── {tenant_id}/
    └── reports/
        └── {report_id}/
            └── figures/
                └── {figure_id}.{png|jpg|...}

RLS path-based scoping {tenant_id}/... — user vidí jen vlastní.

Stav v produkci (30.4.2026)

  • 22 tabulek
  • 21 reportů s chunks (po C.13 backfill)
  • 290 chunks (238 text + 51 multimodal)
  • 116 figures (99 Gemini AI + 17 Mistral AI)
  • 134 pgTAP testů (Phase A 80 + Phase B 28 + Phase C 26)
  • Total ingestion cost: 281 halířů (2.81 Kč) za 21 reportů