Glosář¶
Domain-specific termíny + zkratky používané napříč projektem.
Architektura + tech¶
- ADR (Architecture Decision Record)
- Záznam rozhodnutí + důvodu. Pomáhá retrospect pochopit "proč jsme udělali co". Detail viz Decisions.
- Defense-in-depth
- Více vrstev ochrany. Pokud 1 vrstva selže, další blokuje. Příklad: RLS policy + table grant + sequence USAGE — 3 vrstvy proti unauthorized access.
- Folder model
- Phase B post-deploy koncept. Report = container, ne single file. 1 report obsahuje main + N attachments + figures cross-source. Implementováno přes
attachments.report_idFK +figures.report_idalways parent. - Idempotent stage
- Worker stage který lze opakovat bez side effects. Implementováno přes DELETE existing data před re-run. Důležité pro retry semantiku.
- JWT custom hook
- Supabase Postgres function (
private.custom_access_token_hook) volaná při emit JWT. Injektuje custom claims (personal_tenant_id,active_tenant_id). - Multi-tenant
- Architektura kde jeden Postgres + jeden Supabase project obsluhuje více logických "tenants" izolovaných přes RLS. Každý user má
personal_tenant(auto-created). - RLS (Row Level Security)
- Postgres feature pro per-row authorization. Policy definuje kdo vidí které řádky. Aplikováno per
tenant_id. - SECURITY DEFINER
- Postgres function modifier — function běží s privilegy ownera, ne callera. Použito pro custom JWT hook + RPC functions kde caller (authenticated) potřebuje access k privát resources.
- Soft-fail (D7)
- Phase C decision. Pokud embed stage selže, report přejde do
readypřesto.parsed_metadata.embedding_status='failed'se zapíše. Cost: chat může fallbackovat na full-text místo chunks. - Tenant
- Logical grouping dat per user/team. Personal tenant = 1:1 user. Team tenant = N:M.
AI / RAG¶
- BM25
- Best Matching 25 — keyword-based search algorithm. Postgres implementace přes
tsvector+ts_rank_cd. Phase C BM25 leg sczech_unaccenttext search config. - Chunk
- Část reportu po split (sekce, tabulka, figura). 1 chunk = 1 row v
nemoreport.chunks. Phase C chunk size: 800-1200 tokens text, 1 chunk per table, 1 chunk per qualifying figure. - Cohere Rerank 4.0
- Cross-encoder rerank model od Cohere. Released 2026-04-06.
rerank-v4.0-pro(default) /rerank-v4.0-fast(cost). 32K context, multilingual including CZ. - Cross-encoder
- Jednoduší: model který bere query + candidate dohromady (concatenated) a produkuje 1 relevance score. Pomalejší ale přesnější než bi-encoder (= embedding similarity).
- Embedding
- Vektorová reprezentace textu / image. Phase C:
gemini-embedding-2GA, native multimodal (text + image into unified 3072-dim space). Truncated na 1536 dims (Matryoshka) + L2 normalized → halfvec(1536). - Gemini Embedding 2
- Google's multimodal embedding model. Released 2026-03-10 jako preview, GA ~04/2026. Native multimodal — text + image + video + audio + PDF do single embedding space.
- halfvec(1536)
- pgvector typ pro 16-bit float vector (vs
vector(1536)32-bit). 50% menší storage, ~stejná recall pro most use cases. - HNSW (Hierarchical Navigable Small World)
- Vector index algoritmus pro fast approximate nearest neighbor search. pgvector 0.8+ native support. Phase C:
m=16, ef_construction=64, ef_search=80-150 per scope. - Hybrid retrieval
- Kombinace vector + BM25 + RRF fusion. Phase C signature flow: hybrid_search_chunks_by_folder RPC → top-K candidates → optional Cohere rerank → final top-K.
- HyDE (Hypothetical Document Embeddings)
- Technika pro krátké/vague queries. LLM vygeneruje hypotetickou odpověď, embed se použije místo embed query → lepší recall.
- Matryoshka embeddings
- Embedding model trained tak že prefix [:K] dimenzí stejně dobře reprezentuje text jako celek. Umožňuje truncation s minimal recall loss. Gemini-2 podporuje.
- Multimodal embedding
- Vector reprezentující text + image v unified vector space. Phase C C.6: figure chunks s
embedding_type='multimodal'(text annotation + image bytes do Gemini-2). - OCR (Optical Character Recognition)
- Text extrakce z obrázku/PDF. Phase B používá Mistral OCR (
mistral-ocr-latest) pro PDF + image. - RAG (Retrieval-Augmented Generation)
- Pattern: nejdřív retrieve relevantní context z vector DB, pak inject do LLM prompt místo full text. Snižuje token cost + zlepšuje "needle in haystack" pro velké dokumenty.
- RRF (Reciprocal Rank Fusion)
- Hybrid retrieval fusion algoritmus. Score =
1 / (k + rank)per leg, sum across legs. Phase C k=60. Stably outperforms weighted fusion. - Standalone rewrite
- Convert conversational query (s history kontextem) na self-contained query. Phase C MVP no-op (žádná history), Phase D bude full LLM-based.
- TS vector / tsquery
- Postgres full-text search types. tsvector = lexemes + positions. tsquery = parsed query.
tsv @@ q= match operator.
DB¶
auth.users- Supabase managed table — user authentication. NemoReport má
nemoreport.user_profiles1:1 mapping pro extra fields. nemoreport.chunks- Phase C central table. 290 rows v produkci (po C.13 backfill). Per-row halfvec(1536) embedding + tsvector + content + source metadata.
nemoreport.reports- Top-level container. Status FSM:
uploaded → parsing → parsed → annotating → annotated → embedding → ready / failed. privateschema- Supabase managed schema pro internal helpers (
custom_access_token_hook,user_has_tenant_accessetc). Nedostupné proauthenticatedrole default — migrace 0009 dodala explicit USAGE prosupabase_auth_admin. service_role- Supabase Postgres role s
BYPASSRLSattribute. Použito backendem + workerem pro write operations. NIKDY exposed na frontend. - RPC (Remote Procedure Call)
- PostgREST exposes Postgres functions přes REST API. Phase C
search_chunks_by_folder,hybrid_search_chunks_by_folderjsou RPC functions volané z backend přessupabase-py.
Infra¶
- Cloudflare R2
- Object storage compatible s S3 API. NEPOUŽÍVÁME přímo — pro file storage máme Supabase Storage (jejich managed backend). CF je jen jako CDN/edge proxy před Supabase API.
- Supabase Storage
- Managed S3-compatible object storage od Supabase. Bytes na jejich infra. NemoReport má 4 buckety. Files se uploadují přes
supabase.storage.from_(bucket).upload(path, bytes), retrieve přes signed URLs. - Cloudflare Workers
- Serverless edge runtime (V8 isolates). Frontend deployed jako Worker přes
@opennextjs/cloudflare. - JWKS (JSON Web Key Set)
- Supabase exposes public keys pro JWT verification přes
{supabase_url}/auth/v1/.well-known/jwks.json. Backend cache 1h TTL. - Sliplane
- CZ-based Docker hosting (Hetzner backend). 3 services: backend, worker, redis.
- Supabase
- Postgres + Auth + Storage + Realtime BaaS. Project ref
cubdrgjdkatyecrgckwp. - taskiq
- Async task queue framework Python. Use s Redis Stream broker. Phase B+ orchestrates worker pipeline.
Doménové (real estate, NemoReport-specific)¶
- KOMPLET (KOMPLET / TECHNICKÁ ČÁST)
- Označení rozsahu reportu. KOMPLET = full report (technical + valuation), KOMPLET TECHNICKÁ ČÁST = jen technical section.
- KN (Katastr nemovitostí)
- CZ government real estate register. Reports porovnávají KM (katastrální mapu) + ortofoto.
- Mawis Utility
- CZ system pro inženýrské sítě (utility infrastructure) info. Phase B ingestion typicky obsahuje data odtud.
- MHTML
- Web archive format (single file containing HTML + linked resources base64-encoded). Format ve kterém Nette ukládá reporty pro NemoReport AI. Phase B ingest cesta
_parse_via_bs4_mhtml. - Nette
- PHP framework + reálná aplikace která generuje NemoReport reports. Phase E plánuje integration (JWT bridge + HMAC webhooks).
- Parcela
- Real estate parcel (specific land area). Reports referencují parcel numbers.
- Q100
- Hydrologie — záplavová zóna s 100-letou periodou. "Pozemek je v zóně Q100" = 1% pravděpodobnost povodně/rok.
- Územní plán
- Master plan pro municipal use of land (zoning).
Czech-specific terms¶
ts_rank_cd- Postgres function pro BM25-like scoring na tsvector match. CD = "covering density" weighted variant.
websearch_to_tsquery- User-friendly tsquery parser (handles natural language better než
to_tsquery). Použito v Phase C BM25 leg. - czech_unaccent
- Custom Postgres ts_config (Phase C migrace 0014).
simple + unaccentparser → strip diakritiku, no stemmer.
Zkratky¶
| Zkratka | Význam |
|---|---|
| ADR | Architecture Decision Record |
| AGPL | Affero General Public License (copyleft) |
| BM25 | Best Matching 25 (search algorithm) |
| BS4 | BeautifulSoup4 (HTML parser) |
| CF | Cloudflare |
| CORS | Cross-Origin Resource Sharing |
| CRUD | Create / Read / Update / Delete |
| CTE | Common Table Expression (SQL WITH) |
| DR | Disaster Recovery |
| ENV | Environment variable |
| FE / BE | Frontend / Backend |
| GDPR | General Data Protection Regulation (EU) |
| HNSW | Hierarchical Navigable Small World (ANN algorithm) |
| HMAC | Hash-based Message Authentication Code |
| JWT | JSON Web Token |
| JWKS | JSON Web Key Set |
| MIME | Multipurpose Internet Mail Extensions (file type) |
| MAU | Monthly Active Users |
| MRR | Mean Reciprocal Rank (eval metric) |
| MVP | Minimum Viable Product |
| OCR | Optical Character Recognition |
| OWASP | Open Web Application Security Project |
| PITR | Point-in-Time Recovery |
| RAG | Retrieval-Augmented Generation |
| RBAC | Role-Based Access Control |
| RLS | Row Level Security |
| RPC | Remote Procedure Call |
| RPM | Requests Per Minute |
| RRF | Reciprocal Rank Fusion |
| RTL | React Testing Library |
| RTO / RPO | Recovery Time Objective / Recovery Point Objective |
| SaaS | Software as a Service |
| SDK | Software Development Kit |
| SLA | Service Level Agreement |
| SOC2 | Security audit framework |
| SSE | Server-Sent Events |
| SSL / TLS | Secure Socket Layer / Transport Layer Security |
| SSPL | Server Side Public License (problematic for SaaS) |
| T&C | Terms and Conditions |
| TBD | To Be Determined |
| TS | TypeScript |
| TTL | Time To Live (cache lifetime) |
| UUID | Universally Unique Identifier |
| VAT | Value-Added Tax |
| VC | Venture Capital |