Retrieval-Augmented Generation (RAG)

RAG is a pattern that retrieves passages from a private knowledge base before an LLM generates an answer, so responses cite your own documents instead of fabricating facts. In ONOXIA, RAG is the default answering mode for every site.

Purpose

Generic LLMs hallucinate when asked about proprietary content. RAG anchors each answer in a verifiable source you control, which is what makes a chat widget trustworthy enough to deploy on a public site without supervision.

Scope

RAG applies to every visitor question that arrives at the ONOXIA widget. It runs before model inference, on every conversation turn, regardless of language. It does not apply to small-talk or off-topic messages — those are handled by the persona's fallback behaviour.

Components

  • Ingestion — PDFs, FAQ pairs, URLs and plain text are split into overlapping chunks (~500 tokens).
  • Embedding — each chunk is encoded as a vector by a multilingual embedding model.
  • Store — vectors live in a Qdrant index, scoped per site.
  • Retrieval — at query time the visitor's question is embedded, the top-k nearest chunks are fetched, and they are inserted into the prompt.
  • Generation — Mistral AI (Europe) or Qwen/Gemini (Asia-Pacific) produces the final answer constrained by the retrieved context.

Outputs

  • A grounded answer in the visitor's language (28 supported).
  • An internal trace showing which chunks were retrieved, available in the dashboard for audit.
  • Token consumption metered against the site's monthly budget.

Relationships

RAG is implemented by ONOXIA's SoftwareApplication, configured per site by a Persona, and routed across providers via Multi-LLM-Routing. When RAG cannot answer with high enough confidence, the bot may trigger Human-Handover.

Authority

Defined by OCENOX LTD as the canonical retrieval pattern used across the ONOXIA platform.

Version

1.0 — 2026-05-22

Gerelateerde termen

  • Shadow DOM Chat Widget — A chat widget loaded inside a Shadow DOM root so its styles and DOM are isolated from the host page, guaranteeing no CSS conflicts and no script collisions regardless of the host CMS or theme.
  • Persona Configuration — A per-site bundle that defines who the bot is, how it speaks, which tools it may call, and what it must refuse. Personas sit between the visitor's question and the LLM, shaping the system prompt and tool registry for every conversation on that site.
  • Multi-LLM Language Routing — A request-time decision that picks the inference provider best suited for the visitor's language pair — Mistral AI in Paris for European languages (GDPR-compliant), Qwen for Chinese and other Asia-Pacific languages, Gemini Flash as a global fallback.