Build a Custom Knowledge Base
The Alauda Hyperflux plugin ships with a built-in system knowledge base covering Alauda Container Platform (ACP) and Alauda product documentation. Alongside it, Hyperflux maintains a separate user knowledge base that you can fill with your own content — internal runbooks, SRE playbooks, design documents, versions of Alauda product docs that don't ship in the bundled corpus, or any other private Git repositories. Hyperflux retrieves from both knowledge bases on every question and merges the results, so adding a custom corpus augments the bundled product knowledge rather than replacing it.
This guide covers the admin-driven path: bulk-ingest a list of Git repositories into the user knowledge base using the in-pod builder. (For end-user file uploads through the chat UI, use the BYO Knowledge tool added in v1.3.1 — same destination, different entry point.)
The workflow runs inside the deployed smart-doc pod. The image already contains the builder source,
the embedding model, and the right Python environment, and the pod already has connectivity to the user
knowledge-base PostgreSQL. You only need a manifest of repositories to ingest and kubectl exec access to
the global cluster.
TOC
How the embedding pipeline worksPrerequisitesStep 1 — Describe your corpusStep 2 — Copy the manifest into the smart-doc podStep 3 — Runprepare inside the podStep 4 — Run embed inside the podStep 5 — Create the BM25 index on the user knowledge base (one-time)(Optional) Multi-cluster: dump and restoreReplacing the bundled system knowledge base (advanced)Re-roll cadenceCaveatsHow the embedding pipeline works
The smart-doc builder turns a list of Git repositories into a vectorised knowledge base in two stages:
- Prepare — clone the repos, split each
.md/.mdxdocument by heading and chunk size, then call an LLM (the same one already configured for the running Hyperflux plugin) to generate a one-paragraph summary and a few representative questions per document. Output:documents.json. - Embed — load the
gte-multilingual-baseembedding model, embed both the chunks and the per-document summaries, and write them to the user-knowledge-base PostgreSQL (docvec_user_kb) under the collection name the running server reads (user_defined_kbby default).
Once embedded, the running server picks up the new content on the next query — no chart edit, no pod restart. The system knowledge base is untouched.
NOTE: The BM25 index used by hybrid retrieval is not built by the embed step. The bundled system-KB dumps include a pre-built BM25 index, but the user-KB database is created empty during install. If your deployment has hybrid retrieval enabled (the default), you must create the BM25 index on the user-KB collection once — see Step 5.
Prerequisites
kubectlaccess to the cluster running Hyperflux (the global cluster) withexecpermission on thesmart-docpod in thecpaas-systemnamespace.- Read access to every Git repository you want to ingest. If they are private, have an HTTPS user/token pair ready.
- An LLM endpoint with sufficient quota. The prepare phase calls the LLM once per source document (roughly 1,000–3,000 calls per ACP-sized knowledge base). By default the in-pod commands reuse the LLM endpoint already configured for Hyperflux (the one set during install via LLM Base URL / LLM API Key / LLM Model Name). If your account is rate-limited, prepare a separate higher-quota endpoint and override the env vars in the exec session (Step 3).
You do not need a separate workstation Python environment, a downloaded copy of gte-multilingual-base,
a separate PostgreSQL instance, or a clone of the smart-doc repository — the smart-doc container has all of
this baked in.
Step 1 — Describe your corpus
Create a JSON manifest listing every Git repository you want to ingest. Save it on your local machine as
my-kb.json (any name works — you'll copy it into the pod in Step 2).
Add more repos entries to ingest multiple repositories in a single run.
Field reference:
Step 2 — Copy the manifest into the smart-doc pod
Step 3 — Run prepare inside the pod
Open an interactive shell in the serve container and run the prepare phase:
Useful flags:
--dryrun— clone the repos and print the document list, but don't call the LLM. Use this first to confirm yoursub_dirsanddoc_typefilters match what you expect.
NOTE:
preparedoes not cache LLM responses across runs — every invocation re-issues one LLM call per source document. Plan for the full per-document cost on each re-roll, and only re-run prepare when the source corpus actually changed (otherwise re-run onlyembedagainst the existingdocuments.json).
Step 4 — Run embed inside the pod
Stay in the same shell (so the env vars from Step 3 are still in effect):
The chart already exports EMB_MODEL_NAME=/opt/gte-multilingual-base, DEVICE=cpu, and
PG_USER_KB_CONN_STR (pointing at the user-KB database docvec_user_kb), so you don't need to pass
--emb-model or --device. Pass --pg-conn-str "${PG_USER_KB_CONN_STR}" explicitly because the script's
default is the system-KB connection string (PG_SYS_KB_CONN_STR).
What the flags do:
WARNING: When
--vector-typesincludesdoc_summarybutdocuments.jsonwas produced by apreparerun that didn't haveENABLE_GEN_QUESTIONS=trueset, embed silently emits zero summary vectors and only logsno_summary=Nat the end. If you see a highno_summarycount, re-runpreparewithENABLE_GEN_QUESTIONS=truebefore re-running embed.
NOTE: The serve container runs on CPU by default (
DEVICE=cpu). Embedding 5,000 chunks on CPU takes roughly an hour; for a small internal corpus (a few hundred docs) it's a few minutes. If you need GPU throughput, run the build on a GPU-scheduled pod (out of scope for this guide) and use the multi-cluster dump-and-restore path described later.
The embed step does not call the LLM; it can be re-run cheaply against the same documents.json to try a
different chunk size or to extend an existing collection with new documents.
Step 5 — Create the BM25 index on the user knowledge base (one-time)
The user-KB database is created empty during install — unlike the bundled system-KB dumps, it does not ship with a BM25 index pre-built. If hybrid retrieval is enabled (the default), create the index once after your first embed run:
Without this index, the hybrid-retrieval path silently falls back to dense-only on the user KB (which still returns results, just without exact-keyword recall).
That's it — there is no chart edit and no pod restart required. The running server already reads from
docvec_user_kb / user_defined_kb on every query, so your custom corpus starts contributing to answers
immediately. Run a representative question through the chat UI to confirm.
(Optional) Multi-cluster: dump and restore
If you want to build the corpus once and push it to several clusters, take a pg_dump of the user KB after
Step 4 and restore it on each target cluster:
Then perform Step 5 (BM25 index) on each target cluster — the index is per-database and doesn't survive a
restore-into-empty-DB on its own (a restore into a populated DB merges BM25 if present in the dump; for a
fresh user KB, run the CREATE INDEX after the restore).
NOTE:
pg_restoreof the user-KB dump into a target cluster that already has user-KB content (e.g., previous BYO uploads) will conflict on the langchain table CREATEs. For a clean overlay, use the chart's swap mechanism instead, or first drop and recreatedocvec_user_kbon the target. Plainpg_restoreis safest only when the target user KB is empty.
Replacing the bundled system knowledge base (advanced)
The workflow above adds to the user KB and leaves the bundled product knowledge intact. If you have a different requirement — for example, the bundled Alauda product docs are wrong for your environment and you want to replace them with an in-house variant — repeat the same in-pod workflow but target the system KB:
--pg-conn-str "${PG_SYS_KB_CONN_STR}"--collection-name <your-system-kb-collection-name>- Update
pgconnect.pgCollectionNamein the chart to point at your collection name and re-roll.
This path requires a chart edit and disables automatic system-KB upgrades on future Hyperflux releases (the auto-swap rule scan won't match your custom collection name). Use it only if user-KB augmentation isn't enough.
Re-roll cadence
Custom knowledge bases drift faster than product documentation. Plan to re-run prepare → embed on a schedule that matches your source repos:
- Daily-changing runbooks → nightly cron (a Kubernetes
CronJobthat runs the same in-pod commands as a one-shot job) re-embedding into the sameuser_defined_kbcollection. - Quarterly architecture docs → manual re-roll on each release boundary.
Re-running embed against the same collection name uses deterministic row IDs (make_row_id at
smart_doc_builder.py:38-48), so unchanged chunks upsert idempotently and changed chunks overwrite — but
rows for documents that were removed from the manifest become orphans. For a clean replacement, drop and
recreate the collection before re-embedding:
Then re-run Step 4 to repopulate. The BM25 index on langchain_pg_embedding continues to apply to the new
rows automatically, so you do not need to re-run Step 5 — unless you also dropped the database itself
(e.g., for the multi-cluster pg_restore overlay above).
Caveats
- The custom corpus shares the user KB with BYO Knowledge tool uploads. Files end users upload through
the chat UI (introduced in v1.3.1) land in the same
user_defined_kbcollection as your admin-curated corpus. They coexist as long as URLs don't collide; in practice this is fine because BYO uploads use a distinct URL scheme. If you ever drop and recreate the collection (Re-roll cadence section), BYO uploads in that collection are wiped along with your corpus — back them up first if you care about them. - System KB version filters do not apply to the user KB. When a user asks a version-specific question (e.g., about ACP 4.3), the system KB is filtered to that version but the user KB returns matches across all versions. If your custom corpus is multi-version, expect cross-version chunks to surface in answers.
- Embedding-model mismatch is the most common silent failure: vectors built with anything other than the
in-pod
/opt/gte-multilingual-basewill be retrievable but always score near zero, so answers collapse to "I don't have enough information" without an obvious error. The in-pod workflow described here uses the baked-in model by default — only overrideEMB_MODEL_NAMEif you know what you're doing.