The context wiki: brand, ICP & voice — what the AI writes from
Every blog post Nukipa generates is grounded in a per-tenant knowledge base called the context wiki. It's where the company profile, ICPs, USPs, and writing voice live. Before the writer agent drafts anything, it reads from this wiki; if the wiki is thin or wrong, the output is thin or wrong. This article explains the model, how onboarding populates it, and exactly which documents the writer reads.
The document model
The context wiki is a path-addressed virtual filesystem of markdown documents, one per tenant. There are no folder rows in the database — folders are derived from the /-separated paths. A document at /profile/company.md implies a /profile/ folder; nothing else creates it.
Each document row looks like this:
{
id, tenant_id, campaign_id, path, slug, title, kind, is_system,
source_url, original_path, original_mime,
body, body_preview, metadata,
created_at, updated_at
}
Rules worth internalising:
pathis the natural key. It's unique per tenant (per campaign, where a campaign is set),/-prefixed, and ends in.md. Writing to the same path twice updates the document rather than creating a duplicate — that's why the write tool is calledcontext_upsert_document.bodyis plain markdown. No proprietary format. You can read and edit any document by hand.- Originals are preserved — but only HTML, today. When the onboarding crawler imports a page, the raw HTML is kept in the private
context-originalsstorage bucket (per-tenant prefix). The markdownbodyis the working copy; the original HTML is the receipt. There is no PDF/DOCX ingestion path yet —POST /import/filereturns 501. If you want a PDF or DOCX in the wiki, upload it to storage directly and write the markdown body yourself; nothing produces those originals automatically.
Kinds
Every document has a kind. It's a loose classifier — search and the writer use it to filter — not a rigid schema. The current set:
kind |
Typical path | What it holds |
|---|---|---|
profile |
/profile/company.md |
What the company does, who it serves, how it positions itself |
industry |
/profile/industry.md |
Primary sector, broader category, adjacent industries |
product |
/products/overview.md |
Catalogue: per-product name, function, audience, features |
icp |
/icp/overview.md |
1–3 ideal customer profiles |
usp |
/usp/overview.md |
3–6 unique selling points |
writing_style |
/writing-style/voice.md |
Tone, vocabulary, rhythm, taboos |
competitor |
/competitors/<slug>.md |
Confirmed competitors to track |
page |
/web/<slug>.md |
Raw crawled site pages |
cms_post |
/cms/posts/<slug>.md |
Mirror of a published blog post (see below) |
note, doc, pdf |
anywhere | Free-form additions you upload or write |
system |
— | Internal |
Onboarding also writes two social-channel documents — /social/voice.md (a LinkedIn-tuned voice, often punchier than long-form) and /social/visual-style.md (mood, colour direction, typography for generated graphics, seeded with the detected brand hex codes). When a blog post is published, the CMS mirrors a flattened copy into the wiki at /cms/posts/<slug>.md (kind cms_post) so the writer can answer "have we already covered this?" without hitting the CMS database.
The brand document is virtual
/settings/brand.md looks like a normal document, but it is not stored in context.documents. It is synthesised on read from the tenant's brand_theme_* settings and parsed back into those settings on write. It carries kind: 'brand', is_system: true, and a virtual: true flag, and its id is the same fixed UUID for every tenant. It has no original file.
Because it's a projection, the generic document rules above do not apply to it:
- Editing the body only accepts colour rows of the form
- **Primary**: \#1f2937`(alsoAccent,Background,Text) and aFont family` line. Those values flow into the brand-theme override, per field. - Anything else in the body — a custom title, narrative paragraphs — is discarded on save. A body with no recognisable hex rows is rejected with a 400.
- You cannot move, recategorise, or delete it. PATCH attempts on its path/kind are refused.
[!NOTE] The brand theme has two surfaces that write to the same place: the
/settings/brand.mdbody and the dashboard's Brand theme tab (also reachable viacontext_set_brand_theme). Editing colours in either is equivalent. For anything other than colours and the font, use the structured tools — the markdown body has nowhere to store it.
System documents
Documents written by the onboarding pipeline carry is_system: true. They're the profile/industry/ICP/USP/voice set the writer depends on, so they're partly protected:
- You can edit their
body,title, andmetadata— by hand in the dashboard or viacontext_patch_document. - You cannot delete them. Deletion is unconditionally refused with a 403.
- Moving them (changing
pathorkind) is refused with a 403 — but only when the patch contains nothing else. A patch that also includes abody,title, ormetadatachange skips that guard and the move succeeds.
[!WARNING] The "system documents can't be moved" protection is partial. The 403 fires only on a path/kind-only patch; bundling any content edit alongside the move bypasses it (
patchDocument,services/context/src/modules/documents/service.js). This is a code gap, not a designed escape hatch — don't rely on either behaviour. Deletion, by contrast, is always refused.
Linking documents
Documents can reference each other, and links are extracted from the body on every write. Two forms:
See our [main ICP](nukipa:///icp/overview.md) for details.
Or the wiki form: [[/icp/overview.md|main ICP]]
Link rows are a derived projection — rebuilt every upsert/patch. If the target exists, the link resolves to its id; if not, the unresolved path is retained so it resolves later when the target is created. context_document_links returns both outbound and inbound links, which is how you navigate the wiki as a graph.
Onboarding: how the wiki gets populated
You don't build the wiki by hand from scratch. Onboarding does a first pass from the company's own website. It's a fire-and-forget background pipeline kicked off by a single call:
context_onboarding_start { company_url: "https://acme.com" }
Then poll context_onboarding_status to track progress. Note that this returns the tenant's latest onboarding run, not specifically the run you just started — for a single concurrent onboarding (the normal case) that's the same thing.
The pipeline runs these phases in order, each writing one or more system documents:
crawling— crawl the site, up to 80 pages, each saved as a/web/<slug>.mddocument (kindpage) with the original HTML stored. The homepage is detected as the shortest source URL (robust againstwww/locale-path redirects); nav, footer, logo, and hero image are extracted from its HTML onto the document'smetadata.profile→/profile/company.mdindustry→/profile/industry.mdbrand— infers colours and typography from the homepage HTML into the virtual brand doc. Best-effort, never fatal.products→/products/overview.mdicp→/icp/overview.mdusp→/usp/overview.mdwriting_style→/writing-style/voice.mdsocial→/social/voice.mdand/social/visual-style.mdcompetitors— LLM-driven discovery; up to 3 suggestions land in the run result for you to confirm separately viacontext_confirm_competitors. Nothing is written to/competitors/until you confirm.complete
Each text artifact is one constrained LLM call against an aggregated, ~40k-character extract of the crawled pages (model gpt-4.1-mini, temperature 0.4, max 1500 tokens). The prompts are deliberately tight — a skimmable wiki page, not an essay.
[!WARNING] The crawl is the one place onboarding can hard-fail. If the crawler fetches zero pages, the run aborts with
status: 'failed'and the error'no pages could be fetched from the URL'. Every later phase is best-effort and never aborts the run. If onboarding fails outright, check the URL is reachable and not blocking crawlers.
[!WARNING] When the LLM is unreachable, several phases produce nothing rather than failing. If no LLM provider is configured (or the LLM service is down), each text artifact is written with a placeholder body (
"_Generated context will appear here…_"), and under the same condition the competitors phase yields zero suggestions and brand detection silently no-ops. Each text artifact records the source in its metadata asgenerated_by: 'llm'orgenerated_by: 'placeholder'— that flag is the reliable, machine-readable signal that a phase ran without a model. Re-run onboarding once a provider is configured, or write the documents yourself.
[!NOTE] Competitor discovery returns an empty list (not an error) on several paths beyond "LLM down": empty model output, unparseable JSON, or when every candidate scores below confidence
0.7. An empty competitors step is normal and non-fatal.
This is a starting point, not the finish line
Onboarding gets you a usable wiki from public marketing copy. It cannot know what your website doesn't say — internal positioning, the segment you're actually pivoting toward, the phrases your founder refuses to use. Read the generated documents and edit them. They are the single biggest lever on output quality, and they're just markdown. Add documents too: context_import_url pulls a single page into /web/<slug>.md (overridable with target_path), and context_upsert_document writes anything you want the writer to know.
What the writer actually reads
When the CMS writer generates a post, the context resolver (services/cms/src/lib/contextResolver.js) assembles the prompt from several surfaces. They split into a fixed company layer, a per-post author layer, and an on-demand search layer.
Fixed company documents
Six well-known paths are fetched by exact path (each independently; a missing one is skipped, not fatal):
| Field in prompt | Path |
|---|---|
company_profile |
/profile/company.md |
industry |
/profile/industry.md |
products |
/products/overview.md |
icp |
/icp/overview.md |
usp |
/usp/overview.md |
writing_style |
/writing-style/voice.md |
Each present document is pasted into the user prompt under its own # (H1) heading. This is the baseline grounding on every post.
The per-post author byline
There is a per-post override layer. A post can carry an author_id (cms.posts.author_id); when it does, the resolver pulls that author's name, job_title, and writing_style and injects a # Author: … block. The author's writing style is layered after the company writing style and, in the prompt's own words, "overrides company defaults where they conflict." Because author_id is per-post, two posts in the same tenant can be written in two different voices.
Campaign addenda, merged in place
When a post is tagged into a campaign (covered below), the campaign's profile_addendum and writing_style_addendum are merged into the same company_profile and writing_style strings — appended under a ## Campaign-specific … sub-heading inside those fields, not emitted as separate top-level sections. The campaign's attached_context_paths are fetched and injected separately, each under a ## heading in an "Attached context (campaign-pinned)" block.
So the writer's user prompt, in resolver order, is: an optional campaign header, then company_profile (with any profile addendum folded in), industry, products, icp, usp, writing_style (with any writing-style addendum folded in), then the author byline + author style, then the campaign-pinned attached documents.
Live search
On top of all that, the writer has a search_context tool (hybrid text + vector over the wiki) and is instructed to call it first, before drafting, to pull in anything relevant to the specific topic — prior published posts, and any wiki document beyond the fixed six. The fixed company docs are the always-on floor; search_context is how the writer reaches the rest of the wiki on demand, including the cms_post mirrors, which it uses as anti-references so it differentiates from what's already published rather than repeating it.
[!TIP] If generated posts get the audience or the value proposition wrong, fix
/icp/overview.mdand/usp/overview.mdfirst — they're injected verbatim into every prompt. If the tone is off, fix/writing-style/voice.md, or set a per-post author with their ownwriting_style. You don't need to retrain anything; the next generation reads the updated documents.
Campaigns: scoped overrides on top of the wiki
A campaign is a time-bounded editorial container that groups posts and re-shapes the writer's context for that group. Campaigns are owned by the context module (context.campaigns) precisely because they change what the AI writes from. A campaign contributes three things on top of the company-level context:
| Field | Effect on the prompt |
|---|---|
writing_style_addendum |
Merged into writing_style under a "Campaign-specific writing style" sub-heading |
profile_addendum |
Merged into company_profile under a "Campaign-specific context" sub-heading |
attached_context_paths |
An array of wiki paths fetched and injected as pinned attached_documents |
When a post's campaign_id is set, the resolver merges these on top of the company documents — the campaign augments, it doesn't replace. Use it to push posts toward a launch, a season, or a sub-audience without rewriting the company-level wiki.
context_create_campaign {
name: "Q3 launch",
start_date: "2026-07-01",
end_date: "2026-09-30",
profile_addendum: "Lead with the new compliance module; de-emphasise pricing.",
writing_style_addendum: "Slightly more technical; assume the reader is an ops lead.",
attached_context_paths: ["/web/compliance-module.md", "/notes/launch-faq.md"]
}
Deleting a campaign (context_delete_campaign) is a soft-delete that unlinks its posts — the posts survive, their campaign_id is just cleared.
Search: hybrid text + vector
The wiki supports three search modes. The default and the one the writer uses is hybrid.
| Tool | Backing | Use when |
|---|---|---|
context_search |
Hybrid (text ∪ vector) | Default — general "find relevant docs for this query" |
context_search_text |
Postgres tsvector full-text |
You need exact keyword matches; cheaper |
context_search_vector |
pgvector cosine similarity | Semantic / paraphrase matching |
How they work, concretely:
- Text queries a stored
search_tsvcolumn (a generatedtsvectorover title + body,simpleconfig) withplainto_tsquery, so you don't need search operators. Thesimpleconfig keeps it language-agnostic. Always available. - Vector depends on embeddings. Every document write fires a background job (
embedDocument) that chunks the markdown, embeds each chunk withopenai/text-embedding-3-small(1536-dim), and stores the vectors incontext.document_chunks. Vector search embeds your query and ranks chunks by cosine similarity. There's a language guard: a chunk whose dominant script doesn't match the query's is filtered out (so an English query won't surface an Arabic document just because they're semantically near). Chunks whose script ismixedorunknown— short, number-heavy, or multilingual — always pass, so the guard only drops clear script mismatches. - Hybrid runs both and merges, vector-first (semantic), filling the remainder from text, deduped by document.
[!WARNING] A document's embedding job is best-effort and swallows its failures (LLM down, partial-embed mismatch, insert error — all logged only). A document whose embed failed stays text-searchable but vector-invisible until it's re-embedded. So a write doesn't guarantee the doc is semantically searchable, only lexically. This is exactly the gap
POST /embeddings/backfillexists to close.
[!NOTE] Vector search returns empty if no embeddings exist yet for the tenant. For a tenant that predates embedding — or after a database reset — run
POST /embeddings/backfillto seed the chunks; it's synchronous, idempotent, and returns{ scanned, embedded, failed, skipped }. One caveat: a single call scans a bounded page of the most-recently-updated documents (limit, default 200, max 500), not necessarily every document. A tenant with more than ~200 non-empty docs needs more than one call to cover everything.
Current limits — stated plainly
The vector path is basic for now. It reads up to 2,000 chunks for the tenant and computes cosine similarity in application code, not in the database. This is fine at MVP volumes and holds to roughly 5–10k chunks. Past that it becomes a hot loop, and the planned fix is a match_chunks SQL function using pgvector's <=> operator with an IVFFlat index. Until a tenant is large, you won't notice; it's a known ceiling, not a silent one.
FAQ
Where do I edit these documents? In the dashboard's context section, or programmatically with context_patch_document (by id) and context_upsert_document (by path). They're plain markdown — except the virtual brand doc, which only accepts colour/font rows.
Can I delete the profile/ICP/USP documents? No — they're system documents (is_system: true) and deletion is always refused with a 403. You can edit their contents freely. Moving them (changing path/kind) is also meant to be refused, though that guard is bypassable today if you bundle a content edit — don't rely on it either way.
Which documents are injected into every post? The six fixed company docs (/profile/company.md, /profile/industry.md, /products/overview.md, /icp/overview.md, /usp/overview.md, /writing-style/voice.md), each under an # heading. On top of that: a per-post author byline if the post has an author_id, and any campaign addenda + attached docs if the post is tagged into a campaign. Everything else is reached on demand via search_context.
Do I need to re-generate or re-embed after editing a document? No. Every write re-extracts links and fires the embedder in the background, and the writer re-reads the documents on the next generation. There's no separate "publish the wiki" step. If an embed silently failed, run POST /embeddings/backfill to make the doc vector-searchable again.
The writer is ignoring something I put in the wiki — why? Check three things. First, is it at one of the six fixed paths? Only those are injected on every post; everything else is reached via search_context, so if it's not findable by search (no embeddings yet, a failed embed, or weak keyword overlap) the writer may not pull it. Second, if it's author-specific, confirm the post carries the right author_id. Third, if it's campaign-specific, confirm the post is tagged into the campaign and the path is in attached_context_paths.
What happens if the context service is down during generation? The resolver returns a partial context — each missing document is simply skipped, and the prompt rules tolerate empty fields. The post still generates; it's just less grounded. The fixed-path fetch and the published-post mirror are both best-effort by design.