Steeren/The context wiki: brand, ICP & voice — what the AI writes fromlive from the platform← site
Create content

The context wiki: brand, ICP & voice — what the AI writes from

Every blog post Nukipa generates is grounded in a per-tenant knowledge base called the context wiki. It's where the company profile, ICPs, USPs, and writing voice live. Before the writer agent drafts anything, it reads from this wiki; if the wiki is thin or wrong, the output is thin or wrong. This article explains the model, how onboarding populates it, and exactly which documents the writer reads.

The document model

The context wiki is a path-addressed virtual filesystem of markdown documents, one per tenant. There are no folder rows in the database — folders are derived from the /-separated paths. A document at /profile/company.md implies a /profile/ folder; nothing else creates it.

Each document row looks like this:

{
  id, tenant_id, campaign_id, path, slug, title, kind, is_system,
  source_url, original_path, original_mime,
  body, body_preview, metadata,
  created_at, updated_at
}

Rules worth internalising:

  • path is the natural key. It's unique per tenant (per campaign, where a campaign is set), /-prefixed, and ends in .md. Writing to the same path twice updates the document rather than creating a duplicate — that's why the write tool is called context_upsert_document.
  • body is plain markdown. No proprietary format. You can read and edit any document by hand.
  • Originals are preserved — but only HTML, today. When the onboarding crawler imports a page, the raw HTML is kept in the private context-originals storage bucket (per-tenant prefix). The markdown body is the working copy; the original HTML is the receipt. There is no PDF/DOCX ingestion path yet — POST /import/file returns 501. If you want a PDF or DOCX in the wiki, upload it to storage directly and write the markdown body yourself; nothing produces those originals automatically.

Kinds

Every document has a kind. It's a loose classifier — search and the writer use it to filter — not a rigid schema. The current set:

kind Typical path What it holds
profile /profile/company.md What the company does, who it serves, how it positions itself
industry /profile/industry.md Primary sector, broader category, adjacent industries
product /products/overview.md Catalogue: per-product name, function, audience, features
icp /icp/overview.md 1–3 ideal customer profiles
usp /usp/overview.md 3–6 unique selling points
writing_style /writing-style/voice.md Tone, vocabulary, rhythm, taboos
competitor /competitors/<slug>.md Confirmed competitors to track
page /web/<slug>.md Raw crawled site pages
cms_post /cms/posts/<slug>.md Mirror of a published blog post (see below)
note, doc, pdf anywhere Free-form additions you upload or write
system Internal

Onboarding also writes two social-channel documents — /social/voice.md (a LinkedIn-tuned voice, often punchier than long-form) and /social/visual-style.md (mood, colour direction, typography for generated graphics, seeded with the detected brand hex codes). When a blog post is published, the CMS mirrors a flattened copy into the wiki at /cms/posts/<slug>.md (kind cms_post) so the writer can answer "have we already covered this?" without hitting the CMS database.

The brand document is virtual

/settings/brand.md looks like a normal document, but it is not stored in context.documents. It is synthesised on read from the tenant's brand_theme_* settings and parsed back into those settings on write. It carries kind: 'brand', is_system: true, and a virtual: true flag, and its id is the same fixed UUID for every tenant. It has no original file.

Because it's a projection, the generic document rules above do not apply to it:

  • Editing the body only accepts colour rows of the form - **Primary**: \#1f2937`(alsoAccent, Background, Text) and a Font family` line. Those values flow into the brand-theme override, per field.
  • Anything else in the body — a custom title, narrative paragraphs — is discarded on save. A body with no recognisable hex rows is rejected with a 400.
  • You cannot move, recategorise, or delete it. PATCH attempts on its path/kind are refused.

[!NOTE] The brand theme has two surfaces that write to the same place: the /settings/brand.md body and the dashboard's Brand theme tab (also reachable via context_set_brand_theme). Editing colours in either is equivalent. For anything other than colours and the font, use the structured tools — the markdown body has nowhere to store it.

System documents

Documents written by the onboarding pipeline carry is_system: true. They're the profile/industry/ICP/USP/voice set the writer depends on, so they're partly protected:

  • You can edit their body, title, and metadata — by hand in the dashboard or via context_patch_document.
  • You cannot delete them. Deletion is unconditionally refused with a 403.
  • Moving them (changing path or kind) is refused with a 403 — but only when the patch contains nothing else. A patch that also includes a body, title, or metadata change skips that guard and the move succeeds.

[!WARNING] The "system documents can't be moved" protection is partial. The 403 fires only on a path/kind-only patch; bundling any content edit alongside the move bypasses it (patchDocument, services/context/src/modules/documents/service.js). This is a code gap, not a designed escape hatch — don't rely on either behaviour. Deletion, by contrast, is always refused.

Linking documents

Documents can reference each other, and links are extracted from the body on every write. Two forms:

See our [main ICP](nukipa:///icp/overview.md) for details.
Or the wiki form: [[/icp/overview.md|main ICP]]

Link rows are a derived projection — rebuilt every upsert/patch. If the target exists, the link resolves to its id; if not, the unresolved path is retained so it resolves later when the target is created. context_document_links returns both outbound and inbound links, which is how you navigate the wiki as a graph.

Onboarding: how the wiki gets populated

You don't build the wiki by hand from scratch. Onboarding does a first pass from the company's own website. It's a fire-and-forget background pipeline kicked off by a single call:

context_onboarding_start { company_url: "https://acme.com" }

Then poll context_onboarding_status to track progress. Note that this returns the tenant's latest onboarding run, not specifically the run you just started — for a single concurrent onboarding (the normal case) that's the same thing.

The pipeline runs these phases in order, each writing one or more system documents:

  1. crawling — crawl the site, up to 80 pages, each saved as a /web/<slug>.md document (kind page) with the original HTML stored. The homepage is detected as the shortest source URL (robust against www/locale-path redirects); nav, footer, logo, and hero image are extracted from its HTML onto the document's metadata.
  2. profile/profile/company.md
  3. industry/profile/industry.md
  4. brand — infers colours and typography from the homepage HTML into the virtual brand doc. Best-effort, never fatal.
  5. products/products/overview.md
  6. icp/icp/overview.md
  7. usp/usp/overview.md
  8. writing_style/writing-style/voice.md
  9. social/social/voice.md and /social/visual-style.md
  10. competitors — LLM-driven discovery; up to 3 suggestions land in the run result for you to confirm separately via context_confirm_competitors. Nothing is written to /competitors/ until you confirm.
  11. complete

Each text artifact is one constrained LLM call against an aggregated, ~40k-character extract of the crawled pages (model gpt-4.1-mini, temperature 0.4, max 1500 tokens). The prompts are deliberately tight — a skimmable wiki page, not an essay.

[!WARNING] The crawl is the one place onboarding can hard-fail. If the crawler fetches zero pages, the run aborts with status: 'failed' and the error 'no pages could be fetched from the URL'. Every later phase is best-effort and never aborts the run. If onboarding fails outright, check the URL is reachable and not blocking crawlers.

[!WARNING] When the LLM is unreachable, several phases produce nothing rather than failing. If no LLM provider is configured (or the LLM service is down), each text artifact is written with a placeholder body ("_Generated context will appear here…_"), and under the same condition the competitors phase yields zero suggestions and brand detection silently no-ops. Each text artifact records the source in its metadata as generated_by: 'llm' or generated_by: 'placeholder' — that flag is the reliable, machine-readable signal that a phase ran without a model. Re-run onboarding once a provider is configured, or write the documents yourself.

[!NOTE] Competitor discovery returns an empty list (not an error) on several paths beyond "LLM down": empty model output, unparseable JSON, or when every candidate scores below confidence 0.7. An empty competitors step is normal and non-fatal.

This is a starting point, not the finish line

Onboarding gets you a usable wiki from public marketing copy. It cannot know what your website doesn't say — internal positioning, the segment you're actually pivoting toward, the phrases your founder refuses to use. Read the generated documents and edit them. They are the single biggest lever on output quality, and they're just markdown. Add documents too: context_import_url pulls a single page into /web/<slug>.md (overridable with target_path), and context_upsert_document writes anything you want the writer to know.

What the writer actually reads

When the CMS writer generates a post, the context resolver (services/cms/src/lib/contextResolver.js) assembles the prompt from several surfaces. They split into a fixed company layer, a per-post author layer, and an on-demand search layer.

Fixed company documents

Six well-known paths are fetched by exact path (each independently; a missing one is skipped, not fatal):

Field in prompt Path
company_profile /profile/company.md
industry /profile/industry.md
products /products/overview.md
icp /icp/overview.md
usp /usp/overview.md
writing_style /writing-style/voice.md

Each present document is pasted into the user prompt under its own # (H1) heading. This is the baseline grounding on every post.

The per-post author byline

There is a per-post override layer. A post can carry an author_id (cms.posts.author_id); when it does, the resolver pulls that author's name, job_title, and writing_style and injects a # Author: … block. The author's writing style is layered after the company writing style and, in the prompt's own words, "overrides company defaults where they conflict." Because author_id is per-post, two posts in the same tenant can be written in two different voices.

Campaign addenda, merged in place

When a post is tagged into a campaign (covered below), the campaign's profile_addendum and writing_style_addendum are merged into the same company_profile and writing_style strings — appended under a ## Campaign-specific … sub-heading inside those fields, not emitted as separate top-level sections. The campaign's attached_context_paths are fetched and injected separately, each under a ## heading in an "Attached context (campaign-pinned)" block.

So the writer's user prompt, in resolver order, is: an optional campaign header, then company_profile (with any profile addendum folded in), industry, products, icp, usp, writing_style (with any writing-style addendum folded in), then the author byline + author style, then the campaign-pinned attached documents.

Live search

On top of all that, the writer has a search_context tool (hybrid text + vector over the wiki) and is instructed to call it first, before drafting, to pull in anything relevant to the specific topic — prior published posts, and any wiki document beyond the fixed six. The fixed company docs are the always-on floor; search_context is how the writer reaches the rest of the wiki on demand, including the cms_post mirrors, which it uses as anti-references so it differentiates from what's already published rather than repeating it.

[!TIP] If generated posts get the audience or the value proposition wrong, fix /icp/overview.md and /usp/overview.md first — they're injected verbatim into every prompt. If the tone is off, fix /writing-style/voice.md, or set a per-post author with their own writing_style. You don't need to retrain anything; the next generation reads the updated documents.

Campaigns: scoped overrides on top of the wiki

A campaign is a time-bounded editorial container that groups posts and re-shapes the writer's context for that group. Campaigns are owned by the context module (context.campaigns) precisely because they change what the AI writes from. A campaign contributes three things on top of the company-level context:

Field Effect on the prompt
writing_style_addendum Merged into writing_style under a "Campaign-specific writing style" sub-heading
profile_addendum Merged into company_profile under a "Campaign-specific context" sub-heading
attached_context_paths An array of wiki paths fetched and injected as pinned attached_documents

When a post's campaign_id is set, the resolver merges these on top of the company documents — the campaign augments, it doesn't replace. Use it to push posts toward a launch, a season, or a sub-audience without rewriting the company-level wiki.

context_create_campaign {
  name: "Q3 launch",
  start_date: "2026-07-01",
  end_date: "2026-09-30",
  profile_addendum: "Lead with the new compliance module; de-emphasise pricing.",
  writing_style_addendum: "Slightly more technical; assume the reader is an ops lead.",
  attached_context_paths: ["/web/compliance-module.md", "/notes/launch-faq.md"]
}

Deleting a campaign (context_delete_campaign) is a soft-delete that unlinks its posts — the posts survive, their campaign_id is just cleared.

Search: hybrid text + vector

The wiki supports three search modes. The default and the one the writer uses is hybrid.

Tool Backing Use when
context_search Hybrid (text ∪ vector) Default — general "find relevant docs for this query"
context_search_text Postgres tsvector full-text You need exact keyword matches; cheaper
context_search_vector pgvector cosine similarity Semantic / paraphrase matching

How they work, concretely:

  • Text queries a stored search_tsv column (a generated tsvector over title + body, simple config) with plainto_tsquery, so you don't need search operators. The simple config keeps it language-agnostic. Always available.
  • Vector depends on embeddings. Every document write fires a background job (embedDocument) that chunks the markdown, embeds each chunk with openai/text-embedding-3-small (1536-dim), and stores the vectors in context.document_chunks. Vector search embeds your query and ranks chunks by cosine similarity. There's a language guard: a chunk whose dominant script doesn't match the query's is filtered out (so an English query won't surface an Arabic document just because they're semantically near). Chunks whose script is mixed or unknown — short, number-heavy, or multilingual — always pass, so the guard only drops clear script mismatches.
  • Hybrid runs both and merges, vector-first (semantic), filling the remainder from text, deduped by document.

[!WARNING] A document's embedding job is best-effort and swallows its failures (LLM down, partial-embed mismatch, insert error — all logged only). A document whose embed failed stays text-searchable but vector-invisible until it's re-embedded. So a write doesn't guarantee the doc is semantically searchable, only lexically. This is exactly the gap POST /embeddings/backfill exists to close.

[!NOTE] Vector search returns empty if no embeddings exist yet for the tenant. For a tenant that predates embedding — or after a database reset — run POST /embeddings/backfill to seed the chunks; it's synchronous, idempotent, and returns { scanned, embedded, failed, skipped }. One caveat: a single call scans a bounded page of the most-recently-updated documents (limit, default 200, max 500), not necessarily every document. A tenant with more than ~200 non-empty docs needs more than one call to cover everything.

Current limits — stated plainly

The vector path is basic for now. It reads up to 2,000 chunks for the tenant and computes cosine similarity in application code, not in the database. This is fine at MVP volumes and holds to roughly 5–10k chunks. Past that it becomes a hot loop, and the planned fix is a match_chunks SQL function using pgvector's <=> operator with an IVFFlat index. Until a tenant is large, you won't notice; it's a known ceiling, not a silent one.

FAQ

Where do I edit these documents? In the dashboard's context section, or programmatically with context_patch_document (by id) and context_upsert_document (by path). They're plain markdown — except the virtual brand doc, which only accepts colour/font rows.

Can I delete the profile/ICP/USP documents? No — they're system documents (is_system: true) and deletion is always refused with a 403. You can edit their contents freely. Moving them (changing path/kind) is also meant to be refused, though that guard is bypassable today if you bundle a content edit — don't rely on it either way.

Which documents are injected into every post? The six fixed company docs (/profile/company.md, /profile/industry.md, /products/overview.md, /icp/overview.md, /usp/overview.md, /writing-style/voice.md), each under an # heading. On top of that: a per-post author byline if the post has an author_id, and any campaign addenda + attached docs if the post is tagged into a campaign. Everything else is reached on demand via search_context.

Do I need to re-generate or re-embed after editing a document? No. Every write re-extracts links and fires the embedder in the background, and the writer re-reads the documents on the next generation. There's no separate "publish the wiki" step. If an embed silently failed, run POST /embeddings/backfill to make the doc vector-searchable again.

The writer is ignoring something I put in the wiki — why? Check three things. First, is it at one of the six fixed paths? Only those are injected on every post; everything else is reached via search_context, so if it's not findable by search (no embeddings yet, a failed embed, or weak keyword overlap) the writer may not pull it. Second, if it's author-specific, confirm the post carries the right author_id. Third, if it's campaign-specific, confirm the post is tagged into the campaign and the path is in attached_context_paths.

What happens if the context service is down during generation? The resolver returns a partial context — each missing document is simply skipped, and the prompt rules tolerate empty fields. The post still generates; it's just less grounded. The fixed-path fetch and the published-post mirror are both best-effort by design.

Served live from the platform · /docs/the-context-wiki