Troubleshooting & gotchas
This is the cross-cutting failure-mode reference for a Nukipa-managed site: the things that break at the seams between your site, the gateway, and the platform services. Each section names the symptom, the cause in the actual request path, and the fix. Where the platform is thin, it says so.
The shape to keep in your head: your site → the Nukipa gateway (/public/v1/*) → the owning service (CMS, signals, newsletters, …). Almost every tenant-resolution and rate-limit problem lives in that first hop.
Host resolution: 404s and empty content
Every public read and write resolves the tenant from one input: the visitor host. Your site forwards it; the gateway passes it through as X-Forwarded-Host; the owning service runs resolveTenantByHost(host). If that host doesn't map to a tenant, you get a 404 (single post / form) or an empty list (/posts, /folders) — never another tenant's content.
The resolver tries two things, in order:
- Custom domain — a lowercased, port-stripped match against a
tenant_domainsrow that hasverified_atset. A row that exists but isn't verified yet does not match. - Subdomain on a platform host —
<slug>.<platform>(e.g.acme.nukipa.com), where<slug>matchestenants.slug. The platform-host list comes fromPUBLIC_PLATFORM_HOSTS(defaults tonukipa.com,localhost).
If neither matches, the resolver returns null and the caller decides 404-vs-empty.
[!WARNING] There are three copies of this resolver and they don't all agree. The shared resolver (
@nukipa/shared, used by the signals visit-ingest + confirm path) toggleswww.: it builds the candidate set[host, apex, www.apex]and matches any verified row, sowww.example.comandexample.comresolve to the same tenant there. The CMS resolver (reads, gating, forms) and the newsletters resolver do exact-match only (.eq('domain', hostname)) — on those pathswww.example.comandexample.comare different hosts. If your site serves both and you rely on CMS reads, register both as verifiedtenant_domainsrows, or redirect one to the other at the edge before the request reaches Nukipa.
How the host actually travels:
| Layer | What sets the host | Gotcha |
|---|---|---|
| Your site (SDK) | getHost() reads inbound x-forwarded-host, then host, then NUKIPA_TENANT_HOST |
If getHost() returns '', the gateway falls back to its own host → wrong tenant or 404 |
| Gateway | visitorHost(req) = x-forwarded-host ∥ host |
req.get('host') is the literal Host header even behind a proxy, so the gateway does the X-Forwarded-Host check explicitly |
| Service | resolveTenantByHost(X-Forwarded-Host) |
Unverified custom domain or multi-label subdomain → no match (on CMS/shared; see below) |
Symptom → cause:
- Live custom domain 404s, staging subdomain works. The
tenant_domainsrow isn'tverified_at. Until DNS verifies, only the<slug>.<platform>subdomain resolves. - Everything is empty in build/edge contexts but fine in the browser. Build-time and edge renders have no inbound request headers. That's exactly what
NUKIPA_TENANT_HOSTis for — it's the fallback host whenx-forwarded-host/hostare absent. - A multi-label subdomain (
blog.acme.nukipa.com) won't resolve. On the CMS and shared resolvers the subdomain branch skips any slug containing a dot (slug.includes('.')→ continue), so only a single label resolves. The newsletters resolver omits that guard — but a multi-label slug still won't match a realtenants.slug, so the practical result is the same (no resolution). Either way: use a single-label subdomain.
[!NOTE]
NUKIPA_TENANT_HOSTis a fallback, not an override. The SDK resolver is ordered visitor-host-first on purpose: when a custom domain is live, the visitor host (example.com) must win so the gateway can return that domain's per-domaingoogle_verification_token. ForcingNUKIPA_TENANT_HOST(your<slug>.sites.…host) routes through the subdomain branch, which doesn't carry the token, and Google Search Console verification stalls at "verification token could not be found." SetNUKIPA_TENANT_HOSTto your platform subdomain and leave it as the fallback.
Why the gateway relay is mandatory
Don't call services directly, and don't fetch('/public/v1/*') by hand from your site — go through the SDK (getNukipaClient() / getMiddlewareClient(req)). The relay isn't decoration; the services depend on header normalisation the gateway performs:
- It sets
X-Forwarded-Host(tenant resolution),X-Forwarded-For(visitor IP / fingerprint), and forwardsUser-Agent+Refererso signals classifies the real browser, not the gateway. - It threads visitor identity for gating (
X-Visitor-Email, andX-Visitor-IP-Hashif the upstream set it) and version pinning (X-Nukipa-Site-Version) through verbatim. - For the Resend delivery webhook it verifies the svix signature at the gateway, so the
RESEND_WEBHOOK_SECRETnever leaves the gateway. The Post for Me callback is different: the gateway forwards theAuthorization: Bearerheader verbatim and does not check it — that bearer is verified downstream in the social service againstPOSTFORME_WEBHOOK_SECRET. So "signing secret stays at the gateway" is true for Resend, not for Post for Me.
The CMS/signals public routes are mounted before internal auth and self-resolve the tenant from request data — so a raw call that omits these headers won't 401, it'll just silently resolve the wrong tenant (or none) and drop your data. That failure is invisible until you notice missing visits or unattributed leads.
[!TIP] One known IP-loss trap is already handled for you: serverless runtimes strip
X-Forwarded-Foron outbound calls, sorecordVisitinjectsvisitor_ipinto the request body (signals prefersbody.visitor_ipover the header). If you reimplement visit ingest with raw fetch, you lose the Visitors KPI (thevisitor_fingerprintcolumn stays NULL and the dashboard shows "—"). Use the SDK.
Rate limits and shared-IP collapse
The gateway runs three buckets, all per-IP over a 60-second window:
| Bucket | Limit (per IP / min) | Applies to | Env override |
|---|---|---|---|
submitLimiter |
2000 |
form submit, inline contact-form submit, audit run/gate, newsletter subscribe, ingestion webhooks, Post for Me webhook | PUBLIC_SUBMIT_RATE_LIMIT |
visitIngestLimiter |
1200 (~20/s) |
visit ingest, visit confirm, CTA clicks, Resend webhook | — (code constant) |
readLimiter |
unlimited (no-op passthrough) | all blog/tenant/folder/event/data reads | — |
The thing to understand: behind a CDN or SSR layer, the gateway's req.ip is the proxy egress IP, not the visitor's. Every tenant's traffic collapses onto a small set of Vercel/Cloudflare function outbound IPs. A naive per-IP counter would 429 legitimate cross-tenant load constantly — which is why:
- Reads are intentionally unlimited. The read limiter is a no-op. Reads are anonymous, cheap, idempotent public data; there's nothing to protect against in a per-IP express counter. If real abuse protection is ever needed it belongs at the edge (Cloudflare / Vercel WAF), not here.
- The submit bucket is deliberately huge (
2000/min) for the same reason — it has to absorb every tenant sharing one egress IP, including long-running audit polls. It is not a per-visitor anti-abuse limit.
[!NOTE] The gateway's
req.iponly reflects the egress IP because the gateway app sets Expresstrust proxy. The public routes file documents this in comments but doesn't set it (theapp.set('trust proxy', …)lives in the gateway bootstrap) — the reasoning above holds regardless of the exact value.
Symptom → cause:
429 {"error":{"code":"too_many"}}on form submits under load. You've hit the shared submit bucket across all tenants on one egress IP. RaisePUBLIC_SUBMIT_RATE_LIMIT— it's crankable without a deploy. Don't expect it to isolate one tenant; that's an edge-layer concern.- Visits silently missing during traffic spikes, no visible error. The visit and confirm endpoints answer
204on any failure (including a 429) so a tracking misfire never breaks a page load. A hot client degrades to gaps in analytics, not 4xxs the visitor sees. There is no per-tenant alarm for this today — it's a known tradeoff.
[!NOTE] Per-form and per-IP abuse limits do exist, but only on the inline contact-form path inside the CMS (
contact_formcomponents): 5 submissions per IP+post per hour and 10 per email+tenant per day, over →429. The slug-based lead form (/forms/:slug/submit) and the audit gates have no honeypot, bot-fill, or per-visitor caps of their own — they rely solely on the gateway's shared submit bucket.
Forms not syncing to CRM
There are two form paths, and they record outcomes differently. The submission row is saved either way — the lead is never lost, but it may not reach the CRM, and only one path records a CRM status.
| Path | Endpoint | Honeypot / bot-fill | status set after submit |
|---|---|---|---|
| Slug-based lead form | POST /public/v1/forms/:slug/submit |
none | always stays received |
| Inline contact form | POST /public/v1/posts/:postId/contact-form-submissions |
yes (hp_field + looksLikeBotFillAll) |
crm_synced / crm_failed / spam |
Both resolve the tenant from X-Forwarded-Host, insert a cms.form_submissions row, and call services/crm createLead best-effort. The difference is what they write back:
- Slug lead form (
submitForm): on a successful CRM sync it patches onlycrm_contact_id. It never touchesstatus, so the row sits at the defaultreceivedwhether the CRM call succeeded, failed, or was skipped. To tell whether a slug-form lead synced, checkcrm_contact_id, notstatus. - Inline contact form (
submitContactForm): patchesstatusto one of:
status |
Meaning (inline contact form only) |
|---|---|
received |
Inserted, before the CRM patch (transient) |
crm_synced |
Lead created in crm.contacts, crm_contact_id set |
crm_failed |
Submission saved, but createLead returned null / threw |
spam |
Honeypot or bot-fill heuristic tripped — no CRM call at all |
Symptom → cause:
- A slug-form submission shows in the inbox but
statusis stillreceived. Expected — the slug path doesn't advancestatus. Look atcrm_contact_id: populated means it synced; null means the CRM call didn't return an id (CRM down, orcreateLeadreturned null). The submission row is intact and can be reconciled. - An inline contact-form submission never reaches the CRM. Look at
status.crm_failedmeans the CMS→CRM call failed.spammeans it was classified as a bot and intentionally never forwarded. - A real visitor's inline submission is marked
spam. The honeypot fieldhp_fieldwas non-empty (some autofill extensions fill hidden fields), or thelooksLikeBotFillAllheuristic tripped (every text field identical). Honeypot hits deliberately return200so the bot can't tell it tripped a wire — which means the visitor sees success too. You can override the status toreceivedfrom the dashboard inbox. - Leads land unowned. Expected. When the actor is any internal
service:*caller (a form submit forwarded by the CMS qualifies),createContactforcesowner_user_id = null— leads arrive unowned (actor_kind: system). Assign an owner in the CRM.
[!NOTE] Gating unlock is matched by email or IP hash in the CMS, but a managed
apps/publicsite only delivers the email path. After a visitor submits a gate form, the CMS unlocks the full post on the next read only if it can match theirip_hashagainst acms.form_submissionsrow for the post'sgated_form_id— either match unlocks. On a Nukipa-managedapps/publicsite this is automatic but email-only: the app persists the visitor's email in thenk_lead_emailcookie and forwards it asX-Visitor-Email. IP-hash unlock is intentionally not wired inapps/public(the CMS salts IPs per-tenant and the public app doesn't have the salt), so don't expect IP-hash unlock to work on the managed site. On a custom site pointing directly at the gateway, you own visitor identity — if you don't persist the email-to-visitor mapping and forward it asX-Visitor-Email, the visitor re-hits the gate on every page load even though their lead synced fine. The lead-capture half works; the unlock half is your site's job.
Lead classification (fit/intent scoring) runs as a fire-and-forget job on lead creation. If scores are missing, the classify job hasn't run or failed — see the next section; it's a job, not a synchronous step.
Sends and sequences stuck
Newsletters and nurturing share the same backbone: an in-process setInterval(30s) scheduler (no cron) that scans for due work and enqueues a pg-boss job, with workers living in the owning service. If sends are stuck, the failure is in one of those three links.
The scheduler tick (every 30s):
- Newsletters: scans
issues WHERE status='scheduled' AND scheduled_for<=now(), caps 25 per tick, race-safely flips tosendingvia a guardedUPDATE, then enqueuesnewsletters.send-issue. On enqueue failure it rolls the row back toscheduledand retries next tick. - Nurturing:
next_send_atis the only scheduler input. Edits re-anchor it in the same transaction; the next 30s tick picks up the new time.
Symptom → cause:
- A scheduled issue never sends and stays
scheduled. The scheduler enqueues viaSERVICE_JOBS_URL. If that env var is unset, the scheduler tick returns early (if (!jobs) return) and never claims the row — it staysscheduled. ConfirmSERVICE_JOBS_URLis set and the jobs service is up. - A send-now issue is stuck in
sending, notscheduled. Different code path.POST /:id/sendflips the issue tostatus='sending'before attempting the enqueue, then returns202 { job_id: null }if the jobs client is unavailable — so the row is left insendingwith no job behind it. Same root cause (SERVICE_JOBS_URL/ jobs service), different stuck state. Fix the jobs config; the row won't self-recover fromsendinghere. - The issue flipped to
sendingbut nothing went out. Thenewsletters.send-issueworker isn't running (no@nukipa/jobs-clientworker registered, or the jobs/pg-boss connection is down). The row is claimed but never advanced. - A job shows
runningthen dies asfailed { code: 'orphaned' }. The jobs sweeper marks anyrunningrow whoseheartbeat_atis older than 6 minutes (defaultJOBS_ORPHAN_AFTER_MS = 360_000) as orphaned. The worker process stopped sending its ~5-second heartbeat — it crashed or was killed mid-send. (The threshold is sized for long image-gen workers; a stale comment in the sweeper still says 60s — the config value wins.) Re-enqueue. - A nurturing step won't fire at the time I expect. Only
next_send_atmatters —positionanddelayare inputs to computing it. If you edited a step, check that the re-anchor wrote a newnext_send_at; the tick won't act until that timestamp passes. - An unschedule returns
409./issues/:id/unscheduleonly works whilestatus='scheduled'. Once the tick has flipped it tosending, you can't pull it back — that's the escape hatch's boundary.
What you won't see is a double send. The nurturing send worker inserts the sends row (state='queued') with a UNIQUE (enrollment_id, step_id) key before calling Resend; a conflict short-circuits to idempotent recovery. Newsletter deliveries are similarly UNIQUE(send_id, email).
[!WARNING] Cancellations propagate from CRM and unsubscribes: moving a contact to customer/disqualified/unqualified/opted-out, or a public unsubscribe, cancels every in-flight enrollment for that contact across all sequences. If a nurture sequence "stopped on its own," check the contact's stage/status changes — that's usually the cause, and it's intended.
Deploys stuck
A tenant site deploy is a pg-boss job (deploy-tenant-site) that reconciles a Vercel project against the tenant's github_repo, sets env vars, attaches the domain, and explicitly triggers a production build. The job is idempotent — re-running it is safe. The deployment row moves pending → building → ready/failed; building → ready/failed is driven by the Vercel webhook.
Symptom → cause:
- Repo pushed, but the deploy sits idle / never starts. Vercel's project-create call only links the git source for future pushes — commits that already exist on the branch don't auto-build. The worker handles this with an explicit
triggerDeployment. If that step never ran, the row hangs inbuildingwaiting for a webhook that never fires. failedwith "project is not linked to GitHub repo …". The trigger needs the numeric GitHub repo id (project.link.repoId), which Vercel only records when its GitHub App can see the repo. The fix is in the error message: install Vercel's GitHub App on the org that owns the repo, grant access to that specific repo, then re-run. This is the "host not registered" class of failure — Vercel doesn't know the repo exists.- The deploy builds the old repo after I changed
tenants.github_repo. The worker re-links the existing Vercel project (DELETE + POST on/link) when the linked repo no longer matches. If that re-link didn't happen, you keep deploying old commits. Re-run the job; the relink is idempotent. - I changed
NUKIPA_TENANT_HOST(orNUKIPA_GATEWAY_URL) in Vercel but the site still uses the old value. The worker sets these env vars only when they don't already exist — Vercel rejects a duplicate key, and the worker treats that "already exists" error as success and does not overwrite. So a value you set manually persists across re-runs (good), but equally, the deployer will never "fix" a stale one for you (the gotcha). To change it, update the env var in the Vercel project directly and redeploy. The domain it attaches is<slug>.<PLATFORM_SITES_HOST>(default<slug>.sites.<apex>), which is also what it writes intoNUKIPA_TENANT_HOST.
[!NOTE] The deployer maps Vercel webhook events bluntly:
deployment.erroranddeployment.canceledboth land asstatus='failed'with a generic message pointing you at the Vercel project log. The webhook does not capture the build's stderr — for a build that compiles locally but fails on Vercel, the real error is in the Vercel dashboard, not intenant_deployments.error.
Cookieless visitor tracking: where it lives
Nukipa ships a cookieless proof-of-JS beacon: a per-pageview nonce (client_nonce) goes out with the server-side visit row, and a tiny inline <script> confirms a real browser rendered the page by POSTing the same nonce. The signals service then flips beacon_fired on the matching visit, which powers the "verified human" count and lets no-JS scrapers be excluded. The nonce is ephemeral, per-pageview, not stored, and not a fingerprint — so it stays out of ePrivacy Art 5(3) / GDPR scope.
[!WARNING] This beacon is implemented only on the Nuxt
apps/publicsurface — not in the Next.js starter. Onapps/public,server/middleware/visit.tsmints the nonce and stashes it onevent.context.nkBeaconNonce, andserver/plugins/beacon.tsinjects the<script>via the Nitrorender:htmlhook (it POSTs to the same-origin relay/api/visit-confirm). The Next.js starter (sites/nukipa/src/middleware.ts) mints only thenk_sidsession cookie and callsrecordVisit— it does not mint a per-pageview nonce, does not set anyx-nukipa-nonceheader, and itslayout.tsxcontains no beacon<script>. So on a custom Next site there is no confirmed-human count unless you implement the nonce + beacon yourself.
If you do want the beacon on a custom Next site, the SDK ships the inline-script builder nukipaBeaconScript({ nonce, endpoint }) (in @nukipa/site-sdk). You mint and thread the nonce yourself; the x-nukipa-nonce header referenced in that helper's doc comment is a suggested convention — nothing in the platform emits it for you.
The two pieces the Next starter's layout.tsx does treat as platform contracts (don't remove them):
generateMetadatareturningverification: { google: <token> }— removing it stalls Google Search Console meta-tag verification forever.<NukipaFeedback />inside<body>— the design-review feedback loop depends on it.
Separately, the starter middleware sets one first-party cookie regardless of the beacon: nk_sid, a session id with a 30-minute sliding TTL. Minting it server-side and setting it synchronously is what keeps every pageview from looking like a brand-new session.
[!WARNING] The "cookieless" claim covers the proof-of-JS beacon (where it exists), not your whole site. The moment you add any non-essential cookie — analytics, ads, a tracker — you've moved the page into ePrivacy/GDPR consent scope and the cookieless framing no longer applies to the site as a whole. The beacon (and
nk_sid, a strictly-functional session cookie) stay compliant; your additions are on you.
FAQ
A post's body just stops mid-paragraph with no form. Is it broken?
No — it's gated and your site is missing the gate branch. When cms.posts.gated_form_id is set, the public read truncates the body to the first gate_after_paragraph paragraphs and returns is_gated: true with the form metadata. Render the gate form (the starter does this for you). The visitor unlocks by submitting; the CMS matches them by email (or IP hash, where the site forwards it) on the next read.
A slug-form lead synced to the CRM but the submission status still says received. Bug?
No. The slug-based /forms/:slug/submit path never advances status — it only patches crm_contact_id. Check that field to confirm the sync. crm_synced/crm_failed/spam are written only by the inline contact-form path.
My form submit got a 429 but only sometimes. Rate limit on my form?
Almost certainly the shared submit bucket (2000/min per egress IP), not a per-form limit — under load all tenants on one CDN egress IP share it. Raise PUBLIC_SUBMIT_RATE_LIMIT (no deploy needed). The only true per-form/per-IP limits are on the inline contact-form path (5/IP/post/hour, 10/email/tenant/day).
My visits aren't being recorded. Are they landing with a NULL host?
No — a non-resolving host doesn't produce a NULL-host row, it produces no row. recordPublicVisit returns null (caller 204s) when the host doesn't resolve; when it does resolve it stores that host string. The real failure mode is dropped visits — getHost() returned empty and the gateway fell back to its own host, which then either failed tenant resolution (visit dropped) or, if the gateway host happens to map to a tenant, attributed the visit to the wrong host. Make sure your middleware client reads x-forwarded-host/host from the request, and set NUKIPA_TENANT_HOST as the build/edge fallback.
A job is stuck in running. Will it ever resolve?
Yes — the jobs sweeper marks any running row with no heartbeat for 6 minutes as failed { code: 'orphaned' } (default JOBS_ORPHAN_AFTER_MS = 360_000, sweep interval 30s). If a worker dies mid-job it won't hang forever; it'll flip to failed and you re-enqueue.
Resend webhook events aren't updating delivery rows. Where's the break?
The signature is verified at the gateway (svix), not the newsletters service. A 401 bad signature at the gateway means RESEND_WEBHOOK_SECRET is wrong or missing. Past the gateway, every event is archived into newsletters.events regardless of whether the delivery row correlates — so if events are arriving but rows aren't moving, the resend_message_id isn't matching a delivery (check that the send actually recorded message ids).