Steeren/Troubleshooting & gotchaslive from the platform← site
Operate

Troubleshooting & gotchas

This is the cross-cutting failure-mode reference for a Nukipa-managed site: the things that break at the seams between your site, the gateway, and the platform services. Each section names the symptom, the cause in the actual request path, and the fix. Where the platform is thin, it says so.

The shape to keep in your head: your site → the Nukipa gateway (/public/v1/*) → the owning service (CMS, signals, newsletters, …). Almost every tenant-resolution and rate-limit problem lives in that first hop.

Host resolution: 404s and empty content

Every public read and write resolves the tenant from one input: the visitor host. Your site forwards it; the gateway passes it through as X-Forwarded-Host; the owning service runs resolveTenantByHost(host). If that host doesn't map to a tenant, you get a 404 (single post / form) or an empty list (/posts, /folders) — never another tenant's content.

The resolver tries two things, in order:

  1. Custom domain — a lowercased, port-stripped match against a tenant_domains row that has verified_at set. A row that exists but isn't verified yet does not match.
  2. Subdomain on a platform host<slug>.<platform> (e.g. acme.nukipa.com), where <slug> matches tenants.slug. The platform-host list comes from PUBLIC_PLATFORM_HOSTS (defaults to nukipa.com,localhost).

If neither matches, the resolver returns null and the caller decides 404-vs-empty.

[!WARNING] There are three copies of this resolver and they don't all agree. The shared resolver (@nukipa/shared, used by the signals visit-ingest + confirm path) toggles www.: it builds the candidate set [host, apex, www.apex] and matches any verified row, so www.example.com and example.com resolve to the same tenant there. The CMS resolver (reads, gating, forms) and the newsletters resolver do exact-match only (.eq('domain', hostname)) — on those paths www.example.com and example.com are different hosts. If your site serves both and you rely on CMS reads, register both as verified tenant_domains rows, or redirect one to the other at the edge before the request reaches Nukipa.

How the host actually travels:

Layer What sets the host Gotcha
Your site (SDK) getHost() reads inbound x-forwarded-host, then host, then NUKIPA_TENANT_HOST If getHost() returns '', the gateway falls back to its own host → wrong tenant or 404
Gateway visitorHost(req) = x-forwarded-hosthost req.get('host') is the literal Host header even behind a proxy, so the gateway does the X-Forwarded-Host check explicitly
Service resolveTenantByHost(X-Forwarded-Host) Unverified custom domain or multi-label subdomain → no match (on CMS/shared; see below)

Symptom → cause:

  • Live custom domain 404s, staging subdomain works. The tenant_domains row isn't verified_at. Until DNS verifies, only the <slug>.<platform> subdomain resolves.
  • Everything is empty in build/edge contexts but fine in the browser. Build-time and edge renders have no inbound request headers. That's exactly what NUKIPA_TENANT_HOST is for — it's the fallback host when x-forwarded-host/host are absent.
  • A multi-label subdomain (blog.acme.nukipa.com) won't resolve. On the CMS and shared resolvers the subdomain branch skips any slug containing a dot (slug.includes('.') → continue), so only a single label resolves. The newsletters resolver omits that guard — but a multi-label slug still won't match a real tenants.slug, so the practical result is the same (no resolution). Either way: use a single-label subdomain.

[!NOTE] NUKIPA_TENANT_HOST is a fallback, not an override. The SDK resolver is ordered visitor-host-first on purpose: when a custom domain is live, the visitor host (example.com) must win so the gateway can return that domain's per-domain google_verification_token. Forcing NUKIPA_TENANT_HOST (your <slug>.sites.… host) routes through the subdomain branch, which doesn't carry the token, and Google Search Console verification stalls at "verification token could not be found." Set NUKIPA_TENANT_HOST to your platform subdomain and leave it as the fallback.

Why the gateway relay is mandatory

Don't call services directly, and don't fetch('/public/v1/*') by hand from your site — go through the SDK (getNukipaClient() / getMiddlewareClient(req)). The relay isn't decoration; the services depend on header normalisation the gateway performs:

  • It sets X-Forwarded-Host (tenant resolution), X-Forwarded-For (visitor IP / fingerprint), and forwards User-Agent + Referer so signals classifies the real browser, not the gateway.
  • It threads visitor identity for gating (X-Visitor-Email, and X-Visitor-IP-Hash if the upstream set it) and version pinning (X-Nukipa-Site-Version) through verbatim.
  • For the Resend delivery webhook it verifies the svix signature at the gateway, so the RESEND_WEBHOOK_SECRET never leaves the gateway. The Post for Me callback is different: the gateway forwards the Authorization: Bearer header verbatim and does not check it — that bearer is verified downstream in the social service against POSTFORME_WEBHOOK_SECRET. So "signing secret stays at the gateway" is true for Resend, not for Post for Me.

The CMS/signals public routes are mounted before internal auth and self-resolve the tenant from request data — so a raw call that omits these headers won't 401, it'll just silently resolve the wrong tenant (or none) and drop your data. That failure is invisible until you notice missing visits or unattributed leads.

[!TIP] One known IP-loss trap is already handled for you: serverless runtimes strip X-Forwarded-For on outbound calls, so recordVisit injects visitor_ip into the request body (signals prefers body.visitor_ip over the header). If you reimplement visit ingest with raw fetch, you lose the Visitors KPI (the visitor_fingerprint column stays NULL and the dashboard shows "—"). Use the SDK.

Rate limits and shared-IP collapse

The gateway runs three buckets, all per-IP over a 60-second window:

Bucket Limit (per IP / min) Applies to Env override
submitLimiter 2000 form submit, inline contact-form submit, audit run/gate, newsletter subscribe, ingestion webhooks, Post for Me webhook PUBLIC_SUBMIT_RATE_LIMIT
visitIngestLimiter 1200 (~20/s) visit ingest, visit confirm, CTA clicks, Resend webhook — (code constant)
readLimiter unlimited (no-op passthrough) all blog/tenant/folder/event/data reads

The thing to understand: behind a CDN or SSR layer, the gateway's req.ip is the proxy egress IP, not the visitor's. Every tenant's traffic collapses onto a small set of Vercel/Cloudflare function outbound IPs. A naive per-IP counter would 429 legitimate cross-tenant load constantly — which is why:

  • Reads are intentionally unlimited. The read limiter is a no-op. Reads are anonymous, cheap, idempotent public data; there's nothing to protect against in a per-IP express counter. If real abuse protection is ever needed it belongs at the edge (Cloudflare / Vercel WAF), not here.
  • The submit bucket is deliberately huge (2000/min) for the same reason — it has to absorb every tenant sharing one egress IP, including long-running audit polls. It is not a per-visitor anti-abuse limit.

[!NOTE] The gateway's req.ip only reflects the egress IP because the gateway app sets Express trust proxy. The public routes file documents this in comments but doesn't set it (the app.set('trust proxy', …) lives in the gateway bootstrap) — the reasoning above holds regardless of the exact value.

Symptom → cause:

  • 429 {"error":{"code":"too_many"}} on form submits under load. You've hit the shared submit bucket across all tenants on one egress IP. Raise PUBLIC_SUBMIT_RATE_LIMIT — it's crankable without a deploy. Don't expect it to isolate one tenant; that's an edge-layer concern.
  • Visits silently missing during traffic spikes, no visible error. The visit and confirm endpoints answer 204 on any failure (including a 429) so a tracking misfire never breaks a page load. A hot client degrades to gaps in analytics, not 4xxs the visitor sees. There is no per-tenant alarm for this today — it's a known tradeoff.

[!NOTE] Per-form and per-IP abuse limits do exist, but only on the inline contact-form path inside the CMS (contact_form components): 5 submissions per IP+post per hour and 10 per email+tenant per day, over → 429. The slug-based lead form (/forms/:slug/submit) and the audit gates have no honeypot, bot-fill, or per-visitor caps of their own — they rely solely on the gateway's shared submit bucket.

Forms not syncing to CRM

There are two form paths, and they record outcomes differently. The submission row is saved either way — the lead is never lost, but it may not reach the CRM, and only one path records a CRM status.

Path Endpoint Honeypot / bot-fill status set after submit
Slug-based lead form POST /public/v1/forms/:slug/submit none always stays received
Inline contact form POST /public/v1/posts/:postId/contact-form-submissions yes (hp_field + looksLikeBotFillAll) crm_synced / crm_failed / spam

Both resolve the tenant from X-Forwarded-Host, insert a cms.form_submissions row, and call services/crm createLead best-effort. The difference is what they write back:

  • Slug lead form (submitForm): on a successful CRM sync it patches only crm_contact_id. It never touches status, so the row sits at the default received whether the CRM call succeeded, failed, or was skipped. To tell whether a slug-form lead synced, check crm_contact_id, not status.
  • Inline contact form (submitContactForm): patches status to one of:
status Meaning (inline contact form only)
received Inserted, before the CRM patch (transient)
crm_synced Lead created in crm.contacts, crm_contact_id set
crm_failed Submission saved, but createLead returned null / threw
spam Honeypot or bot-fill heuristic tripped — no CRM call at all

Symptom → cause:

  • A slug-form submission shows in the inbox but status is still received. Expected — the slug path doesn't advance status. Look at crm_contact_id: populated means it synced; null means the CRM call didn't return an id (CRM down, or createLead returned null). The submission row is intact and can be reconciled.
  • An inline contact-form submission never reaches the CRM. Look at status. crm_failed means the CMS→CRM call failed. spam means it was classified as a bot and intentionally never forwarded.
  • A real visitor's inline submission is marked spam. The honeypot field hp_field was non-empty (some autofill extensions fill hidden fields), or the looksLikeBotFillAll heuristic tripped (every text field identical). Honeypot hits deliberately return 200 so the bot can't tell it tripped a wire — which means the visitor sees success too. You can override the status to received from the dashboard inbox.
  • Leads land unowned. Expected. When the actor is any internal service:* caller (a form submit forwarded by the CMS qualifies), createContact forces owner_user_id = null — leads arrive unowned (actor_kind: system). Assign an owner in the CRM.

[!NOTE] Gating unlock is matched by email or IP hash in the CMS, but a managed apps/public site only delivers the email path. After a visitor submits a gate form, the CMS unlocks the full post on the next read only if it can match their email or ip_hash against a cms.form_submissions row for the post's gated_form_id — either match unlocks. On a Nukipa-managed apps/public site this is automatic but email-only: the app persists the visitor's email in the nk_lead_email cookie and forwards it as X-Visitor-Email. IP-hash unlock is intentionally not wired in apps/public (the CMS salts IPs per-tenant and the public app doesn't have the salt), so don't expect IP-hash unlock to work on the managed site. On a custom site pointing directly at the gateway, you own visitor identity — if you don't persist the email-to-visitor mapping and forward it as X-Visitor-Email, the visitor re-hits the gate on every page load even though their lead synced fine. The lead-capture half works; the unlock half is your site's job.

Lead classification (fit/intent scoring) runs as a fire-and-forget job on lead creation. If scores are missing, the classify job hasn't run or failed — see the next section; it's a job, not a synchronous step.

Sends and sequences stuck

Newsletters and nurturing share the same backbone: an in-process setInterval(30s) scheduler (no cron) that scans for due work and enqueues a pg-boss job, with workers living in the owning service. If sends are stuck, the failure is in one of those three links.

The scheduler tick (every 30s):

  • Newsletters: scans issues WHERE status='scheduled' AND scheduled_for<=now(), caps 25 per tick, race-safely flips to sending via a guarded UPDATE, then enqueues newsletters.send-issue. On enqueue failure it rolls the row back to scheduled and retries next tick.
  • Nurturing: next_send_at is the only scheduler input. Edits re-anchor it in the same transaction; the next 30s tick picks up the new time.

Symptom → cause:

  • A scheduled issue never sends and stays scheduled. The scheduler enqueues via SERVICE_JOBS_URL. If that env var is unset, the scheduler tick returns early (if (!jobs) return) and never claims the row — it stays scheduled. Confirm SERVICE_JOBS_URL is set and the jobs service is up.
  • A send-now issue is stuck in sending, not scheduled. Different code path. POST /:id/send flips the issue to status='sending' before attempting the enqueue, then returns 202 { job_id: null } if the jobs client is unavailable — so the row is left in sending with no job behind it. Same root cause (SERVICE_JOBS_URL / jobs service), different stuck state. Fix the jobs config; the row won't self-recover from sending here.
  • The issue flipped to sending but nothing went out. The newsletters.send-issue worker isn't running (no @nukipa/jobs-client worker registered, or the jobs/pg-boss connection is down). The row is claimed but never advanced.
  • A job shows running then dies as failed { code: 'orphaned' }. The jobs sweeper marks any running row whose heartbeat_at is older than 6 minutes (default JOBS_ORPHAN_AFTER_MS = 360_000) as orphaned. The worker process stopped sending its ~5-second heartbeat — it crashed or was killed mid-send. (The threshold is sized for long image-gen workers; a stale comment in the sweeper still says 60s — the config value wins.) Re-enqueue.
  • A nurturing step won't fire at the time I expect. Only next_send_at matters — position and delay are inputs to computing it. If you edited a step, check that the re-anchor wrote a new next_send_at; the tick won't act until that timestamp passes.
  • An unschedule returns 409. /issues/:id/unschedule only works while status='scheduled'. Once the tick has flipped it to sending, you can't pull it back — that's the escape hatch's boundary.

What you won't see is a double send. The nurturing send worker inserts the sends row (state='queued') with a UNIQUE (enrollment_id, step_id) key before calling Resend; a conflict short-circuits to idempotent recovery. Newsletter deliveries are similarly UNIQUE(send_id, email).

[!WARNING] Cancellations propagate from CRM and unsubscribes: moving a contact to customer/disqualified/unqualified/opted-out, or a public unsubscribe, cancels every in-flight enrollment for that contact across all sequences. If a nurture sequence "stopped on its own," check the contact's stage/status changes — that's usually the cause, and it's intended.

Deploys stuck

A tenant site deploy is a pg-boss job (deploy-tenant-site) that reconciles a Vercel project against the tenant's github_repo, sets env vars, attaches the domain, and explicitly triggers a production build. The job is idempotent — re-running it is safe. The deployment row moves pending → building → ready/failed; building → ready/failed is driven by the Vercel webhook.

Symptom → cause:

  • Repo pushed, but the deploy sits idle / never starts. Vercel's project-create call only links the git source for future pushes — commits that already exist on the branch don't auto-build. The worker handles this with an explicit triggerDeployment. If that step never ran, the row hangs in building waiting for a webhook that never fires.
  • failed with "project is not linked to GitHub repo …". The trigger needs the numeric GitHub repo id (project.link.repoId), which Vercel only records when its GitHub App can see the repo. The fix is in the error message: install Vercel's GitHub App on the org that owns the repo, grant access to that specific repo, then re-run. This is the "host not registered" class of failure — Vercel doesn't know the repo exists.
  • The deploy builds the old repo after I changed tenants.github_repo. The worker re-links the existing Vercel project (DELETE + POST on /link) when the linked repo no longer matches. If that re-link didn't happen, you keep deploying old commits. Re-run the job; the relink is idempotent.
  • I changed NUKIPA_TENANT_HOST (or NUKIPA_GATEWAY_URL) in Vercel but the site still uses the old value. The worker sets these env vars only when they don't already exist — Vercel rejects a duplicate key, and the worker treats that "already exists" error as success and does not overwrite. So a value you set manually persists across re-runs (good), but equally, the deployer will never "fix" a stale one for you (the gotcha). To change it, update the env var in the Vercel project directly and redeploy. The domain it attaches is <slug>.<PLATFORM_SITES_HOST> (default <slug>.sites.<apex>), which is also what it writes into NUKIPA_TENANT_HOST.

[!NOTE] The deployer maps Vercel webhook events bluntly: deployment.error and deployment.canceled both land as status='failed' with a generic message pointing you at the Vercel project log. The webhook does not capture the build's stderr — for a build that compiles locally but fails on Vercel, the real error is in the Vercel dashboard, not in tenant_deployments.error.

Cookieless visitor tracking: where it lives

Nukipa ships a cookieless proof-of-JS beacon: a per-pageview nonce (client_nonce) goes out with the server-side visit row, and a tiny inline <script> confirms a real browser rendered the page by POSTing the same nonce. The signals service then flips beacon_fired on the matching visit, which powers the "verified human" count and lets no-JS scrapers be excluded. The nonce is ephemeral, per-pageview, not stored, and not a fingerprint — so it stays out of ePrivacy Art 5(3) / GDPR scope.

[!WARNING] This beacon is implemented only on the Nuxt apps/public surface — not in the Next.js starter. On apps/public, server/middleware/visit.ts mints the nonce and stashes it on event.context.nkBeaconNonce, and server/plugins/beacon.ts injects the <script> via the Nitro render:html hook (it POSTs to the same-origin relay /api/visit-confirm). The Next.js starter (sites/nukipa/src/middleware.ts) mints only the nk_sid session cookie and calls recordVisit — it does not mint a per-pageview nonce, does not set any x-nukipa-nonce header, and its layout.tsx contains no beacon <script>. So on a custom Next site there is no confirmed-human count unless you implement the nonce + beacon yourself.

If you do want the beacon on a custom Next site, the SDK ships the inline-script builder nukipaBeaconScript({ nonce, endpoint }) (in @nukipa/site-sdk). You mint and thread the nonce yourself; the x-nukipa-nonce header referenced in that helper's doc comment is a suggested convention — nothing in the platform emits it for you.

The two pieces the Next starter's layout.tsx does treat as platform contracts (don't remove them):

  1. generateMetadata returning verification: { google: <token> } — removing it stalls Google Search Console meta-tag verification forever.
  2. <NukipaFeedback /> inside <body> — the design-review feedback loop depends on it.

Separately, the starter middleware sets one first-party cookie regardless of the beacon: nk_sid, a session id with a 30-minute sliding TTL. Minting it server-side and setting it synchronously is what keeps every pageview from looking like a brand-new session.

[!WARNING] The "cookieless" claim covers the proof-of-JS beacon (where it exists), not your whole site. The moment you add any non-essential cookie — analytics, ads, a tracker — you've moved the page into ePrivacy/GDPR consent scope and the cookieless framing no longer applies to the site as a whole. The beacon (and nk_sid, a strictly-functional session cookie) stay compliant; your additions are on you.

FAQ

A post's body just stops mid-paragraph with no form. Is it broken? No — it's gated and your site is missing the gate branch. When cms.posts.gated_form_id is set, the public read truncates the body to the first gate_after_paragraph paragraphs and returns is_gated: true with the form metadata. Render the gate form (the starter does this for you). The visitor unlocks by submitting; the CMS matches them by email (or IP hash, where the site forwards it) on the next read.

A slug-form lead synced to the CRM but the submission status still says received. Bug? No. The slug-based /forms/:slug/submit path never advances status — it only patches crm_contact_id. Check that field to confirm the sync. crm_synced/crm_failed/spam are written only by the inline contact-form path.

My form submit got a 429 but only sometimes. Rate limit on my form? Almost certainly the shared submit bucket (2000/min per egress IP), not a per-form limit — under load all tenants on one CDN egress IP share it. Raise PUBLIC_SUBMIT_RATE_LIMIT (no deploy needed). The only true per-form/per-IP limits are on the inline contact-form path (5/IP/post/hour, 10/email/tenant/day).

My visits aren't being recorded. Are they landing with a NULL host? No — a non-resolving host doesn't produce a NULL-host row, it produces no row. recordPublicVisit returns null (caller 204s) when the host doesn't resolve; when it does resolve it stores that host string. The real failure mode is dropped visitsgetHost() returned empty and the gateway fell back to its own host, which then either failed tenant resolution (visit dropped) or, if the gateway host happens to map to a tenant, attributed the visit to the wrong host. Make sure your middleware client reads x-forwarded-host/host from the request, and set NUKIPA_TENANT_HOST as the build/edge fallback.

A job is stuck in running. Will it ever resolve? Yes — the jobs sweeper marks any running row with no heartbeat for 6 minutes as failed { code: 'orphaned' } (default JOBS_ORPHAN_AFTER_MS = 360_000, sweep interval 30s). If a worker dies mid-job it won't hang forever; it'll flip to failed and you re-enqueue.

Resend webhook events aren't updating delivery rows. Where's the break? The signature is verified at the gateway (svix), not the newsletters service. A 401 bad signature at the gateway means RESEND_WEBHOOK_SECRET is wrong or missing. Past the gateway, every event is archived into newsletters.events regardless of whether the delivery row correlates — so if events are arriving but rows aren't moving, the resend_message_id isn't matching a delivery (check that the send actually recorded message ids).

Served live from the platform · /docs/troubleshooting-and-gotchas