railway-serverless-diagnose

Why this skill exists: On 2026-04-24, AgentPact had three Paperclip tickets (WIS-253, WIS-258, WIS-310) blocked for days with the diagnosis "Railway deployment returns 404 'Application not found' on ALL endpoints. Entire API unreachable. All code is correct and merged to main, but no external consumer can reach it." This was wrong. The services were healthy. They were sleeping. The first probe got the wake-window 404; nobody retried, so the 404 became "dead" in Paperclip and downstream tickets all chained off that misdiagnosis.

The pattern repeats whenever a low-traffic Railway project is probed by an autonomous agent that doesn't know about serverless sleep.

When to use

An autonomous agent or human reports "Railway service returns 404 on every endpoint"
Paperclip ticket says "Application not found" with no successful probes attempted at intervals
Service was working "yesterday" and now isn't, but nothing was deployed
Multiple downstream tickets are blocked on a single "Railway is dead" claim
You're tempted to redeploy without first verifying the service still exists

What Railway serverless mode actually does

Railway projects on the serverless tier (default for hobby/Pro) sleep services after N minutes of zero traffic. The wake sequence:

First HTTP request to a sleeping service hits Railway's edge → returns 404 {"status":"error","code":404,"message":"Application not found","request_id":"..."} while the container boots
Container boots in ~2-4 seconds (Node) or ~5-10s (Python with heavy deps)
Subsequent requests resolve normally to the running app

The 404 body Railway returns during cold-start is identical to the 404 returned for a deleted application. There is no header or body field that distinguishes "sleeping" from "deleted." The only way to tell them apart is:

API state check via backboard.railway.app/graphql/v2 (auth required) — looks at the service and deployments records
Retry with a wait — sleeping services resolve in seconds, deleted ones don't

The diagnosis runbook

Step 1 — never trust a single 404

If the only evidence you have is one HTTP probe returning Application not found, stop. That is not enough to declare a service dead. Run step 2.

Step 2 — wake-and-retry probe

# First request triggers wake; ignore its result.
curl -sS -o /dev/null --max-time 30 "https://api.example.up.railway.app/api/health"
# Wait for boot
sleep 5
# Real probe — this is the one that tells you if the service exists
curl -sS -w "\nHTTP %{http_code} time=%{time_total}s\n" --max-time 30 "https://api.example.up.railway.app/api/health"

If the second probe returns 2xx/4xx that is NOT "Application not found", the service is alive. The first 404 was a cold-miss.

If the second probe still returns Application not found, proceed to step 3.

Step 3 — Railway API truth check (requires token)

You need a Railway account token (not a project token) to list projects. From https://railway.com/account/tokens. Account-token shape: UUID. Header form: Authorization: Bearer <token>. Project-token shape: also UUID, but uses Project-Access-Token: <token> header and only works against the project it was minted for.

TOK='your-account-token-uuid'

# 1. Confirm the token is valid and identify the account
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOK" \
  -d '{"query":"query { me { id email name } }"}'

# 2. List projects + services + environments
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOK" \
  -d '{"query":"query { me { workspaces { name projects { edges { node { id name services { edges { node { id name } } } environments { edges { node { id name } } } } } } } } }"}'

# 3. Inspect deployments for a specific service
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOK" \
  -d '{
    "query": "query($p:String!,$e:String!,$s:String!){deployments(first:3,input:{projectId:$p,environmentId:$e,serviceId:$s}){edges{node{id status createdAt updatedAt staticUrl url canRedeploy}}}}",
    "variables": {"p":"PROJECT_ID","e":"ENV_ID","s":"SERVICE_ID"}
  }'

Status interpretation:

GraphQL `status`	Meaning	Action
`SUCCESS`	Last deploy succeeded, currently running	Service is alive — re-probe
`SLEEPING`	Idle, will wake on next request	Service is alive — re-probe
`BUILDING`	Build in progress	Wait, then probe
`DEPLOYING`	Build done, container starting	Wait, then probe
`CRASHED`	Container booted then died	Read logs, fix, redeploy
`FAILED`	Build failed	Read build logs
`REMOVED`	Deployment was deleted	Older versions only — check newer rows in same query
(no rows at all)	Service truly deleted or token lacks scope	Verify in dashboard

If status is SUCCESS or SLEEPING, the service exists and the 404 was a cold-miss — re-probe and confirm. If status is CRASHED/FAILED, the work is real (logs + fix), not a redeploy of "what's gone."

Pitfalls

me query fails with project tokens. me is account-scoped. If query { me } returns "Not Authorized" but query { projectToken { projectId environmentId } } works (with Project-Access-Token header), you have a project token, not an account token. Account tokens see all projects; project tokens see only one.
Cached CLI tokens (~/.railway/config.json) expire silently. A rw_* 300-char token in the CLI config is the CLI session token, not an API token. They expire after long inactivity and you cannot tell from the file. Test with railway whoami first; if it says "Unauthorized", do not waste time debugging — get a fresh API token.
*.up.railway.app static URLs lie. A service can have a staticUrl set in the API even when no deployment exists. Always check deployments rows, not just the static URL.
Custom domains with no DNS update keep showing the old target. When you add a new custom domain via customDomainCreate, Railway issues a NEW requiredValue CNAME target like 2x68toyo.up.railway.app. The DNS record at the registrar still points to the OLD target, so cert provisioning silently never finishes. Always read the dnsRecords array returned by customDomainCreate and update the registrar accordingly.
Apex domain on a third-party DNS host. Don't assume a domain is in the user's Cloudflare account just because they have CF tokens. dig +short NS domain.tld will tell you. agentpact.xyz uses ns1/ns2.dns-parking.com — those nameservers are Hostinger's default parking DNS (NOT Namecheap, despite that domain string looking generic). Confirm registrar via RDAP: curl -sS -H "Accept: application/json" "https://rdap.centralnic.com/xyz/domain/agentpact.xyz" | jq -r '.entities[] | select(.roles[]==\"registrar\") | .vcardArray[1][] | select(.[0]==\"fn\") | .[3]' returns HOSTINGER operations, UAB. wisechef.ai is the only Adam-domain on Namecheap. DNS edits go to the appropriate registrar panel — or, when available, to the registrar's API (see companion skill hostinger-dns-api).
Misdiagnosis chains in Paperclip. If one ticket reports "Railway is dead" and four other tickets get blocked citing the first, fixing the first ticket unblocks all of them, but only if you re-comment and re-status each downstream one. Don't just close the upstream — propagate the correction.
watchPatterns blocks auto-deploys. Railway's GitHub webhook respects per-service watchPatterns in the deploy config. If a service is configured with watchPatterns: ["/apps/api/**"], a PR that touches only /migrations/, /scripts/, or any path outside that glob will land on main but never trigger an auto-deploy. The merge succeeds, CI passes, GitHub shows the commit on main, and Railway sits on the prior deploy as if nothing happened. Symptom: git log main --oneline shows the new commit, the live API still serves old behavior, and query { deployments } shows the latest deploy is on the OLD commit hash.

Verify the watch pattern by inspecting meta.serviceManifest.build.watchPatterns on any deployment row:
```
query { deployment(id: "<dep-id>") { meta } }
# → meta.serviceManifest.build.watchPatterns: ["/apps/api/**"]
```
Either expand the pattern in Railway's service settings (Settings → Source → Watch Paths), or trigger a manual deploy with the GraphQL mutation below. Also worth widening the pattern as a permanent fix — schema migrations and deploy scripts should auto-deploy.
serviceInstanceDeployV2 with commitSha races against GitHub propagation. When you've just merged a PR and want to force-deploy that specific commit, this fails:
```
mutation { serviceInstanceDeployV2(serviceId: "...", environmentId: "...", commitSha: "abc1234") }
# → deployment created but immediately FAILED:
#   "Failed to fetch specific commit: couldn't find remote ref \"abc1234\""
```
Railway's internal git mirror has a 30-90 second propagation delay after GitHub. The deployment row shows status=FAILED with meta.configErrors=["Failed to fetch specific commit..."].

Fix: use serviceInstanceDeploy (V1) with latestCommit: true instead — it re-resolves HEAD on Railway's side at deploy time and works even when GitHub propagation is in flight:
```
mutation { serviceInstanceDeploy(serviceId: "...", environmentId: "...", latestCommit: true) }
# → returns true, builds correctly with the new commit
```
Use V2+commitSha only when you specifically need to re-deploy a known-good older commit (rollback). For "deploy whatever's on main right now," V1+latestCommit is more reliable.
Two-migrations/ folder footgun (deploy-script-vs-conventions drift). A repo can accumulate two migration directories — one read by the runtime migrate script, one where some contributor put new files thinking it was the canonical location. Files in the second dir silently never run, and you only find out when an endpoint that depends on the missing schema returns relation "X" does not exist in production.

Quick check before declaring "code is fine, must be Railway":
```
# find every migrations/ in the repo
find . -type d -name 'migrations' -not -path '*/node_modules/*'
# what does the migrate script actually read?
grep -nE "migrationsDir|readdir.*migration" scripts/migrate.ts apps/*/scripts/migrate.* 2>/dev/null
# diff the dirs to find orphan migrations
diff <(ls migrations/) <(ls apps/api/migrations/) 2>/dev/null
```
If you find orphan migrations, copy them into the canonical dir at the next free sequence number, don't rename or move (other deploy artifacts may reference the old paths). Add an IF NOT EXISTS guard if the migration doesn't already have one. Document the canonical dir in a migrations/README.md so future contributors don't re-create the drift.

Validated 2026-04-25 on AgentPact: /api/concierge/stats returned 500 (relation "concierge_messages" does not exist) for ~2 weeks because apps/api/migrations/020_concierge_relay.sql was never run — the migrate script only reads root migrations/. Fixed in PR #8 by copying the file to migrations/033_concierge_relay.sql.

The OTHER misdiagnosis: 503 timeout that LOOKS like Railway/cron but is app-level pool starvation

Validated 2026-04-27 on AgentPact. Cron agentpact-morning-check failed with SCRIPT TIMEOUT after 120s and the obvious read was "Railway is down again" — wrong. Pattern:

Symptoms that mimic Railway death but aren't

/health returns 200 ~0.5s (Railway is alive — would 404 if dead)
Every data endpoint returns HTTP 503 with body {"error":"Request timeout — server is under load, please retry"} after exactly 30s (or whatever your REQUEST_TIMEOUT_MS is set to)
Even trivial endpoints (/health/pool running SELECT 1) hit the 30s ceiling
Per-endpoint statement_timeouts (e.g. WIS-250's 4s withBrowseStatementTimeout) are bypassed — requests hang before reaching SQL
Cron probes time out at the cron's own ceiling, not the app's, hiding the "exactly 30s" tell

Why 503 here ≠ Railway

The 503 is the app's own onRequest timeout middleware (apps/api/src/index.ts in Fastify projects):

const REQUEST_TIMEOUT_MS = 30_000;
app.addHook('onRequest', async (_request, reply) => {
  const timer = setTimeout(() => {
    if (!reply.sent) reply.code(503).send({ error: 'Request timeout — server is under load, please retry' });
  }, REQUEST_TIMEOUT_MS);
  ...
});

Railway never sees a hang — the app emits the 503 itself. Restarting via Railway does fix it (clears leaked connections), which makes the misdiagnosis sticky: "restart = Railway was bad."

Diagnosis runbook

Probe /health AND a data endpoint side-by-side. If /health is 200 fast and data endpoints 503 at exactly the timeout ceiling → app-level, not Railway.
Probe the pool canary. If /health/pool (or whatever runs SELECT 1) also times out → Postgres connection-pool starvation, not slow queries.

pg_stat_activity snapshot on Supabase/Postgres:

SELECT pid, state, now()-xact_start AS age, wait_event, left(query,120)
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY age DESC NULLS LAST LIMIT 20;

Anything > 2 min holding a tx is the leak. pg_terminate_backend(pid) for emergency relief.

Audit sql.begin() exit paths in suspect files. Common leak sources:
- Single-flight queues that don't release on exception
- Concierge / outbound-relay code that does external HTTP inside a transaction
- Recently-merged batch/embedding work that holds connections across iterations
- try { await sql.begin(...) } with no finally releasing on cancel/abort

Quick triage fix vs. permanent fix

Mitigation: restart the API service (any platform — Railway, Fly, Heroku). Clears leaked connections, site comes back. Does not fix the leak.
Permanent: find the code path that opens a tx and exits without release. Pool max=N means the issue is inevitable after N leaks — only the cadence varies with traffic.

Why this matters for autonomous agents

Cron agents tend to write SCRIPT TIMEOUT → Railway issue when their probe budget exceeds the app's own 503 ceiling. The 30s-on-the-nose timing is the giveaway. Add a side-by-side /health probe to any monitoring script — it's the cheapest possible discriminator between platform-down (everything 404/5xx-fast) and app-pool-starved (/health fine, data endpoints timeout-pegged).

Pitfall: don't restart blind on recurrence

If a service needs restarting more than ~weekly to come back, the connection leak is real and getting worse. Don't normalize the restart — that's evolution-via-bandaid. Profile pg_stat_activity during the next degradation window, identify the leaking code path, fix it.

Bonus: keep-alive cron pattern (optional)

If cold-start 404s are a recurring source of misdiagnosis or a real customer-experience problem, a 4-hourly cron pinging each service's /health endpoint keeps them warm. Add to ~/.hermes/cron/jobs.json with:

for svc in api mcp web; do
  curl -sS -o /dev/null --max-time 15 "https://${svc}.example.com/health" || true
done

Cost: zero. Trade-off: lose some serverless cost savings on truly-idle services, gain consistent customer-facing latency and one less failure mode for autonomous agents to misdiagnose.

Force a deploy via GraphQL (when auto-deploy is blocked)

When watchPatterns, GitHub webhook delays, or any other reason prevents auto-deploy after a merge to main, force a deploy from the API:

TOK='your-account-token-uuid'
PROJ='project-uuid'  ; ENV='env-uuid'  ; SVC='service-uuid'

# Preferred: deploy whatever's currently on main (HEAD-resolves on Railway side)
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
  -H "Authorization: Bearer $TOK" -H "Content-Type: application/json" \
  -d "{\"query\":\"mutation { serviceInstanceDeploy(serviceId: \\\"$SVC\\\", environmentId: \\\"$ENV\\\", latestCommit: true) }\"}"
# returns: { "data": { "serviceInstanceDeploy": true } }

# Then poll deployment status until SUCCESS:
for i in 1 2 3 4 5 6 7 8 9 10; do
  sleep 30
  curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
    -H "Authorization: Bearer $TOK" -H "Content-Type: application/json" \
    -d "{\"query\":\"query { deployments(first: 1, input: { projectId: \\\"$PROJ\\\", environmentId: \\\"$ENV\\\", serviceId: \\\"$SVC\\\" }) { edges { node { status meta } } } }\"}" \
    | python3 -c "import sys,json; d=json.load(sys.stdin); n=d['data']['deployments']['edges'][0]['node']; print(n['status'], (n.get('meta') or {}).get('commitHash','')[:8])"
done

Avoid serviceInstanceDeployV2 with commitSha for fresh-merge deploys (see pitfall #8 above). It is correct only for known-good older commits.

If the manual deploy lands on FAILED immediately, query the deployment object for meta.configErrors — that's where Railway puts the human-readable reason:

curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
  -H "Authorization: Bearer $TOK" -H "Content-Type: application/json" \
  -d "{\"query\":\"query { deployment(id: \\\"DEP_ID\\\") { id status meta } }\"}" \
  | python3 -c "import sys,json; d=json.load(sys.stdin); m=d['data']['deployment'].get('meta') or {}; print(m.get('configErrors')); print(m.get('serviceManifest',{}).get('build',{}).get('watchPatterns'))"

buildLogs and deploymentLogs queries return empty for deployments that didn't reach the build stage — the configError in meta is your only signal in those cases.

Verification — definition of done

Re-probed the suspected-dead service after a wake delay
Confirmed via Railway GraphQL that the service exists and status=SUCCESS|SLEEPING
If applicable, fixed up Paperclip tickets that chained off the misdiagnosis (status, comments, downstream blockers)
If the issue recurs, added a keep-alive cron OR documented the cold-miss behavior in the team's diagnosis playbook

Railway Serverless Diagnose