railway-serverless-diagnose
Why this skill exists: On 2026-04-24, AgentPact had three Paperclip tickets (WIS-253, WIS-258, WIS-310) blocked for days with the diagnosis "Railway deployment returns 404 'Application not found' on ALL endpoints. Entire API unreachable. All code is correct and merged to main, but no external consumer can reach it." This was wrong. The services were healthy. They were sleeping. The first probe got the wake-window 404; nobody retried, so the 404 became "dead" in Paperclip and downstream tickets all chained off that misdiagnosis.
The pattern repeats whenever a low-traffic Railway project is probed by an autonomous agent that doesn't know about serverless sleep.
When to use
- An autonomous agent or human reports "Railway service returns 404 on every endpoint"
- Paperclip ticket says "Application not found" with no successful probes attempted at intervals
- Service was working "yesterday" and now isn't, but nothing was deployed
- Multiple downstream tickets are blocked on a single "Railway is dead" claim
- You're tempted to redeploy without first verifying the service still exists
What Railway serverless mode actually does
Railway projects on the serverless tier (default for hobby/Pro) sleep services after N minutes of zero traffic. The wake sequence:
- First HTTP request to a sleeping service hits Railway's edge → returns
404 {"status":"error","code":404,"message":"Application not found","request_id":"..."}while the container boots - Container boots in ~2-4 seconds (Node) or ~5-10s (Python with heavy deps)
- Subsequent requests resolve normally to the running app
The 404 body Railway returns during cold-start is identical to the 404 returned for a deleted application. There is no header or body field that distinguishes "sleeping" from "deleted." The only way to tell them apart is:
- API state check via
backboard.railway.app/graphql/v2(auth required) — looks at theserviceanddeploymentsrecords - Retry with a wait — sleeping services resolve in seconds, deleted ones don't
The diagnosis runbook
Step 1 — never trust a single 404
If the only evidence you have is one HTTP probe returning Application not found, stop. That is not enough to declare a service dead. Run step 2.
Step 2 — wake-and-retry probe
# First request triggers wake; ignore its result.
curl -sS -o /dev/null --max-time 30 "https://api.example.up.railway.app/api/health"
# Wait for boot
sleep 5
# Real probe — this is the one that tells you if the service exists
curl -sS -w "\nHTTP %{http_code} time=%{time_total}s\n" --max-time 30 "https://api.example.up.railway.app/api/health"
If the second probe returns 2xx/4xx that is NOT "Application not found", the service is alive. The first 404 was a cold-miss.
If the second probe still returns Application not found, proceed to step 3.
Step 3 — Railway API truth check (requires token)
You need a Railway account token (not a project token) to list projects. From https://railway.com/account/tokens. Account-token shape: UUID. Header form: Authorization: Bearer <token>. Project-token shape: also UUID, but uses Project-Access-Token: <token> header and only works against the project it was minted for.
TOK='your-account-token-uuid'
# 1. Confirm the token is valid and identify the account
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOK" \
-d '{"query":"query { me { id email name } }"}'
# 2. List projects + services + environments
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOK" \
-d '{"query":"query { me { workspaces { name projects { edges { node { id name services { edges { node { id name } } } environments { edges { node { id name } } } } } } } } }"}'
# 3. Inspect deployments for a specific service
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOK" \
-d '{
"query": "query($p:String!,$e:String!,$s:String!){deployments(first:3,input:{projectId:$p,environmentId:$e,serviceId:$s}){edges{node{id status createdAt updatedAt staticUrl url canRedeploy}}}}",
"variables": {"p":"PROJECT_ID","e":"ENV_ID","s":"SERVICE_ID"}
}'
Status interpretation:
GraphQL status |
Meaning | Action |
|---|---|---|
SUCCESS |
Last deploy succeeded, currently running | Service is alive — re-probe |
SLEEPING |
Idle, will wake on next request | Service is alive — re-probe |
BUILDING |
Build in progress | Wait, then probe |
DEPLOYING |
Build done, container starting | Wait, then probe |
CRASHED |
Container booted then died | Read logs, fix, redeploy |
FAILED |
Build failed | Read build logs |
REMOVED |
Deployment was deleted | Older versions only — check newer rows in same query |
| (no rows at all) | Service truly deleted or token lacks scope | Verify in dashboard |
If status is SUCCESS or SLEEPING, the service exists and the 404 was a cold-miss — re-probe and confirm. If status is CRASHED/FAILED, the work is real (logs + fix), not a redeploy of "what's gone."
Pitfalls
mequery fails with project tokens.meis account-scoped. Ifquery { me }returns "Not Authorized" butquery { projectToken { projectId environmentId } }works (withProject-Access-Tokenheader), you have a project token, not an account token. Account tokens see all projects; project tokens see only one.Cached CLI tokens (
~/.railway/config.json) expire silently. Arw_*300-char token in the CLI config is the CLI session token, not an API token. They expire after long inactivity and you cannot tell from the file. Test withrailway whoamifirst; if it says "Unauthorized", do not waste time debugging — get a fresh API token.*.up.railway.appstatic URLs lie. A service can have astaticUrlset in the API even when no deployment exists. Always checkdeploymentsrows, not just the static URL.Custom domains with no DNS update keep showing the old target. When you add a new custom domain via
customDomainCreate, Railway issues a NEWrequiredValueCNAME target like2x68toyo.up.railway.app. The DNS record at the registrar still points to the OLD target, so cert provisioning silently never finishes. Always read thednsRecordsarray returned bycustomDomainCreateand update the registrar accordingly.Apex domain on a third-party DNS host. Don't assume a domain is in the user's Cloudflare account just because they have CF tokens.
dig +short NS domain.tldwill tell you. agentpact.xyz uses ns1/ns2.dns-parking.com — those nameservers are Hostinger's default parking DNS (NOT Namecheap, despite that domain string looking generic). Confirm registrar via RDAP:curl -sS -H "Accept: application/json" "https://rdap.centralnic.com/xyz/domain/agentpact.xyz" | jq -r '.entities[] | select(.roles[]==\"registrar\") | .vcardArray[1][] | select(.[0]==\"fn\") | .[3]'returnsHOSTINGER operations, UAB. wisechef.ai is the only Adam-domain on Namecheap. DNS edits go to the appropriate registrar panel — or, when available, to the registrar's API (see companion skillhostinger-dns-api).Misdiagnosis chains in Paperclip. If one ticket reports "Railway is dead" and four other tickets get blocked citing the first, fixing the first ticket unblocks all of them, but only if you re-comment and re-status each downstream one. Don't just close the upstream — propagate the correction.
watchPatternsblocks auto-deploys. Railway's GitHub webhook respects per-servicewatchPatternsin the deploy config. If a service is configured withwatchPatterns: ["/apps/api/**"], a PR that touches only/migrations/,/scripts/, or any path outside that glob will land onmainbut never trigger an auto-deploy. The merge succeeds, CI passes, GitHub shows the commit on main, and Railway sits on the prior deploy as if nothing happened. Symptom:git log main --onelineshows the new commit, the live API still serves old behavior, andquery { deployments }shows the latest deploy is on the OLD commit hash.Verify the watch pattern by inspecting
meta.serviceManifest.build.watchPatternson any deployment row:query { deployment(id: "<dep-id>") { meta } } # → meta.serviceManifest.build.watchPatterns: ["/apps/api/**"]Either expand the pattern in Railway's service settings (Settings → Source → Watch Paths), or trigger a manual deploy with the GraphQL mutation below. Also worth widening the pattern as a permanent fix — schema migrations and deploy scripts should auto-deploy.
serviceInstanceDeployV2withcommitSharaces against GitHub propagation. When you've just merged a PR and want to force-deploy that specific commit, this fails:mutation { serviceInstanceDeployV2(serviceId: "...", environmentId: "...", commitSha: "abc1234") } # → deployment created but immediately FAILED: # "Failed to fetch specific commit: couldn't find remote ref \"abc1234\""Railway's internal git mirror has a 30-90 second propagation delay after GitHub. The deployment row shows
status=FAILEDwithmeta.configErrors=["Failed to fetch specific commit..."].Fix: use
serviceInstanceDeploy(V1) withlatestCommit: trueinstead — it re-resolves HEAD on Railway's side at deploy time and works even when GitHub propagation is in flight:mutation { serviceInstanceDeploy(serviceId: "...", environmentId: "...", latestCommit: true) } # → returns true, builds correctly with the new commitUse V2+commitSha only when you specifically need to re-deploy a known-good older commit (rollback). For "deploy whatever's on main right now," V1+latestCommit is more reliable.
Two-
migrations/folder footgun (deploy-script-vs-conventions drift). A repo can accumulate two migration directories — one read by the runtime migrate script, one where some contributor put new files thinking it was the canonical location. Files in the second dir silently never run, and you only find out when an endpoint that depends on the missing schema returnsrelation "X" does not existin production.Quick check before declaring "code is fine, must be Railway":
# find every migrations/ in the repo find . -type d -name 'migrations' -not -path '*/node_modules/*' # what does the migrate script actually read? grep -nE "migrationsDir|readdir.*migration" scripts/migrate.ts apps/*/scripts/migrate.* 2>/dev/null # diff the dirs to find orphan migrations diff <(ls migrations/) <(ls apps/api/migrations/) 2>/dev/nullIf you find orphan migrations, copy them into the canonical dir at the next free sequence number, don't rename or move (other deploy artifacts may reference the old paths). Add an
IF NOT EXISTSguard if the migration doesn't already have one. Document the canonical dir in amigrations/README.mdso future contributors don't re-create the drift.Validated 2026-04-25 on AgentPact:
/api/concierge/statsreturned 500 (relation "concierge_messages" does not exist) for ~2 weeks becauseapps/api/migrations/020_concierge_relay.sqlwas never run — the migrate script only reads rootmigrations/. Fixed in PR #8 by copying the file tomigrations/033_concierge_relay.sql.
The OTHER misdiagnosis: 503 timeout that LOOKS like Railway/cron but is app-level pool starvation
Validated 2026-04-27 on AgentPact. Cron agentpact-morning-check failed with SCRIPT TIMEOUT after 120s and the obvious read was "Railway is down again" — wrong. Pattern:
Symptoms that mimic Railway death but aren't
/healthreturns200 ~0.5s(Railway is alive — would 404 if dead)- Every data endpoint returns HTTP 503 with body
{"error":"Request timeout — server is under load, please retry"}after exactly 30s (or whatever yourREQUEST_TIMEOUT_MSis set to) - Even trivial endpoints (
/health/poolrunningSELECT 1) hit the 30s ceiling - Per-endpoint statement_timeouts (e.g. WIS-250's 4s
withBrowseStatementTimeout) are bypassed — requests hang before reaching SQL - Cron probes time out at the cron's own ceiling, not the app's, hiding the "exactly 30s" tell
Why 503 here ≠ Railway
The 503 is the app's own onRequest timeout middleware (apps/api/src/index.ts in Fastify projects):
const REQUEST_TIMEOUT_MS = 30_000;
app.addHook('onRequest', async (_request, reply) => {
const timer = setTimeout(() => {
if (!reply.sent) reply.code(503).send({ error: 'Request timeout — server is under load, please retry' });
}, REQUEST_TIMEOUT_MS);
...
});
Railway never sees a hang — the app emits the 503 itself. Restarting via Railway does fix it (clears leaked connections), which makes the misdiagnosis sticky: "restart = Railway was bad."
Diagnosis runbook
- Probe
/healthAND a data endpoint side-by-side. If/healthis 200 fast and data endpoints 503 at exactly the timeout ceiling → app-level, not Railway. - Probe the pool canary. If
/health/pool(or whatever runsSELECT 1) also times out → Postgres connection-pool starvation, not slow queries. pg_stat_activitysnapshot on Supabase/Postgres:
Anything > 2 min holding a tx is the leak.SELECT pid, state, now()-xact_start AS age, wait_event, left(query,120) FROM pg_stat_activity WHERE state <> 'idle' ORDER BY age DESC NULLS LAST LIMIT 20;pg_terminate_backend(pid)for emergency relief.- Audit
sql.begin()exit paths in suspect files. Common leak sources:- Single-flight queues that don't release on exception
- Concierge / outbound-relay code that does external HTTP inside a transaction
- Recently-merged batch/embedding work that holds connections across iterations
try { await sql.begin(...) }with nofinallyreleasing on cancel/abort
Quick triage fix vs. permanent fix
- Mitigation: restart the API service (any platform — Railway, Fly, Heroku). Clears leaked connections, site comes back. Does not fix the leak.
- Permanent: find the code path that opens a tx and exits without release. Pool
max=Nmeans the issue is inevitable after N leaks — only the cadence varies with traffic.
Why this matters for autonomous agents
Cron agents tend to write SCRIPT TIMEOUT → Railway issue when their probe budget exceeds the app's own 503 ceiling. The 30s-on-the-nose timing is the giveaway. Add a side-by-side /health probe to any monitoring script — it's the cheapest possible discriminator between platform-down (everything 404/5xx-fast) and app-pool-starved (/health fine, data endpoints timeout-pegged).
Pitfall: don't restart blind on recurrence
If a service needs restarting more than ~weekly to come back, the connection leak is real and getting worse. Don't normalize the restart — that's evolution-via-bandaid. Profile pg_stat_activity during the next degradation window, identify the leaking code path, fix it.
Bonus: keep-alive cron pattern (optional)
If cold-start 404s are a recurring source of misdiagnosis or a real customer-experience problem, a 4-hourly cron pinging each service's /health endpoint keeps them warm. Add to ~/.hermes/cron/jobs.json with:
for svc in api mcp web; do
curl -sS -o /dev/null --max-time 15 "https://${svc}.example.com/health" || true
done
Cost: zero. Trade-off: lose some serverless cost savings on truly-idle services, gain consistent customer-facing latency and one less failure mode for autonomous agents to misdiagnose.
Force a deploy via GraphQL (when auto-deploy is blocked)
When watchPatterns, GitHub webhook delays, or any other reason prevents auto-deploy after a merge to main, force a deploy from the API:
TOK='your-account-token-uuid'
PROJ='project-uuid' ; ENV='env-uuid' ; SVC='service-uuid'
# Preferred: deploy whatever's currently on main (HEAD-resolves on Railway side)
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
-H "Authorization: Bearer $TOK" -H "Content-Type: application/json" \
-d "{\"query\":\"mutation { serviceInstanceDeploy(serviceId: \\\"$SVC\\\", environmentId: \\\"$ENV\\\", latestCommit: true) }\"}"
# returns: { "data": { "serviceInstanceDeploy": true } }
# Then poll deployment status until SUCCESS:
for i in 1 2 3 4 5 6 7 8 9 10; do
sleep 30
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
-H "Authorization: Bearer $TOK" -H "Content-Type: application/json" \
-d "{\"query\":\"query { deployments(first: 1, input: { projectId: \\\"$PROJ\\\", environmentId: \\\"$ENV\\\", serviceId: \\\"$SVC\\\" }) { edges { node { status meta } } } }\"}" \
| python3 -c "import sys,json; d=json.load(sys.stdin); n=d['data']['deployments']['edges'][0]['node']; print(n['status'], (n.get('meta') or {}).get('commitHash','')[:8])"
done
Avoid serviceInstanceDeployV2 with commitSha for fresh-merge deploys (see pitfall #8 above). It is correct only for known-good older commits.
If the manual deploy lands on FAILED immediately, query the deployment object for meta.configErrors — that's where Railway puts the human-readable reason:
curl -sS -X POST "https://backboard.railway.app/graphql/v2" \
-H "Authorization: Bearer $TOK" -H "Content-Type: application/json" \
-d "{\"query\":\"query { deployment(id: \\\"DEP_ID\\\") { id status meta } }\"}" \
| python3 -c "import sys,json; d=json.load(sys.stdin); m=d['data']['deployment'].get('meta') or {}; print(m.get('configErrors')); print(m.get('serviceManifest',{}).get('build',{}).get('watchPatterns'))"
buildLogs and deploymentLogs queries return empty for deployments that didn't reach the build stage — the configError in meta is your only signal in those cases.
Verification — definition of done
- Re-probed the suspected-dead service after a wake delay
- Confirmed via Railway GraphQL that the service exists and
status=SUCCESS|SLEEPING - If applicable, fixed up Paperclip tickets that chained off the misdiagnosis (status, comments, downstream blockers)
- If the issue recurs, added a keep-alive cron OR documented the cold-miss behavior in the team's diagnosis playbook