PostgreSQL Outage Blast Radius
How a single database failure cascades across the platform's domains.
- 1.User request enters the portal
- 2.NestJS API handles the request — sync_call
- 3.Ring 0 — PostgreSQL unavailable
Madhuri Ganta's orgArchitecture dossier
The root board of this workspace, serving as the central hub for viewing, organizing and navigating all systems, entities and connections
9 findings · 3 traced paths across this system.
3 findings: 1 critical, 2 high, 0 medium, 0 low Polarities: 3 risks, 0 strengths, 0 opportunities, 0 observations 2 paths traced, 1 suggestion
Two structural single points of failure dominate the rootboard: the unreplicated PostgreSQL instance and the single Lambda handler that fronts the entire API.
Element: postgres-db
Nine of ten backend domains plus NextAuth depend on one PostgreSQL/pgvector instance. No replica or standby appears in the graph.
Recommendation: Run multi-AZ with a read replica; route dashboard reads to the replica; define and test RTO/RPO.
Element: lambda-runtime File: apps/server/serverless.yml
The whole NestJS app is one dist/main.handler behind ANY /{proxy+}. Every domain shares one function's concurrency and cold-start budget.
Recommendation: Reserved/provisioned concurrency, per-route timeouts; split auth + billing webhooks into dedicated functions.
Element: sso-platform-auth
Every authenticated request depends on sso-platform-auth, which depends on the IdP and the database. A DB or auth outage blocks all new logins.
Recommendation: Circuit breakers + timeouts on IdP/Cognito calls; graceful-degradation messaging.
web → nestjs-api → postgres-db (Ring 0) branching into auth lockout, dashboard failure, billing/webhook backup, and onboarding loss. Nothing stops the cascade — the blast boundary is the whole platform.
web → nestjs-api → lambda-runtime. One function, no per-domain isolation.
add node — Add a PostgreSQL read replica: removes the DB as an absolute read SPOF and shrinks the Ring 1 blast radius.
How a single database failure cascades across the platform's domains.
All portal traffic funnels through one Lambda function.
Nine of ten backend domains plus NextAuth read or write the single PostgreSQL/pgvector instance (billing, deployment, container-images, sso, onboarding, portal, org-members, user-profile, curation, and nextjs-web all have edges into postgres-db). There is no evidence of a read replica or multi-AZ standby in the graph.
Impact: A database outage takes down auth, dashboard, billing, and onboarding simultaneously — a full platform outage with no partial degradation.
→ Run PostgreSQL with a multi-AZ standby and a read replica; route dashboard reads to the replica and define an RTO/RPO target with a tested restore runbook.
The whole NestJS app is bundled as a single dist/main.handler behind an httpApi catch-all (ANY /{proxy+}), confirmed in apps/server/serverless.yml. Every domain shares one function's concurrency, memory, and cold-start budget.
Impact: A poison request, memory leak, or concurrency throttle in any one domain degrades every endpoint at once; there is no per-domain isolation.
→ Set reserved/provisioned concurrency and per-route timeouts; consider splitting hot or risky domains (auth, billing webhooks) into dedicated functions.
sso-platform-auth sits between nextjs-web and both the IdP and PostgreSQL. Because every authenticated request depends on it, an auth or DB failure blocks all logins — existing sessions may survive but no new access is possible.
Impact: During a DB or auth outage, users cannot sign in at all; the blast radius is the entire user base.
→ Add circuit breakers and timeouts on the IdP and Cognito calls, and graceful-degradation messaging so the portal fails informatively rather than hanging.
In one sentence: Onboarding is a client-driven, four-step REST flow on OnboardingController (/user/*) that progressively provisions a user's profile, an organisation + first-user membership, a scored qualification record, and finally a subscription + deployment — sending a fire-and-forget notification to the support team at the end.
| Step | Endpoint | What OnboardingService does |
|---|---|---|
| 1 | POST /user/save-profile-precis | Sets name + userHandle on the user profile. (GET /user/isUserHandleAvailable backs the live check — a handle is rejected if taken as a user or org handle.) |
| 2 | POST /user/setup-solo-handle or setup-organisation-handle | Creates the organisation (kind solo-user or org) and the first org_user membership, then lists the public profile. |
| 3 | POST /user/save-qualification | Scores the answers (computeQualificationScore), inserts an onboarding_qualification row, mirrors the role onto userProfile.title, and writes a summary onto organisation.businessDescription. (save-business-information is the deprecated predecessor.) |
| 4 | POST /user/setup-deployment-choice | Creates a subscription + deployment, updates subscriptionStatus, and fires the onboarding-complete email. |
Every mutating step first runs assertNotInvitedMember — invited (non-admin) members are rejected with 403.
BILLING_MODE env) — stripe-trial -> FREE plan, immediately ACTIVE; founder (default) -> FOUNDER plan with a 6-month preview, status WAITLIST unless the email is in ROOT_ADMIN_EMAILS (then ACTIVE).SAAS -> deployment COMPLETED with a platform URL; self-hosted -> PENDING.how-002). Step 2 re-implements OrganisationService.createOrganisation inline against the repositories.how-003) — partial failure orphans rows.organisation.businessDescription doubles as a progress flag (how-006) — a semantic field overloaded with flow-control meaning.The qualification scoring (how-004) is a clean, pure, testable function, and the completion email (how-005) is correctly non-blocking.
setupSoloHandle does organisationRepository.create then organisationUserRepository.create as two separate awaits; setupDeploymentChoice does subscription.create -> deployment.create -> profile.update as three. A failure on any later write leaves the earlier rows committed and orphaned (org with no user, subscription with no deployment).
Impact: Orphaned organisations/subscriptions on partial failure; the user can land in a half-provisioned state that later steps assume is complete.
→ Wrap each step's writes in a single transaction via DatabaseService.runEffect() so partial provisioning rolls back atomically.
setupSoloHandle/setupOrganisationHandle build the org via OrganisationFactory and persist with OrganisationRepository.create + OrganisationUserRepository.create directly, duplicating the exact logic in OrganisationService.createOrganisation. The nested module's service is used only for the findOrganisation read in assertNotInvitedMember.
Impact: Two copies of provisioning logic can drift; the module's public service no longer owns its own writes.
→ Route provisioning through OrganisationService.createOrganisation so the org/first-user creation logic lives in one place and the module boundary is respected.
saveQualification persists structured answers to onboarding_qualification, then also writes a human-readable summary string into organisation.businessDescription. The code comment states this is to keep an existing state-machine check (!org?.description) working — so the field doubles as an onboarding-progress signal.
Impact: A semantic field is coupled to flow-control; clearing or editing the description could silently reset the user's perceived onboarding state.
→ Track onboarding step completion explicitly (e.g. a status/step column) instead of inferring progress from whether businessDescription is populated.
computeQualificationScore takes the answers and returns {score, tier} with no I/O: role (+1..+3), each qualifying system type (+2), team size (+1..+2), qualifying challenge (+2), tooling signal (+1). Tier is HIGH at >=11, MEDIUM at >=7, else LOW. The service persists this alongside the raw answers.
Being side-effect-free makes the scoring trivially unit-testable and safe to re-run; only OnboardingService.saveQualification touches the database around it.
OnboardingService exposes four independent POST steps under /user: save-profile-precis (name + handle), setup-solo/organisation-handle (provision org + first user), save-qualification (score the lead), and setup-deployment-choice (subscription + deployment). Each is a separate stateless call; the client decides ordering — there is no server-side step sequencing.
Step order is enforced only by the frontend. The backend accepts any step in any order, relying on data from earlier steps (e.g. the org must already exist before save-qualification looks it up by orgCode).
After the final step, setupDeploymentChoice calls sendOnboardingNotification (EmailService.sendEmail -> Resend) to support@contextdx.com without awaiting it; the promise is .catch()'d and only logged. Email delivery failure never blocks or fails onboarding completion.
Good degradation behaviour for a non-critical side effect, but it also means a silently dropped notification has no retry or alerting beyond a log line.
ContextDx maps your architecture from your codebase and reconciles it into living, shareable insights — just like this board.