09 — Deployment & Operations¶

As of 2026-05-28.

What is actually deployed today, how it's wired together, and how an operator restarts / rolls back / debugs it. Code-grounded against QuantaTradeAI/platform, QuantaTradeAI/admin-panel, and the EC2 deployment captured in docs/deployment-state.md.

Bottom line up front: a single EC2 instance hosts every running service — platform monorepo containers via docker-compose, plus four PM2-managed Next.js front-ends, plus the matching engine. Cloudflare provides DNS only (not proxying). nginx terminates TLS and reverse-proxies into either localhost ports or the docker-compose network. There is no UAT / staging environment, no CD pipeline, and secrets live in plain .env files on the host. This is fit for a M1 internal demo. It is not fit for paying customers, and the gaps are itemised at the end.

1. Production topology¶

%%{init: {'theme':'base','themeVariables':{'background':'#ffffff','primaryColor':'#ddf4ff','primaryBorderColor':'#0969da','primaryTextColor':'#0a0a0a','lineColor':'#1f2328','secondaryColor':'#fff8c5','tertiaryColor':'#dafbe1','clusterBkg':'#f6f8fa','clusterBorder':'#d0d7de'}}}%%
graph TB
    subgraph Internet
        OPS[Operator browser]
        CLIENT[Trader browser]
    end

    subgraph CF[Cloudflare]
        DNS[Cloudflare DNS<br/>DNS-only, NOT proxying]
    end

    subgraph EC2["1× EC2 t3.2xlarge — 34.199.105.99 — us-east-1b"]
        NGINX[nginx<br/>TLS terminator<br/>Let's Encrypt certs]

        subgraph PM2["pm2-managed Next.js front-ends"]
            QT[quantatrade :3000<br/>main frontend]
            ADM[qt-admin :3012<br/>admin-panel]
            TRD[qt-trade :3013<br/>trading-ui]
            PRE[qt-presale :3010]
            INV[qt-dashboard :3011]
        end

        subgraph DC[docker-compose stack]
            PG[postgres:16-alpine :5432]
            RD[redis:7-alpine :6379]
            NT[nats:2-alpine :4222]
            AG[api-gateway :3001]
            WS[ws-gateway :3002]
            OR[order-router :3006]
            LED[ledger-service :3007]
            PMS[pms-service :3008]
            RSK[risk-service :3009]
            SUB[subscription-service :3060]
            ME[matching-engine :8090 / :9090]
            ENV[envoy gRPC-web :8088]
        end

        DOCS[/var/www/docs.quanta.emoment.tech<br/>static MkDocs/]
    end

    OPS --> DNS
    CLIENT --> DNS
    DNS -.A record.-> NGINX

    NGINX --> QT
    NGINX --> ADM
    NGINX --> TRD
    NGINX --> PRE
    NGINX --> INV
    NGINX --> AG
    NGINX --> WS
    NGINX --> ENV
    NGINX --> ME
    NGINX --> DOCS

    AG --> PG
    AG --> RD
    AG -.NATS.-> NT
    OR -.NATS.-> NT
    LED -.NATS.-> NT
    PMS -.NATS.-> NT
    WS -.NATS.-> NT
    SUB -.NATS.-> NT

    OR -.gRPC.-> ME
    ENV -.gRPC.-> ME

    LED --> PG
    PMS --> PG
    SUB --> PG

    style EC2 fill:#f6f8fa
    style DC fill:#ddf4ff
    style PM2 fill:#ddf4ff

Host facts¶

From docs/deployment-state.md:14-21:

Resource	Value
AWS account	`094969483885`
Instance	`i-077d5f14e17fb052c`, t3.2xlarge (8 vCPU, 32 GB), us-east-1b, Ubuntu 24.04, 200 GB gp3 root
Elastic IP	`34.199.105.99` (alloc `eipalloc-02511dc727ab251a9`)
SSH	`ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99`
Source tree	`/home/ubuntu/qt/` (multi-repo working area)
Docs static	`/var/www/docs.quanta.emoment.tech/` (rsync target)

t3.2xlarge is a burstable instance. Under sustained load it will throttle once CPU credits exhaust. Acceptable for demo traffic. Not acceptable for live trading — moving to a c7i.2xlarge or m7i.2xlarge (non-burstable) is on the M5 ops list.

Public surface¶

Eight subdomains, all pointing at the same EIP. From deployment-state.md:24-36:

Subdomain	Routed to	Serves
`quanta.emoment.tech`	pm2 `quantatrade` :3000	Main Next.js frontend (login / register / dashboard scaffold). Source missing locally — only the deployed `.next/` build exists on the host. Recovery is on the M1 follow-up list
`docs.quanta.emoment.tech`	nginx static → `/var/www/docs.quanta.emoment.tech/`	MkDocs build of this repo's `docs/`
`api.quanta.emoment.tech`	docker `api-gateway` :3001	NestJS REST API
`ws.quanta.emoment.tech`	docker `ws-gateway` :3002	WebSocket gateway (`/health` returns 200)
`presale.quanta.emoment.tech`	pm2 `qt-presale` :3010	Presale Next.js (`/home/ubuntu/qt/presale-app/`)
`dashboard.quanta.emoment.tech`	pm2 `qt-dashboard` :3011	Investor portal Next.js (`/home/ubuntu/qt/investor-dashboard/`)
`admin.quanta.emoment.tech`	pm2 `qt-admin` :3012	Admin panel Next.js (`/home/ubuntu/qt/admin-panel/`) — see 06-admin-panel.md
`trade.quanta.emoment.tech`	pm2 `qt-trade` :3013	Trading UI (`/home/ubuntu/qt/trade-ui/`)
`grpc.quanta.emoment.tech`	docker `envoy` :8088 → matching-engine :9090	gRPC-web endpoint for the admin panel
`matching.quanta.emoment.tech`	matching-engine :8090, but path-locked to `/api/v1/accounts/deposit`	CI self-heal balance top-up only — all other paths return 404

Domain story¶

The platform runs on emoment.tech today. quantatrade.tech is owned but not on the same Cloudflare account as the rest of the zone. A migration is queued but not gating any other work — internal links use emoment.tech for now.

TLS¶

Four Let's Encrypt certs, all expire on or around 2026-07-24 to 2026-07-26. Renewed by certbot systemd timer (daily check, renews when < 30 days remain). Last certbot renew --dry-run was clean on 2026-04-25.

Multi-SAN consolidation: api.quanta.emoment.tech cert also covers ws, presale, dashboard, admin (5 hosts on one cert). grpc.quanta.emoment.tech is single-SAN, separate.

2. Docker-compose service graph¶

The platform compose stack lives at /home/ubuntu/qt/infrastructure/docker-compose.yml with patches in docker-compose.override.yml alongside it. Neither file is in this repo or the quantatrade-slippage working tree — they are host-only artefacts. The compose service list is inferred from deployment-state.md:47-65 and the per-service Dockerfile declarations in services/*/Dockerfile.

Containers¶

11 services, all running with restart: unless-stopped, all networked on the default compose bridge.

Service	Image	Host port	Container port	Healthcheck	Depends on
`postgres`	`postgres:16-alpine`	127.0.0.1:5432	5432	`pg_isready`	—
`redis`	`redis:7-alpine`	127.0.0.1:6379	6379	`redis-cli ping`	—
`nats`	`nats:2-alpine`	127.0.0.1:4222	4222	TCP probe	—
`api-gateway`	`ghcr.io/quantatradeai/platform-api-gateway:latest`	127.0.0.1:3001	3001	`GET /api/v1/health` (overridden — image default `/health` 404s)	postgres, redis, nats
`ws-gateway`	`ghcr.io/.../ws-gateway:latest`	127.0.0.1:3002	8080 (image) → overridden to 3002	`node -e` probe on `:3002` (image had no `wget`/`curl`)	nats
`order-router`	`ghcr.io/.../order-router:latest`	127.0.0.1:3006	3006 (+ 9092 gRPC)	image default `/health`	nats, redis, matching-engine
`ledger-service`	`ghcr.io/.../ledger-service:latest`	127.0.0.1:3007	3007	image default `/health`	nats, postgres, temporal
`pms-service`	`ghcr.io/.../pms-service:latest`	127.0.0.1:3008	3008 (overridden — image set `:3007`)	overridden `:3008`	nats, postgres
`risk-service`	`ghcr.io/.../risk-service:latest`	127.0.0.1:3009	3009	image default `/health`	nats, postgres
`subscription-service`	`ghcr.io/.../subscription-service:latest`	127.0.0.1:3060	3060	image default `/health`	nats
`matching-engine`	`quantatrade-matching-engine:local` (built on EC2 from `QuantaTradeAI/exchange-core`)	127.0.0.1:8090 (HTTP), 127.0.0.1:9090 (gRPC)	8090 / 9090	Spring Actuator `/actuator/health`	postgres
`grpc-web-proxy` (Envoy)	`envoyproxy/envoy:v1.29-latest`	127.0.0.1:8088 (data), 127.0.0.1:9901 (admin)	8088 / 9901	TCP probe	matching-engine

All container ports are bound to 127.0.0.1 — public access goes only through nginx. The single exception is the matching-engine deposit subdomain (matching.quanta.emoment.tech), which is path-locked.

`docker-compose.override.yml` patches¶

Four image-baked defaults required overrides at deploy time (deployment-state.md:66-75):

api-gateway healthcheck — image used wget … /health (404). Overridden to /api/v1/health (the real path).
ws-gateway — env was missing JWT_SECRET (caused crash loop); image healthcheck on :8080 overridden to :3002 via node -e (image has no wget/curl).
pms-service healthcheck — image used :3007 but PORT env sets :3008. Overridden to :3008.
ledger-service — image was built without the @quantatrade/logger workspace package compiled (Dockerfile builds 5 of 6 workspace packages, skips logger). Locally-built dist/ bind-mounted from /home/ubuntu/qt/platform/packages/logger/dist/. Image was rebuilt on the host 2026-04-26; the bind-mount remains as belt-and-braces.

The override file is committed at /home/ubuntu/qt/infrastructure/docker-compose.override.yml on the instance. It is not yet in any GitHub repo — restoring the host from a snapshot requires manually recreating the override.

Image source¶

Most images come from ghcr.io/quantatradeai/platform-<service>:latest — built and pushed by the CI workflow (.github/workflows/ci.yml — see §6). The matching-engine image is built on the EC2 host from a checkout of QuantaTradeAI/exchange-core because the Java build is heavyweight and we did not yet wire a GHA matrix for it.

Dockerfile shape per TS service¶

All seven TS service Dockerfiles follow the same pattern (verified across all of services/*/Dockerfile):

# Build stage
FROM node:22-alpine AS builder
WORKDIR /app
RUN apk add --no-cache python3 make g++ openssl  # native deps + Prisma engine
COPY package.json package-lock.json tsconfig.base.json ./
COPY packages ./packages
COPY services/<svc> ./services/<svc>
RUN npm ci --workspace=@quantatrade/<svc> --include-workspace-root
RUN npx prisma generate --schema=packages/db/prisma/schema.prisma   # api-gateway only
RUN npm run build --workspace=@quantatrade/common
# … build each workspace package, then the service …
RUN npm run build --workspace=@quantatrade/<svc>

# Runtime
FROM node:22-alpine AS production
WORKDIR /app
RUN apk add --no-cache openssl  # Prisma engine
RUN addgroup -g 1001 -S nodejs && adduser -S nestjs -u 1001 -G nodejs
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/packages ./packages
COPY --from=builder /app/services/<svc>/dist ./services/<svc>/dist
COPY --from=builder /app/services/<svc>/package.json ./services/<svc>/
ENV NODE_ENV=production
ENV PORT=<svc-port>
USER nestjs
EXPOSE <svc-port>
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
  CMD wget -q --spider http://localhost:<svc-port>/health || exit 1
WORKDIR /app/services/<svc>
CMD ["node", "dist/main.js"]

Citation: services/api-gateway/Dockerfile:1-72 (the most fleshed-out; others differ only in package list and port).

Non-root user (nestjs:1001), unprivileged ports, Node 22 alpine. No multi-arch builds (linux/amd64 only — the EC2 host is x86_64).

3. Environment variables — per service¶

Each service's src/config/index.ts (or config.validation.ts) defines a Zod schema. Required + commonly-set vars below; defaults are from those schemas.

api-gateway (`services/api-gateway/src/config/config.validation.ts:37-92`)¶

Variable	Required	Default	Purpose
`JWT_SECRET`	yes (≥ 32 chars; enforced at `:46`)	—	HS256 token signing
`JWT_EXPIRES_IN`	no	`8h`	Access token TTL
`DATABASE_PASSWORD`	yes in prod (`:117`)	`exchange_dev` in dev	Postgres credential
`DATABASE_HOST` / `_PORT` / `_USER` / `_NAME`	no	`localhost` / `5432` / `exchange` / `exchange`	Postgres connection
`REDIS_HOST` / `_PORT` / `_PASSWORD`	no	`localhost` / `6379` / —	Cache, token blacklist
`NATS_URL`	no	`nats://localhost:4222`	Inter-service bus
`MATCHING_ENGINE_URL`	no	`http://localhost:8090`	REST fallback to engine
`MATCHING_ENGINE_GRPC_URL`	no	`localhost:9090`	gRPC to engine
`SUMSUB_APP_TOKEN` / `_SECRET_KEY` / `_BASE_URL`	optional	— / — / `https://api.sumsub.com`	KYC provider (M4 — not yet wired in prod)
`CORS_ORIGINS`	no	`http://localhost:3000`	Comma-separated allowlist
`RATE_LIMIT_TTL` / `_MAX`	no	60 / 100	NestJS throttler
`BODY_SIZE_LIMIT_KB`	no	100	Default body limit (10 MB for uploads via `_UPLOAD_KB`)
`REQUEST_TIMEOUT_MS`	no	30000	Per-request timeout
`LOG_LEVEL`	no	`info` (prod) / `debug` (dev)	Pino log level
`LOG_REQUESTS` / `LOG_REQUEST_BODY` / `LOG_RESPONSE_BODY`	no	`true` / `false` / `false`	Request audit logging
`AUDIT_ASYNC_LOGGING`	no	`false`	Buffer audit writes (see `audit/audit.module.ts:25`)

Hard-fail at boot if JWT_SECRET or (in prod) DATABASE_PASSWORD is missing — validateConfig() throws and the process exits (config.validation.ts:109-126).

ledger-service (`services/ledger-service/src/config/index.ts:22-32`)¶

Variable	Required	Default	Purpose
`PORT`	no	`3007`	HTTP health
`NODE_ENV`	no	`development`	Mode
`NATS_URL`	no	`nats://localhost:4222`	RPC bus for `ledger.credit/debit/lock/unlock/settleTrade`
`TEMPORAL_ADDRESS`	no	`localhost:7233`	Temporal cluster for `tradeSettlementWorkflow`
`TEMPORAL_NAMESPACE`	no	`default`	Temporal namespace
`DATABASE_URL`	yes (via `@quantatrade/db` shared schema)	—	Postgres connection — same instance as api-gateway

Note: ledger-service runs a Temporal worker (services/ledger-service/src/worker.ts) in addition to the HTTP server. Both are launched from main.ts. The Temporal server itself is not in the compose stack today — TEMPORAL_ADDRESS defaults to localhost:7233 which has nothing listening in prod. Settlement is functionally NATS-only right now; the Temporal workflow is the planned-durable path (see §8).

order-router (`services/order-router/src/config/index.ts:26-79`)¶

Variable	Required	Default	Purpose
`PORT`	no	`3006`	HTTP health
`NATS_URL`	no	`nats://localhost:4222`	Order events
`REDIS_HOST` / `_PORT` / `_PASSWORD`	no	`localhost` / `6379` / —	Order state persistence + dedupe
`MATCHING_ENGINE_URL`	no	`http://localhost:8090`	REST to engine
`MATCHING_ENGINE_WS_URL`	no	`ws://localhost:8090/ws`	Engine event stream
`MATCHING_ENGINE_GRPC_URL`	no	— (see schema)	gRPC PlaceOrder/Cancel
`RISK_MAX_ORDER_VALUE_USD`	no	`1_000_000`	Per-order cap
`RISK_MAX_DAILY_VOLUME_USD`	no	`10_000_000`	Per-user daily cap
`SERVICE_API_KEY` / `SERVICE_API_SECRET`	yes (per `.env.example:6-7`)	—	Service-auth to matching engine. Hardcoded fallbacks were removed for security

pms-service (`services/pms-service/src/config/index.ts`)¶

Variable	Required	Default	Purpose
`PORT`	no	`3007` (collides with ledger — see Dockerfile override)	HTTP
`JWT_SECRET`	yes in prod	`pms-secret-change-in-production` 🔴	Hard-coded fallback is a known weakness
`JWT_EXPIRES_IN`	no	`8h`	—
`MATCHING_ENGINE_URL` / `_WS_URL`	no	`http://localhost:8090` / `ws://localhost:8090/ws`	Engine connection
`LEDGER_SERVICE_URL`	no	`http://localhost:3004` (wrong — should be `:3007`)	Ledger calls (latent bug — works only because pms doesn't call ledger over HTTP in practice; NATS is used instead)
`PRICE_FEED_URL`	no	`http://localhost:8090/api/prices`	Mark-to-market
`PNL_UPDATE_INTERVAL_MS` / `PNL_SNAPSHOT_INTERVAL_MS`	no	`5000` / `60000`	P&L refresh cadence
`BVI_WEBHOOK_ID`	optional	—	BVI Financial Services Commission reporting
`TIMESCALEDB_URL`	optional	`postgresql://marketdata:marketdata_dev@localhost:5434/marketdata`	Time-series for tick history (not yet provisioned in prod)
`FIFO_BASE_CURRENCY`	no	`USD`	FIFO P&L accounting

ws-gateway (`services/ws-gateway/src/config/index.ts:28-53`)¶

Variable	Required	Default	Purpose
`PORT`	no	`3002`	WebSocket listener
`JWT_SECRET`	yes (≥ 32 chars; comes from `jwtConfigSchema`)	—	WS handshake auth
`NATS_URL`	no	`nats://localhost:4222`	Source of trade / order events
`WS_COMPRESSION`	no	`true`	`permessage-deflate`
`WS_MAX_PAYLOAD_KB`	no	`16`	Per-frame limit
`WS_IDLE_TIMEOUT`	no	`120` (seconds)	Drop idle connections
`WS_MAX_SUBSCRIPTIONS`	no	`50`	Per-connection cap
`WS_RATE_LIMIT_PER_SECOND`	no	`100`	Per-connection msg rate
`WS_MAX_MESSAGE_LENGTH`	no	`4096` (chars)	Inbound message size
`WS_MAX_CHANNEL_LENGTH`	no	`100`	Subscription channel name length

risk-service, subscription-service¶

Smaller surfaces — see services/risk-service/src/config/ and services/subscription-service/.env.example. The subscription service uses SQLite (DATABASE_URL=file:./subscription.db, subscription-service/.env.example:4) — not Postgres. Its persistence is container-local (the file is inside the container's writable layer, not a volume). This is a defect — restarting the container loses subscription state. Migration to Postgres is on the M4 list.

Admin panel envs (`admin-panel/next.config.js:5-13`)¶

These get baked into the static Next.js build at deploy time:

Variable	Default	Purpose
`MATCHING_ENGINE_URL`	`http://localhost:8090`	REST fallback path (currently unused in prod — gRPC-web is primary)
`GRPC_WEB_URL`	`http://localhost:8088`	Connect-protocol endpoint — in prod set to `https://grpc.quanta.emoment.tech`
`USE_GRPC`	`true`	Set to `false` to force REST-only
`API_GATEWAY_URL`	`http://localhost:3001`	For pages that call platform REST (positions, treasury)
`LEDGER_SERVICE_URL`	`http://localhost:3004` (wrong default — should be `:3007`)	Not yet called by any page
`RISK_SERVICE_URL`	`http://localhost:3005` (wrong — should be `:3009`)	Not yet called
`CUSTODY_SERVICE_URL`	`http://localhost:3006`	Treasury page (M2)
`KYC_SERVICE_URL`	`http://localhost:3009` (collides with risk)	KYC page (M4)

The default URLs are dev-machine values. Production overrides come from /home/ubuntu/qt/admin-panel/.env.production. These defaults must not ship to prod; review the override file when redeploying.

4. Secrets management¶

🔴 Current state: every secret lives in a per-host .env file. There is no Vault, no AWS Secrets Manager, no rotation. The CLAUDE.md rule "never copy .env between hosts" is the only thing protecting prod from dev-credentials leakage.

Where secrets live on the EC2 host¶

File	Holds	Read by
`/home/ubuntu/qt/infrastructure/.env`	`DATABASE_PASSWORD`, `JWT_SECRET`, `NATS_URL`, `REDIS_PASSWORD`, `SERVICE_API_KEY`, `SERVICE_API_SECRET`	docker-compose `env_file:` for every container
`/home/ubuntu/qt/admin-panel/.env.production`	`GRPC_WEB_URL=https://grpc.quanta.emoment.tech`, `NEXT_PUBLIC_*`	`next build` bakes them into the static bundle
`/home/ubuntu/qt/trade-ui/.env.production`	`NEXT_PUBLIC_API_BASE`, `NEXT_PUBLIC_WS_BASE`	Trade UI build
`/home/ubuntu/qt/presale-app/.env.production`	`NEXT_PUBLIC_WC_PROJECT_ID`, contract addresses	Missing in prod — see `deployment-state.md:248`

What's missing¶

No rotation. JWT_SECRET has been the same since the instance was provisioned. KYC/AML compliance will eventually require a quarterly rotation.
No central audit. We don't know which secrets exist on which host — there's no inventory beyond grepping the host.
Plain text on disk. chmod 600 .env, ubuntu user owns; no LUKS, no envelope encryption.
SUMSUB_* / BITGO_* envs are unset in prod today because we don't have credentials yet. When they arrive (M2 / M4), this section needs a refresh.

Path forward¶

Smallest defensible change: move to AWS Secrets Manager + IAM role on the instance, fetched at container start via an entrypoint shim. ~2 days of work; on the M5 ops list.

5. nginx reverse proxy¶

🔴 nginx config files are not in this repo or any working repo. They live on the EC2 instance at /etc/nginx/sites-available/ and have not been version-controlled. Recovering nginx config from a host snapshot is the only way to read them today. Restoring from scratch would mean rebuilding from scratch.

What we know from deployment-state.md:25-37 (the Backend (host:port) column gives the proxy targets) and :191-199 (the only nginx snippet committed anywhere):

Routing rules (inferred)¶

# Each subdomain is a separate server block at /etc/nginx/sites-available/
server {
    listen 443 ssl http2;
    server_name api.quanta.emoment.tech;
    ssl_certificate     /etc/letsencrypt/live/api.quanta.emoment.tech/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.quanta.emoment.tech/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3001;   # api-gateway
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

The one nginx fragment we do have¶

For matching.quanta.emoment.tech (CI self-heal subdomain) — deployment-state.md:193-199:

location = /api/v1/accounts/deposit {
    limit_except POST OPTIONS { deny all; }
    proxy_pass http://127.0.0.1:8090;
}
location / { return 404; }

This is the entire pattern: scope the path narrowly, deny everything else. Worth porting back into a versioned infrastructure/nginx/ directory.

TLS termination¶

All TLS terminates at nginx via Let's Encrypt certs. nginx → upstream containers is plain HTTP on 127.0.0.1. The matching-engine gRPC port is also 127.0.0.1-only — Envoy (which is itself 127.0.0.1:8088) is the only proxy that bridges public TLS to internal gRPC.

Rate limiting¶

nginx-level rate limiting is not configured today (limit_req_zone does not appear in any committed config). Per-route rate limiting exists at NestJS throttler level inside api-gateway (RATE_LIMIT_TTL=60, RATE_LIMIT_MAX=100 per IP per minute — config.validation.ts:75-76).

CORS¶

api-gateway sets CORS via the NestJS enableCors() call, reading from the CORS_ORIGINS env (config.validation.ts:41). Default is http://localhost:3000 — in prod the production frontend domains are added explicitly. nginx is not in the CORS path — it transparently forwards Origin and Access-Control-* headers.

6. CI/CD posture¶

CI today (`.github/workflows/ci.yml`)¶

Full content of the workflow at /Users/pk/ws/quantatrade-slippage/.github/workflows/ci.yml:

name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint-test-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'
      - name: Install dependencies
        run: npm ci || npm install
      - name: Lint (best effort — non-blocking until normalized)
        run: npm run lint --if-present || true
      - name: Type check
        run: npx tsc --noEmit --skipLibCheck || true
      - name: Test (non-blocking until each service defines real test scripts)
        run: npm test --if-present || true
      - name: Build
        run: npm run build --if-present

🟡 Lint, type-check, and test all swallow failures (|| true). Only build failures block merges. Hardening to remove || true is M1 follow-up — every service needs a clean lint + tsc pass first.

🔴 No image push step. The compose stack pulls ghcr.io/quantatradeai/platform-<service>:latest but nothing in this workflow builds or pushes those images. The images are built separately — historically by manual docker build && docker push from the deploy operator's laptop. This is fragile and is on the immediate fix list.

🔴 No deploy step. There is no CD. Deploys are manual SSH + docker compose pull + docker compose up -d.

CD today¶

Manual. The workflow:

PR merged to main on QuantaTradeAI/platform.
CI passes (or doesn't — only build is enforced).
Operator builds images locally and pushes to GHCR (or relies on a stale :latest).
Operator SSHes to 34.199.105.99.
cd /home/ubuntu/qt/infrastructure && docker compose pull && docker compose up -d --remove-orphans.
Operator manually verifies via /services admin page or curl https://api.quanta.emoment.tech/api/v1/health.

Existing CD for the trading-ui repo¶

The one bright spot — QuantaTradeAI/trading-ui/.github/workflows/e2e.yml runs Playwright e2e tests against the live trade.quanta.emoment.tech on every push/PR. Self-heals test balance via the matching.quanta.emoment.tech deposit subdomain. First green run 2026-04-29 — 26 specs in 1m12s. See deployment-state.md:177-203.

Path forward¶

Smallest defensible change (1 week of work):

Add a docker/build-push-action matrix to ci.yml — push per-service images on tag push.
Add a deploy workflow triggered by tag: docker compose pull && up -d over SSH (using appleboy/ssh-action).
Remove the || true swallowing once each service has a real lint + test target.

7. Observability¶

Metrics¶

🟡 @quantatrade/metrics exists but is currently a stub (packages/metrics/src/index.ts:4):

Stub metrics package providing Prometheus-compatible metric types. All metric operations are no-ops until a real implementation (e.g. prom-client) is wired in.

The shape is correct — Counter, Gauge, Histogram, Summary, and service-specific bundles (createApiGatewayMetrics, createOrderRouterMetrics, createWsGatewayMetrics) are all defined and consumed by the services. But .inc() / .observe() / .set() are all no-ops (packages/metrics/src/index.ts:20-33). The /metrics endpoint on api-gateway returns a header-only response (# HELP stub metrics for api-gateway).

Wiring prom-client into the registry is a one-day task. No Prometheus / Grafana / Datadog target is configured today — there is nowhere for metrics to flow even if the stub were replaced.

Logs¶

🟢 Structured logging via @quantatrade/logger (packages/logger/src/). Pino-backed. Two methods that get used everywhere: .error(msg, err, meta) and .logTrade(meta). The .logError / .logTrade calls were a recent fix (see docs/milestone-1-status.md — "Resolved 2026-05-27: ledger logger bug").

Container logs end up at the Docker daemon's default location (/var/lib/docker/containers/<id>/<id>-json.log). No log shipping → no Loki, CloudWatch, or Splunk target. Operator reads logs with docker compose logs -f <service> over SSH.

Tracing¶

🔴 None. No OpenTelemetry exporters, no Jaeger, no Datadog APM.

Healthchecks¶

Every container has a docker-level HEALTHCHECK. docker compose ps shows status as (healthy) / (unhealthy) / (starting). The admin panel's /services page polls these endpoints from the browser side via api.getServiceHealth(serviceKey).

Alerting¶

🔴 None. No PagerDuty, no Slack alert pipeline, no on-call rotation. If the host goes down, the next person to look at trade.quanta.emoment.tech finds out.

Cost / value trade-off¶

For the current M1 demo posture this is acceptable. Before any paying-customer traffic: wire prom-client + push to Grafana Cloud (free tier covers 10K series), add CloudWatch container insights, write 4 alarms (instance down / disk > 80% / matching-engine 5xx / no trades in 5 min).

8. Backups¶

EBS snapshots (host-level)¶

🟢 DLM (Data Lifecycle Manager) is configured (deployment-state.md:255-262):

Item	Value
IAM role	`arn:aws:iam::094969483885:role/AWSDataLifecycleManagerDefaultRole`
Policy	`policy-0066a67ecb6c3daa7` (ENABLED)
Schedule	Daily, 03:00 UTC
Retention	7 days
Target	EBS volumes tagged `Backup=daily` (currently the root disk `vol-0ddc7e9d1de5a2b59`)
Tagging	Snapshots tagged `SnapshotType=DLM-Daily`
Baseline	`snap-07e001a69835f1973` (manual, pre-DLM)

Restore = create new EBS volume from a snapshot, attach, fsck, mount, fix /etc/fstab UUID. ~15 minutes to a running new host.

Postgres backups (logical)¶

🔴 None today. No pg_dump cron, no continuous archiving, no AWS RDS (Postgres runs in-container with a Docker volume). The EBS snapshot is the only Postgres backup — sufficient for crash recovery, insufficient for point-in-time recovery.

Adding a nightly pg_dump → S3 with 30-day retention is a 30-minute task. It's not yet done.

Temporal-based settlement durability¶

🟡 tradeSettlementWorkflow (services/ledger-service/src/workflows/trade-settlement.ts) is implemented and wired up. Steps (:39-50):

settleTradeActivity — calls LedgerService.settleTrade({...}). Idempotent via trade ID.
notifyTradeSettledActivity — publishes trade.settled on NATS for downstream services.

Retry policy: maximumAttempts: 3, initialInterval: 1s, backoffCoefficient: 2, startToCloseTimeout: 30s (trade-settlement.ts:6-14).

The worker is launched from services/ledger-service/src/worker.ts:1-23 — maxConcurrentActivityTaskExecutions: 20, maxConcurrentWorkflowTaskExecutions: 20.

But there's no Temporal server in production today — TEMPORAL_ADDRESS defaults to localhost:7233 (config/index.ts:47) and nothing answers there. The worker connects, fails silently, and the workflow path is not exercised. Settlement actually happens via the NATS trade.executed listener inside ledger.ts directly.

Wiring a real Temporal cluster (Temporal Cloud or self-hosted Temporal in compose) is part of the M3 "make settlement crash-safe" workstream.

9. Operational runbooks¶

Restart a service¶

ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/infrastructure
docker compose restart <service>   # graceful; honours stop_grace_period
docker compose ps                  # confirm (healthy)

For pm2-managed front-ends (admin-panel, trade-ui, presale-app, investor-dashboard, main frontend):

pm2 restart qt-admin              # or qt-trade / qt-presale / qt-dashboard / quantatrade
pm2 status                        # confirm
pm2 logs qt-admin --lines 50      # check post-restart logs

Tail logs¶

# Docker container
docker compose logs -f --tail 100 api-gateway

# Multiple containers
docker compose logs -f api-gateway order-router ledger-service

# pm2 frontend
pm2 logs qt-trade --lines 200

# nginx access log
sudo tail -f /var/log/nginx/access.log

# nginx error log
sudo tail -f /var/log/nginx/error.log

Run a Prisma migration¶

The Prisma schema is in packages/db/prisma/schema.prisma (24 models). Migrations live in packages/db/prisma/migrations/:

20240201000000_add_address_pool
20260206000000_add_order_internal_id
20260208000000_add_user_password_hash

To apply migrations against the live database:

ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/platform

# Option A: from inside the api-gateway container (the image has Prisma CLI)
docker compose exec api-gateway npx prisma migrate deploy --schema=/app/packages/db/prisma/schema.prisma

# Option B: db push for non-versioned changes (development convenience — never in prod)
docker compose exec api-gateway npx prisma db push --schema=/app/packages/db/prisma/schema.prisma

⚠ History note (deployment-state.md:109): The initial deploy had no migrations — only the address_pool migration existed; User / Order / Trade / etc. tables were missing. Recovery was prisma db push from inside the api-gateway container, which created 25 tables from the schema. db push is dev-only — never run it on prod once we have customers. The two follow-up migrations (order_internal_id, user_password_hash) were added properly.

Roll back a bad deploy¶

# Tag-based pin (preferred — once we tag images properly)
ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/infrastructure
# Edit docker-compose.yml: pin <service>'s image: tag from `:latest` to the last-known-good tag
docker compose pull <service>
docker compose up -d <service>

# Or, if rollback requires DB state too: restore from EBS snapshot
# 1. Stop services that depend on Postgres
docker compose stop api-gateway order-router ledger-service pms-service risk-service
# 2. Create EBS volume from yesterday's DLM snapshot
aws ec2 create-volume --snapshot-id snap-<id> --availability-zone us-east-1b --volume-type gp3 --region us-east-1
# 3. Stop instance, detach old volume, attach new, fsck, mount
# 4. Restart instance and bring the stack up

There is no automated rollback — every step above is manual.

Renew a TLS certificate¶

certbot runs daily on a systemd timer. Manual force-renewal:

sudo certbot renew --dry-run                  # safety check
sudo certbot renew                            # actual renewal (only renews if < 30 days remain)
sudo certbot renew --force-renewal --cert-name api.quanta.emoment.tech   # nuclear option
sudo systemctl reload nginx

Redeploy docs (this repo's `docs/`)¶

cd /Users/pk/ws/quantatrade
~/Library/Python/3.9/bin/mkdocs build
rsync -az --delete -e "ssh -i ~/.ssh/quantatrade-key.pem" \
  site/ ubuntu@34.199.105.99:/var/www/docs.quanta.emoment.tech/

Inspect the docker-compose override¶

ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cat /home/ubuntu/qt/infrastructure/docker-compose.override.yml

(This file is host-only — see §2.)

Self-heal CI test balance¶

order-pipeline.spec.ts in the trading-ui repo deposits 1 B USDT + 1 k BTC into the test user before every CI run. The endpoint is gated by service-auth headers:

curl -X POST https://matching.quanta.emoment.tech/api/v1/accounts/deposit \
  -H "Content-Type: application/json" \
  -H "x-api-key: $SERVICE_API_KEY" \
  -H "x-api-secret: $SERVICE_API_SECRET" \
  -H "x-participant-type: SYSTEM" \
  -d '{"userId":"cmohcayzs0000n57cskqsutdc","currency":"USDT","amount":"1000000000"}'

10. Known operational gaps¶

🔴 Single host = SPOF. One EC2 instance dies → everything goes down. No multi-AZ, no auto-scaling, no failover. RTO is roughly 15 minutes (restore from EBS snapshot). RPO is 24 hours (daily DLM cadence).

🔴 Manual deploys. No CD pipeline. Operator SSHes in, runs docker compose pull && up -d. No deploy log, no atomic swap, no canary.

🔴 No UAT environment. Every change goes to prod direct. Trading UI has its e2e suite running against prod (which is brave but works for a demo).

🔴 Secrets in plain .env. No Vault, no rotation. JWT_SECRET hasn't rotated since provisioning.

🔴 Metrics stub. @quantatrade/metrics is no-ops. No Prometheus target.

🔴 No alerting. Host down / disk full / matching-engine 5xx — nothing pages anyone.

🔴 No logical Postgres backup. Only EBS-level. PITR is impossible.

🔴 No Temporal in prod. The settlement workflow exists in code but has no worker target to dispatch to.

🔴 nginx config not version-controlled. Reconstructing it from scratch would be from memory + grep on the host.

🔴 docker-compose.override.yml not committed. Same problem.

🟡 pms-service JWT secret has a hard-coded fallback (pms-secret-change-in-production) — must be overridden in prod env.

🟡 subscription-service uses SQLite inside the container — no volume mount, state lost on restart.

🟡 order-router MATCHING_ENGINE_URL/_WS_URL defaulted to localhost:8090 — unreachable from inside its container. Fixed via override env vars pointing at http://matching-engine:8090. The default in the schema is misleading.

🟡 Burstable t3.2xlarge. Under sustained load CPU credits will throttle. Move to c7i.2xlarge or m7i.2xlarge before paying-customer traffic.

🟡 CI swallows lint/type/test failures. Only build is enforced.

🟢 DLM EBS snapshots — working. Tested on 2026-04-26. 7-day retention.

🟢 Certbot auto-renewal — working. Last dry-run successful 2026-04-25.

🟢 pm2 survives reboots — pm2 startup was run on provision; the four front-ends come back up automatically.

🟢 Docker restart: unless-stopped on every service — containers come back after reboot or crash.

01-architecture.md — what the services do; this doc is "where they run"
03-ledger-accounting.md — Temporal settlement design (the worker that has no server yet)
06-admin-panel.md — how the operator UI lands on the matching engine via Envoy
docs/deployment-state.md — the authoritative day-to-day operations log; this doc summarises and structures it
docs/forward-plan.md — the prioritised list of operational improvements

09 — Deployment & Operations¶

1. Production topology¶

Host facts¶

Public surface¶

Domain story¶

TLS¶

2. Docker-compose service graph¶

Containers¶

docker-compose.override.yml patches¶

Image source¶

Dockerfile shape per TS service¶

3. Environment variables — per service¶

api-gateway (services/api-gateway/src/config/config.validation.ts:37-92)¶

ledger-service (services/ledger-service/src/config/index.ts:22-32)¶

order-router (services/order-router/src/config/index.ts:26-79)¶

pms-service (services/pms-service/src/config/index.ts)¶

ws-gateway (services/ws-gateway/src/config/index.ts:28-53)¶

risk-service, subscription-service¶

Admin panel envs (admin-panel/next.config.js:5-13)¶

4. Secrets management¶

Where secrets live on the EC2 host¶

What's missing¶

Path forward¶

5. nginx reverse proxy¶

Routing rules (inferred)¶

The one nginx fragment we do have¶

TLS termination¶

Rate limiting¶

CORS¶

6. CI/CD posture¶

CI today (.github/workflows/ci.yml)¶

CD today¶

Existing CD for the trading-ui repo¶

Path forward¶

7. Observability¶

Metrics¶

Logs¶

Tracing¶

Healthchecks¶

Alerting¶

Cost / value trade-off¶

8. Backups¶

EBS snapshots (host-level)¶

Postgres backups (logical)¶

Temporal-based settlement durability¶

9. Operational runbooks¶

Restart a service¶

Tail logs¶

Run a Prisma migration¶

Roll back a bad deploy¶

Renew a TLS certificate¶

Redeploy docs (this repo's docs/)¶

Inspect the docker-compose override¶

Self-heal CI test balance¶

10. Known operational gaps¶

Related¶

`docker-compose.override.yml` patches¶

api-gateway (`services/api-gateway/src/config/config.validation.ts:37-92`)¶

ledger-service (`services/ledger-service/src/config/index.ts:22-32`)¶

order-router (`services/order-router/src/config/index.ts:26-79`)¶

pms-service (`services/pms-service/src/config/index.ts`)¶

ws-gateway (`services/ws-gateway/src/config/index.ts:28-53`)¶

Admin panel envs (`admin-panel/next.config.js:5-13`)¶

CI today (`.github/workflows/ci.yml`)¶

Redeploy docs (this repo's `docs/`)¶