09 — Deployment & Operations¶
As of 2026-05-28.
What is actually deployed today, how it's wired together, and how an operator restarts / rolls back / debugs it. Code-grounded against QuantaTradeAI/platform, QuantaTradeAI/admin-panel, and the EC2 deployment captured in docs/deployment-state.md.
Bottom line up front: a single EC2 instance hosts every running service — platform monorepo containers via docker-compose, plus four PM2-managed Next.js front-ends, plus the matching engine. Cloudflare provides DNS only (not proxying). nginx terminates TLS and reverse-proxies into either localhost ports or the docker-compose network. There is no UAT / staging environment, no CD pipeline, and secrets live in plain .env files on the host. This is fit for a M1 internal demo. It is not fit for paying customers, and the gaps are itemised at the end.
1. Production topology¶
%%{init: {'theme':'base','themeVariables':{'background':'#ffffff','primaryColor':'#ddf4ff','primaryBorderColor':'#0969da','primaryTextColor':'#0a0a0a','lineColor':'#1f2328','secondaryColor':'#fff8c5','tertiaryColor':'#dafbe1','clusterBkg':'#f6f8fa','clusterBorder':'#d0d7de'}}}%%
graph TB
subgraph Internet
OPS[Operator browser]
CLIENT[Trader browser]
end
subgraph CF[Cloudflare]
DNS[Cloudflare DNS<br/>DNS-only, NOT proxying]
end
subgraph EC2["1× EC2 t3.2xlarge — 34.199.105.99 — us-east-1b"]
NGINX[nginx<br/>TLS terminator<br/>Let's Encrypt certs]
subgraph PM2["pm2-managed Next.js front-ends"]
QT[quantatrade :3000<br/>main frontend]
ADM[qt-admin :3012<br/>admin-panel]
TRD[qt-trade :3013<br/>trading-ui]
PRE[qt-presale :3010]
INV[qt-dashboard :3011]
end
subgraph DC[docker-compose stack]
PG[postgres:16-alpine :5432]
RD[redis:7-alpine :6379]
NT[nats:2-alpine :4222]
AG[api-gateway :3001]
WS[ws-gateway :3002]
OR[order-router :3006]
LED[ledger-service :3007]
PMS[pms-service :3008]
RSK[risk-service :3009]
SUB[subscription-service :3060]
ME[matching-engine :8090 / :9090]
ENV[envoy gRPC-web :8088]
end
DOCS[/var/www/docs.quanta.emoment.tech<br/>static MkDocs/]
end
OPS --> DNS
CLIENT --> DNS
DNS -.A record.-> NGINX
NGINX --> QT
NGINX --> ADM
NGINX --> TRD
NGINX --> PRE
NGINX --> INV
NGINX --> AG
NGINX --> WS
NGINX --> ENV
NGINX --> ME
NGINX --> DOCS
AG --> PG
AG --> RD
AG -.NATS.-> NT
OR -.NATS.-> NT
LED -.NATS.-> NT
PMS -.NATS.-> NT
WS -.NATS.-> NT
SUB -.NATS.-> NT
OR -.gRPC.-> ME
ENV -.gRPC.-> ME
LED --> PG
PMS --> PG
SUB --> PG
style EC2 fill:#f6f8fa
style DC fill:#ddf4ff
style PM2 fill:#ddf4ff
Host facts¶
From docs/deployment-state.md:14-21:
| Resource | Value |
|---|---|
| AWS account | 094969483885 |
| Instance | i-077d5f14e17fb052c, t3.2xlarge (8 vCPU, 32 GB), us-east-1b, Ubuntu 24.04, 200 GB gp3 root |
| Elastic IP | 34.199.105.99 (alloc eipalloc-02511dc727ab251a9) |
| SSH | ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99 |
| Source tree | /home/ubuntu/qt/ (multi-repo working area) |
| Docs static | /var/www/docs.quanta.emoment.tech/ (rsync target) |
t3.2xlarge is a burstable instance. Under sustained load it will throttle once CPU credits exhaust. Acceptable for demo traffic. Not acceptable for live trading — moving to a c7i.2xlarge or m7i.2xlarge (non-burstable) is on the M5 ops list.
Public surface¶
Eight subdomains, all pointing at the same EIP. From deployment-state.md:24-36:
| Subdomain | Routed to | Serves |
|---|---|---|
quanta.emoment.tech |
pm2 quantatrade :3000 |
Main Next.js frontend (login / register / dashboard scaffold). Source missing locally — only the deployed .next/ build exists on the host. Recovery is on the M1 follow-up list |
docs.quanta.emoment.tech |
nginx static → /var/www/docs.quanta.emoment.tech/ |
MkDocs build of this repo's docs/ |
api.quanta.emoment.tech |
docker api-gateway :3001 |
NestJS REST API |
ws.quanta.emoment.tech |
docker ws-gateway :3002 |
WebSocket gateway (/health returns 200) |
presale.quanta.emoment.tech |
pm2 qt-presale :3010 |
Presale Next.js (/home/ubuntu/qt/presale-app/) |
dashboard.quanta.emoment.tech |
pm2 qt-dashboard :3011 |
Investor portal Next.js (/home/ubuntu/qt/investor-dashboard/) |
admin.quanta.emoment.tech |
pm2 qt-admin :3012 |
Admin panel Next.js (/home/ubuntu/qt/admin-panel/) — see 06-admin-panel.md |
trade.quanta.emoment.tech |
pm2 qt-trade :3013 |
Trading UI (/home/ubuntu/qt/trade-ui/) |
grpc.quanta.emoment.tech |
docker envoy :8088 → matching-engine :9090 |
gRPC-web endpoint for the admin panel |
matching.quanta.emoment.tech |
matching-engine :8090, but path-locked to /api/v1/accounts/deposit |
CI self-heal balance top-up only — all other paths return 404 |
Domain story¶
The platform runs on emoment.tech today. quantatrade.tech is owned but not on the same Cloudflare account as the rest of the zone. A migration is queued but not gating any other work — internal links use emoment.tech for now.
TLS¶
Four Let's Encrypt certs, all expire on or around 2026-07-24 to 2026-07-26. Renewed by certbot systemd timer (daily check, renews when < 30 days remain). Last certbot renew --dry-run was clean on 2026-04-25.
Multi-SAN consolidation: api.quanta.emoment.tech cert also covers ws, presale, dashboard, admin (5 hosts on one cert). grpc.quanta.emoment.tech is single-SAN, separate.
2. Docker-compose service graph¶
The platform compose stack lives at /home/ubuntu/qt/infrastructure/docker-compose.yml with patches in docker-compose.override.yml alongside it. Neither file is in this repo or the quantatrade-slippage working tree — they are host-only artefacts. The compose service list is inferred from deployment-state.md:47-65 and the per-service Dockerfile declarations in services/*/Dockerfile.
Containers¶
11 services, all running with restart: unless-stopped, all networked on the default compose bridge.
| Service | Image | Host port | Container port | Healthcheck | Depends on |
|---|---|---|---|---|---|
postgres |
postgres:16-alpine |
127.0.0.1:5432 | 5432 | pg_isready |
— |
redis |
redis:7-alpine |
127.0.0.1:6379 | 6379 | redis-cli ping |
— |
nats |
nats:2-alpine |
127.0.0.1:4222 | 4222 | TCP probe | — |
api-gateway |
ghcr.io/quantatradeai/platform-api-gateway:latest |
127.0.0.1:3001 | 3001 | GET /api/v1/health (overridden — image default /health 404s) |
postgres, redis, nats |
ws-gateway |
ghcr.io/.../ws-gateway:latest |
127.0.0.1:3002 | 8080 (image) → overridden to 3002 | node -e probe on :3002 (image had no wget/curl) |
nats |
order-router |
ghcr.io/.../order-router:latest |
127.0.0.1:3006 | 3006 (+ 9092 gRPC) | image default /health |
nats, redis, matching-engine |
ledger-service |
ghcr.io/.../ledger-service:latest |
127.0.0.1:3007 | 3007 | image default /health |
nats, postgres, temporal |
pms-service |
ghcr.io/.../pms-service:latest |
127.0.0.1:3008 | 3008 (overridden — image set :3007) |
overridden :3008 |
nats, postgres |
risk-service |
ghcr.io/.../risk-service:latest |
127.0.0.1:3009 | 3009 | image default /health |
nats, postgres |
subscription-service |
ghcr.io/.../subscription-service:latest |
127.0.0.1:3060 | 3060 | image default /health |
nats |
matching-engine |
quantatrade-matching-engine:local (built on EC2 from QuantaTradeAI/exchange-core) |
127.0.0.1:8090 (HTTP), 127.0.0.1:9090 (gRPC) | 8090 / 9090 | Spring Actuator /actuator/health |
postgres |
grpc-web-proxy (Envoy) |
envoyproxy/envoy:v1.29-latest |
127.0.0.1:8088 (data), 127.0.0.1:9901 (admin) | 8088 / 9901 | TCP probe | matching-engine |
All container ports are bound to 127.0.0.1 — public access goes only through nginx. The single exception is the matching-engine deposit subdomain (matching.quanta.emoment.tech), which is path-locked.
docker-compose.override.yml patches¶
Four image-baked defaults required overrides at deploy time (deployment-state.md:66-75):
api-gatewayhealthcheck — image usedwget … /health(404). Overridden to/api/v1/health(the real path).ws-gateway— env was missingJWT_SECRET(caused crash loop); image healthcheck on:8080overridden to:3002vianode -e(image has no wget/curl).pms-servicehealthcheck — image used:3007butPORTenv sets:3008. Overridden to:3008.ledger-service— image was built without the@quantatrade/loggerworkspace package compiled (Dockerfile builds 5 of 6 workspace packages, skipslogger). Locally-builtdist/bind-mounted from/home/ubuntu/qt/platform/packages/logger/dist/. Image was rebuilt on the host 2026-04-26; the bind-mount remains as belt-and-braces.
The override file is committed at /home/ubuntu/qt/infrastructure/docker-compose.override.yml on the instance. It is not yet in any GitHub repo — restoring the host from a snapshot requires manually recreating the override.
Image source¶
Most images come from ghcr.io/quantatradeai/platform-<service>:latest — built and pushed by the CI workflow (.github/workflows/ci.yml — see §6). The matching-engine image is built on the EC2 host from a checkout of QuantaTradeAI/exchange-core because the Java build is heavyweight and we did not yet wire a GHA matrix for it.
Dockerfile shape per TS service¶
All seven TS service Dockerfiles follow the same pattern (verified across all of services/*/Dockerfile):
# Build stage
FROM node:22-alpine AS builder
WORKDIR /app
RUN apk add --no-cache python3 make g++ openssl # native deps + Prisma engine
COPY package.json package-lock.json tsconfig.base.json ./
COPY packages ./packages
COPY services/<svc> ./services/<svc>
RUN npm ci --workspace=@quantatrade/<svc> --include-workspace-root
RUN npx prisma generate --schema=packages/db/prisma/schema.prisma # api-gateway only
RUN npm run build --workspace=@quantatrade/common
# … build each workspace package, then the service …
RUN npm run build --workspace=@quantatrade/<svc>
# Runtime
FROM node:22-alpine AS production
WORKDIR /app
RUN apk add --no-cache openssl # Prisma engine
RUN addgroup -g 1001 -S nodejs && adduser -S nestjs -u 1001 -G nodejs
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/packages ./packages
COPY --from=builder /app/services/<svc>/dist ./services/<svc>/dist
COPY --from=builder /app/services/<svc>/package.json ./services/<svc>/
ENV NODE_ENV=production
ENV PORT=<svc-port>
USER nestjs
EXPOSE <svc-port>
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
CMD wget -q --spider http://localhost:<svc-port>/health || exit 1
WORKDIR /app/services/<svc>
CMD ["node", "dist/main.js"]
Citation: services/api-gateway/Dockerfile:1-72 (the most fleshed-out; others differ only in package list and port).
Non-root user (nestjs:1001), unprivileged ports, Node 22 alpine. No multi-arch builds (linux/amd64 only — the EC2 host is x86_64).
3. Environment variables — per service¶
Each service's src/config/index.ts (or config.validation.ts) defines a Zod schema. Required + commonly-set vars below; defaults are from those schemas.
api-gateway (services/api-gateway/src/config/config.validation.ts:37-92)¶
| Variable | Required | Default | Purpose |
|---|---|---|---|
JWT_SECRET |
yes (≥ 32 chars; enforced at :46) |
— | HS256 token signing |
JWT_EXPIRES_IN |
no | 8h |
Access token TTL |
DATABASE_PASSWORD |
yes in prod (:117) |
exchange_dev in dev |
Postgres credential |
DATABASE_HOST / _PORT / _USER / _NAME |
no | localhost / 5432 / exchange / exchange |
Postgres connection |
REDIS_HOST / _PORT / _PASSWORD |
no | localhost / 6379 / — |
Cache, token blacklist |
NATS_URL |
no | nats://localhost:4222 |
Inter-service bus |
MATCHING_ENGINE_URL |
no | http://localhost:8090 |
REST fallback to engine |
MATCHING_ENGINE_GRPC_URL |
no | localhost:9090 |
gRPC to engine |
SUMSUB_APP_TOKEN / _SECRET_KEY / _BASE_URL |
optional | — / — / https://api.sumsub.com |
KYC provider (M4 — not yet wired in prod) |
CORS_ORIGINS |
no | http://localhost:3000 |
Comma-separated allowlist |
RATE_LIMIT_TTL / _MAX |
no | 60 / 100 | NestJS throttler |
BODY_SIZE_LIMIT_KB |
no | 100 | Default body limit (10 MB for uploads via _UPLOAD_KB) |
REQUEST_TIMEOUT_MS |
no | 30000 | Per-request timeout |
LOG_LEVEL |
no | info (prod) / debug (dev) |
Pino log level |
LOG_REQUESTS / LOG_REQUEST_BODY / LOG_RESPONSE_BODY |
no | true / false / false |
Request audit logging |
AUDIT_ASYNC_LOGGING |
no | false |
Buffer audit writes (see audit/audit.module.ts:25) |
Hard-fail at boot if JWT_SECRET or (in prod) DATABASE_PASSWORD is missing — validateConfig() throws and the process exits (config.validation.ts:109-126).
ledger-service (services/ledger-service/src/config/index.ts:22-32)¶
| Variable | Required | Default | Purpose |
|---|---|---|---|
PORT |
no | 3007 |
HTTP health |
NODE_ENV |
no | development |
Mode |
NATS_URL |
no | nats://localhost:4222 |
RPC bus for ledger.credit/debit/lock/unlock/settleTrade |
TEMPORAL_ADDRESS |
no | localhost:7233 |
Temporal cluster for tradeSettlementWorkflow |
TEMPORAL_NAMESPACE |
no | default |
Temporal namespace |
DATABASE_URL |
yes (via @quantatrade/db shared schema) |
— | Postgres connection — same instance as api-gateway |
Note: ledger-service runs a Temporal worker (services/ledger-service/src/worker.ts) in addition to the HTTP server. Both are launched from main.ts. The Temporal server itself is not in the compose stack today — TEMPORAL_ADDRESS defaults to localhost:7233 which has nothing listening in prod. Settlement is functionally NATS-only right now; the Temporal workflow is the planned-durable path (see §8).
order-router (services/order-router/src/config/index.ts:26-79)¶
| Variable | Required | Default | Purpose |
|---|---|---|---|
PORT |
no | 3006 |
HTTP health |
NATS_URL |
no | nats://localhost:4222 |
Order events |
REDIS_HOST / _PORT / _PASSWORD |
no | localhost / 6379 / — |
Order state persistence + dedupe |
MATCHING_ENGINE_URL |
no | http://localhost:8090 |
REST to engine |
MATCHING_ENGINE_WS_URL |
no | ws://localhost:8090/ws |
Engine event stream |
MATCHING_ENGINE_GRPC_URL |
no | — (see schema) | gRPC PlaceOrder/Cancel |
RISK_MAX_ORDER_VALUE_USD |
no | 1_000_000 |
Per-order cap |
RISK_MAX_DAILY_VOLUME_USD |
no | 10_000_000 |
Per-user daily cap |
SERVICE_API_KEY / SERVICE_API_SECRET |
yes (per .env.example:6-7) |
— | Service-auth to matching engine. Hardcoded fallbacks were removed for security |
pms-service (services/pms-service/src/config/index.ts)¶
| Variable | Required | Default | Purpose |
|---|---|---|---|
PORT |
no | 3007 (collides with ledger — see Dockerfile override) |
HTTP |
JWT_SECRET |
yes in prod | pms-secret-change-in-production 🔴 |
Hard-coded fallback is a known weakness |
JWT_EXPIRES_IN |
no | 8h |
— |
MATCHING_ENGINE_URL / _WS_URL |
no | http://localhost:8090 / ws://localhost:8090/ws |
Engine connection |
LEDGER_SERVICE_URL |
no | http://localhost:3004 (wrong — should be :3007) |
Ledger calls (latent bug — works only because pms doesn't call ledger over HTTP in practice; NATS is used instead) |
PRICE_FEED_URL |
no | http://localhost:8090/api/prices |
Mark-to-market |
PNL_UPDATE_INTERVAL_MS / PNL_SNAPSHOT_INTERVAL_MS |
no | 5000 / 60000 |
P&L refresh cadence |
BVI_WEBHOOK_ID |
optional | — | BVI Financial Services Commission reporting |
TIMESCALEDB_URL |
optional | postgresql://marketdata:marketdata_dev@localhost:5434/marketdata |
Time-series for tick history (not yet provisioned in prod) |
FIFO_BASE_CURRENCY |
no | USD |
FIFO P&L accounting |
ws-gateway (services/ws-gateway/src/config/index.ts:28-53)¶
| Variable | Required | Default | Purpose |
|---|---|---|---|
PORT |
no | 3002 |
WebSocket listener |
JWT_SECRET |
yes (≥ 32 chars; comes from jwtConfigSchema) |
— | WS handshake auth |
NATS_URL |
no | nats://localhost:4222 |
Source of trade / order events |
WS_COMPRESSION |
no | true |
permessage-deflate |
WS_MAX_PAYLOAD_KB |
no | 16 |
Per-frame limit |
WS_IDLE_TIMEOUT |
no | 120 (seconds) |
Drop idle connections |
WS_MAX_SUBSCRIPTIONS |
no | 50 |
Per-connection cap |
WS_RATE_LIMIT_PER_SECOND |
no | 100 |
Per-connection msg rate |
WS_MAX_MESSAGE_LENGTH |
no | 4096 (chars) |
Inbound message size |
WS_MAX_CHANNEL_LENGTH |
no | 100 |
Subscription channel name length |
risk-service, subscription-service¶
Smaller surfaces — see services/risk-service/src/config/ and services/subscription-service/.env.example. The subscription service uses SQLite (DATABASE_URL=file:./subscription.db, subscription-service/.env.example:4) — not Postgres. Its persistence is container-local (the file is inside the container's writable layer, not a volume). This is a defect — restarting the container loses subscription state. Migration to Postgres is on the M4 list.
Admin panel envs (admin-panel/next.config.js:5-13)¶
These get baked into the static Next.js build at deploy time:
| Variable | Default | Purpose |
|---|---|---|
MATCHING_ENGINE_URL |
http://localhost:8090 |
REST fallback path (currently unused in prod — gRPC-web is primary) |
GRPC_WEB_URL |
http://localhost:8088 |
Connect-protocol endpoint — in prod set to https://grpc.quanta.emoment.tech |
USE_GRPC |
true |
Set to false to force REST-only |
API_GATEWAY_URL |
http://localhost:3001 |
For pages that call platform REST (positions, treasury) |
LEDGER_SERVICE_URL |
http://localhost:3004 (wrong default — should be :3007) |
Not yet called by any page |
RISK_SERVICE_URL |
http://localhost:3005 (wrong — should be :3009) |
Not yet called |
CUSTODY_SERVICE_URL |
http://localhost:3006 |
Treasury page (M2) |
KYC_SERVICE_URL |
http://localhost:3009 (collides with risk) |
KYC page (M4) |
The default URLs are dev-machine values. Production overrides come from /home/ubuntu/qt/admin-panel/.env.production. These defaults must not ship to prod; review the override file when redeploying.
4. Secrets management¶
🔴 Current state: every secret lives in a per-host .env file. There is no Vault, no AWS Secrets Manager, no rotation. The CLAUDE.md rule "never copy .env between hosts" is the only thing protecting prod from dev-credentials leakage.
Where secrets live on the EC2 host¶
| File | Holds | Read by |
|---|---|---|
/home/ubuntu/qt/infrastructure/.env |
DATABASE_PASSWORD, JWT_SECRET, NATS_URL, REDIS_PASSWORD, SERVICE_API_KEY, SERVICE_API_SECRET |
docker-compose env_file: for every container |
/home/ubuntu/qt/admin-panel/.env.production |
GRPC_WEB_URL=https://grpc.quanta.emoment.tech, NEXT_PUBLIC_* |
next build bakes them into the static bundle |
/home/ubuntu/qt/trade-ui/.env.production |
NEXT_PUBLIC_API_BASE, NEXT_PUBLIC_WS_BASE |
Trade UI build |
/home/ubuntu/qt/presale-app/.env.production |
NEXT_PUBLIC_WC_PROJECT_ID, contract addresses |
Missing in prod — see deployment-state.md:248 |
What's missing¶
- No rotation.
JWT_SECREThas been the same since the instance was provisioned. KYC/AML compliance will eventually require a quarterly rotation. - No central audit. We don't know which secrets exist on which host — there's no inventory beyond grepping the host.
- Plain text on disk.
chmod 600 .env,ubuntuuser owns; no LUKS, no envelope encryption. SUMSUB_*/BITGO_*envs are unset in prod today because we don't have credentials yet. When they arrive (M2 / M4), this section needs a refresh.
Path forward¶
Smallest defensible change: move to AWS Secrets Manager + IAM role on the instance, fetched at container start via an entrypoint shim. ~2 days of work; on the M5 ops list.
5. nginx reverse proxy¶
🔴 nginx config files are not in this repo or any working repo. They live on the EC2 instance at /etc/nginx/sites-available/ and have not been version-controlled. Recovering nginx config from a host snapshot is the only way to read them today. Restoring from scratch would mean rebuilding from scratch.
What we know from deployment-state.md:25-37 (the Backend (host:port) column gives the proxy targets) and :191-199 (the only nginx snippet committed anywhere):
Routing rules (inferred)¶
# Each subdomain is a separate server block at /etc/nginx/sites-available/
server {
listen 443 ssl http2;
server_name api.quanta.emoment.tech;
ssl_certificate /etc/letsencrypt/live/api.quanta.emoment.tech/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.quanta.emoment.tech/privkey.pem;
location / {
proxy_pass http://127.0.0.1:3001; # api-gateway
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
The one nginx fragment we do have¶
For matching.quanta.emoment.tech (CI self-heal subdomain) — deployment-state.md:193-199:
location = /api/v1/accounts/deposit {
limit_except POST OPTIONS { deny all; }
proxy_pass http://127.0.0.1:8090;
}
location / { return 404; }
This is the entire pattern: scope the path narrowly, deny everything else. Worth porting back into a versioned infrastructure/nginx/ directory.
TLS termination¶
All TLS terminates at nginx via Let's Encrypt certs. nginx → upstream containers is plain HTTP on 127.0.0.1. The matching-engine gRPC port is also 127.0.0.1-only — Envoy (which is itself 127.0.0.1:8088) is the only proxy that bridges public TLS to internal gRPC.
Rate limiting¶
nginx-level rate limiting is not configured today (limit_req_zone does not appear in any committed config). Per-route rate limiting exists at NestJS throttler level inside api-gateway (RATE_LIMIT_TTL=60, RATE_LIMIT_MAX=100 per IP per minute — config.validation.ts:75-76).
CORS¶
api-gateway sets CORS via the NestJS enableCors() call, reading from the CORS_ORIGINS env (config.validation.ts:41). Default is http://localhost:3000 — in prod the production frontend domains are added explicitly. nginx is not in the CORS path — it transparently forwards Origin and Access-Control-* headers.
6. CI/CD posture¶
CI today (.github/workflows/ci.yml)¶
Full content of the workflow at /Users/pk/ws/quantatrade-slippage/.github/workflows/ci.yml:
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint-test-build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: 'npm'
- name: Install dependencies
run: npm ci || npm install
- name: Lint (best effort — non-blocking until normalized)
run: npm run lint --if-present || true
- name: Type check
run: npx tsc --noEmit --skipLibCheck || true
- name: Test (non-blocking until each service defines real test scripts)
run: npm test --if-present || true
- name: Build
run: npm run build --if-present
🟡 Lint, type-check, and test all swallow failures (|| true). Only build failures block merges. Hardening to remove || true is M1 follow-up — every service needs a clean lint + tsc pass first.
🔴 No image push step. The compose stack pulls ghcr.io/quantatradeai/platform-<service>:latest but nothing in this workflow builds or pushes those images. The images are built separately — historically by manual docker build && docker push from the deploy operator's laptop. This is fragile and is on the immediate fix list.
🔴 No deploy step. There is no CD. Deploys are manual SSH + docker compose pull + docker compose up -d.
CD today¶
Manual. The workflow:
- PR merged to
mainonQuantaTradeAI/platform. - CI passes (or doesn't — only build is enforced).
- Operator builds images locally and pushes to GHCR (or relies on a stale
:latest). - Operator SSHes to
34.199.105.99. cd /home/ubuntu/qt/infrastructure && docker compose pull && docker compose up -d --remove-orphans.- Operator manually verifies via
/servicesadmin page orcurl https://api.quanta.emoment.tech/api/v1/health.
Existing CD for the trading-ui repo¶
The one bright spot — QuantaTradeAI/trading-ui/.github/workflows/e2e.yml runs Playwright e2e tests against the live trade.quanta.emoment.tech on every push/PR. Self-heals test balance via the matching.quanta.emoment.tech deposit subdomain. First green run 2026-04-29 — 26 specs in 1m12s. See deployment-state.md:177-203.
Path forward¶
Smallest defensible change (1 week of work):
- Add a
docker/build-push-actionmatrix toci.yml— push per-service images on tag push. - Add a deploy workflow triggered by tag:
docker compose pull && up -dover SSH (usingappleboy/ssh-action). - Remove the
|| trueswallowing once each service has a real lint + test target.
7. Observability¶
Metrics¶
🟡 @quantatrade/metrics exists but is currently a stub (packages/metrics/src/index.ts:4):
Stub metrics package providing Prometheus-compatible metric types. All metric operations are no-ops until a real implementation (e.g. prom-client) is wired in.
The shape is correct — Counter, Gauge, Histogram, Summary, and service-specific bundles (createApiGatewayMetrics, createOrderRouterMetrics, createWsGatewayMetrics) are all defined and consumed by the services. But .inc() / .observe() / .set() are all no-ops (packages/metrics/src/index.ts:20-33). The /metrics endpoint on api-gateway returns a header-only response (# HELP stub metrics for api-gateway).
Wiring prom-client into the registry is a one-day task. No Prometheus / Grafana / Datadog target is configured today — there is nowhere for metrics to flow even if the stub were replaced.
Logs¶
🟢 Structured logging via @quantatrade/logger (packages/logger/src/). Pino-backed. Two methods that get used everywhere: .error(msg, err, meta) and .logTrade(meta). The .logError / .logTrade calls were a recent fix (see docs/milestone-1-status.md — "Resolved 2026-05-27: ledger logger bug").
Container logs end up at the Docker daemon's default location (/var/lib/docker/containers/<id>/<id>-json.log). No log shipping → no Loki, CloudWatch, or Splunk target. Operator reads logs with docker compose logs -f <service> over SSH.
Tracing¶
🔴 None. No OpenTelemetry exporters, no Jaeger, no Datadog APM.
Healthchecks¶
Every container has a docker-level HEALTHCHECK. docker compose ps shows status as (healthy) / (unhealthy) / (starting). The admin panel's /services page polls these endpoints from the browser side via api.getServiceHealth(serviceKey).
Alerting¶
🔴 None. No PagerDuty, no Slack alert pipeline, no on-call rotation. If the host goes down, the next person to look at trade.quanta.emoment.tech finds out.
Cost / value trade-off¶
For the current M1 demo posture this is acceptable. Before any paying-customer traffic: wire prom-client + push to Grafana Cloud (free tier covers 10K series), add CloudWatch container insights, write 4 alarms (instance down / disk > 80% / matching-engine 5xx / no trades in 5 min).
8. Backups¶
EBS snapshots (host-level)¶
🟢 DLM (Data Lifecycle Manager) is configured (deployment-state.md:255-262):
| Item | Value |
|---|---|
| IAM role | arn:aws:iam::094969483885:role/AWSDataLifecycleManagerDefaultRole |
| Policy | policy-0066a67ecb6c3daa7 (ENABLED) |
| Schedule | Daily, 03:00 UTC |
| Retention | 7 days |
| Target | EBS volumes tagged Backup=daily (currently the root disk vol-0ddc7e9d1de5a2b59) |
| Tagging | Snapshots tagged SnapshotType=DLM-Daily |
| Baseline | snap-07e001a69835f1973 (manual, pre-DLM) |
Restore = create new EBS volume from a snapshot, attach, fsck, mount, fix /etc/fstab UUID. ~15 minutes to a running new host.
Postgres backups (logical)¶
🔴 None today. No pg_dump cron, no continuous archiving, no AWS RDS (Postgres runs in-container with a Docker volume). The EBS snapshot is the only Postgres backup — sufficient for crash recovery, insufficient for point-in-time recovery.
Adding a nightly pg_dump → S3 with 30-day retention is a 30-minute task. It's not yet done.
Temporal-based settlement durability¶
🟡 tradeSettlementWorkflow (services/ledger-service/src/workflows/trade-settlement.ts) is implemented and wired up. Steps (:39-50):
settleTradeActivity— callsLedgerService.settleTrade({...}). Idempotent via trade ID.notifyTradeSettledActivity— publishestrade.settledon NATS for downstream services.
Retry policy: maximumAttempts: 3, initialInterval: 1s, backoffCoefficient: 2, startToCloseTimeout: 30s (trade-settlement.ts:6-14).
The worker is launched from services/ledger-service/src/worker.ts:1-23 — maxConcurrentActivityTaskExecutions: 20, maxConcurrentWorkflowTaskExecutions: 20.
But there's no Temporal server in production today — TEMPORAL_ADDRESS defaults to localhost:7233 (config/index.ts:47) and nothing answers there. The worker connects, fails silently, and the workflow path is not exercised. Settlement actually happens via the NATS trade.executed listener inside ledger.ts directly.
Wiring a real Temporal cluster (Temporal Cloud or self-hosted Temporal in compose) is part of the M3 "make settlement crash-safe" workstream.
9. Operational runbooks¶
Restart a service¶
ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/infrastructure
docker compose restart <service> # graceful; honours stop_grace_period
docker compose ps # confirm (healthy)
For pm2-managed front-ends (admin-panel, trade-ui, presale-app, investor-dashboard, main frontend):
pm2 restart qt-admin # or qt-trade / qt-presale / qt-dashboard / quantatrade
pm2 status # confirm
pm2 logs qt-admin --lines 50 # check post-restart logs
Tail logs¶
# Docker container
docker compose logs -f --tail 100 api-gateway
# Multiple containers
docker compose logs -f api-gateway order-router ledger-service
# pm2 frontend
pm2 logs qt-trade --lines 200
# nginx access log
sudo tail -f /var/log/nginx/access.log
# nginx error log
sudo tail -f /var/log/nginx/error.log
Run a Prisma migration¶
The Prisma schema is in packages/db/prisma/schema.prisma (24 models). Migrations live in packages/db/prisma/migrations/:
20240201000000_add_address_pool20260206000000_add_order_internal_id20260208000000_add_user_password_hash
To apply migrations against the live database:
ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/platform
# Option A: from inside the api-gateway container (the image has Prisma CLI)
docker compose exec api-gateway npx prisma migrate deploy --schema=/app/packages/db/prisma/schema.prisma
# Option B: db push for non-versioned changes (development convenience — never in prod)
docker compose exec api-gateway npx prisma db push --schema=/app/packages/db/prisma/schema.prisma
⚠ History note (deployment-state.md:109): The initial deploy had no migrations — only the address_pool migration existed; User / Order / Trade / etc. tables were missing. Recovery was prisma db push from inside the api-gateway container, which created 25 tables from the schema. db push is dev-only — never run it on prod once we have customers. The two follow-up migrations (order_internal_id, user_password_hash) were added properly.
Roll back a bad deploy¶
# Tag-based pin (preferred — once we tag images properly)
ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/infrastructure
# Edit docker-compose.yml: pin <service>'s image: tag from `:latest` to the last-known-good tag
docker compose pull <service>
docker compose up -d <service>
# Or, if rollback requires DB state too: restore from EBS snapshot
# 1. Stop services that depend on Postgres
docker compose stop api-gateway order-router ledger-service pms-service risk-service
# 2. Create EBS volume from yesterday's DLM snapshot
aws ec2 create-volume --snapshot-id snap-<id> --availability-zone us-east-1b --volume-type gp3 --region us-east-1
# 3. Stop instance, detach old volume, attach new, fsck, mount
# 4. Restart instance and bring the stack up
There is no automated rollback — every step above is manual.
Renew a TLS certificate¶
certbot runs daily on a systemd timer. Manual force-renewal:
sudo certbot renew --dry-run # safety check
sudo certbot renew # actual renewal (only renews if < 30 days remain)
sudo certbot renew --force-renewal --cert-name api.quanta.emoment.tech # nuclear option
sudo systemctl reload nginx
Redeploy docs (this repo's docs/)¶
cd /Users/pk/ws/quantatrade
~/Library/Python/3.9/bin/mkdocs build
rsync -az --delete -e "ssh -i ~/.ssh/quantatrade-key.pem" \
site/ ubuntu@34.199.105.99:/var/www/docs.quanta.emoment.tech/
Inspect the docker-compose override¶
ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cat /home/ubuntu/qt/infrastructure/docker-compose.override.yml
(This file is host-only — see §2.)
Self-heal CI test balance¶
order-pipeline.spec.ts in the trading-ui repo deposits 1 B USDT + 1 k BTC into the test user before every CI run. The endpoint is gated by service-auth headers:
curl -X POST https://matching.quanta.emoment.tech/api/v1/accounts/deposit \
-H "Content-Type: application/json" \
-H "x-api-key: $SERVICE_API_KEY" \
-H "x-api-secret: $SERVICE_API_SECRET" \
-H "x-participant-type: SYSTEM" \
-d '{"userId":"cmohcayzs0000n57cskqsutdc","currency":"USDT","amount":"1000000000"}'
10. Known operational gaps¶
🔴 Single host = SPOF. One EC2 instance dies → everything goes down. No multi-AZ, no auto-scaling, no failover. RTO is roughly 15 minutes (restore from EBS snapshot). RPO is 24 hours (daily DLM cadence).
🔴 Manual deploys. No CD pipeline. Operator SSHes in, runs docker compose pull && up -d. No deploy log, no atomic swap, no canary.
🔴 No UAT environment. Every change goes to prod direct. Trading UI has its e2e suite running against prod (which is brave but works for a demo).
🔴 Secrets in plain .env. No Vault, no rotation. JWT_SECRET hasn't rotated since provisioning.
🔴 Metrics stub. @quantatrade/metrics is no-ops. No Prometheus target.
🔴 No alerting. Host down / disk full / matching-engine 5xx — nothing pages anyone.
🔴 No logical Postgres backup. Only EBS-level. PITR is impossible.
🔴 No Temporal in prod. The settlement workflow exists in code but has no worker target to dispatch to.
🔴 nginx config not version-controlled. Reconstructing it from scratch would be from memory + grep on the host.
🔴 docker-compose.override.yml not committed. Same problem.
🟡 pms-service JWT secret has a hard-coded fallback (pms-secret-change-in-production) — must be overridden in prod env.
🟡 subscription-service uses SQLite inside the container — no volume mount, state lost on restart.
🟡 order-router MATCHING_ENGINE_URL/_WS_URL defaulted to localhost:8090 — unreachable from inside its container. Fixed via override env vars pointing at http://matching-engine:8090. The default in the schema is misleading.
🟡 Burstable t3.2xlarge. Under sustained load CPU credits will throttle. Move to c7i.2xlarge or m7i.2xlarge before paying-customer traffic.
🟡 CI swallows lint/type/test failures. Only build is enforced.
🟢 DLM EBS snapshots — working. Tested on 2026-04-26. 7-day retention.
🟢 Certbot auto-renewal — working. Last dry-run successful 2026-04-25.
🟢 pm2 survives reboots — pm2 startup was run on provision; the four front-ends come back up automatically.
🟢 Docker restart: unless-stopped on every service — containers come back after reboot or crash.
Related¶
- 01-architecture.md — what the services do; this doc is "where they run"
- 03-ledger-accounting.md — Temporal settlement design (the worker that has no server yet)
- 06-admin-panel.md — how the operator UI lands on the matching engine via Envoy
docs/deployment-state.md— the authoritative day-to-day operations log; this doc summarises and structures itdocs/forward-plan.md— the prioritised list of operational improvements