Skip to content

09 — Deployment & Operations

As of 2026-05-28.

What is actually deployed today, how it's wired together, and how an operator restarts / rolls back / debugs it. Code-grounded against QuantaTradeAI/platform, QuantaTradeAI/admin-panel, and the EC2 deployment captured in docs/deployment-state.md.

Bottom line up front: a single EC2 instance hosts every running service — platform monorepo containers via docker-compose, plus four PM2-managed Next.js front-ends, plus the matching engine. Cloudflare provides DNS only (not proxying). nginx terminates TLS and reverse-proxies into either localhost ports or the docker-compose network. There is no UAT / staging environment, no CD pipeline, and secrets live in plain .env files on the host. This is fit for a M1 internal demo. It is not fit for paying customers, and the gaps are itemised at the end.


1. Production topology

%%{init: {'theme':'base','themeVariables':{'background':'#ffffff','primaryColor':'#ddf4ff','primaryBorderColor':'#0969da','primaryTextColor':'#0a0a0a','lineColor':'#1f2328','secondaryColor':'#fff8c5','tertiaryColor':'#dafbe1','clusterBkg':'#f6f8fa','clusterBorder':'#d0d7de'}}}%%
graph TB
    subgraph Internet
        OPS[Operator browser]
        CLIENT[Trader browser]
    end

    subgraph CF[Cloudflare]
        DNS[Cloudflare DNS<br/>DNS-only, NOT proxying]
    end

    subgraph EC2["1× EC2 t3.2xlarge — 34.199.105.99 — us-east-1b"]
        NGINX[nginx<br/>TLS terminator<br/>Let's Encrypt certs]

        subgraph PM2["pm2-managed Next.js front-ends"]
            QT[quantatrade :3000<br/>main frontend]
            ADM[qt-admin :3012<br/>admin-panel]
            TRD[qt-trade :3013<br/>trading-ui]
            PRE[qt-presale :3010]
            INV[qt-dashboard :3011]
        end

        subgraph DC[docker-compose stack]
            PG[postgres:16-alpine :5432]
            RD[redis:7-alpine :6379]
            NT[nats:2-alpine :4222]
            AG[api-gateway :3001]
            WS[ws-gateway :3002]
            OR[order-router :3006]
            LED[ledger-service :3007]
            PMS[pms-service :3008]
            RSK[risk-service :3009]
            SUB[subscription-service :3060]
            ME[matching-engine :8090 / :9090]
            ENV[envoy gRPC-web :8088]
        end

        DOCS[/var/www/docs.quanta.emoment.tech<br/>static MkDocs/]
    end

    OPS --> DNS
    CLIENT --> DNS
    DNS -.A record.-> NGINX

    NGINX --> QT
    NGINX --> ADM
    NGINX --> TRD
    NGINX --> PRE
    NGINX --> INV
    NGINX --> AG
    NGINX --> WS
    NGINX --> ENV
    NGINX --> ME
    NGINX --> DOCS

    AG --> PG
    AG --> RD
    AG -.NATS.-> NT
    OR -.NATS.-> NT
    LED -.NATS.-> NT
    PMS -.NATS.-> NT
    WS -.NATS.-> NT
    SUB -.NATS.-> NT

    OR -.gRPC.-> ME
    ENV -.gRPC.-> ME

    LED --> PG
    PMS --> PG
    SUB --> PG

    style EC2 fill:#f6f8fa
    style DC fill:#ddf4ff
    style PM2 fill:#ddf4ff

Host facts

From docs/deployment-state.md:14-21:

Resource Value
AWS account 094969483885
Instance i-077d5f14e17fb052c, t3.2xlarge (8 vCPU, 32 GB), us-east-1b, Ubuntu 24.04, 200 GB gp3 root
Elastic IP 34.199.105.99 (alloc eipalloc-02511dc727ab251a9)
SSH ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
Source tree /home/ubuntu/qt/ (multi-repo working area)
Docs static /var/www/docs.quanta.emoment.tech/ (rsync target)

t3.2xlarge is a burstable instance. Under sustained load it will throttle once CPU credits exhaust. Acceptable for demo traffic. Not acceptable for live trading — moving to a c7i.2xlarge or m7i.2xlarge (non-burstable) is on the M5 ops list.

Public surface

Eight subdomains, all pointing at the same EIP. From deployment-state.md:24-36:

Subdomain Routed to Serves
quanta.emoment.tech pm2 quantatrade :3000 Main Next.js frontend (login / register / dashboard scaffold). Source missing locally — only the deployed .next/ build exists on the host. Recovery is on the M1 follow-up list
docs.quanta.emoment.tech nginx static → /var/www/docs.quanta.emoment.tech/ MkDocs build of this repo's docs/
api.quanta.emoment.tech docker api-gateway :3001 NestJS REST API
ws.quanta.emoment.tech docker ws-gateway :3002 WebSocket gateway (/health returns 200)
presale.quanta.emoment.tech pm2 qt-presale :3010 Presale Next.js (/home/ubuntu/qt/presale-app/)
dashboard.quanta.emoment.tech pm2 qt-dashboard :3011 Investor portal Next.js (/home/ubuntu/qt/investor-dashboard/)
admin.quanta.emoment.tech pm2 qt-admin :3012 Admin panel Next.js (/home/ubuntu/qt/admin-panel/) — see 06-admin-panel.md
trade.quanta.emoment.tech pm2 qt-trade :3013 Trading UI (/home/ubuntu/qt/trade-ui/)
grpc.quanta.emoment.tech docker envoy :8088 → matching-engine :9090 gRPC-web endpoint for the admin panel
matching.quanta.emoment.tech matching-engine :8090, but path-locked to /api/v1/accounts/deposit CI self-heal balance top-up only — all other paths return 404

Domain story

The platform runs on emoment.tech today. quantatrade.tech is owned but not on the same Cloudflare account as the rest of the zone. A migration is queued but not gating any other work — internal links use emoment.tech for now.

TLS

Four Let's Encrypt certs, all expire on or around 2026-07-24 to 2026-07-26. Renewed by certbot systemd timer (daily check, renews when < 30 days remain). Last certbot renew --dry-run was clean on 2026-04-25.

Multi-SAN consolidation: api.quanta.emoment.tech cert also covers ws, presale, dashboard, admin (5 hosts on one cert). grpc.quanta.emoment.tech is single-SAN, separate.


2. Docker-compose service graph

The platform compose stack lives at /home/ubuntu/qt/infrastructure/docker-compose.yml with patches in docker-compose.override.yml alongside it. Neither file is in this repo or the quantatrade-slippage working tree — they are host-only artefacts. The compose service list is inferred from deployment-state.md:47-65 and the per-service Dockerfile declarations in services/*/Dockerfile.

Containers

11 services, all running with restart: unless-stopped, all networked on the default compose bridge.

Service Image Host port Container port Healthcheck Depends on
postgres postgres:16-alpine 127.0.0.1:5432 5432 pg_isready
redis redis:7-alpine 127.0.0.1:6379 6379 redis-cli ping
nats nats:2-alpine 127.0.0.1:4222 4222 TCP probe
api-gateway ghcr.io/quantatradeai/platform-api-gateway:latest 127.0.0.1:3001 3001 GET /api/v1/health (overridden — image default /health 404s) postgres, redis, nats
ws-gateway ghcr.io/.../ws-gateway:latest 127.0.0.1:3002 8080 (image) → overridden to 3002 node -e probe on :3002 (image had no wget/curl) nats
order-router ghcr.io/.../order-router:latest 127.0.0.1:3006 3006 (+ 9092 gRPC) image default /health nats, redis, matching-engine
ledger-service ghcr.io/.../ledger-service:latest 127.0.0.1:3007 3007 image default /health nats, postgres, temporal
pms-service ghcr.io/.../pms-service:latest 127.0.0.1:3008 3008 (overridden — image set :3007) overridden :3008 nats, postgres
risk-service ghcr.io/.../risk-service:latest 127.0.0.1:3009 3009 image default /health nats, postgres
subscription-service ghcr.io/.../subscription-service:latest 127.0.0.1:3060 3060 image default /health nats
matching-engine quantatrade-matching-engine:local (built on EC2 from QuantaTradeAI/exchange-core) 127.0.0.1:8090 (HTTP), 127.0.0.1:9090 (gRPC) 8090 / 9090 Spring Actuator /actuator/health postgres
grpc-web-proxy (Envoy) envoyproxy/envoy:v1.29-latest 127.0.0.1:8088 (data), 127.0.0.1:9901 (admin) 8088 / 9901 TCP probe matching-engine

All container ports are bound to 127.0.0.1 — public access goes only through nginx. The single exception is the matching-engine deposit subdomain (matching.quanta.emoment.tech), which is path-locked.

docker-compose.override.yml patches

Four image-baked defaults required overrides at deploy time (deployment-state.md:66-75):

  1. api-gateway healthcheck — image used wget … /health (404). Overridden to /api/v1/health (the real path).
  2. ws-gateway — env was missing JWT_SECRET (caused crash loop); image healthcheck on :8080 overridden to :3002 via node -e (image has no wget/curl).
  3. pms-service healthcheck — image used :3007 but PORT env sets :3008. Overridden to :3008.
  4. ledger-service — image was built without the @quantatrade/logger workspace package compiled (Dockerfile builds 5 of 6 workspace packages, skips logger). Locally-built dist/ bind-mounted from /home/ubuntu/qt/platform/packages/logger/dist/. Image was rebuilt on the host 2026-04-26; the bind-mount remains as belt-and-braces.

The override file is committed at /home/ubuntu/qt/infrastructure/docker-compose.override.yml on the instance. It is not yet in any GitHub repo — restoring the host from a snapshot requires manually recreating the override.

Image source

Most images come from ghcr.io/quantatradeai/platform-<service>:latest — built and pushed by the CI workflow (.github/workflows/ci.yml — see §6). The matching-engine image is built on the EC2 host from a checkout of QuantaTradeAI/exchange-core because the Java build is heavyweight and we did not yet wire a GHA matrix for it.

Dockerfile shape per TS service

All seven TS service Dockerfiles follow the same pattern (verified across all of services/*/Dockerfile):

# Build stage
FROM node:22-alpine AS builder
WORKDIR /app
RUN apk add --no-cache python3 make g++ openssl  # native deps + Prisma engine
COPY package.json package-lock.json tsconfig.base.json ./
COPY packages ./packages
COPY services/<svc> ./services/<svc>
RUN npm ci --workspace=@quantatrade/<svc> --include-workspace-root
RUN npx prisma generate --schema=packages/db/prisma/schema.prisma   # api-gateway only
RUN npm run build --workspace=@quantatrade/common
# … build each workspace package, then the service …
RUN npm run build --workspace=@quantatrade/<svc>

# Runtime
FROM node:22-alpine AS production
WORKDIR /app
RUN apk add --no-cache openssl  # Prisma engine
RUN addgroup -g 1001 -S nodejs && adduser -S nestjs -u 1001 -G nodejs
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/packages ./packages
COPY --from=builder /app/services/<svc>/dist ./services/<svc>/dist
COPY --from=builder /app/services/<svc>/package.json ./services/<svc>/
ENV NODE_ENV=production
ENV PORT=<svc-port>
USER nestjs
EXPOSE <svc-port>
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
  CMD wget -q --spider http://localhost:<svc-port>/health || exit 1
WORKDIR /app/services/<svc>
CMD ["node", "dist/main.js"]

Citation: services/api-gateway/Dockerfile:1-72 (the most fleshed-out; others differ only in package list and port).

Non-root user (nestjs:1001), unprivileged ports, Node 22 alpine. No multi-arch builds (linux/amd64 only — the EC2 host is x86_64).


3. Environment variables — per service

Each service's src/config/index.ts (or config.validation.ts) defines a Zod schema. Required + commonly-set vars below; defaults are from those schemas.

api-gateway (services/api-gateway/src/config/config.validation.ts:37-92)

Variable Required Default Purpose
JWT_SECRET yes (≥ 32 chars; enforced at :46) HS256 token signing
JWT_EXPIRES_IN no 8h Access token TTL
DATABASE_PASSWORD yes in prod (:117) exchange_dev in dev Postgres credential
DATABASE_HOST / _PORT / _USER / _NAME no localhost / 5432 / exchange / exchange Postgres connection
REDIS_HOST / _PORT / _PASSWORD no localhost / 6379 / — Cache, token blacklist
NATS_URL no nats://localhost:4222 Inter-service bus
MATCHING_ENGINE_URL no http://localhost:8090 REST fallback to engine
MATCHING_ENGINE_GRPC_URL no localhost:9090 gRPC to engine
SUMSUB_APP_TOKEN / _SECRET_KEY / _BASE_URL optional — / — / https://api.sumsub.com KYC provider (M4 — not yet wired in prod)
CORS_ORIGINS no http://localhost:3000 Comma-separated allowlist
RATE_LIMIT_TTL / _MAX no 60 / 100 NestJS throttler
BODY_SIZE_LIMIT_KB no 100 Default body limit (10 MB for uploads via _UPLOAD_KB)
REQUEST_TIMEOUT_MS no 30000 Per-request timeout
LOG_LEVEL no info (prod) / debug (dev) Pino log level
LOG_REQUESTS / LOG_REQUEST_BODY / LOG_RESPONSE_BODY no true / false / false Request audit logging
AUDIT_ASYNC_LOGGING no false Buffer audit writes (see audit/audit.module.ts:25)

Hard-fail at boot if JWT_SECRET or (in prod) DATABASE_PASSWORD is missing — validateConfig() throws and the process exits (config.validation.ts:109-126).

ledger-service (services/ledger-service/src/config/index.ts:22-32)

Variable Required Default Purpose
PORT no 3007 HTTP health
NODE_ENV no development Mode
NATS_URL no nats://localhost:4222 RPC bus for ledger.credit/debit/lock/unlock/settleTrade
TEMPORAL_ADDRESS no localhost:7233 Temporal cluster for tradeSettlementWorkflow
TEMPORAL_NAMESPACE no default Temporal namespace
DATABASE_URL yes (via @quantatrade/db shared schema) Postgres connection — same instance as api-gateway

Note: ledger-service runs a Temporal worker (services/ledger-service/src/worker.ts) in addition to the HTTP server. Both are launched from main.ts. The Temporal server itself is not in the compose stack todayTEMPORAL_ADDRESS defaults to localhost:7233 which has nothing listening in prod. Settlement is functionally NATS-only right now; the Temporal workflow is the planned-durable path (see §8).

order-router (services/order-router/src/config/index.ts:26-79)

Variable Required Default Purpose
PORT no 3006 HTTP health
NATS_URL no nats://localhost:4222 Order events
REDIS_HOST / _PORT / _PASSWORD no localhost / 6379 / — Order state persistence + dedupe
MATCHING_ENGINE_URL no http://localhost:8090 REST to engine
MATCHING_ENGINE_WS_URL no ws://localhost:8090/ws Engine event stream
MATCHING_ENGINE_GRPC_URL no — (see schema) gRPC PlaceOrder/Cancel
RISK_MAX_ORDER_VALUE_USD no 1_000_000 Per-order cap
RISK_MAX_DAILY_VOLUME_USD no 10_000_000 Per-user daily cap
SERVICE_API_KEY / SERVICE_API_SECRET yes (per .env.example:6-7) Service-auth to matching engine. Hardcoded fallbacks were removed for security

pms-service (services/pms-service/src/config/index.ts)

Variable Required Default Purpose
PORT no 3007 (collides with ledger — see Dockerfile override) HTTP
JWT_SECRET yes in prod pms-secret-change-in-production 🔴 Hard-coded fallback is a known weakness
JWT_EXPIRES_IN no 8h
MATCHING_ENGINE_URL / _WS_URL no http://localhost:8090 / ws://localhost:8090/ws Engine connection
LEDGER_SERVICE_URL no http://localhost:3004 (wrong — should be :3007) Ledger calls (latent bug — works only because pms doesn't call ledger over HTTP in practice; NATS is used instead)
PRICE_FEED_URL no http://localhost:8090/api/prices Mark-to-market
PNL_UPDATE_INTERVAL_MS / PNL_SNAPSHOT_INTERVAL_MS no 5000 / 60000 P&L refresh cadence
BVI_WEBHOOK_ID optional BVI Financial Services Commission reporting
TIMESCALEDB_URL optional postgresql://marketdata:marketdata_dev@localhost:5434/marketdata Time-series for tick history (not yet provisioned in prod)
FIFO_BASE_CURRENCY no USD FIFO P&L accounting

ws-gateway (services/ws-gateway/src/config/index.ts:28-53)

Variable Required Default Purpose
PORT no 3002 WebSocket listener
JWT_SECRET yes (≥ 32 chars; comes from jwtConfigSchema) WS handshake auth
NATS_URL no nats://localhost:4222 Source of trade / order events
WS_COMPRESSION no true permessage-deflate
WS_MAX_PAYLOAD_KB no 16 Per-frame limit
WS_IDLE_TIMEOUT no 120 (seconds) Drop idle connections
WS_MAX_SUBSCRIPTIONS no 50 Per-connection cap
WS_RATE_LIMIT_PER_SECOND no 100 Per-connection msg rate
WS_MAX_MESSAGE_LENGTH no 4096 (chars) Inbound message size
WS_MAX_CHANNEL_LENGTH no 100 Subscription channel name length

risk-service, subscription-service

Smaller surfaces — see services/risk-service/src/config/ and services/subscription-service/.env.example. The subscription service uses SQLite (DATABASE_URL=file:./subscription.db, subscription-service/.env.example:4) — not Postgres. Its persistence is container-local (the file is inside the container's writable layer, not a volume). This is a defect — restarting the container loses subscription state. Migration to Postgres is on the M4 list.

Admin panel envs (admin-panel/next.config.js:5-13)

These get baked into the static Next.js build at deploy time:

Variable Default Purpose
MATCHING_ENGINE_URL http://localhost:8090 REST fallback path (currently unused in prod — gRPC-web is primary)
GRPC_WEB_URL http://localhost:8088 Connect-protocol endpoint — in prod set to https://grpc.quanta.emoment.tech
USE_GRPC true Set to false to force REST-only
API_GATEWAY_URL http://localhost:3001 For pages that call platform REST (positions, treasury)
LEDGER_SERVICE_URL http://localhost:3004 (wrong default — should be :3007) Not yet called by any page
RISK_SERVICE_URL http://localhost:3005 (wrong — should be :3009) Not yet called
CUSTODY_SERVICE_URL http://localhost:3006 Treasury page (M2)
KYC_SERVICE_URL http://localhost:3009 (collides with risk) KYC page (M4)

The default URLs are dev-machine values. Production overrides come from /home/ubuntu/qt/admin-panel/.env.production. These defaults must not ship to prod; review the override file when redeploying.


4. Secrets management

🔴 Current state: every secret lives in a per-host .env file. There is no Vault, no AWS Secrets Manager, no rotation. The CLAUDE.md rule "never copy .env between hosts" is the only thing protecting prod from dev-credentials leakage.

Where secrets live on the EC2 host

File Holds Read by
/home/ubuntu/qt/infrastructure/.env DATABASE_PASSWORD, JWT_SECRET, NATS_URL, REDIS_PASSWORD, SERVICE_API_KEY, SERVICE_API_SECRET docker-compose env_file: for every container
/home/ubuntu/qt/admin-panel/.env.production GRPC_WEB_URL=https://grpc.quanta.emoment.tech, NEXT_PUBLIC_* next build bakes them into the static bundle
/home/ubuntu/qt/trade-ui/.env.production NEXT_PUBLIC_API_BASE, NEXT_PUBLIC_WS_BASE Trade UI build
/home/ubuntu/qt/presale-app/.env.production NEXT_PUBLIC_WC_PROJECT_ID, contract addresses Missing in prod — see deployment-state.md:248

What's missing

  • No rotation. JWT_SECRET has been the same since the instance was provisioned. KYC/AML compliance will eventually require a quarterly rotation.
  • No central audit. We don't know which secrets exist on which host — there's no inventory beyond grepping the host.
  • Plain text on disk. chmod 600 .env, ubuntu user owns; no LUKS, no envelope encryption.
  • SUMSUB_* / BITGO_* envs are unset in prod today because we don't have credentials yet. When they arrive (M2 / M4), this section needs a refresh.

Path forward

Smallest defensible change: move to AWS Secrets Manager + IAM role on the instance, fetched at container start via an entrypoint shim. ~2 days of work; on the M5 ops list.


5. nginx reverse proxy

🔴 nginx config files are not in this repo or any working repo. They live on the EC2 instance at /etc/nginx/sites-available/ and have not been version-controlled. Recovering nginx config from a host snapshot is the only way to read them today. Restoring from scratch would mean rebuilding from scratch.

What we know from deployment-state.md:25-37 (the Backend (host:port) column gives the proxy targets) and :191-199 (the only nginx snippet committed anywhere):

Routing rules (inferred)

# Each subdomain is a separate server block at /etc/nginx/sites-available/
server {
    listen 443 ssl http2;
    server_name api.quanta.emoment.tech;
    ssl_certificate     /etc/letsencrypt/live/api.quanta.emoment.tech/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.quanta.emoment.tech/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3001;   # api-gateway
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

The one nginx fragment we do have

For matching.quanta.emoment.tech (CI self-heal subdomain) — deployment-state.md:193-199:

location = /api/v1/accounts/deposit {
    limit_except POST OPTIONS { deny all; }
    proxy_pass http://127.0.0.1:8090;
}
location / { return 404; }

This is the entire pattern: scope the path narrowly, deny everything else. Worth porting back into a versioned infrastructure/nginx/ directory.

TLS termination

All TLS terminates at nginx via Let's Encrypt certs. nginx → upstream containers is plain HTTP on 127.0.0.1. The matching-engine gRPC port is also 127.0.0.1-only — Envoy (which is itself 127.0.0.1:8088) is the only proxy that bridges public TLS to internal gRPC.

Rate limiting

nginx-level rate limiting is not configured today (limit_req_zone does not appear in any committed config). Per-route rate limiting exists at NestJS throttler level inside api-gateway (RATE_LIMIT_TTL=60, RATE_LIMIT_MAX=100 per IP per minute — config.validation.ts:75-76).

CORS

api-gateway sets CORS via the NestJS enableCors() call, reading from the CORS_ORIGINS env (config.validation.ts:41). Default is http://localhost:3000 — in prod the production frontend domains are added explicitly. nginx is not in the CORS path — it transparently forwards Origin and Access-Control-* headers.


6. CI/CD posture

CI today (.github/workflows/ci.yml)

Full content of the workflow at /Users/pk/ws/quantatrade-slippage/.github/workflows/ci.yml:

name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint-test-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'
      - name: Install dependencies
        run: npm ci || npm install
      - name: Lint (best effort — non-blocking until normalized)
        run: npm run lint --if-present || true
      - name: Type check
        run: npx tsc --noEmit --skipLibCheck || true
      - name: Test (non-blocking until each service defines real test scripts)
        run: npm test --if-present || true
      - name: Build
        run: npm run build --if-present

🟡 Lint, type-check, and test all swallow failures (|| true). Only build failures block merges. Hardening to remove || true is M1 follow-up — every service needs a clean lint + tsc pass first.

🔴 No image push step. The compose stack pulls ghcr.io/quantatradeai/platform-<service>:latest but nothing in this workflow builds or pushes those images. The images are built separately — historically by manual docker build && docker push from the deploy operator's laptop. This is fragile and is on the immediate fix list.

🔴 No deploy step. There is no CD. Deploys are manual SSH + docker compose pull + docker compose up -d.

CD today

Manual. The workflow:

  1. PR merged to main on QuantaTradeAI/platform.
  2. CI passes (or doesn't — only build is enforced).
  3. Operator builds images locally and pushes to GHCR (or relies on a stale :latest).
  4. Operator SSHes to 34.199.105.99.
  5. cd /home/ubuntu/qt/infrastructure && docker compose pull && docker compose up -d --remove-orphans.
  6. Operator manually verifies via /services admin page or curl https://api.quanta.emoment.tech/api/v1/health.

Existing CD for the trading-ui repo

The one bright spot — QuantaTradeAI/trading-ui/.github/workflows/e2e.yml runs Playwright e2e tests against the live trade.quanta.emoment.tech on every push/PR. Self-heals test balance via the matching.quanta.emoment.tech deposit subdomain. First green run 2026-04-29 — 26 specs in 1m12s. See deployment-state.md:177-203.

Path forward

Smallest defensible change (1 week of work):

  1. Add a docker/build-push-action matrix to ci.yml — push per-service images on tag push.
  2. Add a deploy workflow triggered by tag: docker compose pull && up -d over SSH (using appleboy/ssh-action).
  3. Remove the || true swallowing once each service has a real lint + test target.

7. Observability

Metrics

🟡 @quantatrade/metrics exists but is currently a stub (packages/metrics/src/index.ts:4):

Stub metrics package providing Prometheus-compatible metric types. All metric operations are no-ops until a real implementation (e.g. prom-client) is wired in.

The shape is correct — Counter, Gauge, Histogram, Summary, and service-specific bundles (createApiGatewayMetrics, createOrderRouterMetrics, createWsGatewayMetrics) are all defined and consumed by the services. But .inc() / .observe() / .set() are all no-ops (packages/metrics/src/index.ts:20-33). The /metrics endpoint on api-gateway returns a header-only response (# HELP stub metrics for api-gateway).

Wiring prom-client into the registry is a one-day task. No Prometheus / Grafana / Datadog target is configured today — there is nowhere for metrics to flow even if the stub were replaced.

Logs

🟢 Structured logging via @quantatrade/logger (packages/logger/src/). Pino-backed. Two methods that get used everywhere: .error(msg, err, meta) and .logTrade(meta). The .logError / .logTrade calls were a recent fix (see docs/milestone-1-status.md — "Resolved 2026-05-27: ledger logger bug").

Container logs end up at the Docker daemon's default location (/var/lib/docker/containers/<id>/<id>-json.log). No log shipping → no Loki, CloudWatch, or Splunk target. Operator reads logs with docker compose logs -f <service> over SSH.

Tracing

🔴 None. No OpenTelemetry exporters, no Jaeger, no Datadog APM.

Healthchecks

Every container has a docker-level HEALTHCHECK. docker compose ps shows status as (healthy) / (unhealthy) / (starting). The admin panel's /services page polls these endpoints from the browser side via api.getServiceHealth(serviceKey).

Alerting

🔴 None. No PagerDuty, no Slack alert pipeline, no on-call rotation. If the host goes down, the next person to look at trade.quanta.emoment.tech finds out.

Cost / value trade-off

For the current M1 demo posture this is acceptable. Before any paying-customer traffic: wire prom-client + push to Grafana Cloud (free tier covers 10K series), add CloudWatch container insights, write 4 alarms (instance down / disk > 80% / matching-engine 5xx / no trades in 5 min).


8. Backups

EBS snapshots (host-level)

🟢 DLM (Data Lifecycle Manager) is configured (deployment-state.md:255-262):

Item Value
IAM role arn:aws:iam::094969483885:role/AWSDataLifecycleManagerDefaultRole
Policy policy-0066a67ecb6c3daa7 (ENABLED)
Schedule Daily, 03:00 UTC
Retention 7 days
Target EBS volumes tagged Backup=daily (currently the root disk vol-0ddc7e9d1de5a2b59)
Tagging Snapshots tagged SnapshotType=DLM-Daily
Baseline snap-07e001a69835f1973 (manual, pre-DLM)

Restore = create new EBS volume from a snapshot, attach, fsck, mount, fix /etc/fstab UUID. ~15 minutes to a running new host.

Postgres backups (logical)

🔴 None today. No pg_dump cron, no continuous archiving, no AWS RDS (Postgres runs in-container with a Docker volume). The EBS snapshot is the only Postgres backup — sufficient for crash recovery, insufficient for point-in-time recovery.

Adding a nightly pg_dump → S3 with 30-day retention is a 30-minute task. It's not yet done.

Temporal-based settlement durability

🟡 tradeSettlementWorkflow (services/ledger-service/src/workflows/trade-settlement.ts) is implemented and wired up. Steps (:39-50):

  1. settleTradeActivity — calls LedgerService.settleTrade({...}). Idempotent via trade ID.
  2. notifyTradeSettledActivity — publishes trade.settled on NATS for downstream services.

Retry policy: maximumAttempts: 3, initialInterval: 1s, backoffCoefficient: 2, startToCloseTimeout: 30s (trade-settlement.ts:6-14).

The worker is launched from services/ledger-service/src/worker.ts:1-23maxConcurrentActivityTaskExecutions: 20, maxConcurrentWorkflowTaskExecutions: 20.

But there's no Temporal server in production todayTEMPORAL_ADDRESS defaults to localhost:7233 (config/index.ts:47) and nothing answers there. The worker connects, fails silently, and the workflow path is not exercised. Settlement actually happens via the NATS trade.executed listener inside ledger.ts directly.

Wiring a real Temporal cluster (Temporal Cloud or self-hosted Temporal in compose) is part of the M3 "make settlement crash-safe" workstream.


9. Operational runbooks

Restart a service

ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/infrastructure
docker compose restart <service>   # graceful; honours stop_grace_period
docker compose ps                  # confirm (healthy)

For pm2-managed front-ends (admin-panel, trade-ui, presale-app, investor-dashboard, main frontend):

pm2 restart qt-admin              # or qt-trade / qt-presale / qt-dashboard / quantatrade
pm2 status                        # confirm
pm2 logs qt-admin --lines 50      # check post-restart logs

Tail logs

# Docker container
docker compose logs -f --tail 100 api-gateway

# Multiple containers
docker compose logs -f api-gateway order-router ledger-service

# pm2 frontend
pm2 logs qt-trade --lines 200

# nginx access log
sudo tail -f /var/log/nginx/access.log

# nginx error log
sudo tail -f /var/log/nginx/error.log

Run a Prisma migration

The Prisma schema is in packages/db/prisma/schema.prisma (24 models). Migrations live in packages/db/prisma/migrations/:

  • 20240201000000_add_address_pool
  • 20260206000000_add_order_internal_id
  • 20260208000000_add_user_password_hash

To apply migrations against the live database:

ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/platform

# Option A: from inside the api-gateway container (the image has Prisma CLI)
docker compose exec api-gateway npx prisma migrate deploy --schema=/app/packages/db/prisma/schema.prisma

# Option B: db push for non-versioned changes (development convenience — never in prod)
docker compose exec api-gateway npx prisma db push --schema=/app/packages/db/prisma/schema.prisma

History note (deployment-state.md:109): The initial deploy had no migrations — only the address_pool migration existed; User / Order / Trade / etc. tables were missing. Recovery was prisma db push from inside the api-gateway container, which created 25 tables from the schema. db push is dev-only — never run it on prod once we have customers. The two follow-up migrations (order_internal_id, user_password_hash) were added properly.

Roll back a bad deploy

# Tag-based pin (preferred — once we tag images properly)
ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cd /home/ubuntu/qt/infrastructure
# Edit docker-compose.yml: pin <service>'s image: tag from `:latest` to the last-known-good tag
docker compose pull <service>
docker compose up -d <service>

# Or, if rollback requires DB state too: restore from EBS snapshot
# 1. Stop services that depend on Postgres
docker compose stop api-gateway order-router ledger-service pms-service risk-service
# 2. Create EBS volume from yesterday's DLM snapshot
aws ec2 create-volume --snapshot-id snap-<id> --availability-zone us-east-1b --volume-type gp3 --region us-east-1
# 3. Stop instance, detach old volume, attach new, fsck, mount
# 4. Restart instance and bring the stack up

There is no automated rollback — every step above is manual.

Renew a TLS certificate

certbot runs daily on a systemd timer. Manual force-renewal:

sudo certbot renew --dry-run                  # safety check
sudo certbot renew                            # actual renewal (only renews if < 30 days remain)
sudo certbot renew --force-renewal --cert-name api.quanta.emoment.tech   # nuclear option
sudo systemctl reload nginx

Redeploy docs (this repo's docs/)

cd /Users/pk/ws/quantatrade
~/Library/Python/3.9/bin/mkdocs build
rsync -az --delete -e "ssh -i ~/.ssh/quantatrade-key.pem" \
  site/ ubuntu@34.199.105.99:/var/www/docs.quanta.emoment.tech/

Inspect the docker-compose override

ssh -i ~/.ssh/quantatrade-key.pem ubuntu@34.199.105.99
cat /home/ubuntu/qt/infrastructure/docker-compose.override.yml

(This file is host-only — see §2.)

Self-heal CI test balance

order-pipeline.spec.ts in the trading-ui repo deposits 1 B USDT + 1 k BTC into the test user before every CI run. The endpoint is gated by service-auth headers:

curl -X POST https://matching.quanta.emoment.tech/api/v1/accounts/deposit \
  -H "Content-Type: application/json" \
  -H "x-api-key: $SERVICE_API_KEY" \
  -H "x-api-secret: $SERVICE_API_SECRET" \
  -H "x-participant-type: SYSTEM" \
  -d '{"userId":"cmohcayzs0000n57cskqsutdc","currency":"USDT","amount":"1000000000"}'

10. Known operational gaps

🔴 Single host = SPOF. One EC2 instance dies → everything goes down. No multi-AZ, no auto-scaling, no failover. RTO is roughly 15 minutes (restore from EBS snapshot). RPO is 24 hours (daily DLM cadence).

🔴 Manual deploys. No CD pipeline. Operator SSHes in, runs docker compose pull && up -d. No deploy log, no atomic swap, no canary.

🔴 No UAT environment. Every change goes to prod direct. Trading UI has its e2e suite running against prod (which is brave but works for a demo).

🔴 Secrets in plain .env. No Vault, no rotation. JWT_SECRET hasn't rotated since provisioning.

🔴 Metrics stub. @quantatrade/metrics is no-ops. No Prometheus target.

🔴 No alerting. Host down / disk full / matching-engine 5xx — nothing pages anyone.

🔴 No logical Postgres backup. Only EBS-level. PITR is impossible.

🔴 No Temporal in prod. The settlement workflow exists in code but has no worker target to dispatch to.

🔴 nginx config not version-controlled. Reconstructing it from scratch would be from memory + grep on the host.

🔴 docker-compose.override.yml not committed. Same problem.

🟡 pms-service JWT secret has a hard-coded fallback (pms-secret-change-in-production) — must be overridden in prod env.

🟡 subscription-service uses SQLite inside the container — no volume mount, state lost on restart.

🟡 order-router MATCHING_ENGINE_URL/_WS_URL defaulted to localhost:8090 — unreachable from inside its container. Fixed via override env vars pointing at http://matching-engine:8090. The default in the schema is misleading.

🟡 Burstable t3.2xlarge. Under sustained load CPU credits will throttle. Move to c7i.2xlarge or m7i.2xlarge before paying-customer traffic.

🟡 CI swallows lint/type/test failures. Only build is enforced.

🟢 DLM EBS snapshots — working. Tested on 2026-04-26. 7-day retention.

🟢 Certbot auto-renewal — working. Last dry-run successful 2026-04-25.

🟢 pm2 survives rebootspm2 startup was run on provision; the four front-ends come back up automatically.

🟢 Docker restart: unless-stopped on every service — containers come back after reboot or crash.


  • 01-architecture.md — what the services do; this doc is "where they run"
  • 03-ledger-accounting.md — Temporal settlement design (the worker that has no server yet)
  • 06-admin-panel.md — how the operator UI lands on the matching engine via Envoy
  • docs/deployment-state.md — the authoritative day-to-day operations log; this doc summarises and structures it
  • docs/forward-plan.md — the prioritised list of operational improvements