Monitoring

Health checks and monitoring for Scaleway containers, Cloudflare Workers, and AWS App Runner

Overview

SanMarcSoft services are monitored via health check endpoints, Scaleway container status checks, and Cloudflare Worker analytics.

Health Check Endpoints

Verifieddit (Scaleway Container)

1
2
curl -s -o /dev/null -w "%{http_code}" https://verifieddit.com/
# Expected: 200

Stripe Backend (Scaleway Container)

1
2
curl -s https://<stripe-backend-url>/health
# Expected: {"status": "ok"}

Badge Signer (Scaleway Container)

1
2
curl -s https://<badge-signer-url>/health
# Expected: 200

Badges Worker (Cloudflare)

1
2
curl -s https://verifieddit.com/api/__debug | jq .
# Expected: {"version": "...", "deployed": "..."}

Phenom Drop (AWS App Runner)

1
2
curl -s -o /dev/null -w "%{http_code}" https://<phenom-drop-url>/health
# Expected: 200

Percy TTS (ai.matthewstevens.org)

1
2
curl -s http://ai.matthewstevens.org:8086/health
# Expected: {"status": "ok"}

Scaleway Container Monitoring

Check Container Status

1
2
3
4
5
6
SCW_TOKEN=$(pass sanmarcsoft/scaleway/api-secret)

# List all containers
curl -s -H "X-Auth-Token: ${SCW_TOKEN}" \
  "https://api.scaleway.com/containers/v1beta1/regions/fr-par/containers" | \
  jq '.containers[] | {name, status, domain_name, min_scale, max_scale}'

Container Status Values

StatusMeaningAction
readyContainer is deployed and servingNormal
pendingContainer is being deployedWait
errorContainer failed to deployInvestigate (see below)
lockedContainer is lockedContact Scaleway support
deletingContainer is being removedWait

Diagnosing Error State

1
2
3
4
# Get detailed error info
curl -s -H "X-Auth-Token: ${SCW_TOKEN}" \
  "https://api.scaleway.com/containers/v1beta1/regions/fr-par/containers/<container-id>" | \
  jq '{status, error_message, description}'

Common error causes:

  • Image not found in registry
  • Port mismatch
  • Entrypoint crash
  • Memory exceeded during startup

Resolution: Redeploy

1
2
cd infra
pulumi up --stack <environment>

If still in error state, delete and recreate:

1
2
pulumi destroy --stack <environment>
pulumi up --stack <environment>

Cloudflare Worker Analytics

View Worker Analytics via API

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
CF_TOKEN=$(pass cloudflare/api-token)
ACCOUNT_ID=$(pass cloudflare/account-id)

# Get worker analytics (last 24 hours)
curl -s -X POST \
  "https://api.cloudflare.com/client/v4/graphql" \
  -H "Authorization: Bearer ${CF_TOKEN}" \
  -H "Content-Type: application/json" \
  --data '{
    "query": "query { viewer { accounts(filter: {accountTag: \"'${ACCOUNT_ID}'\"}) { workersInvocationsAdaptive(limit: 10, filter: {datetime_geq: \"'$(date -u -d '-24 hours' +%Y-%m-%dT%H:%M:%SZ)'\", datetime_leq: \"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'\"}) { sum { requests errors subrequests } dimensions { scriptName status } } } } }"
  }' | jq '.data.viewer.accounts[0].workersInvocationsAdaptive'

Tail Worker Logs (Real-time)

1
2
npx wrangler tail verifieddit-badges
npx wrangler tail verifieddit-badges --status error

AWS App Runner Monitoring

Check Service Status

1
2
3
SERVICE_ARN=$(pass aws/phenom-drop/apprunner-arn)
aws apprunner describe-service --service-arn ${SERVICE_ARN} \
  --query 'Service.{Status:Status,URL:ServiceUrl,Updated:UpdatedAt,Running:InstanceConfiguration}'

View Service Logs

1
2
aws apprunner list-operations --service-arn ${SERVICE_ARN} \
  --query 'OperationSummaryList[0:5]'

Automated Monitoring Checklist

Run this script to check all services:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
echo "=== SanMarcSoft Service Health Check ==="
echo ""

echo -n "Verifieddit: "
curl -s -o /dev/null -w "%{http_code}" https://verifieddit.com/
echo ""

echo -n "Badges Worker: "
curl -s -o /dev/null -w "%{http_code}" https://verifieddit.com/api/__debug
echo ""

echo -n "Phenom Drop: "
curl -s -o /dev/null -w "%{http_code}" https://<phenom-drop-url>/health
echo ""

echo -n "Percy TTS: "
curl -s -o /dev/null -w "%{http_code}" http://ai.matthewstevens.org:8086/health
echo ""

echo ""
echo "=== Scaleway Containers ==="
SCW_TOKEN=$(pass sanmarcsoft/scaleway/api-secret)
curl -s -H "X-Auth-Token: ${SCW_TOKEN}" \
  "https://api.scaleway.com/containers/v1beta1/regions/fr-par/containers" | \
  jq -r '.containers[] | "\(.name): \(.status)"'

Troubleshooting

  • Health check timeout: Service may be in cold start. Wait 10 seconds and retry.
  • 502 from Scaleway: Container is starting or has crashed. Check container status and logs.
  • Worker returning old data: Version mismatch. Check with /__debug endpoint. See Cloudflare Workers SOP.
  • All services down: Check Cloudflare status page, Scaleway status page, and AWS status page.