Skip to content

🚦 Status Health Service - Monitoring Stack ​

A comprehensive, single-container monitoring solution for tracking HTTPS availability of services deployed on Railway. Combines Prometheus, Grafana, and Blackbox Exporter with automatic provisioning and persistent storage.

✨ Features ​

  • βœ… HTTPS Availability Monitoring - Track uptime and response times of your services
  • βœ… Unified Dashboard - Auto-provisioned Grafana dashboard with datasource configuration
  • βœ… Single Service Deployment - Everything runs in one Railway container
  • βœ… Persistent Storage - Single volume setup with automatic symlink management
  • βœ… Built-in Documentation - /docs endpoint with setup instructions and credentials
  • βœ… Zero Configuration - Works out of the box with provisioned datasources and dashboards

πŸ—οΈ Architecture ​

This monitoring stack combines three powerful tools in a single container, managed by Supervisor:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Railway Container               β”‚
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Reverse Proxy (Go)              β”‚  β”‚
β”‚  β”‚  Port: 3000                      β”‚  β”‚
β”‚  β”‚  Routes: /docs β†’ Docs Server     β”‚  β”‚
β”‚  β”‚          /* β†’ Grafana            β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β–²            β–²                  β”‚
β”‚         β”‚            β”‚                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Docs     β”‚  β”‚  Grafana        β”‚    β”‚
β”‚  β”‚ Server   β”‚  β”‚  Visualization  β”‚    β”‚
β”‚  β”‚ (Go)     β”‚  β”‚  & Dashboards   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                      β–²                  β”‚
β”‚                      β”‚                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Prometheus                      β”‚  β”‚
β”‚  β”‚  Port: 9090                      β”‚  β”‚
β”‚  β”‚  Metrics Collection & Storage    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚              β–²                          β”‚
β”‚              β”‚                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Blackbox Exporter               β”‚  β”‚
β”‚  β”‚  Port: 9115                      β”‚  β”‚
β”‚  β”‚  HTTP/HTTPS Probes               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Technology Stack:

  • Prometheus v2.48.0 - Metrics collection and time-series database
  • Grafana v10.2.0 - Visualization and dashboards (auto-provisioned)
  • Blackbox Exporter v0.24.0 - HTTP/HTTPS endpoint probing
  • Supervisord - Process management for all services
  • Custom Go Services:
    • Reverse Proxy - Routes /docs to docs server, rest to Grafana on port 3000
    • Docs Server - Static file server for instructions and credentials
    • Go Probe - Diagnostic tool for debugging probe issues from inside container

πŸ“ Project Structure ​

.
β”œβ”€β”€ Dockerfile                      # Multi-stage build with 3 Go services + monitoring stack
β”œβ”€β”€ entrypoint.sh                   # Volume setup (creates /data subdirectories + symlinks)
β”œβ”€β”€ supervisord.conf                # Process manager configuration
β”œβ”€β”€ prometheus.yml                  # Prometheus scrape configuration (targets here)
β”œβ”€β”€ blackbox.yml                    # Blackbox exporter modules
β”œβ”€β”€ railway.json                    # Railway configuration
β”œβ”€β”€ cmd/
β”‚   β”œβ”€β”€ proxy/main.go              # Reverse proxy: :3000/docs β†’ docs-server, /* β†’ Grafana
β”‚   β”œβ”€β”€ docs-server/main.go        # Static docs server
β”‚   └── go-probe/main.go           # Diagnostic probe tool
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ index.html                 # Docs endpoint content (edit to customize)
β”‚   └── services.json              # Service list (unverified - confirm with team)
β”œβ”€β”€ grafana-provisioning/
β”‚   β”œβ”€β”€ datasources/               # Auto-provisioned Prometheus datasource
β”‚   β”œβ”€β”€ dashboards/                # Auto-provisioned dashboard definitions
β”‚   β”‚   └── default/
β”‚   β”‚       └── all-servers.json   # Unified dashboard
β”‚   └── alerting/                  # Alert rules and contact points
β”‚       └── alerting.yml           # Alert configuration (requires Slack webhook)
β”œβ”€β”€ debug-probe.sh                 # Debug script for Railway probe issues
β”œβ”€β”€ ADD_NEW_SERVICE.md             # Guide for adding new services (unverified - confirm with team)
└── SERVICE_OWNER_GUIDE.md         # Service owner documentation (unverified - confirm with team)

πŸš€ Quick Start ​

Prerequisites ​

  • A Railway account (railway.app)
  • GitHub repository with this code
  • Services to monitor (deployed with public HTTPS endpoints)

Deployment Steps ​

  1. Push to GitHub

    bash
    git add .
    git commit -m "Add monitoring stack"
    git push origin master
  2. Deploy on Railway

    • Go to Railway Dashboard
    • Click New Project
    • Select Deploy from GitHub repo
    • Choose this repository (Wellysa/status-health-service)
    • Railway will auto-detect the Dockerfile and start building
  3. Add Persistent Volume (Required for data persistence)

    • In Railway project, go to your service β†’ Volumes tab
    • Click + New Volume
    • Create volume:
      • Name: monitoring-data (or any name)
      • Mount Path: /data ⚠️ Must be exactly /data
    • ⚠️ Important: Without this volume, all metrics and dashboards are lost on redeploy
    • πŸ“ Note: Railway allows only one volume per service - /data contains subdirectories for all persistent data
  4. Expose Public Port

    • In Railway project, go to Settings β†’ Networking
    • Click Generate Domain or use existing domain
    • Ensure Public Port is exposed:
      • Port: 3000
      • Protocol: HTTP
  5. Access Grafana

    • Open your Railway domain: https://<your-project>.up.railway.app
    • Default credentials:
      • Username: admin
      • Password: admin
    • ⚠️ Important: Change the password on first login
  6. Access Documentation

    • Same domain, path /docs: https://<your-project>.up.railway.app/docs
    • Contains: setup instructions, credentials, how to add services
    • Edit docs/index.html to customize content
  7. Verify Auto-Provisioning

    • βœ… Prometheus datasource - automatically configured at http://localhost:9090
    • βœ… Unified dashboard - automatically imported (check Dashboards β†’ "All Servers")
    • No manual import needed - provisioned on startup from grafana-provisioning/

πŸ”§ Configuration ​

Adding Services to Monitor ​

Edit prometheus.yml to add services. The file defines two job types:

Job 1: Availability Monitoring (HTTPS Probes) ​

The availability job uses Blackbox Exporter to probe HTTPS endpoints:

yaml
- job_name: "availability"
  metrics_path: /probe
  params:
    module: [https_2xx]
  static_configs:
    - targets:
        - https://service-a.up.railway.app/health
        - https://service-b.up.railway.app/health
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - target_label: __address__
      replacement: localhost:9115

Requirements for monitored endpoints:

  • Must be public HTTPS URLs
  • Should return HTTP 200 (or any 2xx) status code
  • Response body must contain "ok" (case-insensitive) - e.g., plain OK or JSON {"status":"ok"}
  • Railway domains work: *.up.railway.app

Optional labels for alerting (unverified - confirm with team):

  • responsible: Display name for dashboard (e.g., "@MichaΕ‚ Basznianin")
  • slack_id: Slack Member ID for @mentions in alerts (e.g., "U0ABC123")

Example with labels:

yaml
- targets:
    - https://service-a.up.railway.app/health
  labels:
    responsible: "@John Doe"
    slack_id: "U0ABC123"

Job 2: Coverage Metrics (unverified - confirm with team) ​

The coverage job scrapes test coverage metrics from your services (if implemented):

yaml
- job_name: "coverage"
  scheme: https
  metrics_path: /metrics
  static_configs:
    - targets:
        - service-a.up.railway.app
        - service-b.up.railway.app

Requirements:

  • Each service must expose a /metrics endpoint
  • Metrics must follow Prometheus format
  • Required metric: unit_test_coverage{service="service-name"} 0.82 (value 0.0-1.0)

Blackbox Exporter Configuration ​

The blackbox.yml defines probe modules (example configuration):

yaml
modules:
  https_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200]

This module probes HTTP/HTTPS endpoints expecting 200 status code with 5s timeout.

πŸ“Š Grafana Dashboard ​

Pre-Provisioned Dashboard ​

The service includes an auto-provisioned unified dashboard located at:

  • grafana-provisioning/dashboards/default/all-servers.json

Provisioned automatically on startup:

  1. Prometheus datasource - configured to http://localhost:9090
  2. Unified dashboard - imported from all-servers.json
  3. Alert rules - loaded from grafana-provisioning/alerting/

Access the dashboard:

  1. Login to Grafana
  2. Navigate to Dashboards
  3. Look for "All Servers - Unified Dashboard" (or similar name)

Common PromQL Queries ​

Service Availability:

promql
probe_success{job="availability"}

Response Time:

promql
probe_http_duration_seconds{job="availability"}

Service Uptime (last 24h):

promql
avg_over_time(probe_success{job="availability"}[24h]) * 100

Coverage Percentage (if coverage job configured):

promql
unit_test_coverage * 100

Creating Custom Dashboards ​

Refer to the current README.md section "Grafana Dashboard Setup - Complete Guide" for detailed instructions on creating custom panels with availability tables, response time graphs, and coverage metrics.

🚨 Alerting (Optional) ​

Alert configuration is provisioned from grafana-provisioning/alerting/:

Setup requirements:

  1. Set SLACK_WEBHOOK_URL environment variable in Railway
  2. Uncomment alerting provisioning in Dockerfile if disabled
  3. Configure contact points in grafana-provisioning/alerting/contact-points.yml

Alert behavior:

  • Repeat interval: Set in policies.yml (e.g., 5m to match evaluation interval)
  • Slack @mentions: Use slack_id label on targets (Slack Member ID, e.g., U0ABC123)
    • Get ID: Slack β†’ User profile β†’ More β†’ Copy member ID
  • No slack_id: No @mention added to notification

Example alert rules (unverified - confirm with team):

  • Service Down: probe_success == 0 for 2+ minutes
  • Low Coverage: unit_test_coverage < 0.80 for 10+ minutes

πŸ’Ύ Persistent Volumes ​

Why Volumes Are Critical ​

Without a volume, all data is lost on:

  • Service redeploy
  • Container restart
  • Railway maintenance

Data requiring persistence:

  • Prometheus: Historical metrics (grows ~100MB-1GB)
  • Grafana: Dashboards, datasources, users (~50-200MB)
  • Logs: Application logs (~10-100MB)

Single Volume Setup ​

⚠️ Railway allows only ONE volume per service

Configuration:

  1. Mount point: /data (exactly)
  2. Entrypoint script creates subdirectories:
    • /data/prometheus β†’ symlinked to /var/lib/prometheus
    • /data/grafana β†’ symlinked to /var/lib/grafana
    • /data/grafana-logs β†’ symlinked to /var/log/grafana

How it works:

  • entrypoint.sh runs on container start
  • Creates subdirectories in /data if missing
  • Creates symlinks from standard paths to volume subdirectories
  • Services use standard paths transparently

To add volume:

  1. Railway dashboard β†’ Your service β†’ Volumes tab
  2. + New Volume
  3. Mount Path: /data (must be exact)
  4. Name: monitoring-data (or any name)
  5. Redeploy service

Verifying Volume Works ​

After adding volume and redeploying:

  • Prometheus: Query metrics in Grafana β†’ if historical data survives restart, volume works
  • Grafana: Create test dashboard β†’ redeploy β†’ if dashboard persists, volume works

πŸ” Accessing Services ​

ServiceURLAccessPort
Grafanahttps://<railway-domain>Public (Railway domain)3000
Documentationhttps://<railway-domain>/docsPublic (same domain)3000
Prometheus UIhttp://localhost:9090Internal only (within container)9090
Blackbox Exporterhttp://localhost:9115Internal only (within container)9115

Default Grafana credentials:

  • Username: admin
  • Password: admin (change on first login)

Overriding credentials: Set Railway environment variables:

  • GF_SECURITY_ADMIN_USER
  • GF_SECURITY_ADMIN_PASSWORD

πŸ› οΈ Troubleshooting ​

Issue: probe_http_status_code is 0 but service responds ​

Symptom: Blackbox probe shows probe_http_status_code 0 but go-probe tool succeeds inside container.

Diagnosis from inside container:

bash
# Run the exact request Prometheus sends to Blackbox
curl -sS 'http://localhost:9115/probe?target=https%3A%2F%2Fyour-service.up.railway.app%2Fhealth&module=https_2xx' | grep -E "probe_success|probe_http_status_code"

If probe_http_status_code is 200:

  • Blackbox works when called manually
  • Check Prometheus Targets page: http://localhost:9090/targets
  • Look for availability job scrape errors

If probe_http_status_code is 0:

  • Blackbox is failing (encoding, module, or target issue)
  • Try without URL encoding: curl -sS 'http://localhost:9115/probe?target=https://your-service.up.railway.app/health&module=https_2xx'

Use debug-probe.sh:

bash
# Inside container (Railway CLI or exec)
/usr/local/bin/debug-probe.sh

Issue: No metrics appearing in Grafana ​

Check:

  1. Prometheus targets: http://localhost:9090/targets (access via kubectl port-forward if needed)
  2. Verify prometheus.yml has correct target URLs
  3. Ensure monitored services are publicly accessible
  4. Check Railway logs for Prometheus scrape errors

Issue: Dashboard not auto-provisioned ​

Check:

  1. Verify grafana-provisioning/dashboards/default/all-servers.json exists
  2. Check Grafana logs in Railway for provisioning errors
  3. Ensure volume is mounted at /data (provisioning may fail without write access)

Issue: Data lost after redeploy ​

Solution:

  1. Verify volume mount path is exactly /data (not /var/lib/prometheus or other path)
  2. Check Railway logs for entrypoint script errors (symlink creation)
  3. Verify volume is attached to the service in Railway dashboard

πŸ“ Environment Variables ​

Configure via Railway environment variables:

VariableDescriptionDefault
GF_SERVER_HTTP_ADDRGrafana bind address (internal)(set by entrypoint)
GF_SERVER_HTTP_PORTGrafana port (internal, accessed via proxy)(set by entrypoint)
GF_SECURITY_ADMIN_USERGrafana admin usernameadmin
GF_SECURITY_ADMIN_PASSWORDGrafana admin passwordadmin
SLACK_WEBHOOK_URLSlack webhook for alerts(unset)

Note: The reverse proxy listens on port 3000 and routes to Grafana internally. Grafana's actual port is configured by supervisord/entrypoint.

πŸ” Security Notes ​

  • ⚠️ Change default Grafana password immediately after first login
  • πŸ”’ Monitored services should use HTTPS endpoints
  • πŸ›‘οΈ Prometheus (9090) and Blackbox (9115) are not exposed publicly by default (internal use only)
  • πŸ”‘ Use Railway environment variables for sensitive configuration (passwords, webhooks)
  • πŸ“ Edit docs/index.html to remove or secure sensitive documentation

πŸ“š Additional Resources ​

πŸ“„ Status ​

Production: Deployed on Railway as single-container monitoring stack

Known limitations:

  • Railway allows only one volume per service (handled by /data subdirectory structure)
  • Coverage monitoring requires services to expose /metrics endpoint (optional feature)
  • Slack alerting requires manual webhook configuration

Built for Wellysa β€’ Deployed on Railway

Wellysa Consigliere β€” internal use only.