🚦 Status Health Service - Monitoring Stack

A comprehensive, single-container monitoring solution for tracking HTTPS availability of services deployed on Railway. Combines Prometheus, Grafana, and Blackbox Exporter with automatic provisioning and persistent storage.

✨ Features

✅ HTTPS Availability Monitoring - Track uptime and response times of your services
✅ Unified Dashboard - Auto-provisioned Grafana dashboard with datasource configuration
✅ Single Service Deployment - Everything runs in one Railway container
✅ Persistent Storage - Single volume setup with automatic symlink management
✅ Built-in Documentation - /docs endpoint with setup instructions and credentials
✅ Zero Configuration - Works out of the box with provisioned datasources and dashboards

🏗️ Architecture

This monitoring stack combines three powerful tools in a single container, managed by Supervisor:

┌─────────────────────────────────────────┐
│         Railway Container               │
│                                         │
│  ┌──────────────────────────────────┐  │
│  │  Reverse Proxy (Go)              │  │
│  │  Port: 3000                      │  │
│  │  Routes: /docs → Docs Server     │  │
│  │          /* → Grafana            │  │
│  └──────────────────────────────────┘  │
│         ▲            ▲                  │
│         │            │                  │
│  ┌──────────┐  ┌─────────────────┐    │
│  │ Docs     │  │  Grafana        │    │
│  │ Server   │  │  Visualization  │    │
│  │ (Go)     │  │  & Dashboards   │    │
│  └──────────┘  └─────────────────┘    │
│                      ▲                  │
│                      │                  │
│  ┌──────────────────────────────────┐  │
│  │  Prometheus                      │  │
│  │  Port: 9090                      │  │
│  │  Metrics Collection & Storage    │  │
│  └──────────────────────────────────┘  │
│              ▲                          │
│              │                          │
│  ┌──────────────────────────────────┐  │
│  │  Blackbox Exporter               │  │
│  │  Port: 9115                      │  │
│  │  HTTP/HTTPS Probes               │  │
│  └──────────────────────────────────┘  │
└─────────────────────────────────────────┘

Technology Stack:

Prometheus v2.48.0 - Metrics collection and time-series database
Grafana v10.2.0 - Visualization and dashboards (auto-provisioned)
Blackbox Exporter v0.24.0 - HTTP/HTTPS endpoint probing
Supervisord - Process management for all services
Custom Go Services:
- Reverse Proxy - Routes /docs to docs server, rest to Grafana on port 3000
- Docs Server - Static file server for instructions and credentials
- Go Probe - Diagnostic tool for debugging probe issues from inside container

📁 Project Structure

.
├── Dockerfile                      # Multi-stage build with 3 Go services + monitoring stack
├── entrypoint.sh                   # Volume setup (creates /data subdirectories + symlinks)
├── supervisord.conf                # Process manager configuration
├── prometheus.yml                  # Prometheus scrape configuration (targets here)
├── blackbox.yml                    # Blackbox exporter modules
├── railway.json                    # Railway configuration
├── cmd/
│   ├── proxy/main.go              # Reverse proxy: :3000/docs → docs-server, /* → Grafana
│   ├── docs-server/main.go        # Static docs server
│   └── go-probe/main.go           # Diagnostic probe tool
├── docs/
│   ├── index.html                 # Docs endpoint content (edit to customize)
│   └── services.json              # Service list (unverified - confirm with team)
├── grafana-provisioning/
│   ├── datasources/               # Auto-provisioned Prometheus datasource
│   ├── dashboards/                # Auto-provisioned dashboard definitions
│   │   └── default/
│   │       └── all-servers.json   # Unified dashboard
│   └── alerting/                  # Alert rules and contact points
│       └── alerting.yml           # Alert configuration (requires Slack webhook)
├── debug-probe.sh                 # Debug script for Railway probe issues
├── ADD_NEW_SERVICE.md             # Guide for adding new services (unverified - confirm with team)
└── SERVICE_OWNER_GUIDE.md         # Service owner documentation (unverified - confirm with team)

🚀 Quick Start

Prerequisites

A Railway account (railway.app)
GitHub repository with this code
Services to monitor (deployed with public HTTPS endpoints)

Deployment Steps

Push to GitHub

bash

git add .
git commit -m "Add monitoring stack"
git push origin master

Deploy on Railway
- Go to Railway Dashboard
- Click New Project
- Select Deploy from GitHub repo
- Choose this repository (Wellysa/status-health-service)
- Railway will auto-detect the Dockerfile and start building
Add Persistent Volume (Required for data persistence)
- In Railway project, go to your service → Volumes tab
- Click + New Volume
- Create volume:
  - Name: monitoring-data (or any name)
  - Mount Path: /data ⚠️ Must be exactly /data
- ⚠️ Important: Without this volume, all metrics and dashboards are lost on redeploy
- 📝 Note: Railway allows only one volume per service - /data contains subdirectories for all persistent data
Expose Public Port
- In Railway project, go to Settings → Networking
- Click Generate Domain or use existing domain
- Ensure Public Port is exposed:
  - Port: 3000
  - Protocol: HTTP
Access Grafana
- Open your Railway domain: https://<your-project>.up.railway.app
- Default credentials:
  - Username: admin
  - Password: admin
- ⚠️ Important: Change the password on first login
Access Documentation
- Same domain, path /docs: https://<your-project>.up.railway.app/docs
- Contains: setup instructions, credentials, how to add services
- Edit docs/index.html to customize content
Verify Auto-Provisioning
- ✅ Prometheus datasource - automatically configured at http://localhost:9090
- ✅ Unified dashboard - automatically imported (check Dashboards → "All Servers")
- No manual import needed - provisioned on startup from grafana-provisioning/

🔧 Configuration

Adding Services to Monitor

Edit prometheus.yml to add services. The file defines two job types:

Job 1: Availability Monitoring (HTTPS Probes)

The availability job uses Blackbox Exporter to probe HTTPS endpoints:

yaml

- job_name: "availability"
  metrics_path: /probe
  params:
    module: [https_2xx]
  static_configs:
    - targets:
        - https://service-a.up.railway.app/health
        - https://service-b.up.railway.app/health
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - target_label: __address__
      replacement: localhost:9115

Requirements for monitored endpoints:

Must be public HTTPS URLs
Should return HTTP 200 (or any 2xx) status code
Response body must contain "ok" (case-insensitive) - e.g., plain OK or JSON {"status":"ok"}
Railway domains work: *.up.railway.app

Optional labels for alerting (unverified - confirm with team):

responsible: Display name for dashboard (e.g., "@Michał Basznianin")
slack_id: Slack Member ID for @mentions in alerts (e.g., "U0ABC123")

Example with labels:

yaml

- targets:
    - https://service-a.up.railway.app/health
  labels:
    responsible: "@John Doe"
    slack_id: "U0ABC123"

Job 2: Coverage Metrics (unverified - confirm with team)

The coverage job scrapes test coverage metrics from your services (if implemented):

yaml

- job_name: "coverage"
  scheme: https
  metrics_path: /metrics
  static_configs:
    - targets:
        - service-a.up.railway.app
        - service-b.up.railway.app

Requirements:

Each service must expose a /metrics endpoint
Metrics must follow Prometheus format
Required metric: unit_test_coverage{service="service-name"} 0.82 (value 0.0-1.0)

Blackbox Exporter Configuration

The blackbox.yml defines probe modules (example configuration):

yaml

modules:
  https_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200]

This module probes HTTP/HTTPS endpoints expecting 200 status code with 5s timeout.

📊 Grafana Dashboard

Pre-Provisioned Dashboard

The service includes an auto-provisioned unified dashboard located at:

grafana-provisioning/dashboards/default/all-servers.json

Provisioned automatically on startup:

Prometheus datasource - configured to http://localhost:9090
Unified dashboard - imported from all-servers.json
Alert rules - loaded from grafana-provisioning/alerting/

Access the dashboard:

Login to Grafana
Navigate to Dashboards
Look for "All Servers - Unified Dashboard" (or similar name)

Common PromQL Queries

Service Availability:

promql

probe_success{job="availability"}

Response Time:

promql

probe_http_duration_seconds{job="availability"}

Service Uptime (last 24h):

promql

avg_over_time(probe_success{job="availability"}[24h]) * 100

Coverage Percentage (if coverage job configured):

promql

unit_test_coverage * 100

Creating Custom Dashboards

Refer to the current README.md section "Grafana Dashboard Setup - Complete Guide" for detailed instructions on creating custom panels with availability tables, response time graphs, and coverage metrics.

🚨 Alerting (Optional)

Alert configuration is provisioned from grafana-provisioning/alerting/:

Setup requirements:

Set SLACK_WEBHOOK_URL environment variable in Railway
Uncomment alerting provisioning in Dockerfile if disabled
Configure contact points in grafana-provisioning/alerting/contact-points.yml

Alert behavior:

Repeat interval: Set in policies.yml (e.g., 5m to match evaluation interval)
Slack @mentions: Use slack_id label on targets (Slack Member ID, e.g., U0ABC123)
- Get ID: Slack → User profile → More → Copy member ID
No slack_id: No @mention added to notification

Example alert rules (unverified - confirm with team):

Service Down: probe_success == 0 for 2+ minutes
Low Coverage: unit_test_coverage < 0.80 for 10+ minutes

💾 Persistent Volumes

Why Volumes Are Critical

Without a volume, all data is lost on:

Service redeploy
Container restart
Railway maintenance

Data requiring persistence:

Prometheus: Historical metrics (grows ~100MB-1GB)
Grafana: Dashboards, datasources, users (~50-200MB)
Logs: Application logs (~10-100MB)

Single Volume Setup

⚠️ Railway allows only ONE volume per service

Configuration:

Mount point: /data (exactly)
Entrypoint script creates subdirectories:
- /data/prometheus → symlinked to /var/lib/prometheus
- /data/grafana → symlinked to /var/lib/grafana
- /data/grafana-logs → symlinked to /var/log/grafana

How it works:

entrypoint.sh runs on container start
Creates subdirectories in /data if missing
Creates symlinks from standard paths to volume subdirectories
Services use standard paths transparently

To add volume:

Railway dashboard → Your service → Volumes tab
+ New Volume
Mount Path: /data (must be exact)
Name: monitoring-data (or any name)
Redeploy service

Verifying Volume Works

After adding volume and redeploying:

Prometheus: Query metrics in Grafana → if historical data survives restart, volume works
Grafana: Create test dashboard → redeploy → if dashboard persists, volume works

🔍 Accessing Services

Service	URL	Access	Port
Grafana	`https://<railway-domain>`	Public (Railway domain)	3000
Documentation	`https://<railway-domain>/docs`	Public (same domain)	3000
Prometheus UI	`http://localhost:9090`	Internal only (within container)	9090
Blackbox Exporter	`http://localhost:9115`	Internal only (within container)	9115

Default Grafana credentials:

Username: admin
Password: admin (change on first login)

Overriding credentials: Set Railway environment variables:

GF_SECURITY_ADMIN_USER
GF_SECURITY_ADMIN_PASSWORD

🛠️ Troubleshooting

Issue: `probe_http_status_code` is 0 but service responds

Symptom: Blackbox probe shows probe_http_status_code 0 but go-probe tool succeeds inside container.

Diagnosis from inside container:

bash

# Run the exact request Prometheus sends to Blackbox
curl -sS 'http://localhost:9115/probe?target=https%3A%2F%2Fyour-service.up.railway.app%2Fhealth&module=https_2xx' | grep -E "probe_success|probe_http_status_code"

If probe_http_status_code is 200:

Blackbox works when called manually
Check Prometheus Targets page: http://localhost:9090/targets
Look for availability job scrape errors

If probe_http_status_code is 0:

Blackbox is failing (encoding, module, or target issue)
Try without URL encoding: curl -sS 'http://localhost:9115/probe?target=https://your-service.up.railway.app/health&module=https_2xx'

Use debug-probe.sh:

bash

# Inside container (Railway CLI or exec)
/usr/local/bin/debug-probe.sh

Issue: No metrics appearing in Grafana

Check:

Prometheus targets: http://localhost:9090/targets (access via kubectl port-forward if needed)
Verify prometheus.yml has correct target URLs
Ensure monitored services are publicly accessible
Check Railway logs for Prometheus scrape errors

Issue: Dashboard not auto-provisioned

Check:

Verify grafana-provisioning/dashboards/default/all-servers.json exists
Check Grafana logs in Railway for provisioning errors
Ensure volume is mounted at /data (provisioning may fail without write access)

Issue: Data lost after redeploy

Solution:

Verify volume mount path is exactly /data (not /var/lib/prometheus or other path)
Check Railway logs for entrypoint script errors (symlink creation)
Verify volume is attached to the service in Railway dashboard

📝 Environment Variables

Configure via Railway environment variables:

Variable	Description	Default
`GF_SERVER_HTTP_ADDR`	Grafana bind address (internal)	(set by entrypoint)
`GF_SERVER_HTTP_PORT`	Grafana port (internal, accessed via proxy)	(set by entrypoint)
`GF_SECURITY_ADMIN_USER`	Grafana admin username	`admin`
`GF_SECURITY_ADMIN_PASSWORD`	Grafana admin password	`admin`
`SLACK_WEBHOOK_URL`	Slack webhook for alerts	(unset)

Note: The reverse proxy listens on port 3000 and routes to Grafana internally. Grafana's actual port is configured by supervisord/entrypoint.

🔐 Security Notes

⚠️ Change default Grafana password immediately after first login
🔒 Monitored services should use HTTPS endpoints
🛡️ Prometheus (9090) and Blackbox (9115) are not exposed publicly by default (internal use only)
🔑 Use Railway environment variables for sensitive configuration (passwords, webhooks)
📝 Edit docs/index.html to remove or secure sensitive documentation

📚 Additional Resources

Prometheus Documentation
Grafana Documentation
Blackbox Exporter
Railway Documentation
Internal guides (unverified - confirm with team):
- ADD_NEW_SERVICE.md - How to add services to monitoring
- SERVICE_OWNER_GUIDE.md - Guide for service owners

📄 Status

Production: Deployed on Railway as single-container monitoring stack

Known limitations:

Railway allows only one volume per service (handled by /data subdirectory structure)
Coverage monitoring requires services to expose /metrics endpoint (optional feature)
Slack alerting requires manual webhook configuration

Built for Wellysa • Deployed on Railway

🚦 Status Health Service - Monitoring Stack ​

✨ Features ​

🏗️ Architecture ​

📁 Project Structure ​

🚀 Quick Start ​

Prerequisites ​

Deployment Steps ​

🔧 Configuration ​

Adding Services to Monitor ​

Job 1: Availability Monitoring (HTTPS Probes) ​

Job 2: Coverage Metrics (unverified - confirm with team) ​

Blackbox Exporter Configuration ​

📊 Grafana Dashboard ​

Pre-Provisioned Dashboard ​

Common PromQL Queries ​

Creating Custom Dashboards ​

🚨 Alerting (Optional) ​

💾 Persistent Volumes ​

Why Volumes Are Critical ​

Single Volume Setup ​

Verifying Volume Works ​

🔍 Accessing Services ​

🛠️ Troubleshooting ​

Issue: probe_http_status_code is 0 but service responds ​

Issue: No metrics appearing in Grafana ​

Issue: Dashboard not auto-provisioned ​

Issue: Data lost after redeploy ​

📝 Environment Variables ​

🔐 Security Notes ​

📚 Additional Resources ​

📄 Status ​

🚦 Status Health Service - Monitoring Stack

✨ Features

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Prerequisites

Deployment Steps

🔧 Configuration

Adding Services to Monitor

Job 1: Availability Monitoring (HTTPS Probes)

Job 2: Coverage Metrics (unverified - confirm with team)

Blackbox Exporter Configuration

📊 Grafana Dashboard

Pre-Provisioned Dashboard

Common PromQL Queries

Creating Custom Dashboards

🚨 Alerting (Optional)

💾 Persistent Volumes

Why Volumes Are Critical

Single Volume Setup

Verifying Volume Works

🔍 Accessing Services

🛠️ Troubleshooting

Issue: `probe_http_status_code` is 0 but service responds

Issue: No metrics appearing in Grafana

Issue: Dashboard not auto-provisioned

Issue: Data lost after redeploy

📝 Environment Variables

🔐 Security Notes

📚 Additional Resources

📄 Status