Appearance
π¦ Status Health Service - Monitoring Stack β
A comprehensive, single-container monitoring solution for tracking HTTPS availability of services deployed on Railway. Combines Prometheus, Grafana, and Blackbox Exporter with automatic provisioning and persistent storage.
β¨ Features β
- β HTTPS Availability Monitoring - Track uptime and response times of your services
- β Unified Dashboard - Auto-provisioned Grafana dashboard with datasource configuration
- β Single Service Deployment - Everything runs in one Railway container
- β Persistent Storage - Single volume setup with automatic symlink management
- β
Built-in Documentation -
/docsendpoint with setup instructions and credentials - β Zero Configuration - Works out of the box with provisioned datasources and dashboards
ποΈ Architecture β
This monitoring stack combines three powerful tools in a single container, managed by Supervisor:
βββββββββββββββββββββββββββββββββββββββββββ
β Railway Container β
β β
β ββββββββββββββββββββββββββββββββββββ β
β β Reverse Proxy (Go) β β
β β Port: 3000 β β
β β Routes: /docs β Docs Server β β
β β /* β Grafana β β
β ββββββββββββββββββββββββββββββββββββ β
β β² β² β
β β β β
β ββββββββββββ βββββββββββββββββββ β
β β Docs β β Grafana β β
β β Server β β Visualization β β
β β (Go) β β & Dashboards β β
β ββββββββββββ βββββββββββββββββββ β
β β² β
β β β
β ββββββββββββββββββββββββββββββββββββ β
β β Prometheus β β
β β Port: 9090 β β
β β Metrics Collection & Storage β β
β ββββββββββββββββββββββββββββββββββββ β
β β² β
β β β
β ββββββββββββββββββββββββββββββββββββ β
β β Blackbox Exporter β β
β β Port: 9115 β β
β β HTTP/HTTPS Probes β β
β ββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββTechnology Stack:
- Prometheus v2.48.0 - Metrics collection and time-series database
- Grafana v10.2.0 - Visualization and dashboards (auto-provisioned)
- Blackbox Exporter v0.24.0 - HTTP/HTTPS endpoint probing
- Supervisord - Process management for all services
- Custom Go Services:
- Reverse Proxy - Routes
/docsto docs server, rest to Grafana on port 3000 - Docs Server - Static file server for instructions and credentials
- Go Probe - Diagnostic tool for debugging probe issues from inside container
- Reverse Proxy - Routes
π Project Structure β
.
βββ Dockerfile # Multi-stage build with 3 Go services + monitoring stack
βββ entrypoint.sh # Volume setup (creates /data subdirectories + symlinks)
βββ supervisord.conf # Process manager configuration
βββ prometheus.yml # Prometheus scrape configuration (targets here)
βββ blackbox.yml # Blackbox exporter modules
βββ railway.json # Railway configuration
βββ cmd/
β βββ proxy/main.go # Reverse proxy: :3000/docs β docs-server, /* β Grafana
β βββ docs-server/main.go # Static docs server
β βββ go-probe/main.go # Diagnostic probe tool
βββ docs/
β βββ index.html # Docs endpoint content (edit to customize)
β βββ services.json # Service list (unverified - confirm with team)
βββ grafana-provisioning/
β βββ datasources/ # Auto-provisioned Prometheus datasource
β βββ dashboards/ # Auto-provisioned dashboard definitions
β β βββ default/
β β βββ all-servers.json # Unified dashboard
β βββ alerting/ # Alert rules and contact points
β βββ alerting.yml # Alert configuration (requires Slack webhook)
βββ debug-probe.sh # Debug script for Railway probe issues
βββ ADD_NEW_SERVICE.md # Guide for adding new services (unverified - confirm with team)
βββ SERVICE_OWNER_GUIDE.md # Service owner documentation (unverified - confirm with team)π Quick Start β
Prerequisites β
- A Railway account (railway.app)
- GitHub repository with this code
- Services to monitor (deployed with public HTTPS endpoints)
Deployment Steps β
Push to GitHub
bashgit add . git commit -m "Add monitoring stack" git push origin masterDeploy on Railway
- Go to Railway Dashboard
- Click New Project
- Select Deploy from GitHub repo
- Choose this repository (Wellysa/status-health-service)
- Railway will auto-detect the
Dockerfileand start building
Add Persistent Volume (Required for data persistence)
- In Railway project, go to your service β Volumes tab
- Click + New Volume
- Create volume:
- Name:
monitoring-data(or any name) - Mount Path:
/dataβ οΈ Must be exactly/data
- Name:
- β οΈ Important: Without this volume, all metrics and dashboards are lost on redeploy
- π Note: Railway allows only one volume per service -
/datacontains subdirectories for all persistent data
Expose Public Port
- In Railway project, go to Settings β Networking
- Click Generate Domain or use existing domain
- Ensure Public Port is exposed:
- Port:
3000 - Protocol:
HTTP
- Port:
Access Grafana
- Open your Railway domain:
https://<your-project>.up.railway.app - Default credentials:
- Username:
admin - Password:
admin
- Username:
- β οΈ Important: Change the password on first login
- Open your Railway domain:
Access Documentation
- Same domain, path
/docs:https://<your-project>.up.railway.app/docs - Contains: setup instructions, credentials, how to add services
- Edit
docs/index.htmlto customize content
- Same domain, path
Verify Auto-Provisioning
- β
Prometheus datasource - automatically configured at
http://localhost:9090 - β Unified dashboard - automatically imported (check Dashboards β "All Servers")
- No manual import needed - provisioned on startup from
grafana-provisioning/
- β
Prometheus datasource - automatically configured at
π§ Configuration β
Adding Services to Monitor β
Edit prometheus.yml to add services. The file defines two job types:
Job 1: Availability Monitoring (HTTPS Probes) β
The availability job uses Blackbox Exporter to probe HTTPS endpoints:
yaml
- job_name: "availability"
metrics_path: /probe
params:
module: [https_2xx]
static_configs:
- targets:
- https://service-a.up.railway.app/health
- https://service-b.up.railway.app/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: localhost:9115Requirements for monitored endpoints:
- Must be public HTTPS URLs
- Should return HTTP 200 (or any 2xx) status code
- Response body must contain "ok" (case-insensitive) - e.g., plain
OKor JSON{"status":"ok"} - Railway domains work:
*.up.railway.app
Optional labels for alerting (unverified - confirm with team):
responsible: Display name for dashboard (e.g.,"@MichaΕ Basznianin")slack_id: Slack Member ID for @mentions in alerts (e.g.,"U0ABC123")
Example with labels:
yaml
- targets:
- https://service-a.up.railway.app/health
labels:
responsible: "@John Doe"
slack_id: "U0ABC123"Job 2: Coverage Metrics (unverified - confirm with team) β
The coverage job scrapes test coverage metrics from your services (if implemented):
yaml
- job_name: "coverage"
scheme: https
metrics_path: /metrics
static_configs:
- targets:
- service-a.up.railway.app
- service-b.up.railway.appRequirements:
- Each service must expose a
/metricsendpoint - Metrics must follow Prometheus format
- Required metric:
unit_test_coverage{service="service-name"} 0.82(value 0.0-1.0)
Blackbox Exporter Configuration β
The blackbox.yml defines probe modules (example configuration):
yaml
modules:
https_2xx:
prober: http
timeout: 5s
http:
valid_status_codes: [200]This module probes HTTP/HTTPS endpoints expecting 200 status code with 5s timeout.
π Grafana Dashboard β
Pre-Provisioned Dashboard β
The service includes an auto-provisioned unified dashboard located at:
grafana-provisioning/dashboards/default/all-servers.json
Provisioned automatically on startup:
- Prometheus datasource - configured to
http://localhost:9090 - Unified dashboard - imported from
all-servers.json - Alert rules - loaded from
grafana-provisioning/alerting/
Access the dashboard:
- Login to Grafana
- Navigate to Dashboards
- Look for "All Servers - Unified Dashboard" (or similar name)
Common PromQL Queries β
Service Availability:
promql
probe_success{job="availability"}Response Time:
promql
probe_http_duration_seconds{job="availability"}Service Uptime (last 24h):
promql
avg_over_time(probe_success{job="availability"}[24h]) * 100Coverage Percentage (if coverage job configured):
promql
unit_test_coverage * 100Creating Custom Dashboards β
Refer to the current README.md section "Grafana Dashboard Setup - Complete Guide" for detailed instructions on creating custom panels with availability tables, response time graphs, and coverage metrics.
π¨ Alerting (Optional) β
Alert configuration is provisioned from grafana-provisioning/alerting/:
Setup requirements:
- Set
SLACK_WEBHOOK_URLenvironment variable in Railway - Uncomment alerting provisioning in Dockerfile if disabled
- Configure contact points in
grafana-provisioning/alerting/contact-points.yml
Alert behavior:
- Repeat interval: Set in
policies.yml(e.g.,5mto match evaluation interval) - Slack @mentions: Use
slack_idlabel on targets (Slack Member ID, e.g.,U0ABC123)- Get ID: Slack β User profile β More β Copy member ID
- No slack_id: No @mention added to notification
Example alert rules (unverified - confirm with team):
- Service Down:
probe_success == 0for 2+ minutes - Low Coverage:
unit_test_coverage < 0.80for 10+ minutes
πΎ Persistent Volumes β
Why Volumes Are Critical β
Without a volume, all data is lost on:
- Service redeploy
- Container restart
- Railway maintenance
Data requiring persistence:
- Prometheus: Historical metrics (grows ~100MB-1GB)
- Grafana: Dashboards, datasources, users (~50-200MB)
- Logs: Application logs (~10-100MB)
Single Volume Setup β
β οΈ Railway allows only ONE volume per service
Configuration:
- Mount point:
/data(exactly) - Entrypoint script creates subdirectories:
/data/prometheusβ symlinked to/var/lib/prometheus/data/grafanaβ symlinked to/var/lib/grafana/data/grafana-logsβ symlinked to/var/log/grafana
How it works:
entrypoint.shruns on container start- Creates subdirectories in
/dataif missing - Creates symlinks from standard paths to volume subdirectories
- Services use standard paths transparently
To add volume:
- Railway dashboard β Your service β Volumes tab
- + New Volume
- Mount Path:
/data(must be exact) - Name:
monitoring-data(or any name) - Redeploy service
Verifying Volume Works β
After adding volume and redeploying:
- Prometheus: Query metrics in Grafana β if historical data survives restart, volume works
- Grafana: Create test dashboard β redeploy β if dashboard persists, volume works
π Accessing Services β
| Service | URL | Access | Port |
|---|---|---|---|
| Grafana | https://<railway-domain> | Public (Railway domain) | 3000 |
| Documentation | https://<railway-domain>/docs | Public (same domain) | 3000 |
| Prometheus UI | http://localhost:9090 | Internal only (within container) | 9090 |
| Blackbox Exporter | http://localhost:9115 | Internal only (within container) | 9115 |
Default Grafana credentials:
- Username:
admin - Password:
admin(change on first login)
Overriding credentials: Set Railway environment variables:
GF_SECURITY_ADMIN_USERGF_SECURITY_ADMIN_PASSWORD
π οΈ Troubleshooting β
Issue: probe_http_status_code is 0 but service responds β
Symptom: Blackbox probe shows probe_http_status_code 0 but go-probe tool succeeds inside container.
Diagnosis from inside container:
bash
# Run the exact request Prometheus sends to Blackbox
curl -sS 'http://localhost:9115/probe?target=https%3A%2F%2Fyour-service.up.railway.app%2Fhealth&module=https_2xx' | grep -E "probe_success|probe_http_status_code"If probe_http_status_code is 200:
- Blackbox works when called manually
- Check Prometheus Targets page:
http://localhost:9090/targets - Look for
availabilityjob scrape errors
If probe_http_status_code is 0:
- Blackbox is failing (encoding, module, or target issue)
- Try without URL encoding:
curl -sS 'http://localhost:9115/probe?target=https://your-service.up.railway.app/health&module=https_2xx'
Use debug-probe.sh:
bash
# Inside container (Railway CLI or exec)
/usr/local/bin/debug-probe.shIssue: No metrics appearing in Grafana β
Check:
- Prometheus targets:
http://localhost:9090/targets(access via kubectl port-forward if needed) - Verify
prometheus.ymlhas correct target URLs - Ensure monitored services are publicly accessible
- Check Railway logs for Prometheus scrape errors
Issue: Dashboard not auto-provisioned β
Check:
- Verify
grafana-provisioning/dashboards/default/all-servers.jsonexists - Check Grafana logs in Railway for provisioning errors
- Ensure volume is mounted at
/data(provisioning may fail without write access)
Issue: Data lost after redeploy β
Solution:
- Verify volume mount path is exactly
/data(not/var/lib/prometheusor other path) - Check Railway logs for entrypoint script errors (symlink creation)
- Verify volume is attached to the service in Railway dashboard
π Environment Variables β
Configure via Railway environment variables:
| Variable | Description | Default |
|---|---|---|
GF_SERVER_HTTP_ADDR | Grafana bind address (internal) | (set by entrypoint) |
GF_SERVER_HTTP_PORT | Grafana port (internal, accessed via proxy) | (set by entrypoint) |
GF_SECURITY_ADMIN_USER | Grafana admin username | admin |
GF_SECURITY_ADMIN_PASSWORD | Grafana admin password | admin |
SLACK_WEBHOOK_URL | Slack webhook for alerts | (unset) |
Note: The reverse proxy listens on port 3000 and routes to Grafana internally. Grafana's actual port is configured by supervisord/entrypoint.
π Security Notes β
- β οΈ Change default Grafana password immediately after first login
- π Monitored services should use HTTPS endpoints
- π‘οΈ Prometheus (9090) and Blackbox (9115) are not exposed publicly by default (internal use only)
- π Use Railway environment variables for sensitive configuration (passwords, webhooks)
- π Edit
docs/index.htmlto remove or secure sensitive documentation
π Additional Resources β
- Prometheus Documentation
- Grafana Documentation
- Blackbox Exporter
- Railway Documentation
- Internal guides (unverified - confirm with team):
ADD_NEW_SERVICE.md- How to add services to monitoringSERVICE_OWNER_GUIDE.md- Guide for service owners
π Status β
Production: Deployed on Railway as single-container monitoring stack
Known limitations:
- Railway allows only one volume per service (handled by
/datasubdirectory structure) - Coverage monitoring requires services to expose
/metricsendpoint (optional feature) - Slack alerting requires manual webhook configuration
Built for Wellysa β’ Deployed on Railway