Web Infra Lab — Pierre Baroni

Architecture

Infrastructure multi-couches

De la requête visiteur au backend applicatif — chaque couche est sécurisée, monitorée et redondée.

// 01 · Edge & CDN

Visiteur

Browser · HTTPS

──▶

Cloudflare

CDN · WAF · DDoS · SSL Edge

↓

// 02 · Load Balancing

HAProxy

Load Balancer · :80/:443

──▶

Nginx

Reverse Proxy · SSL Termination

──▶

Let's Encrypt

Certbot · Auto-renew · 12 certs

↓

// 03 · Application Layer

Flask API

Python · Gunicorn · :808X

·

Worker IA

Groq LLaMA · Claude API

·

n8n

Orchestration · 24 workflows

·

Drupal ×12

CMS · Cloud hosted

↓

// 04 · Data & Async

Redis

Queue · Cache · Sessions

──▶

Jobs async

Workers · Retry · Backoff

──▶

SQLite / DB

Persistance · Backups

// 05 · Observability & CI/CD

GitLab CI

Pipelines · rsync

·

Monitoring

zensar-check.sh · Cron 7h

·

Logs

Nginx · journald · Loki

·

Alerting

Email · SSL expiry · HTTP check

Live System

État des services

Vue temps réel de l'infrastructure simulée — 18 services actifs, surveillance automatique quotidienne.

8,400

Requêtes / jour

↑ +3.2% vs hier

24

Workflows n8n actifs

Brevo · Groq · Twilio

12

Certs SSL valides

Prochain renouvellement: 45j

18

Services K8s

K3s · Traefik · ArgoCD

Service	Stack	Uptime	Status	Dernière vérif.
Nginx reverse proxy	Debian · systemd	99.98%	● RUNNING	07:00:01 UTC
HAProxy load balancer	HAProxy 2.6 · :80/:443	99.94%	● RUNNING	07:00:01 UTC
Cloudflare CDN	Full (strict) · WAF ON	100%	● ACTIVE	07:00:02 UTC
SSL certificates (×12)	Let's Encrypt · certbot	100%	● VALID	07:00:03 UTC
n8n orchestration	K3s · NodePort 30678	99.91%	● RUNNING	07:00:04 UTC
GitLab CI pipelines	rsync · SSH deploy	99.87%	● PASSING	06:52:14 UTC
zensar-check.sh	Bash · cron 0 7 * * *	100%	● SCHEDULED	07:00:01 UTC
Disk /	Debian · ext4	42%	● OK	07:00:01 UTC

Incidents & Résilience

Tickets résolus · Workflow SRE complet

Diagnostic → root cause → fix → vérification → post-mortem → prévention. Chaque ticket documenté.

WEB-023

Task✓ Resolved

Nouveau site Drupal — vhost Nginx + certificat SSL

~15 min ▼

1

Pré-check DNS — avant tout

Vérifier que le DNS pointe sur ce serveur avant certbot. Un challenge ACME raté bloque toute la procédure.

bash

dig drupal-client9.webinfra-lab.com +short
# → IP du serveur attendue

2

Vhost HTTP only — pas de SSL encore

Certbot a besoin de répondre sur :80 pour le challenge ACME. On crée d'abord un vhost HTTP avec proxy_pass + blocage fichiers sensibles.

nginx

server {
    listen 80;
    server_name drupal-client9.webinfra-lab.com;
    location /.well-known/acme-challenge/ { root /var/www/html; }
    location / {
        proxy_pass         http://127.0.0.1:8082;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_set_header   X-Forwarded-Proto $scheme;
    }
    location ~* \.(git|htaccess|yml|sql)$ { deny all; }
    location ~* \.(js|css|png|svg|woff2)$ { expires max; access_log off; }
}

3

Activer → nginx -t → certbot → vérification

bash

sudo ln -s /etc/nginx/sites-available/drupal-client9.webinfra-lab.com /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx          # toujours nginx -t avant reload
sudo certbot --nginx -d drupal-client9.webinfra-lab.com  # rewrite le vhost SSL auto
sudo nginx -t && sudo systemctl reload nginx
sudo certbot renew --dry-run                          # vérifier auto-renouvellement
curl -I https://drupal-client9.webinfra-lab.com       # → HTTP/2 200 ✅

// Jira comment

Completed Nginx vhost + SSL for drupal-client9.webinfra-lab.com. - Vhost HTTP only first (ACME challenge) → certbot --nginx → auto rewrite SSL - proxy_pass 127.0.0.1:8082 · security rules · static cache · separate logs - certbot renew --dry-run: OK · curl: HTTP/2 200 ✅ Site live and ready for QA.

✅ ~15 min NginxSSL/TLSLet's EncryptLinux

WEB-031

Task✓ Resolved

502 Bad Gateway — backend Python/Flask inaccessible

~10 min ▼

1

Logs Nginx → Connection refused port 8084

bash

sudo tail -50 /var/log/nginx/error.log
# → connect() failed (111: Connection refused) upstream: http://127.0.0.1:8084
sudo ss -tlnp | grep 8084    # → rien — port fermé

2

Service systemd failed → journalctl → root cause

bash

systemctl list-units | grep drupal   # → drupal-client4: FAILED
sudo journalctl -u drupal-client4.service -n 50
# → ModuleNotFoundError: No module named 'flask'
# Dépendance Python manquante dans le venv

3

Fix — requirements.txt complet, pas juste flask

⚠ Toujours pip install -r requirements.txt — jamais juste le module manquant. Les dépendances transitives peuvent aussi manquer.

bash

cd /var/www/drupal-client4 && source venv/bin/activate
pip install -r requirements.txt
sudo systemctl restart drupal-client4.service
sudo systemctl reload nginx
curl -I http://127.0.0.1:8084    # → HTTP 200 ✅
curl -I https://drupal-client4.webinfra-lab.com  # → HTTP/2 200 ✅

✓ Prévention : ajouter pip install -r requirements.txt dans le pipeline CI/CD pour éviter les dépendances manquantes en déploiement.

// Jira comment

Root cause: drupal-client4.service FAILED. journalctl: ModuleNotFoundError: No module named 'flask'. Python dep missing from venv. Fix: source venv/activate → pip install -r requirements.txt → systemctl restart → reload nginx Verification: HTTP 200 backend ✅ · HTTP/2 200 frontend ✅ Preventive: add pip install -r requirements.txt to CI/CD deploy step. Status: ✅ Resolved.

✅ ~10 min NginxPython/Flaskvenvsystemd

INC-001

🚨 P1✓ Resolved

504 Gateway Timeout — CPU saturé · workers Python bloqués

~20 min · SLA OK ▼

IMPACT: Application métier inaccessible · Tous les utilisateurs bloqués · SLA <20 min

1

10:14 UTC — Nginx logs → upstream timed out

bash

sudo tail -100 /var/log/nginx/error.log | grep "client-portal"
# → upstream timed out (110) upstream: http://127.0.0.1:8082
# Port écoute → backend vivant mais ne répond pas → bloqué, pas crashé

2

10:14 UTC — Curl direct → timeout 5s · port up mais muet

bash

curl -v --max-time 5 http://127.0.0.1:8082
# → Connected (port écoute) MAIS Operation timed out after 5001ms
# Le process est vivant → on ne fait pas kill -9 → on cherche pourquoi

3

10:15 UTC — Root cause : 3 workers Python à 99% CPU

bash

top -bn1 | head -10
# → www-data python3 app.py : 99.3% CPU, 45.2% RAM
# → www-data python3 app.py : 98.7% CPU, 44.8% RAM
# → www-data python3 app.py : 97.9% CPU, 44.5% RAM
# Boucle bloquante → Gunicorn workers saturés → Nginx timeout → 504

4

10:16 UTC — Fix P1 : systemctl restart (restore > analyse)

⚠ Règle P1 : on restore d'abord, on analyse ensuite. Pourquoi pas kill -HUP ? Sur Python/Gunicorn le comportement est non garanti — peut créer un état encore plus instable. Pourquoi restart > reload ? reload laisse les workers bloqués tourner — restart les tue proprement et recrée des workers frais.

bash

sudo systemctl restart client-portal.service    # workers bloqués tués + recréés proprement
curl -I http://127.0.0.1:8082                   # → HTTP 200 ✅
curl -I https://client-portal.webinfra-lab.com  # → HTTP/2 200 ✅
sudo journalctl -u client-portal.service -n 50  # analyse post-fix

✓ Prévention : ajouter --timeout 30 à Gunicorn pour killer automatiquement les workers bloqués. Évite la saturation CPU et le P1.

// Jira P1 post-mortem

🚨 P1 INCIDENT RESOLVED — client-portal.webinfra-lab.com Duration: ~20 min | SLA respected Impact: All users unable to access application during incident window. Root cause: Python workers (app.py) saturated CPU (~99%) and RAM (~45% each). Port 8082 was listening but not responding — workers stuck in blocking loop → Nginx upstream timeout → 504. Timeline: - 10:14 UTC — 504 reported · Nginx logs: upstream timed out (127.0.0.1:8082) - 10:14 UTC — curl direct: Connected but timed out (port alive, workers frozen) - 10:15 UTC — top: 3x python3 at ~99% CPU / ~45% RAM - 10:16 UTC — systemctl restart client-portal → service restored - 10:16 UTC — HTTP/2 200 confirmed Immediate mitigation: sudo systemctl restart client-portal.service Note: kill -HUP not used (Python/Gunicorn behavior non-guaranteed). restart chosen over reload (kills frozen workers cleanly). Post-fix: curl backend HTTP 200 ✅ · curl frontend HTTP/2 200 ✅ · CPU normalized ✅ Next steps: - Add --timeout 30 to Gunicorn (auto-kill frozen workers) - Investigate blocking condition in app.py - Monitor 24h Status: ✅ Resolved · no data loss · SLA respected.

✅ ~20 min ⚠ SLA P1 respecté P1SRE504GunicornITIL

Fail Safe

Résilience & patterns de protection

Les mécanismes qui font tourner l'infra sans supervision — ce qui sépare un système fragile d'un système production-grade.

🔁

Restart policy systemd

Chaque service configuré avec Restart=on-failure et RestartSec=5s. Un crash → redémarrage automatique sans intervention humaine.

Restart=on-failure · RestartSec=5s

⏱

Gunicorn worker timeout

Workers Python avec --timeout 30 — un worker bloqué plus de 30s est tué et remplacé automatiquement. Évite la saturation CPU et le P1.

gunicorn --timeout 30 --workers 3

🛡

HAProxy health checks

Check HTTP toutes les 3s sur chaque backend. Un backend qui échoue 3 fois de suite est sorti du pool — le trafic bascule sur les autres automatiquement.

check inter 3s rise 2 fall 3

🔒

SSL auto-renew

certbot.timer vérifie quotidiennement les certs — renouvellement automatique 30 jours avant expiration. zensar-check.sh alerte si un cert passe sous 30 jours.

certbot renew --nginx · cron daily

👁

Surveillance quotidienne

Script bash à 7h00 — HTTP check sur chaque site, SSL expiry, disk, services. Email automatique si anomalie. Aucune intervention manuelle nécessaire.

zensar-check.sh · 0 7 * * *

📦

Backup server HAProxy

Un serveur backup configuré dans HAProxy — activé automatiquement si tous les serveurs actifs sont DOWN. Zéro interruption de service.

server drupal-backup :808X backup