wip

2025-09-13 23:47:30 +02:00 · 2025-09-13 23:47:30 +02:00 · 645f7d2208
commit 645f7d2208
parent 534f45fda7
7 changed files with 3 additions and 236 deletions
--- a/README.md
+++ b/README.md
@ -282,26 +282,20 @@ make cost-report

 ## 📈 Roadmap

-### v1.0 (Actuel)
+### v1.0
 - ✅ Infrastructure Hetzner complète
 - ✅ Auto-scaling GPU
 - ✅ Monitoring production-ready
 - ✅ Tests automatisés

-### v1.1 (Q4 2024)
+### v1.1
 - 🔄 Multi-région (Nuremberg + Helsinki)
 - 🔄 Support Kubernetes (optionnel)
 - 🔄 Advanced cost optimization
 - 🔄 Model caching intelligent

-### v2.0 (Q1 2025)
+### v2.0
 - 🆕 Support H100 servers
 - 🆕 Edge deployment
 - 🆕 Fine-tuning pipeline
 - 🆕 Advanced observability
-
-
---
-
-
-📖 **Lire l'article complet** : [Infrastructure IA Production-Ready avec Hetzner](article.md)
--- a/docs/01_architecture.md
+++ b/docs/01_architecture.md
--- a/docs/02_deployment.md
+++ b/docs/02_deployment.md
--- a/docs/03_applications.md
+++ b/docs/03_applications.md
--- a/docs/04_tools.md
+++ b/docs/04_tools.md
--- a/docs/05_troubleshooting.md
+++ b/docs/05_troubleshooting.md
--- a/docs/deployment.md
+++ b/docs/deployment.md
@ -1,227 +0,0 @@
-# Deployment Guide
-
-## Quick Start
-
-### Prérequis
- Ubuntu 24.04 sur tous les serveurs
- Terraform 1.12+
- Ansible 8.0+
- Python 3.12+
- Accès API Hetzner Cloud + Robot
-
-### Déploiement Development
-
-```bash
-# 1. Configuration initiale
-git clone <repository>
-cd ai-infrastructure-hetzner
-
-# 2. Variables d'environnement
-export HCLOUD_TOKEN="your-hetzner-cloud-token"
-export HETZNER_ROBOT_USER="your-robot-username"
-export HETZNER_ROBOT_PASSWORD="your-robot-password"
-
-# 3. Terraform Development
-cd terraform/environments/development
-terraform init
-terraform plan -var-file="dev.tfvars"
-terraform apply -var-file="dev.tfvars"
-
-# 4. Génération inventaire Ansible
-cd ../../../inventories
-python3 generate_inventory.py development
-
-# 5. Configuration serveurs
-cd ../ansible
-ansible-playbook -i inventories/development/hosts.yml site.yml --limit development
-```
-
-### Structure des Fichiers
-
-```
-inventories/
-├── development/
-│   ├── requirements.yml      # Besoins métier dev
-│   ├── hosts.yml            # Généré automatiquement
-│   └── ssh_config           # Config SSH générée
-├── staging/
-│   ├── requirements.yml      # Besoins métier staging
-│   └── ...
-├── production/
-│   ├── requirements.yml      # Besoins métier production
-│   └── ...
-└── generate_inventory.py     # Générateur d'inventaire
-```
-
-## Workflow de Déploiement
-
-### Development → Staging → Production
-
-```mermaid
-graph LR
-    A[develop branch] --> B[Auto Deploy DEV]
-    B --> C[Tests Integration]
-    C --> D[main branch]
-    D --> E[Manual Deploy STAGING]
-    E --> F[Tests Load]
-    F --> G[v*.*.* tag]
-    G --> H[Manual Deploy PROD]
-    H --> I[Health Checks]
-```
-
-### Commandes par Environnement
-
-```bash
-# Development (auto sur push develop)
-terraform -chdir=terraform/environments/development apply -auto-approve
-python3 inventories/generate_inventory.py development
-ansible-playbook -i inventories/development/hosts.yml site.yml
-
-# Staging (manuel sur main)
-terraform -chdir=terraform/environments/staging apply
-python3 inventories/generate_inventory.py staging
-ansible-playbook -i inventories/staging/hosts.yml site.yml --check
-ansible-playbook -i inventories/staging/hosts.yml site.yml
-
-# Production (manuel sur tag)
-terraform -chdir=terraform/environments/production apply
-python3 inventories/generate_inventory.py production
-ansible-playbook -i inventories/production/hosts.yml site.yml --check
-# Confirmation manuelle requise
-ansible-playbook -i inventories/production/hosts.yml site.yml
-```
-
-## Configuration par Environnement
-
-### Development
- **OS** : Ubuntu 24.04 LTS
- **Serveurs** : 1x CX31 (CPU-only)
- **Modèle** : DialoGPT-small (léger)
- **Déploiement** : Automatique sur develop
- **Tests** : Integration uniquement
-
-### Staging
- **OS** : Ubuntu 24.04 LTS
- **Serveurs** : 1x GEX44 + 1x CX21
- **Modèle** : Mixtral-8x7B (quantized)
- **Déploiement** : Manuel sur main
- **Tests** : Integration + Load
-
-### Production
- **OS** : Ubuntu 24.04 LTS
- **Serveurs** : 3x GEX44 + 2x CX31 + 1x CX21
- **Modèle** : Mixtral-8x7B (optimized)
- **Déploiement** : Manuel sur tag + confirmation
- **Tests** : Smoke + Health checks
-
-## Rollback Procedures
-
-### Rollback Application
-```bash
-# Via MLflow (recommandé)
-python3 scripts/rollback_model.py --environment production --version previous
-
-# Via Ansible tags
-ansible-playbook -i inventories/production/hosts.yml site.yml --tags "vllm" --extra-vars "model_version=v1.2.0"
-```
-
-### Rollback Infrastructure
-```bash
-# Terraform state rollback
-terraform -chdir=terraform/environments/production state pull > backup.tfstate
-terraform -chdir=terraform/environments/production import <resource> <id>
-
-# Ansible configuration rollback
-git checkout <previous-commit> ansible/
-ansible-playbook -i inventories/production/hosts.yml site.yml --check
-```
-
-## Troubleshooting
-
-### Diagnostic Commands
-```bash
-# Vérification système Ubuntu 24.04
-ansible all -i inventories/production/hosts.yml -m setup -a "filter=ansible_distribution*"
-
-# Status services
-ansible gex44_production -i inventories/production/hosts.yml -m systemd -a "name=vllm-api"
-
-# Logs applicatifs
-ansible gex44_production -i inventories/production/hosts.yml -m shell -a "journalctl -u vllm-api --since '1 hour ago'"
-
-# GPU status
-ansible gex44_production -i inventories/production/hosts.yml -m shell -a "nvidia-smi"
-
-# Test endpoints
-curl https://ai-api.company.com/health
-curl https://ai-api.company.com/v1/models
-```
-
-### Common Issues
-
-#### GPU non détecté
-```bash
-# Vérifier driver NVIDIA sur Ubuntu 24.04
-sudo nvidia-smi
-sudo dkms status
-
-# Réinstaller si nécessaire
-sudo apt purge nvidia-* -y
-sudo apt install nvidia-driver-545 -y
-sudo reboot
-```
-
-#### Service vLLM failed
-```bash
-# Check logs
-journalctl -u vllm-api -f
-
-# Common issues:
-# - OOM: Réduire gpu_memory_utilization
-# - Model not found: Vérifier path MLflow
-# - Port conflict: Netstat -tulpn | grep 8000
-```
-
-#### Inventory generation failed
-```bash
-# Debug mode
-python3 inventories/generate_inventory.py production --debug
-
-# Manual verification
-terraform -chdir=terraform/environments/production output -json > outputs.json
-cat outputs.json | jq '.'
-```
-
-## Security Checklist
-
-### Pre-deployment
- [ ] SSH keys deployed sur Ubuntu 24.04
- [ ] Firewall rules configured
- [ ] Secrets in Ansible Vault
- [ ] SSL certificates ready
-
-### Post-deployment
- [ ] SSH access working
- [ ] Services running (systemctl status)
- [ ] Endpoints responding
- [ ] Monitoring active
- [ ] Log aggregation working
-
-## Performance Validation
-
-### Load Testing
-```bash
-# Development - CPU only
-python3 tests/load_test.py --endpoint https://dev-ai-api.internal --concurrent 5
-
-# Staging - 1 GPU
-python3 tests/load_test.py --endpoint https://staging-ai-api.company.com --concurrent 20
-
-# Production - 3 GPU
-python3 tests/load_test.py --endpoint https://ai-api.company.com --concurrent 100
-```
-
-### Expected Performance
- **Development** : 1-5 tokens/sec (CPU simulation)
- **Staging** : 80-90 tokens/sec (1x RTX 4000 Ada)
- **Production** : 240-270 tokens/sec (3x RTX 4000 Ada)