This commit is contained in:
spham 2025-09-13 23:47:30 +02:00
parent 534f45fda7
commit 645f7d2208
7 changed files with 3 additions and 236 deletions

View File

@ -282,26 +282,20 @@ make cost-report
## 📈 Roadmap
### v1.0 (Actuel)
### v1.0
- ✅ Infrastructure Hetzner complète
- ✅ Auto-scaling GPU
- ✅ Monitoring production-ready
- ✅ Tests automatisés
### v1.1 (Q4 2024)
### v1.1
- 🔄 Multi-région (Nuremberg + Helsinki)
- 🔄 Support Kubernetes (optionnel)
- 🔄 Advanced cost optimization
- 🔄 Model caching intelligent
### v2.0 (Q1 2025)
### v2.0
- 🆕 Support H100 servers
- 🆕 Edge deployment
- 🆕 Fine-tuning pipeline
- 🆕 Advanced observability
---
📖 **Lire l'article complet** : [Infrastructure IA Production-Ready avec Hetzner](article.md)

View File

@ -1,227 +0,0 @@
# Deployment Guide
## Quick Start
### Prérequis
- Ubuntu 24.04 sur tous les serveurs
- Terraform 1.12+
- Ansible 8.0+
- Python 3.12+
- Accès API Hetzner Cloud + Robot
### Déploiement Development
```bash
# 1. Configuration initiale
git clone <repository>
cd ai-infrastructure-hetzner
# 2. Variables d'environnement
export HCLOUD_TOKEN="your-hetzner-cloud-token"
export HETZNER_ROBOT_USER="your-robot-username"
export HETZNER_ROBOT_PASSWORD="your-robot-password"
# 3. Terraform Development
cd terraform/environments/development
terraform init
terraform plan -var-file="dev.tfvars"
terraform apply -var-file="dev.tfvars"
# 4. Génération inventaire Ansible
cd ../../../inventories
python3 generate_inventory.py development
# 5. Configuration serveurs
cd ../ansible
ansible-playbook -i inventories/development/hosts.yml site.yml --limit development
```
### Structure des Fichiers
```
inventories/
├── development/
│ ├── requirements.yml # Besoins métier dev
│ ├── hosts.yml # Généré automatiquement
│ └── ssh_config # Config SSH générée
├── staging/
│ ├── requirements.yml # Besoins métier staging
│ └── ...
├── production/
│ ├── requirements.yml # Besoins métier production
│ └── ...
└── generate_inventory.py # Générateur d'inventaire
```
## Workflow de Déploiement
### Development → Staging → Production
```mermaid
graph LR
A[develop branch] --> B[Auto Deploy DEV]
B --> C[Tests Integration]
C --> D[main branch]
D --> E[Manual Deploy STAGING]
E --> F[Tests Load]
F --> G[v*.*.* tag]
G --> H[Manual Deploy PROD]
H --> I[Health Checks]
```
### Commandes par Environnement
```bash
# Development (auto sur push develop)
terraform -chdir=terraform/environments/development apply -auto-approve
python3 inventories/generate_inventory.py development
ansible-playbook -i inventories/development/hosts.yml site.yml
# Staging (manuel sur main)
terraform -chdir=terraform/environments/staging apply
python3 inventories/generate_inventory.py staging
ansible-playbook -i inventories/staging/hosts.yml site.yml --check
ansible-playbook -i inventories/staging/hosts.yml site.yml
# Production (manuel sur tag)
terraform -chdir=terraform/environments/production apply
python3 inventories/generate_inventory.py production
ansible-playbook -i inventories/production/hosts.yml site.yml --check
# Confirmation manuelle requise
ansible-playbook -i inventories/production/hosts.yml site.yml
```
## Configuration par Environnement
### Development
- **OS** : Ubuntu 24.04 LTS
- **Serveurs** : 1x CX31 (CPU-only)
- **Modèle** : DialoGPT-small (léger)
- **Déploiement** : Automatique sur develop
- **Tests** : Integration uniquement
### Staging
- **OS** : Ubuntu 24.04 LTS
- **Serveurs** : 1x GEX44 + 1x CX21
- **Modèle** : Mixtral-8x7B (quantized)
- **Déploiement** : Manuel sur main
- **Tests** : Integration + Load
### Production
- **OS** : Ubuntu 24.04 LTS
- **Serveurs** : 3x GEX44 + 2x CX31 + 1x CX21
- **Modèle** : Mixtral-8x7B (optimized)
- **Déploiement** : Manuel sur tag + confirmation
- **Tests** : Smoke + Health checks
## Rollback Procedures
### Rollback Application
```bash
# Via MLflow (recommandé)
python3 scripts/rollback_model.py --environment production --version previous
# Via Ansible tags
ansible-playbook -i inventories/production/hosts.yml site.yml --tags "vllm" --extra-vars "model_version=v1.2.0"
```
### Rollback Infrastructure
```bash
# Terraform state rollback
terraform -chdir=terraform/environments/production state pull > backup.tfstate
terraform -chdir=terraform/environments/production import <resource> <id>
# Ansible configuration rollback
git checkout <previous-commit> ansible/
ansible-playbook -i inventories/production/hosts.yml site.yml --check
```
## Troubleshooting
### Diagnostic Commands
```bash
# Vérification système Ubuntu 24.04
ansible all -i inventories/production/hosts.yml -m setup -a "filter=ansible_distribution*"
# Status services
ansible gex44_production -i inventories/production/hosts.yml -m systemd -a "name=vllm-api"
# Logs applicatifs
ansible gex44_production -i inventories/production/hosts.yml -m shell -a "journalctl -u vllm-api --since '1 hour ago'"
# GPU status
ansible gex44_production -i inventories/production/hosts.yml -m shell -a "nvidia-smi"
# Test endpoints
curl https://ai-api.company.com/health
curl https://ai-api.company.com/v1/models
```
### Common Issues
#### GPU non détecté
```bash
# Vérifier driver NVIDIA sur Ubuntu 24.04
sudo nvidia-smi
sudo dkms status
# Réinstaller si nécessaire
sudo apt purge nvidia-* -y
sudo apt install nvidia-driver-545 -y
sudo reboot
```
#### Service vLLM failed
```bash
# Check logs
journalctl -u vllm-api -f
# Common issues:
# - OOM: Réduire gpu_memory_utilization
# - Model not found: Vérifier path MLflow
# - Port conflict: Netstat -tulpn | grep 8000
```
#### Inventory generation failed
```bash
# Debug mode
python3 inventories/generate_inventory.py production --debug
# Manual verification
terraform -chdir=terraform/environments/production output -json > outputs.json
cat outputs.json | jq '.'
```
## Security Checklist
### Pre-deployment
- [ ] SSH keys deployed sur Ubuntu 24.04
- [ ] Firewall rules configured
- [ ] Secrets in Ansible Vault
- [ ] SSL certificates ready
### Post-deployment
- [ ] SSH access working
- [ ] Services running (systemctl status)
- [ ] Endpoints responding
- [ ] Monitoring active
- [ ] Log aggregation working
## Performance Validation
### Load Testing
```bash
# Development - CPU only
python3 tests/load_test.py --endpoint https://dev-ai-api.internal --concurrent 5
# Staging - 1 GPU
python3 tests/load_test.py --endpoint https://staging-ai-api.company.com --concurrent 20
# Production - 3 GPU
python3 tests/load_test.py --endpoint https://ai-api.company.com --concurrent 100
```
### Expected Performance
- **Development** : 1-5 tokens/sec (CPU simulation)
- **Staging** : 80-90 tokens/sec (1x RTX 4000 Ada)
- **Production** : 240-270 tokens/sec (3x RTX 4000 Ada)