wip
This commit is contained in:
parent
534f45fda7
commit
645f7d2208
12
README.md
12
README.md
@ -282,26 +282,20 @@ make cost-report
|
||||
|
||||
## 📈 Roadmap
|
||||
|
||||
### v1.0 (Actuel)
|
||||
### v1.0
|
||||
- ✅ Infrastructure Hetzner complète
|
||||
- ✅ Auto-scaling GPU
|
||||
- ✅ Monitoring production-ready
|
||||
- ✅ Tests automatisés
|
||||
|
||||
### v1.1 (Q4 2024)
|
||||
### v1.1
|
||||
- 🔄 Multi-région (Nuremberg + Helsinki)
|
||||
- 🔄 Support Kubernetes (optionnel)
|
||||
- 🔄 Advanced cost optimization
|
||||
- 🔄 Model caching intelligent
|
||||
|
||||
### v2.0 (Q1 2025)
|
||||
### v2.0
|
||||
- 🆕 Support H100 servers
|
||||
- 🆕 Edge deployment
|
||||
- 🆕 Fine-tuning pipeline
|
||||
- 🆕 Advanced observability
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
📖 **Lire l'article complet** : [Infrastructure IA Production-Ready avec Hetzner](article.md)
|
||||
|
||||
@ -1,227 +0,0 @@
|
||||
# Deployment Guide
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prérequis
|
||||
- Ubuntu 24.04 sur tous les serveurs
|
||||
- Terraform 1.12+
|
||||
- Ansible 8.0+
|
||||
- Python 3.12+
|
||||
- Accès API Hetzner Cloud + Robot
|
||||
|
||||
### Déploiement Development
|
||||
|
||||
```bash
|
||||
# 1. Configuration initiale
|
||||
git clone <repository>
|
||||
cd ai-infrastructure-hetzner
|
||||
|
||||
# 2. Variables d'environnement
|
||||
export HCLOUD_TOKEN="your-hetzner-cloud-token"
|
||||
export HETZNER_ROBOT_USER="your-robot-username"
|
||||
export HETZNER_ROBOT_PASSWORD="your-robot-password"
|
||||
|
||||
# 3. Terraform Development
|
||||
cd terraform/environments/development
|
||||
terraform init
|
||||
terraform plan -var-file="dev.tfvars"
|
||||
terraform apply -var-file="dev.tfvars"
|
||||
|
||||
# 4. Génération inventaire Ansible
|
||||
cd ../../../inventories
|
||||
python3 generate_inventory.py development
|
||||
|
||||
# 5. Configuration serveurs
|
||||
cd ../ansible
|
||||
ansible-playbook -i inventories/development/hosts.yml site.yml --limit development
|
||||
```
|
||||
|
||||
### Structure des Fichiers
|
||||
|
||||
```
|
||||
inventories/
|
||||
├── development/
|
||||
│ ├── requirements.yml # Besoins métier dev
|
||||
│ ├── hosts.yml # Généré automatiquement
|
||||
│ └── ssh_config # Config SSH générée
|
||||
├── staging/
|
||||
│ ├── requirements.yml # Besoins métier staging
|
||||
│ └── ...
|
||||
├── production/
|
||||
│ ├── requirements.yml # Besoins métier production
|
||||
│ └── ...
|
||||
└── generate_inventory.py # Générateur d'inventaire
|
||||
```
|
||||
|
||||
## Workflow de Déploiement
|
||||
|
||||
### Development → Staging → Production
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
A[develop branch] --> B[Auto Deploy DEV]
|
||||
B --> C[Tests Integration]
|
||||
C --> D[main branch]
|
||||
D --> E[Manual Deploy STAGING]
|
||||
E --> F[Tests Load]
|
||||
F --> G[v*.*.* tag]
|
||||
G --> H[Manual Deploy PROD]
|
||||
H --> I[Health Checks]
|
||||
```
|
||||
|
||||
### Commandes par Environnement
|
||||
|
||||
```bash
|
||||
# Development (auto sur push develop)
|
||||
terraform -chdir=terraform/environments/development apply -auto-approve
|
||||
python3 inventories/generate_inventory.py development
|
||||
ansible-playbook -i inventories/development/hosts.yml site.yml
|
||||
|
||||
# Staging (manuel sur main)
|
||||
terraform -chdir=terraform/environments/staging apply
|
||||
python3 inventories/generate_inventory.py staging
|
||||
ansible-playbook -i inventories/staging/hosts.yml site.yml --check
|
||||
ansible-playbook -i inventories/staging/hosts.yml site.yml
|
||||
|
||||
# Production (manuel sur tag)
|
||||
terraform -chdir=terraform/environments/production apply
|
||||
python3 inventories/generate_inventory.py production
|
||||
ansible-playbook -i inventories/production/hosts.yml site.yml --check
|
||||
# Confirmation manuelle requise
|
||||
ansible-playbook -i inventories/production/hosts.yml site.yml
|
||||
```
|
||||
|
||||
## Configuration par Environnement
|
||||
|
||||
### Development
|
||||
- **OS** : Ubuntu 24.04 LTS
|
||||
- **Serveurs** : 1x CX31 (CPU-only)
|
||||
- **Modèle** : DialoGPT-small (léger)
|
||||
- **Déploiement** : Automatique sur develop
|
||||
- **Tests** : Integration uniquement
|
||||
|
||||
### Staging
|
||||
- **OS** : Ubuntu 24.04 LTS
|
||||
- **Serveurs** : 1x GEX44 + 1x CX21
|
||||
- **Modèle** : Mixtral-8x7B (quantized)
|
||||
- **Déploiement** : Manuel sur main
|
||||
- **Tests** : Integration + Load
|
||||
|
||||
### Production
|
||||
- **OS** : Ubuntu 24.04 LTS
|
||||
- **Serveurs** : 3x GEX44 + 2x CX31 + 1x CX21
|
||||
- **Modèle** : Mixtral-8x7B (optimized)
|
||||
- **Déploiement** : Manuel sur tag + confirmation
|
||||
- **Tests** : Smoke + Health checks
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Rollback Application
|
||||
```bash
|
||||
# Via MLflow (recommandé)
|
||||
python3 scripts/rollback_model.py --environment production --version previous
|
||||
|
||||
# Via Ansible tags
|
||||
ansible-playbook -i inventories/production/hosts.yml site.yml --tags "vllm" --extra-vars "model_version=v1.2.0"
|
||||
```
|
||||
|
||||
### Rollback Infrastructure
|
||||
```bash
|
||||
# Terraform state rollback
|
||||
terraform -chdir=terraform/environments/production state pull > backup.tfstate
|
||||
terraform -chdir=terraform/environments/production import <resource> <id>
|
||||
|
||||
# Ansible configuration rollback
|
||||
git checkout <previous-commit> ansible/
|
||||
ansible-playbook -i inventories/production/hosts.yml site.yml --check
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Diagnostic Commands
|
||||
```bash
|
||||
# Vérification système Ubuntu 24.04
|
||||
ansible all -i inventories/production/hosts.yml -m setup -a "filter=ansible_distribution*"
|
||||
|
||||
# Status services
|
||||
ansible gex44_production -i inventories/production/hosts.yml -m systemd -a "name=vllm-api"
|
||||
|
||||
# Logs applicatifs
|
||||
ansible gex44_production -i inventories/production/hosts.yml -m shell -a "journalctl -u vllm-api --since '1 hour ago'"
|
||||
|
||||
# GPU status
|
||||
ansible gex44_production -i inventories/production/hosts.yml -m shell -a "nvidia-smi"
|
||||
|
||||
# Test endpoints
|
||||
curl https://ai-api.company.com/health
|
||||
curl https://ai-api.company.com/v1/models
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### GPU non détecté
|
||||
```bash
|
||||
# Vérifier driver NVIDIA sur Ubuntu 24.04
|
||||
sudo nvidia-smi
|
||||
sudo dkms status
|
||||
|
||||
# Réinstaller si nécessaire
|
||||
sudo apt purge nvidia-* -y
|
||||
sudo apt install nvidia-driver-545 -y
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
#### Service vLLM failed
|
||||
```bash
|
||||
# Check logs
|
||||
journalctl -u vllm-api -f
|
||||
|
||||
# Common issues:
|
||||
# - OOM: Réduire gpu_memory_utilization
|
||||
# - Model not found: Vérifier path MLflow
|
||||
# - Port conflict: Netstat -tulpn | grep 8000
|
||||
```
|
||||
|
||||
#### Inventory generation failed
|
||||
```bash
|
||||
# Debug mode
|
||||
python3 inventories/generate_inventory.py production --debug
|
||||
|
||||
# Manual verification
|
||||
terraform -chdir=terraform/environments/production output -json > outputs.json
|
||||
cat outputs.json | jq '.'
|
||||
```
|
||||
|
||||
## Security Checklist
|
||||
|
||||
### Pre-deployment
|
||||
- [ ] SSH keys deployed sur Ubuntu 24.04
|
||||
- [ ] Firewall rules configured
|
||||
- [ ] Secrets in Ansible Vault
|
||||
- [ ] SSL certificates ready
|
||||
|
||||
### Post-deployment
|
||||
- [ ] SSH access working
|
||||
- [ ] Services running (systemctl status)
|
||||
- [ ] Endpoints responding
|
||||
- [ ] Monitoring active
|
||||
- [ ] Log aggregation working
|
||||
|
||||
## Performance Validation
|
||||
|
||||
### Load Testing
|
||||
```bash
|
||||
# Development - CPU only
|
||||
python3 tests/load_test.py --endpoint https://dev-ai-api.internal --concurrent 5
|
||||
|
||||
# Staging - 1 GPU
|
||||
python3 tests/load_test.py --endpoint https://staging-ai-api.company.com --concurrent 20
|
||||
|
||||
# Production - 3 GPU
|
||||
python3 tests/load_test.py --endpoint https://ai-api.company.com --concurrent 100
|
||||
```
|
||||
|
||||
### Expected Performance
|
||||
- **Development** : 1-5 tokens/sec (CPU simulation)
|
||||
- **Staging** : 80-90 tokens/sec (1x RTX 4000 Ada)
|
||||
- **Production** : 240-270 tokens/sec (3x RTX 4000 Ada)
|
||||
Loading…
x
Reference in New Issue
Block a user