diff --git a/README.md b/README.md index 14eb295..c87b41e 100644 --- a/README.md +++ b/README.md @@ -282,26 +282,20 @@ make cost-report ## 📈 Roadmap -### v1.0 (Actuel) +### v1.0 - ✅ Infrastructure Hetzner complète - ✅ Auto-scaling GPU - ✅ Monitoring production-ready - ✅ Tests automatisés -### v1.1 (Q4 2024) +### v1.1 - 🔄 Multi-région (Nuremberg + Helsinki) - 🔄 Support Kubernetes (optionnel) - 🔄 Advanced cost optimization - 🔄 Model caching intelligent -### v2.0 (Q1 2025) +### v2.0 - 🆕 Support H100 servers - 🆕 Edge deployment - 🆕 Fine-tuning pipeline - 🆕 Advanced observability - - ---- - - -📖 **Lire l'article complet** : [Infrastructure IA Production-Ready avec Hetzner](article.md) diff --git a/docs/ARCHITECTURE.md b/docs/01_architecture.md similarity index 100% rename from docs/ARCHITECTURE.md rename to docs/01_architecture.md diff --git a/docs/DEPLOYMENT.md b/docs/02_deployment.md similarity index 100% rename from docs/DEPLOYMENT.md rename to docs/02_deployment.md diff --git a/docs/APPLICATIONS.md b/docs/03_applications.md similarity index 100% rename from docs/APPLICATIONS.md rename to docs/03_applications.md diff --git a/docs/tools.md b/docs/04_tools.md similarity index 100% rename from docs/tools.md rename to docs/04_tools.md diff --git a/docs/TROUBLESHOOTING.md b/docs/05_troubleshooting.md similarity index 100% rename from docs/TROUBLESHOOTING.md rename to docs/05_troubleshooting.md diff --git a/docs/deployment.md b/docs/deployment.md deleted file mode 100644 index f695fe7..0000000 --- a/docs/deployment.md +++ /dev/null @@ -1,227 +0,0 @@ -# Deployment Guide - -## Quick Start - -### Prérequis -- Ubuntu 24.04 sur tous les serveurs -- Terraform 1.12+ -- Ansible 8.0+ -- Python 3.12+ -- Accès API Hetzner Cloud + Robot - -### Déploiement Development - -```bash -# 1. Configuration initiale -git clone -cd ai-infrastructure-hetzner - -# 2. Variables d'environnement -export HCLOUD_TOKEN="your-hetzner-cloud-token" -export HETZNER_ROBOT_USER="your-robot-username" -export HETZNER_ROBOT_PASSWORD="your-robot-password" - -# 3. Terraform Development -cd terraform/environments/development -terraform init -terraform plan -var-file="dev.tfvars" -terraform apply -var-file="dev.tfvars" - -# 4. Génération inventaire Ansible -cd ../../../inventories -python3 generate_inventory.py development - -# 5. Configuration serveurs -cd ../ansible -ansible-playbook -i inventories/development/hosts.yml site.yml --limit development -``` - -### Structure des Fichiers - -``` -inventories/ -├── development/ -│ ├── requirements.yml # Besoins métier dev -│ ├── hosts.yml # Généré automatiquement -│ └── ssh_config # Config SSH générée -├── staging/ -│ ├── requirements.yml # Besoins métier staging -│ └── ... -├── production/ -│ ├── requirements.yml # Besoins métier production -│ └── ... -└── generate_inventory.py # Générateur d'inventaire -``` - -## Workflow de Déploiement - -### Development → Staging → Production - -```mermaid -graph LR - A[develop branch] --> B[Auto Deploy DEV] - B --> C[Tests Integration] - C --> D[main branch] - D --> E[Manual Deploy STAGING] - E --> F[Tests Load] - F --> G[v*.*.* tag] - G --> H[Manual Deploy PROD] - H --> I[Health Checks] -``` - -### Commandes par Environnement - -```bash -# Development (auto sur push develop) -terraform -chdir=terraform/environments/development apply -auto-approve -python3 inventories/generate_inventory.py development -ansible-playbook -i inventories/development/hosts.yml site.yml - -# Staging (manuel sur main) -terraform -chdir=terraform/environments/staging apply -python3 inventories/generate_inventory.py staging -ansible-playbook -i inventories/staging/hosts.yml site.yml --check -ansible-playbook -i inventories/staging/hosts.yml site.yml - -# Production (manuel sur tag) -terraform -chdir=terraform/environments/production apply -python3 inventories/generate_inventory.py production -ansible-playbook -i inventories/production/hosts.yml site.yml --check -# Confirmation manuelle requise -ansible-playbook -i inventories/production/hosts.yml site.yml -``` - -## Configuration par Environnement - -### Development -- **OS** : Ubuntu 24.04 LTS -- **Serveurs** : 1x CX31 (CPU-only) -- **Modèle** : DialoGPT-small (léger) -- **Déploiement** : Automatique sur develop -- **Tests** : Integration uniquement - -### Staging -- **OS** : Ubuntu 24.04 LTS -- **Serveurs** : 1x GEX44 + 1x CX21 -- **Modèle** : Mixtral-8x7B (quantized) -- **Déploiement** : Manuel sur main -- **Tests** : Integration + Load - -### Production -- **OS** : Ubuntu 24.04 LTS -- **Serveurs** : 3x GEX44 + 2x CX31 + 1x CX21 -- **Modèle** : Mixtral-8x7B (optimized) -- **Déploiement** : Manuel sur tag + confirmation -- **Tests** : Smoke + Health checks - -## Rollback Procedures - -### Rollback Application -```bash -# Via MLflow (recommandé) -python3 scripts/rollback_model.py --environment production --version previous - -# Via Ansible tags -ansible-playbook -i inventories/production/hosts.yml site.yml --tags "vllm" --extra-vars "model_version=v1.2.0" -``` - -### Rollback Infrastructure -```bash -# Terraform state rollback -terraform -chdir=terraform/environments/production state pull > backup.tfstate -terraform -chdir=terraform/environments/production import - -# Ansible configuration rollback -git checkout ansible/ -ansible-playbook -i inventories/production/hosts.yml site.yml --check -``` - -## Troubleshooting - -### Diagnostic Commands -```bash -# Vérification système Ubuntu 24.04 -ansible all -i inventories/production/hosts.yml -m setup -a "filter=ansible_distribution*" - -# Status services -ansible gex44_production -i inventories/production/hosts.yml -m systemd -a "name=vllm-api" - -# Logs applicatifs -ansible gex44_production -i inventories/production/hosts.yml -m shell -a "journalctl -u vllm-api --since '1 hour ago'" - -# GPU status -ansible gex44_production -i inventories/production/hosts.yml -m shell -a "nvidia-smi" - -# Test endpoints -curl https://ai-api.company.com/health -curl https://ai-api.company.com/v1/models -``` - -### Common Issues - -#### GPU non détecté -```bash -# Vérifier driver NVIDIA sur Ubuntu 24.04 -sudo nvidia-smi -sudo dkms status - -# Réinstaller si nécessaire -sudo apt purge nvidia-* -y -sudo apt install nvidia-driver-545 -y -sudo reboot -``` - -#### Service vLLM failed -```bash -# Check logs -journalctl -u vllm-api -f - -# Common issues: -# - OOM: Réduire gpu_memory_utilization -# - Model not found: Vérifier path MLflow -# - Port conflict: Netstat -tulpn | grep 8000 -``` - -#### Inventory generation failed -```bash -# Debug mode -python3 inventories/generate_inventory.py production --debug - -# Manual verification -terraform -chdir=terraform/environments/production output -json > outputs.json -cat outputs.json | jq '.' -``` - -## Security Checklist - -### Pre-deployment -- [ ] SSH keys deployed sur Ubuntu 24.04 -- [ ] Firewall rules configured -- [ ] Secrets in Ansible Vault -- [ ] SSL certificates ready - -### Post-deployment -- [ ] SSH access working -- [ ] Services running (systemctl status) -- [ ] Endpoints responding -- [ ] Monitoring active -- [ ] Log aggregation working - -## Performance Validation - -### Load Testing -```bash -# Development - CPU only -python3 tests/load_test.py --endpoint https://dev-ai-api.internal --concurrent 5 - -# Staging - 1 GPU -python3 tests/load_test.py --endpoint https://staging-ai-api.company.com --concurrent 20 - -# Production - 3 GPU -python3 tests/load_test.py --endpoint https://ai-api.company.com --concurrent 100 -``` - -### Expected Performance -- **Development** : 1-5 tokens/sec (CPU simulation) -- **Staging** : 80-90 tokens/sec (1x RTX 4000 Ada) -- **Production** : 240-270 tokens/sec (3x RTX 4000 Ada) \ No newline at end of file