diff --git a/README.md b/README.md index 945a3ee..59a889b 100644 --- a/README.md +++ b/README.md @@ -229,11 +229,21 @@ k6 run tests/load/k6_inference_test.js ## 📚 Documentation -- [**Architecture**](docs/01_architecture.md) : Diagrammes et dĂ©cisions -- [**Deployment**](docs/02_deployment.md) : Guide Ă©tape par Ă©tape -- [**Troubleshooting**](docs/05_troubleshooting.md) : Solutions aux problĂšmes courants -- [**Applications**](docs/03_applications.md) : Guide des applications -- [**Tools**](docs/04_tools.md) : Outils disponibles +> đŸ‡«đŸ‡· **Documentation complĂšte en français** - [**INDEX Complet**](docs/INDEX.md) + +### Guides Principaux +- [**đŸ—ïž Architecture**](docs/01_architecture.md) : Architecture complĂšte et composants +- [**⚡ DĂ©ploiement**](docs/02_deployment.md) : Guide Ă©tape par Ă©tape +- [**🔧 DĂ©pannage**](docs/05_troubleshooting.md) : RĂ©solution de problĂšmes +- [**đŸ‘„ Applications**](docs/03_applications.md) : Organisation multi-projets +- [**đŸ› ïž Outils**](docs/04_tools.md) : Stack technologique +- [**🔒 VPN Setup**](docs/vpn-setup.md) : Configuration accĂšs externe + +### Navigation Rapide +- **[📖 INDEX Complet](docs/INDEX.md)** - Navigation thĂ©matique et index par mots-clĂ©s +- **DĂ©marrage Rapide** : [Quick Start](#-quick-start-5-minutes) + [DĂ©ploiement](docs/02_deployment.md) +- **Architecture** : [Vue d'ensemble](docs/01_architecture.md#architecture-de-haut-niveau) + [CoĂ»ts](docs/01_architecture.md#rĂ©partition-des-coĂ»ts) +- **ProblĂšmes** : [Guide de dĂ©pannage](docs/05_troubleshooting.md) + [Diagnostics](docs/05_troubleshooting.md#commandes-de-diagnostic) ## 📈 Roadmap diff --git a/ansible/group_vars/all/main.yml b/ansible/group_vars/all/main.yml index 472762b..b1813aa 100644 --- a/ansible/group_vars/all/main.yml +++ b/ansible/group_vars/all/main.yml @@ -92,6 +92,13 @@ firewall_rules: proto: tcp src: "{{ private_network_cidr }}" comment: "Node exporter from private network" + - rule: allow + port: "{{ wireguard_port }}" + proto: udp + comment: "WireGuard VPN" + - rule: allow + from: "{{ wireguard_network }}" + comment: "Allow traffic from VPN clients" # Logging configuration rsyslog_enabled: true @@ -120,6 +127,26 @@ net_core_somaxconn: 32768 net_core_netdev_max_backlog: 5000 tcp_max_syn_backlog: 8192 +# WireGuard VPN configuration +wireguard_enabled: true +wireguard_port: 51820 +wireguard_interface: wg0 +wireguard_network: "10.0.10.0/24" +wireguard_server_ip: "10.0.10.1" + +# External client networks allowed to access via VPN +wireguard_clients: + - name: "external_company" + ip: "10.0.10.10" + allowed_networks: + - "10.0.1.0/24" # GEX44 GPU servers + - "10.0.2.0/24" # Cloud services + public_key: "{{ external_company_public_key | default('') }}" + +# WireGuard server configuration +wireguard_server_private_key: "{{ wireguard_server_private_key | default('') }}" +wireguard_server_public_key: "{{ wireguard_server_public_key | default('') }}" + # Memory tuning (for ML workloads) transparent_hugepage: "madvise" oom_kill_allocating_task: 1 diff --git a/ansible/inventory/production.yml b/ansible/inventory/production.yml index 13efbe1..2c82c37 100644 --- a/ansible/inventory/production.yml +++ b/ansible/inventory/production.yml @@ -129,4 +129,14 @@ all: min_gex44_count: 1 max_gex44_count: 10 scale_up_threshold: 0.8 - scale_down_threshold: 0.3 \ No newline at end of file + scale_down_threshold: 0.3 + + # VPN Gateway (runs on load balancer or dedicated server) + vpn_gateway: + vars: + wireguard_enabled: true + wireguard_server_role: true + hosts: + load-balancer: + wireguard_gateway: true + wireguard_public_endpoint: "{{ load_balancer_public_ip }}" \ No newline at end of file diff --git a/ansible/playbooks/vpn-setup.yml b/ansible/playbooks/vpn-setup.yml new file mode 100644 index 0000000..46265e5 --- /dev/null +++ b/ansible/playbooks/vpn-setup.yml @@ -0,0 +1,76 @@ +--- +# VPN Setup Playbook +# Sets up WireGuard VPN for external company access +- name: Configure WireGuard VPN Gateway + hosts: vpn_gateway + become: yes + vars: + # Override defaults for VPN gateway + wireguard_enabled: true + + pre_tasks: + - name: Verify VPN gateway configuration + debug: + msg: | + Setting up WireGuard VPN on {{ inventory_hostname }} + Public endpoint: {{ wireguard_public_endpoint | default('NOT_SET') }} + Network: {{ wireguard_network }} + Port: {{ wireguard_port }} + + - name: Ensure public IP is configured + fail: + msg: "wireguard_public_endpoint must be set for VPN gateway" + when: wireguard_public_endpoint is not defined or wireguard_public_endpoint == '' + + roles: + - role: wireguard + when: wireguard_enabled | default(false) + + post_tasks: + - name: Display client configuration instructions + debug: + msg: | + WireGuard VPN setup complete! + + Server public key: {{ wireguard_server_public_key }} + Server endpoint: {{ wireguard_public_endpoint }}:{{ wireguard_port }} + + Client configurations have been generated in: + /etc/wireguard/clients/ + + Next steps: + 1. Share server public key with external company + 2. Get external company's public key + 3. Update inventory with external_company_public_key variable + 4. Re-run this playbook to update server configuration + when: wireguard_server_public_key is defined + + - name: Display routing configuration + debug: + msg: | + VPN Routing Configuration: + - VPN Network: {{ wireguard_network }} + - GEX44 GPU Access: {{ gex44_subnet }} + - Cloud Services Access: {{ cloud_subnet }} + - Private Network: {{ private_network_cidr }} + + The external company will be able to access: + {% for client in wireguard_clients %} + - {{ client.name }}: {{ client.allowed_networks | join(', ') }} + {% endfor %} + +- name: Update firewall rules on all servers + hosts: all + become: yes + tasks: + - name: Allow VPN traffic to reach internal services + ufw: + rule: allow + from_ip: "{{ wireguard_network }}" + comment: "Allow VPN clients access" + when: firewall_enabled | default(true) and wireguard_enabled | default(false) + + - name: Reload firewall + ufw: + state: reloaded + when: firewall_enabled | default(true) \ No newline at end of file diff --git a/ansible/roles/wireguard/handlers/main.yml b/ansible/roles/wireguard/handlers/main.yml new file mode 100644 index 0000000..740c403 --- /dev/null +++ b/ansible/roles/wireguard/handlers/main.yml @@ -0,0 +1,19 @@ +--- +# WireGuard handlers +- name: restart wireguard + systemd: + name: "wg-quick@{{ wireguard_interface }}" + state: restarted + become: yes + listen: restart wireguard + +- name: reload wireguard + shell: "wg-quick down {{ wireguard_interface }} && wg-quick up {{ wireguard_interface }}" + become: yes + listen: reload wireguard + +- name: save iptables + shell: iptables-save > /etc/iptables/rules.v4 + become: yes + listen: save iptables + when: ansible_os_family == "Debian" \ No newline at end of file diff --git a/ansible/roles/wireguard/tasks/main.yml b/ansible/roles/wireguard/tasks/main.yml new file mode 100644 index 0000000..2600816 --- /dev/null +++ b/ansible/roles/wireguard/tasks/main.yml @@ -0,0 +1,124 @@ +--- +# WireGuard VPN Setup +- name: Install WireGuard + apt: + name: + - wireguard + - wireguard-tools + state: present + update_cache: yes + become: yes + +- name: Enable IP forwarding + sysctl: + name: net.ipv4.ip_forward + value: '1' + state: present + reload: yes + become: yes + +- name: Enable IP forwarding for IPv6 + sysctl: + name: net.ipv6.conf.all.forwarding + value: '1' + state: present + reload: yes + become: yes + when: wireguard_ipv6_enabled | default(false) + +- name: Generate WireGuard server private key + shell: wg genkey + register: wireguard_server_private_key_generated + when: wireguard_server_private_key == '' + no_log: true + +- name: Generate WireGuard server public key + shell: echo "{{ wireguard_server_private_key_generated.stdout | default(wireguard_server_private_key) }}" | wg pubkey + register: wireguard_server_public_key_generated + when: wireguard_server_public_key == '' or wireguard_server_private_key == '' + +- name: Set WireGuard server keys facts + set_fact: + wireguard_server_private_key: "{{ wireguard_server_private_key_generated.stdout | default(wireguard_server_private_key) }}" + wireguard_server_public_key: "{{ wireguard_server_public_key_generated.stdout | default(wireguard_server_public_key) }}" + +- name: Create WireGuard configuration directory + file: + path: /etc/wireguard + state: directory + mode: '0700' + owner: root + group: root + become: yes + +- name: Generate WireGuard server configuration + template: + src: wg0.conf.j2 + dest: "/etc/wireguard/{{ wireguard_interface }}.conf" + mode: '0600' + owner: root + group: root + become: yes + notify: restart wireguard + +- name: Enable and start WireGuard service + systemd: + name: "wg-quick@{{ wireguard_interface }}" + enabled: yes + state: started + daemon_reload: yes + become: yes + +- name: Configure firewall rules for WireGuard + ufw: + rule: "{{ item.rule }}" + port: "{{ item.port | default(omit) }}" + proto: "{{ item.proto | default(omit) }}" + from_ip: "{{ item.from | default(omit) }}" + comment: "{{ item.comment | default(omit) }}" + become: yes + loop: + - rule: allow + port: "{{ wireguard_port }}" + proto: udp + comment: "WireGuard VPN" + - rule: allow + from: "{{ wireguard_network }}" + comment: "Allow traffic from VPN clients" + when: firewall_enabled | default(true) + +- name: Configure NAT rules for WireGuard + iptables: + table: nat + chain: POSTROUTING + source: "{{ wireguard_network }}" + out_interface: "{{ ansible_default_ipv4.interface }}" + jump: MASQUERADE + comment: "WireGuard NAT" + become: yes + notify: save iptables + +- name: Display WireGuard server public key + debug: + msg: "WireGuard server public key: {{ wireguard_server_public_key }}" + when: wireguard_server_public_key is defined + +- name: Create client configuration directory + file: + path: /etc/wireguard/clients + state: directory + mode: '0700' + owner: root + group: root + become: yes + +- name: Generate client configurations + template: + src: client.conf.j2 + dest: "/etc/wireguard/clients/{{ item.name }}.conf" + mode: '0600' + owner: root + group: root + become: yes + loop: "{{ wireguard_clients }}" + when: wireguard_clients is defined \ No newline at end of file diff --git a/ansible/roles/wireguard/templates/client.conf.j2 b/ansible/roles/wireguard/templates/client.conf.j2 new file mode 100644 index 0000000..da4ce12 --- /dev/null +++ b/ansible/roles/wireguard/templates/client.conf.j2 @@ -0,0 +1,27 @@ +# WireGuard Client Configuration for {{ item.name }} +# Generated by Ansible - Do not edit manually + +[Interface] +# Client private key (generate with: wg genkey) +PrivateKey = CLIENT_PRIVATE_KEY_HERE +Address = {{ item.ip }}/32 +DNS = 8.8.8.8, 8.8.4.4 + +[Peer] +# Server public key +PublicKey = {{ wireguard_server_public_key }} +# Server endpoint (replace with actual public IP) +Endpoint = YOUR_SERVER_PUBLIC_IP:{{ wireguard_port }} +# Networks accessible through VPN +AllowedIPs = {% if item.allowed_networks is defined %}{{ item.allowed_networks | join(', ') }}{% else %}{{ private_network_cidr }}{% endif %} + +# Keep connection alive +PersistentKeepalive = 25 + +# Instructions for client setup: +# 1. Generate client key pair: +# wg genkey | tee private.key | wg pubkey > public.key +# 2. Replace CLIENT_PRIVATE_KEY_HERE with contents of private.key +# 3. Replace YOUR_SERVER_PUBLIC_IP with server's public IP address +# 4. Add the public key to server configuration +# 5. Import this config to WireGuard client \ No newline at end of file diff --git a/ansible/roles/wireguard/templates/wg0.conf.j2 b/ansible/roles/wireguard/templates/wg0.conf.j2 new file mode 100644 index 0000000..07d50cd --- /dev/null +++ b/ansible/roles/wireguard/templates/wg0.conf.j2 @@ -0,0 +1,26 @@ +# WireGuard Server Configuration +# Generated by Ansible - Do not edit manually +[Interface] +PrivateKey = {{ wireguard_server_private_key }} +Address = {{ wireguard_server_ip }}/{{ wireguard_network.split('/')[1] }} +ListenPort = {{ wireguard_port }} + +# Enable packet forwarding +PostUp = iptables -A FORWARD -i {{ wireguard_interface }} -j ACCEPT; iptables -A FORWARD -o {{ wireguard_interface }} -j ACCEPT; iptables -t nat -A POSTROUTING -o {{ ansible_default_ipv4.interface }} -j MASQUERADE +PostDown = iptables -D FORWARD -i {{ wireguard_interface }} -j ACCEPT; iptables -D FORWARD -o {{ wireguard_interface }} -j ACCEPT; iptables -t nat -D POSTROUTING -o {{ ansible_default_ipv4.interface }} -j MASQUERADE + +{% if wireguard_clients is defined %} +{% for client in wireguard_clients %} +# Client: {{ client.name }} +[Peer] +PublicKey = {{ client.public_key }} +AllowedIPs = {{ client.ip }}/32 +{% if client.allowed_networks is defined %} +# Routes for client access to internal networks +{% for network in client.allowed_networks %} +# Access to {{ network }} +{% endfor %} +{% endif %} + +{% endfor %} +{% endif %} \ No newline at end of file diff --git a/ansible/roles/wireguard/vars/main.yml b/ansible/roles/wireguard/vars/main.yml new file mode 100644 index 0000000..cc3340d --- /dev/null +++ b/ansible/roles/wireguard/vars/main.yml @@ -0,0 +1,16 @@ +--- +# WireGuard default variables +wireguard_interface: "wg0" +wireguard_port: 51820 +wireguard_network: "10.0.10.0/24" +wireguard_server_ip: "10.0.10.1" +wireguard_ipv6_enabled: false + +# Package dependencies +wireguard_packages: + - wireguard + - wireguard-tools + - iptables-persistent + +# Firewall integration +wireguard_firewall_enabled: true \ No newline at end of file diff --git a/docs/01_architecture.md b/docs/01_architecture.md index ba1db59..78dc232 100644 --- a/docs/01_architecture.md +++ b/docs/01_architecture.md @@ -1,29 +1,29 @@ -# Infrastructure Architecture +# Architecture de l'Infrastructure -## Overview +## Aperçu -This document describes the architecture of the AI Infrastructure running on Hetzner Cloud and dedicated servers. The system is designed for high-performance AI inference with cost optimization, automatic scaling, and production-grade reliability. +Ce document dĂ©crit l'architecture de l'Infrastructure IA fonctionnant sur Hetzner Cloud et serveurs dĂ©diĂ©s. Le systĂšme est conçu pour l'infĂ©rence IA haute performance avec optimisation des coĂ»ts, mise Ă  l'Ă©chelle automatique et fiabilitĂ© de niveau production. -## High-Level Architecture +## Architecture de Haut Niveau ```mermaid graph TB Internet[Internet] - CF[CloudFlare Proxy
Optional CDN/DDoS protection] + CF[CloudFlare Proxy
Protection CDN/DDoS optionnelle] subgraph Cloud[Hetzner Cloud] - LB[HAProxy LB
cx31 - 8CPU/32GB
€22.68/month] - GW[API Gateway
cx31 - 8CPU/32GB
€22.68/month] - MON[Monitoring
cx21 - 4CPU/16GB
€11.76/month] + LB[HAProxy LB
cx31 - 8CPU/32GB
22,68€/mois] + GW[API Gateway
cx31 - 8CPU/32GB
22,68€/mois] + MON[Monitoring
cx21 - 4CPU/16GB
11,76€/mois] end - subgraph Dedicated[Hetzner Dedicated Servers] - GEX1[GEX44 #1
vLLM API
Mixtral-8x7B
€184/month] - GEX2[GEX44 #2
vLLM API
Llama-70B
€184/month] - GEX3[GEX44 #3
vLLM API
CodeLlama
€184/month] + subgraph Dedicated[Serveurs DĂ©diĂ©s Hetzner] + GEX1[GEX44 #1
API vLLM
Mixtral-8x7B
184€/mois] + GEX2[GEX44 #2
API vLLM
Llama-70B
184€/mois] + GEX3[GEX44 #3
API vLLM
CodeLlama
184€/mois] end - PrivateNet[Hetzner Private Network
10.0.0.0/16 - VXLAN overlay] + PrivateNet[Réseau Privé Hetzner
10.0.0.0/16 - Overlay VXLAN] Internet --> CF CF --> LB @@ -45,21 +45,21 @@ graph TB MON -.-> PrivateNet ``` -## Component Details +## DĂ©tails des Composants -### 1. Load Balancer (HAProxy) +### 1. RĂ©partiteur de Charge (HAProxy) -**Hardware**: Hetzner Cloud cx31 (8 vCPU, 32GB RAM) -**Location**: Private IP 10.0.2.10 -**Role**: Traffic distribution, SSL termination, health checks +**MatĂ©riel**: Hetzner Cloud cx31 (8 vCPU, 32GB RAM) +**Localisation**: IP privĂ©e 10.0.2.10 +**RĂŽle**: Distribution du trafic, terminaison SSL, contrĂŽles de santĂ© -**Features**: -- Round-robin load balancing with health checks -- SSL/TLS termination with automatic certificate renewal -- Statistics dashboard (port 8404) -- Request routing based on URL patterns -- Rate limiting and DDoS protection -- Prometheus metrics export +**FonctionnalitĂ©s**: +- RĂ©partition de charge round-robin avec contrĂŽles de santĂ© +- Terminaison SSL/TLS avec renouvellement automatique des certificats +- Tableau de bord statistiques (port 8404) +- Routage des requĂȘtes basĂ© sur les patterns d'URL +- Limitation de dĂ©bit et protection DDoS +- Export des mĂ©triques Prometheus **Configuration**: ```haproxy @@ -71,338 +71,338 @@ backend vllm_backend server gex44-3 10.0.1.12:8000 check ``` -### 2. API Gateway (Nginx) +### 2. Passerelle API (Nginx) -**Hardware**: Hetzner Cloud cx31 (8 vCPU, 32GB RAM) -**Location**: Private IP 10.0.2.11 -**Role**: API management, authentication, rate limiting +**MatĂ©riel**: Hetzner Cloud cx31 (8 vCPU, 32GB RAM) +**Localisation**: IP privĂ©e 10.0.2.11 +**RĂŽle**: Gestion API, authentification, limitation de dĂ©bit -**Features**: -- Request/response transformation -- API versioning and routing -- Authentication and authorization -- Request/response logging -- API analytics and metrics -- Caching for frequently requested data +**FonctionnalitĂ©s**: +- Transformation requĂȘte/rĂ©ponse +- Versioning et routage API +- Authentification et autorisation +- Journalisation requĂȘte/rĂ©ponse +- Analytics et mĂ©triques API +- Mise en cache des donnĂ©es frĂ©quemment demandĂ©es -### 3. GPU Servers (GEX44) +### 3. Serveurs GPU (GEX44) -**Hardware per server**: -- CPU: Intel i5-13500 (12 cores, 20 threads) +**MatĂ©riel par serveur**: +- CPU: Intel i5-13500 (12 cƓurs, 20 threads) - GPU: NVIDIA RTX 4000 Ada Generation (20GB VRAM) - RAM: 64GB DDR4 -- Storage: 2x 1.92TB NVMe SSD (RAID 1) -- Network: 1 Gbit/s +- Stockage: 2x 1.92TB NVMe SSD (RAID 1) +- RĂ©seau: 1 Gbit/s -**Software Stack**: +**Stack Logiciel**: - OS: Ubuntu 22.04 LTS - CUDA: 12.3 - Python: 3.11 - vLLM: 0.3.0+ - Docker: 24.0.5 -**Network Configuration**: -- Private IPs: 10.0.1.10, 10.0.1.11, 10.0.1.12 -- vLLM API: Port 8000 -- Metrics: Port 9835 (nvidia-smi-exporter) -- Node metrics: Port 9100 (node-exporter) +**Configuration RĂ©seau**: +- IPs privĂ©es: 10.0.1.10, 10.0.1.11, 10.0.1.12 +- API vLLM: Port 8000 +- MĂ©triques: Port 9835 (nvidia-smi-exporter) +- MĂ©triques nƓud: Port 9100 (node-exporter) -### 4. Monitoring Stack +### 4. Stack de Monitoring -**Hardware**: Hetzner Cloud cx21 (4 vCPU, 16GB RAM) -**Location**: Private IP 10.0.2.12 +**MatĂ©riel**: Hetzner Cloud cx21 (4 vCPU, 16GB RAM) +**Localisation**: IP privĂ©e 10.0.2.12 -**Components**: -- **Prometheus**: Metrics collection and storage -- **Grafana**: Visualization and dashboards -- **AlertManager**: Alert routing and notification -- **Node Exporter**: System metrics -- **nvidia-smi-exporter**: GPU metrics +**Composants**: +- **Prometheus**: Collection et stockage des mĂ©triques +- **Grafana**: Visualisation et tableaux de bord +- **AlertManager**: Routage et notification des alertes +- **Node Exporter**: MĂ©triques systĂšme +- **nvidia-smi-exporter**: MĂ©triques GPU -## Network Architecture +## Architecture RĂ©seau -### Private Network +### RĂ©seau PrivĂ© **CIDR**: 10.0.0.0/16 -**Subnets**: -- Cloud servers: 10.0.2.0/24 -- GEX44 servers: 10.0.1.0/24 +**Sous-rĂ©seaux**: +- Serveurs cloud: 10.0.2.0/24 +- Serveurs GEX44: 10.0.1.0/24 -### Security Groups +### Groupes de SĂ©curitĂ© -1. **SSH Access**: Port 22 (restricted IPs) +1. **AccĂšs SSH**: Port 22 (IPs restreintes) 2. **HTTP/HTTPS**: Ports 80, 443 (public) -3. **API Access**: Port 8000 (internal only) -4. **Monitoring**: Ports 3000, 9090 (restricted) -5. **Internal Communication**: All ports within private network +3. **AccĂšs API**: Port 8000 (interne uniquement) +4. **Monitoring**: Ports 3000, 9090 (restreint) +5. **Communication Interne**: Tous ports dans le rĂ©seau privĂ© -### Firewall Rules +### RĂšgles de Pare-feu ```yaml -# Public access -- HTTP (80) from 0.0.0.0/0 -- HTTPS (443) from 0.0.0.0/0 +# AccĂšs public +- HTTP (80) depuis 0.0.0.0/0 +- HTTPS (443) depuis 0.0.0.0/0 -# Management access (restrict to office IPs) -- SSH (22) from office_cidr -- Grafana (3000) from office_cidr -- Prometheus (9090) from office_cidr +# AccĂšs de gestion (restreindre aux IPs bureau) +- SSH (22) depuis office_cidr +- Grafana (3000) depuis office_cidr +- Prometheus (9090) depuis office_cidr -# Internal communication -- All traffic within 10.0.0.0/16 +# Communication interne +- Tout trafic dans 10.0.0.0/16 ``` -## Data Flow +## Flux de DonnĂ©es -### Inference Request Flow +### Flux de RequĂȘte d'InfĂ©rence -1. **Client** → **Load Balancer** (HAProxy) - - SSL termination - - Request routing - - Health check validation +1. **Client** → **RĂ©partiteur de Charge** (HAProxy) + - Terminaison SSL + - Routage des requĂȘtes + - Validation des contrĂŽles de santĂ© -2. **Load Balancer** → **GPU Server** (vLLM) - - HTTP request to /v1/chat/completions - - Model selection and processing - - Response generation +2. **RĂ©partiteur de Charge** → **Serveur GPU** (vLLM) + - RequĂȘte HTTP vers /v1/chat/completions + - SĂ©lection et traitement du modĂšle + - GĂ©nĂ©ration de rĂ©ponse -3. **GPU Server** → **Load Balancer** → **Client** - - JSON response with completion - - Usage metrics included +3. **Serveur GPU** → **RĂ©partiteur de Charge** → **Client** + - RĂ©ponse JSON avec complĂ©tion + - MĂ©triques d'utilisation incluses -### Monitoring Data Flow +### Flux de DonnĂ©es de Monitoring -1. **GPU Servers** → **Prometheus** - - nvidia-smi metrics (GPU utilization, temperature, memory) - - vLLM metrics (requests, latency, tokens) - - System metrics (CPU, memory, disk) +1. **Serveurs GPU** → **Prometheus** + - MĂ©triques nvidia-smi (utilisation GPU, tempĂ©rature, mĂ©moire) + - MĂ©triques vLLM (requĂȘtes, latence, tokens) + - MĂ©triques systĂšme (CPU, mĂ©moire, disque) -2. **Load Balancer** → **Prometheus** - - HAProxy metrics (requests, response times, errors) - - Backend server health status +2. **RĂ©partiteur de Charge** → **Prometheus** + - MĂ©triques HAProxy (requĂȘtes, temps de rĂ©ponse, erreurs) + - État de santĂ© des serveurs backend 3. **Prometheus** → **Grafana** - - Time-series data visualization - - Dashboard rendering - - Alert evaluation + - Visualisation des donnĂ©es de sĂ©ries temporelles + - Rendu des tableaux de bord + - Évaluation des alertes -## Storage Architecture +## Architecture de Stockage -### Model Storage +### Stockage des ModĂšles -**Location**: Each GEX44 server -**Path**: `/opt/vllm/models/` -**Size**: ~100GB per model +**Localisation**: Chaque serveur GEX44 +**Chemin**: `/opt/vllm/models/` +**Taille**: ~100GB par modĂšle -**Models Stored**: +**ModĂšles StockĂ©s**: - Mixtral-8x7B-Instruct (87GB) -- Llama-2-70B-Chat (140GB, quantized) +- Llama-2-70B-Chat (140GB, quantifiĂ©) - CodeLlama-34B (68GB) -### Shared Storage +### Stockage PartagĂ© -**Type**: Hetzner Cloud Volume -**Size**: 500GB -**Mount**: `/mnt/shared` -**Purpose**: Configuration, logs, backups +**Type**: Volume Hetzner Cloud +**Taille**: 500GB +**Montage**: `/mnt/shared` +**Objectif**: Configuration, journaux, sauvegardes -### Backup Strategy +### StratĂ©gie de Sauvegarde -**What is backed up**: -- Terraform state files -- Ansible configurations -- Grafana dashboards -- Prometheus configuration -- Application logs (last 7 days) +**Ce qui est sauvegardĂ©**: +- Fichiers d'Ă©tat Terraform +- Configurations Ansible +- Tableaux de bord Grafana +- Configuration Prometheus +- Journaux d'application (7 derniers jours) -**What is NOT backed up**: -- Model files (re-downloadable) -- Prometheus metrics (30-day retention) -- Large log files (rotated automatically) +**Ce qui n'est PAS sauvegardĂ©**: +- Fichiers de modĂšles (re-tĂ©lĂ©chargeables) +- MĂ©triques Prometheus (rĂ©tention 30 jours) +- Gros fichiers de journaux (rotation automatique) -## Scaling Architecture +## Architecture de Mise Ă  l'Échelle -### Horizontal Scaling +### Mise Ă  l'Échelle Horizontale -**Auto-scaling triggers**: -- GPU utilization > 80% for 10 minutes → Scale up -- GPU utilization < 30% for 30 minutes → Scale down -- Queue depth > 50 requests → Immediate scale up +**DĂ©clencheurs d'auto-scaling**: +- Utilisation GPU > 80% pendant 10 minutes → Monter en Ă©chelle +- Utilisation GPU < 30% pendant 30 minutes → RĂ©duire l'Ă©chelle +- Profondeur de file > 50 requĂȘtes → MontĂ©e immĂ©diate en Ă©chelle -**Scaling process**: -1. Monitor metrics via Prometheus -2. Autoscaler service evaluates conditions -3. Order new GEX44 via Robot API -4. Ansible configures new server -5. Add to load balancer pool +**Processus de mise Ă  l'Ă©chelle**: +1. Surveiller les mĂ©triques via Prometheus +2. Le service d'autoscaler Ă©value les conditions +3. Commande nouveau GEX44 via API Robot +4. Ansible configure le nouveau serveur +5. Ajout au pool du rĂ©partiteur de charge -### Vertical Scaling +### Mise Ă  l'Échelle Verticale -**Model optimization**: -- Quantization (AWQ, GPTQ) -- Tensor parallelism for large models -- Memory optimization techniques +**Optimisation des modĂšles**: +- Quantification (AWQ, GPTQ) +- ParallĂ©lisme tensoriel pour gros modĂšles +- Techniques d'optimisation mĂ©moire -## High Availability +## Haute DisponibilitĂ© -### Redundancy +### Redondance -- **Load Balancer**: Single point (acceptable for cost/benefit) -- **GPU Servers**: 3 servers minimum (N+1 redundancy) -- **Monitoring**: Single instance with backup configuration +- **RĂ©partiteur de Charge**: Point unique (acceptable pour coĂ»t/bĂ©nĂ©fice) +- **Serveurs GPU**: 3 serveurs minimum (redondance N+1) +- **Monitoring**: Instance unique avec configuration de sauvegarde -### Failure Scenarios +### ScĂ©narios de Panne -1. **Single GPU server failure**: - - Automatic removal from load balancer - - 66% capacity maintained - - Automatic replacement order +1. **Panne d'un serveur GPU**: + - Suppression automatique du rĂ©partiteur de charge + - 66% de capacitĂ© maintenue + - Commande de remplacement automatique -2. **Load balancer failure**: - - Manual failover to backup - - DNS change required - - ~10 minute downtime +2. **Panne du rĂ©partiteur de charge**: + - Basculement manuel vers sauvegarde + - Changement DNS requis + - ~10 minutes d'arrĂȘt -3. **Network partition**: - - Private network redundancy - - Automatic retry logic - - Graceful degradation +3. **Partition rĂ©seau**: + - Redondance du rĂ©seau privĂ© + - Logique de retry automatique + - DĂ©gradation gracieuse -## Security Architecture +## Architecture de SĂ©curitĂ© -### Network Security +### SĂ©curitĂ© RĂ©seau -- Private network isolation -- Firewall rules at multiple levels -- No direct internet access to GPU servers -- VPN for administrative access +- Isolation du rĂ©seau privĂ© +- RĂšgles de pare-feu Ă  plusieurs niveaux +- Pas d'accĂšs internet direct aux serveurs GPU +- VPN pour accĂšs administratif -### Application Security +### SĂ©curitĂ© Application -- API rate limiting -- Request validation -- Input sanitization -- Output filtering +- Limitation de dĂ©bit API +- Validation des requĂȘtes +- Sanitisation des entrĂ©es +- Filtrage des sorties -### Infrastructure Security +### SĂ©curitĂ© Infrastructure -- SSH key-based authentication -- Regular security updates -- Intrusion detection -- Log monitoring +- Authentification basĂ©e sur clĂ©s SSH +- Mises Ă  jour de sĂ©curitĂ© rĂ©guliĂšres +- DĂ©tection d'intrusion +- Surveillance des journaux -## Performance Characteristics +## CaractĂ©ristiques de Performance -### Latency +### Latence -- **P50**: <1.5 seconds -- **P95**: <3 seconds -- **P99**: <5 seconds +- **P50**: <1.5 secondes +- **P95**: <3 secondes +- **P99**: <5 secondes -### Throughput +### DĂ©bit -- **Total**: ~255 tokens/second (3 servers) -- **Per server**: ~85 tokens/second -- **Max RPS**: ~50 requests/second +- **Total**: ~255 tokens/seconde (3 serveurs) +- **Par serveur**: ~85 tokens/seconde +- **RPS Max**: ~50 requĂȘtes/seconde -### Resource Utilization +### Utilisation des Ressources -- **GPU**: 65-75% average utilization -- **CPU**: 30-40% average utilization -- **Memory**: 70-80% utilization (model loading) -- **Network**: <100 Mbps typical +- **GPU**: 65-75% utilisation moyenne +- **CPU**: 30-40% utilisation moyenne +- **MĂ©moire**: 70-80% utilisation (chargement modĂšle) +- **RĂ©seau**: <100 Mbps typique -## Cost Breakdown +## RĂ©partition des CoĂ»ts -### Monthly Costs (EUR) +### CoĂ»ts Mensuels (EUR) -| Component | Quantity | Unit Cost | Total | -|-----------|----------|-----------|--------| -| GEX44 Servers | 3 | €184 | €552 | -| cx31 (LB) | 1 | €22.68 | €22.68 | -| cx31 (API GW) | 1 | €22.68 | €22.68 | -| cx21 (Monitor) | 1 | €11.76 | €11.76 | -| Storage | 500GB | €0.05/GB | €25 | -| **Total** | | | **€634.12** | +| Composant | QuantitĂ© | CoĂ»t Unitaire | Total | +|-----------|----------|---------------|--------| +| Serveurs GEX44 | 3 | 184€ | 552€ | +| cx31 (LB) | 1 | 22,68€ | 22,68€ | +| cx31 (API GW) | 1 | 22,68€ | 22,68€ | +| cx21 (Monitor) | 1 | 11,76€ | 11,76€ | +| Stockage | 500GB | 0,05€/GB | 25€ | +| **Total** | | | **634,12€** | -### Cost per Request +### CoĂ»t par RequĂȘte -At 100,000 requests/day: -- Monthly requests: 3,000,000 -- Cost per request: €0.0002 -- Cost per token: €0.0000025 +À 100 000 requĂȘtes/jour: +- RequĂȘtes mensuelles: 3 000 000 +- CoĂ»t par requĂȘte: 0,0002€ +- CoĂ»t par token: 0,0000025€ -## Disaster Recovery +## Reprise aprĂšs Sinistre -### Backup Procedures +### ProcĂ©dures de Sauvegarde -1. **Daily**: Configuration backup to cloud storage -2. **Weekly**: Full system state backup -3. **Monthly**: Disaster recovery test +1. **Quotidien**: Sauvegarde configuration vers stockage cloud +2. **Hebdomadaire**: Sauvegarde complĂšte Ă©tat systĂšme +3. **Mensuel**: Test de reprise aprĂšs sinistre -### Recovery Procedures +### ProcĂ©dures de RĂ©cupĂ©ration -1. **Infrastructure**: Terraform state restoration -2. **Configuration**: Ansible playbook execution -3. **Models**: Re-download from HuggingFace -4. **Data**: Restore from backup storage +1. **Infrastructure**: Restauration Ă©tat Terraform +2. **Configuration**: ExĂ©cution playbooks Ansible +3. **ModĂšles**: Re-tĂ©lĂ©chargement depuis HuggingFace +4. **DonnĂ©es**: Restauration depuis stockage de sauvegarde -### RTO/RPO Targets +### Objectifs RTO/RPO -- **RTO**: 2 hours (Recovery Time Objective) -- **RPO**: 24 hours (Recovery Point Objective) +- **RTO**: 2 heures (Objectif Temps de RĂ©cupĂ©ration) +- **RPO**: 24 heures (Objectif Point de RĂ©cupĂ©ration) -## Monitoring and Alerting +## Surveillance et Alertes -### Key Metrics +### MĂ©triques ClĂ©s **Infrastructure**: -- GPU utilization and temperature -- Memory usage and availability -- Network throughput -- Storage usage +- Utilisation et tempĂ©rature GPU +- Utilisation et disponibilitĂ© mĂ©moire +- DĂ©bit rĂ©seau +- Utilisation stockage **Application**: -- Request rate and latency -- Error rate and types -- Token generation rate -- Queue depth +- Taux et latence des requĂȘtes +- Taux et types d'erreurs +- Taux de gĂ©nĂ©ration de tokens +- Profondeur de file **Business**: -- Cost per request -- Revenue per request -- SLA compliance -- User satisfaction +- CoĂ»t par requĂȘte +- Revenus par requĂȘte +- ConformitĂ© SLA +- Satisfaction utilisateur -### Alert Levels +### Niveaux d'Alerte -1. **Info**: Cost optimization opportunities -2. **Warning**: Performance degradation -3. **Critical**: Service outage or severe issues +1. **Info**: OpportunitĂ©s d'optimisation des coĂ»ts +2. **Warning**: DĂ©gradation des performances +3. **Critique**: Panne de service ou problĂšmes graves -## Future Architecture Considerations +## ConsidĂ©rations Architecturales Futures -### Planned Improvements +### AmĂ©liorations PrĂ©vues -1. **Multi-region deployment** (Q4 2024) - - Nuremberg + Helsinki regions - - Cross-region load balancing - - Improved latency for global users +1. **DĂ©ploiement multi-rĂ©gion** (T4 2024) + - RĂ©gions Nuremberg + Helsinki + - RĂ©partition de charge inter-rĂ©gions + - Latence amĂ©liorĂ©e pour utilisateurs globaux -2. **Advanced auto-scaling** (Q1 2025) - - Predictive scaling based on usage patterns - - Spot instance integration - - More sophisticated cost optimization +2. **Auto-scaling avancĂ©** (T1 2025) + - Mise Ă  l'Ă©chelle prĂ©dictive basĂ©e sur patterns d'usage + - IntĂ©gration instances spot + - Optimisation coĂ»ts plus sophistiquĂ©e -3. **Edge deployment** (Q2 2025) - - Smaller models at edge locations - - Reduced latency for simple requests - - Hybrid edge-cloud architecture +3. **DĂ©ploiement edge** (T2 2025) + - ModĂšles plus petits aux emplacements edge + - Latence rĂ©duite pour requĂȘtes simples + - Architecture hybride edge-cloud -### Technology Evolution +### Évolution Technologique -- **Hardware**: Migration to H100 when cost-effective -- **Software**: Continuous optimization of inference stack -- **Networking**: 10 Gbit/s upgrade for high-throughput scenarios +- **MatĂ©riel**: Migration vers H100 quand rentable +- **Logiciel**: Optimisation continue de la stack d'infĂ©rence +- **RĂ©seau**: Upgrade 10 Gbit/s pour scĂ©narios haut dĂ©bit -This architecture provides a solid foundation for scaling from thousands to millions of requests per day while maintaining cost efficiency and performance. \ No newline at end of file +Cette architecture fournit une base solide pour passer de milliers Ă  millions de requĂȘtes par jour tout en maintenant l'efficacitĂ© coĂ»t et les performances. \ No newline at end of file diff --git a/docs/02_deployment.md b/docs/02_deployment.md index 7b44535..2b15f72 100644 --- a/docs/02_deployment.md +++ b/docs/02_deployment.md @@ -1,37 +1,37 @@ -# Deployment Guide +# Guide de DĂ©ploiement -This guide provides step-by-step instructions for deploying the AI Infrastructure on Hetzner Cloud and dedicated servers. +Ce guide fournit des instructions Ă©tape par Ă©tape pour dĂ©ployer l'Infrastructure IA sur Hetzner Cloud et les serveurs dĂ©diĂ©s. -## Prerequisites +## PrĂ©requis -Before starting the deployment, ensure you have: +Avant de commencer le dĂ©ploiement, assurez-vous d'avoir : -### Required Accounts and Access +### Comptes et AccĂšs Requis -1. **Hetzner Cloud Account** - - API token with read/write permissions - - Budget sufficient for cloud resources (~€60/month) +1. **Compte Hetzner Cloud** + - Token API avec permissions lecture/Ă©criture + - Budget suffisant pour les ressources cloud (~60€/mois) -2. **Hetzner Robot Account** - - API credentials for dedicated server management - - Budget for GEX44 servers (€184/month each) +2. **Compte Hetzner Robot** + - Identifiants API pour la gestion des serveurs dĂ©diĂ©s + - Budget pour les serveurs GEX44 (184€/mois chacun) -3. **GitLab Account** (for CI/CD) - - Project with CI/CD pipelines enabled - - Variables configured for secrets +3. **Compte GitLab** (pour CI/CD) + - Projet avec pipelines CI/CD activĂ©s + - Variables configurĂ©es pour les secrets -### Local Development Environment +### Environnement de DĂ©veloppement Local ```bash -# Required tools +# Outils requis terraform >= 1.5.0 ansible >= 8.0.0 -kubectl >= 1.28.0 # Optional +kubectl >= 1.28.0 # Optionnel docker >= 24.0.0 python >= 3.11 -go >= 1.21 # For testing +go >= 1.21 # Pour les tests -# Install tools on Ubuntu/Debian +# Installation des outils sur Ubuntu/Debian sudo apt update sudo apt install -y software-properties-common curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add - @@ -39,96 +39,96 @@ sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(l sudo apt update sudo apt install terraform ansible python3-pip docker.io -# Install additional tools +# Installation d'outils supplĂ©mentaires pip3 install ansible-lint molecule[docker] ``` -### SSH Key Setup +### Configuration des ClĂ©s SSH ```bash -# Generate SSH key for server access +# GĂ©nĂ©rer une clĂ© SSH pour l'accĂšs au serveur ssh-keygen -t rsa -b 4096 -f ~/.ssh/hetzner_key -C "ai-infrastructure" -# Add to SSH agent +# Ajouter Ă  l'agent SSH ssh-add ~/.ssh/hetzner_key -# Copy public key content +# Copier le contenu de la clĂ© publique cat ~/.ssh/hetzner_key.pub ``` -## Pre-Deployment Checklist +## Liste de VĂ©rification PrĂ©-DĂ©ploiement -### 1. Order GEX44 Servers +### 1. Commander les Serveurs GEX44 -**Important**: GEX44 servers must be ordered manually through Hetzner Robot portal or API. +**Important** : Les serveurs GEX44 doivent ĂȘtre commandĂ©s manuellement via le portail Hetzner Robot ou l'API. ```bash -# Order via Robot API (optional) +# Commander via l'API Robot (optionnel) curl -X POST https://robot-ws.your-server.de/order/server \ -H "Authorization: Basic $(echo -n 'username:password' | base64)" \ -d "product_id=GEX44&location=FSN1-DC14&os=ubuntu-22.04" ``` -**Manual ordering steps**: -1. Login to [Robot Console](https://robot.your-server.de/) -2. Navigate to "Order" → "Dedicated Servers" -3. Select GEX44 configuration: - - Location: FSN1-DC14 (Frankfurt) - - OS: Ubuntu 22.04 LTS - - Quantity: 3 (for production) -4. Complete payment and wait for provisioning (2-24 hours) +**Étapes de commande manuelle** : +1. Se connecter Ă  la [Console Robot](https://robot.your-server.de/) +2. Naviguer vers "Order" → "Dedicated Servers" +3. SĂ©lectionner la configuration GEX44 : + - Localisation : FSN1-DC14 (Frankfurt) + - OS : Ubuntu 22.04 LTS + - QuantitĂ© : 3 (pour la production) +4. Finaliser le paiement et attendre le provisioning (2-24 heures) -### 2. Configure Environment Variables +### 2. Configurer les Variables d'Environnement -Create environment file: +CrĂ©er le fichier d'environnement : ```bash -# Copy example environment file +# Copier le fichier d'environnement exemple cp .env.example .env -# Edit with your credentials +# Éditer avec vos identifiants vim .env ``` -Required variables: +Variables requises : ```bash -# Hetzner credentials +# Identifiants Hetzner HCLOUD_TOKEN=your_hcloud_token_here ROBOT_API_USER=your_robot_username ROBOT_API_PASSWORD=your_robot_password -# SSH configuration +# Configuration SSH SSH_PUBLIC_KEY="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQ..." SSH_PRIVATE_KEY_PATH=~/.ssh/hetzner_key -# Domain configuration (optional) +# Configuration du domaine (optionnel) API_DOMAIN=api.yourdomain.com MONITORING_DOMAIN=monitoring.yourdomain.com -# Monitoring credentials +# Identifiants de surveillance GRAFANA_ADMIN_PASSWORD=secure_password_here # GitLab CI/CD GITLAB_TOKEN=your_gitlab_token ANSIBLE_VAULT_PASSWORD=secure_vault_password -# Cost tracking +# Suivi des coĂ»ts PROJECT_NAME=ai-infrastructure COST_CENTER=engineering -# Auto-scaling configuration +# Configuration d'auto-scaling MIN_GEX44_COUNT=1 MAX_GEX44_COUNT=5 SCALE_UP_THRESHOLD=0.8 SCALE_DOWN_THRESHOLD=0.3 ``` -### 3. Configure Terraform Backend +### 3. Configurer le Backend Terraform -Choose your state backend: +Choisir votre backend d'Ă©tat : -#### Option A: GitLab Backend (Recommended) +#### Option A : Backend GitLab (RecommandĂ©) ```hcl # terraform/backend.tf @@ -146,7 +146,7 @@ terraform { } ``` -#### Option B: S3-Compatible Backend +#### Option B : Backend Compatible S3 ```hcl # terraform/backend.tf @@ -163,172 +163,172 @@ terraform { } ``` -## Deployment Process +## Processus de DĂ©ploiement -### Step 1: Initial Setup +### Étape 1 : Configuration Initiale ```bash -# Clone the repository +# Cloner le dĂ©pĂŽt git clone https://github.com/yourorg/ai-infrastructure.git cd ai-infrastructure -# Install dependencies +# Installer les dĂ©pendances make setup -# Validate configuration +# Valider la configuration make validate ``` -### Step 2: Development Environment +### Étape 2 : Environnement de DĂ©veloppement -Start with a development deployment to test the configuration: +Commencer par un dĂ©ploiement de dĂ©veloppement pour tester la configuration : ```bash -# Deploy development environment +# DĂ©ployer l'environnement de dĂ©veloppement make deploy-dev -# Wait for completion (15-20 minutes) -# Check deployment status +# Attendre la finalisation (15-20 minutes) +# VĂ©rifier le statut du dĂ©ploiement make status ENV=dev -# Test the deployment +# Tester le dĂ©ploiement make test ENV=dev ``` -### Step 3: Staging Environment +### Étape 3 : Environnement de Staging -Once development is working, deploy staging: +Une fois que le dĂ©veloppement fonctionne, dĂ©ployer le staging : ```bash -# Plan staging deployment +# Planifier le dĂ©ploiement staging make plan ENV=staging -# Review the plan carefully -# Deploy staging +# Examiner attentivement le plan +# DĂ©ployer le staging make deploy-staging -# Run integration tests +# ExĂ©cuter les tests d'intĂ©gration make test-load API_URL=https://api-staging.yourdomain.com ``` -### Step 4: Production Deployment +### Étape 4 : DĂ©ploiement Production -**Warning**: Production deployment should be done during maintenance windows. +**Attention** : Le dĂ©ploiement en production doit ĂȘtre effectuĂ© pendant les fenĂȘtres de maintenance. ```bash -# Create backup of current state +# CrĂ©er une sauvegarde de l'Ă©tat actuel make backup ENV=production -# Plan production deployment +# Planifier le dĂ©ploiement production make plan ENV=production -# Review plan with team -# Get approval for production deployment +# Examiner le plan avec l'Ă©quipe +# Obtenir l'approbation pour le dĂ©ploiement production -# Deploy production (requires manual confirmation) +# DĂ©ployer en production (nĂ©cessite confirmation manuelle) make deploy-prod -# Verify deployment +# VĂ©rifier le dĂ©ploiement make status ENV=production make test ENV=production ``` -## Detailed Deployment Steps +## Étapes DĂ©taillĂ©es de DĂ©ploiement -### Infrastructure Deployment (Terraform) +### DĂ©ploiement d'Infrastructure (Terraform) ```bash -# Navigate to terraform directory +# Naviguer vers le rĂ©pertoire terraform cd terraform/environments/production -# Initialize Terraform +# Initialiser Terraform terraform init -# Create execution plan +# CrĂ©er le plan d'exĂ©cution terraform plan -out=production.tfplan -# Review the plan +# Examiner le plan terraform show production.tfplan -# Apply the plan +# Appliquer le plan terraform apply production.tfplan ``` -Expected resources to be created: -- 1x Private network (10.0.0.0/16) -- 2x Subnets (cloud and GEX44) -- 4x Firewall rules -- 3x Cloud servers (LB, API GW, Monitoring) +Ressources attendues Ă  crĂ©er : +- 1x RĂ©seau privĂ© (10.0.0.0/16) +- 2x Sous-rĂ©seaux (cloud et GEX44) +- 4x RĂšgles de pare-feu +- 3x Serveurs cloud (LB, API GW, Monitoring) - 1x Volume (500GB) -- Various security groups +- Divers groupes de sĂ©curitĂ© -### Server Configuration (Ansible) +### Configuration des Serveurs (Ansible) ```bash -# Navigate to ansible directory +# Naviguer vers le rĂ©pertoire ansible cd ansible -# Test connectivity +# Tester la connectivitĂ© ansible all -i inventory/production.yml -m ping -# Run full configuration +# ExĂ©cuter la configuration complĂšte ansible-playbook -i inventory/production.yml playbooks/site.yml -# Verify services are running +# VĂ©rifier que les services fonctionnent ansible all -i inventory/production.yml -a "systemctl status vllm-api" ``` -### GEX44 Configuration +### Configuration GEX44 -The GEX44 servers require special handling due to their dedicated nature: +Les serveurs GEX44 nĂ©cessitent une manipulation spĂ©ciale due Ă  leur nature dĂ©diĂ©e : ```bash -# Configure GEX44 servers specifically +# Configurer spĂ©cifiquement les serveurs GEX44 ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml -# Wait for model downloads (can take 1-2 hours) -# Monitor progress +# Attendre les tĂ©lĂ©chargements de modĂšles (peut prendre 1-2 heures) +# Surveiller le progrĂšs ansible gex44 -i inventory/production.yml -a "tail -f /var/log/vllm/model-download.log" -# Verify GPU accessibility +# VĂ©rifier l'accessibilitĂ© GPU ansible gex44 -i inventory/production.yml -a "nvidia-smi" -# Test vLLM API +# Tester l'API vLLM ansible gex44 -i inventory/production.yml -a "curl -f http://localhost:8000/health" ``` -### Load Balancer Configuration +### Configuration du Load Balancer ```bash -# Configure HAProxy load balancer +# Configurer le load balancer HAProxy ansible-playbook -i inventory/production.yml playbooks/load-balancer-setup.yml -# Test load balancer +# Tester le load balancer curl -f http://LOAD_BALANCER_IP/health -# Check HAProxy stats +# VĂ©rifier les statistiques HAProxy curl http://LOAD_BALANCER_IP:8404/stats ``` -### Monitoring Setup +### Configuration de la Surveillance ```bash -# Configure monitoring stack +# Configurer la pile de surveillance ansible-playbook -i inventory/production.yml playbooks/monitoring-setup.yml -# Access Grafana (after DNS setup) +# AccĂ©der Ă  Grafana (aprĂšs configuration DNS) open https://monitoring.yourdomain.com -# Default credentials: -# Username: admin -# Password: (from GRAFANA_ADMIN_PASSWORD) +# Identifiants par dĂ©faut : +# Nom d'utilisateur : admin +# Mot de passe : (depuis GRAFANA_ADMIN_PASSWORD) ``` -## Post-Deployment Configuration +## Configuration Post-DĂ©ploiement -### 1. DNS Configuration +### 1. Configuration DNS -Update your DNS records to point to the deployed infrastructure: +Mettre Ă  jour vos enregistrements DNS pour pointer vers l'infrastructure dĂ©ployĂ©e : ```dns api.yourdomain.com. 300 IN A LOAD_BALANCER_IP @@ -336,233 +336,233 @@ monitoring.yourdomain.com. 300 IN A MONITORING_IP *.api.yourdomain.com. 300 IN A LOAD_BALANCER_IP ``` -### 2. SSL Certificate Setup +### 2. Configuration des Certificats SSL ```bash -# Let's Encrypt certificates (automatic) +# Certificats Let's Encrypt (automatique) ansible-playbook -i inventory/production.yml playbooks/ssl-setup.yml -# Or manually with certbot +# Ou manuellement avec certbot sudo certbot --nginx -d api.yourdomain.com -d monitoring.yourdomain.com ``` -### 3. Monitoring Configuration +### 3. Configuration de la Surveillance -#### Grafana Dashboards +#### Tableaux de Bord Grafana -1. Login to Grafana at https://monitoring.yourdomain.com -2. Import pre-built dashboards from `monitoring/grafana/dashboards/` -3. Configure alert channels (email, Slack, etc.) +1. Se connecter Ă  Grafana sur https://monitoring.yourdomain.com +2. Importer les tableaux de bord prĂ©-construits depuis `monitoring/grafana/dashboards/` +3. Configurer les canaux d'alerte (email, Slack, etc.) -#### Prometheus Alerts +#### Alertes Prometheus -Alerts are automatically configured, but you may want to customize: +Les alertes sont automatiquement configurĂ©es, mais vous pourriez vouloir personnaliser : ```bash -# Edit alert rules +# Éditer les rĂšgles d'alerte vim monitoring/prometheus/alerts.yml -# Reload Prometheus configuration +# Recharger la configuration Prometheus ansible monitoring -i inventory/production.yml -a "systemctl reload prometheus" ``` -### 4. Backup Configuration +### 4. Configuration de Sauvegarde ```bash -# Setup automated backups +# Configurer les sauvegardes automatisĂ©es ansible-playbook -i inventory/production.yml playbooks/backup-setup.yml -# Test backup process +# Tester le processus de sauvegarde make backup ENV=production -# Verify backup files +# VĂ©rifier les fichiers de sauvegarde ls -la backups/$(date +%Y%m%d)/ ``` -## Validation and Testing +## Validation et Tests -### Health Checks +### ContrĂŽles de SantĂ© ```bash -# Infrastructure health +# SantĂ© de l'infrastructure make status ENV=production -# API health +# SantĂ© de l'API curl -f https://api.yourdomain.com/health -# Monitoring health +# SantĂ© de la surveillance curl -f https://monitoring.yourdomain.com/api/health ``` -### Load Testing +### Tests de Charge ```bash -# Basic load test +# Test de charge basique make test-load API_URL=https://api.yourdomain.com -# Extended load test +# Test de charge Ă©tendu k6 run tests/load/k6_inference_test.js --env API_URL=https://api.yourdomain.com ``` -### Contract Testing +### Tests de Contrat ```bash -# API contract tests +# Tests de contrat API python tests/contracts/test_inference_api.py --api-url=https://api.yourdomain.com ``` -## Troubleshooting Deployment Issues +## DĂ©pannage des ProblĂšmes de DĂ©ploiement -### Common Issues +### ProblĂšmes Courants -#### 1. Terraform State Lock +#### 1. Verrouillage d'État Terraform ```bash -# If state is locked +# Si l'Ă©tat est verrouillĂ© terraform force-unlock LOCK_ID -# Or reset state (dangerous) +# Ou rĂ©initialiser l'Ă©tat (dangereux) terraform state pull > backup.tfstate -terraform state rm # problematic resource -terraform import # re-import resource +terraform state rm # ressource problĂ©matique +terraform import # rĂ©-importer la ressource ``` -#### 2. Ansible Connection Issues +#### 2. ProblĂšmes de Connexion Ansible ```bash -# Test SSH connectivity +# Tester la connectivitĂ© SSH ansible all -i inventory/production.yml -m ping -# Check SSH agent +# VĂ©rifier l'agent SSH ssh-add -l -# Debug connection +# DĂ©boguer la connexion ansible all -i inventory/production.yml -m ping -vvv ``` -#### 3. GEX44 Not Accessible +#### 3. GEX44 Non Accessible ```bash -# Check server status in Robot console -# Verify network configuration -# Ensure servers are in same private network +# VĂ©rifier le statut du serveur dans la console Robot +# VĂ©rifier la configuration rĂ©seau +# S'assurer que les serveurs sont dans le mĂȘme rĂ©seau privĂ© -# Manual SSH to debug +# SSH manuel pour dĂ©boguer ssh -i ~/.ssh/hetzner_key ubuntu@GEX44_IP ``` -#### 4. Model Download Failures +#### 4. Échecs de TĂ©lĂ©chargement de ModĂšles ```bash -# Check disk space +# VĂ©rifier l'espace disque ansible gex44 -i inventory/production.yml -a "df -h" -# Check download logs +# VĂ©rifier les logs de tĂ©lĂ©chargement ansible gex44 -i inventory/production.yml -a "tail -f /var/log/vllm/model-download.log" -# Retry download +# RĂ©essayer le tĂ©lĂ©chargement ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=models ``` -### Debug Commands +### Commandes de DĂ©bogage ```bash -# Check all service statuses +# VĂ©rifier tous les statuts de service ansible all -i inventory/production.yml -a "systemctl list-units --failed" -# View logs +# Voir les logs ansible all -i inventory/production.yml -a "journalctl -u vllm-api -n 50" -# Check GPU status +# VĂ©rifier le statut GPU ansible gex44 -i inventory/production.yml -a "nvidia-smi" -# Check network connectivity +# VĂ©rifier la connectivitĂ© rĂ©seau ansible all -i inventory/production.yml -a "ping -c 3 8.8.8.8" ``` -## Rollback Procedures +## ProcĂ©dures de Rollback -### Emergency Rollback +### Rollback d'Urgence ```bash -# Stop accepting new traffic -# Update load balancer to maintenance mode +# ArrĂȘter l'acceptation de nouveau trafic +# Mettre le load balancer en mode maintenance ansible load_balancers -i inventory/production.yml -a "systemctl stop haproxy" -# Rollback Terraform changes +# Rollback des changements Terraform cd terraform/environments/production terraform plan -destroy -out=rollback.tfplan terraform apply rollback.tfplan -# Restore from backup +# Restaurer depuis une sauvegarde make restore BACKUP_DATE=20241201 ENV=production ``` -### Gradual Rollback +### Rollback Graduel ```bash -# Remove problematic servers from load balancer -# Update HAProxy configuration to exclude failed servers +# Retirer les serveurs problĂ©matiques du load balancer +# Mettre Ă  jour la configuration HAProxy pour exclure les serveurs dĂ©faillants ansible-playbook -i inventory/production.yml playbooks/load-balancer-setup.yml --extra-vars="exclude_servers=['gex44-3']" -# Fix issues on excluded servers -# Re-add to load balancer when ready +# Corriger les problĂšmes sur les serveurs exclus +# Les rajouter au load balancer quand prĂȘts ``` -## Maintenance Procedures +## ProcĂ©dures de Maintenance -### Regular Maintenance +### Maintenance RĂ©guliĂšre ```bash -# Weekly: Update all packages +# Hebdomadaire : Mettre Ă  jour tous les paquets ansible all -i inventory/production.yml -a "apt update && apt upgrade -y" -# Monthly: Restart services +# Mensuelle : RedĂ©marrer les services ansible all -i inventory/production.yml -a "systemctl restart vllm-api" -# Quarterly: Full system reboot (during maintenance window) +# Trimestrielle : RedĂ©marrage complet du systĂšme (pendant la fenĂȘtre de maintenance) ansible all -i inventory/production.yml -a "reboot" --become ``` -### Cost Optimization +### Optimisation des CoĂ»ts ```bash -# Generate cost report +# GĂ©nĂ©rer un rapport de coĂ»ts make cost-report ENV=production -# Review unused resources +# Examiner les ressources inutilisĂ©es python scripts/cost-analysis.py --find-unused -# Implement recommendations -# Scale down during low usage periods +# ImplĂ©menter les recommandations +# RĂ©duire l'Ă©chelle pendant les pĂ©riodes de faible utilisation ``` -## Security Hardening +## Durcissement de SĂ©curitĂ© -### Post-Deployment Security +### SĂ©curitĂ© Post-DĂ©ploiement ```bash -# Run security hardening playbook +# ExĂ©cuter le playbook de durcissement de sĂ©curitĂ© ansible-playbook -i inventory/production.yml playbooks/security-hardening.yml -# Update firewall rules +# Mettre Ă  jour les rĂšgles de pare-feu ansible-playbook -i inventory/production.yml playbooks/firewall-setup.yml -# Rotate SSH keys +# Rotation des clĂ©s SSH ansible-playbook -i inventory/production.yml playbooks/ssh-key-rotation.yml ``` -### Security Monitoring +### Surveillance de SĂ©curitĂ© ```bash -# Enable fail2ban +# Activer fail2ban ansible all -i inventory/production.yml -a "systemctl enable fail2ban" -# Setup log monitoring +# Configurer la surveillance des logs ansible-playbook -i inventory/production.yml playbooks/log-monitoring.yml -# Configure intrusion detection +# Configurer la dĂ©tection d'intrusion ansible-playbook -i inventory/production.yml playbooks/ids-setup.yml ``` -This deployment guide provides a comprehensive path from initial setup to production deployment. Always test changes in development and staging environments before applying to production. \ No newline at end of file +Ce guide de dĂ©ploiement fournit un chemin complet depuis la configuration initiale jusqu'au dĂ©ploiement en production. Testez toujours les changements dans les environnements de dĂ©veloppement et de staging avant de les appliquer en production. \ No newline at end of file diff --git a/docs/04_tools.md b/docs/04_tools.md index 2f6b45a..3e0cfe2 100644 --- a/docs/04_tools.md +++ b/docs/04_tools.md @@ -1,249 +1,238 @@ -# Tools & Technologies +# Outils et Technologies -## Core Infrastructure +## Infrastructure de Base -### Infrastructure as Code -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Terraform** | 1.12+ | Infrastructure provisioning | MPL-2.0 | -| **Hetzner Provider** | 1.45+ | Hetzner Cloud resources | MPL-2.0 | +### Infrastructure en tant que Code +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **Terraform** | 1.12+ | Provisioning d'infrastructure | MPL-2.0 | +| **Hetzner Provider** | 1.45+ | Ressources Hetzner Cloud | MPL-2.0 | -### Configuration Management -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Ansible** | 8.0+ | Server configuration | GPL-3.0 | -| **Ansible Vault** | Included | Secrets management | GPL-3.0 | +### Gestion de Configuration +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **Ansible** | 8.0+ | Configuration de serveurs | GPL-3.0 | +| **Ansible Vault** | Inclus | Gestion des secrets | GPL-3.0 | -## Operating System & Runtime +## SystĂšme d'Exploitation et Runtime -### Base System -| Component | Version | Purpose | Support | -|-----------|---------|---------|---------| -| **Ubuntu Server** | 24.04 LTS | Base operating system | Until 2034 | -| **Docker** | 24.0.x | Container runtime | Docker Inc. | -| **systemd** | 253+ | Service management | Built-in | +### SystĂšme de Base +| Composant | Version | Objectif | Support | +|-----------|---------|----------|---------| +| **Ubuntu Server** | 24.04 LTS | SystĂšme d'exploitation de base | Jusqu'en 2034 | +| **Docker** | 24.0.x | Runtime de conteneurs | Docker Inc. | +| **systemd** | 253+ | Gestion des services | IntĂ©grĂ© | -### GPU Stack -| Component | Version | Purpose | Support | -|-----------|---------|---------|---------| -| **NVIDIA Driver** | 545.23.08 | GPU driver | NVIDIA | -| **CUDA Toolkit** | 12.3+ | GPU computing | NVIDIA | -| **NVIDIA Container Toolkit** | 1.14+ | Docker GPU support | NVIDIA | +### Stack GPU +| Composant | Version | Objectif | Support | +|-----------|---------|----------|---------| +| **NVIDIA Driver** | 545.23.08 | Pilote GPU | NVIDIA | +| **CUDA Toolkit** | 12.3+ | Computing GPU | NVIDIA | +| **NVIDIA Container Toolkit** | 1.14+ | Support GPU Docker | NVIDIA | -## AI/ML Stack +## Stack IA/ML -### Inference Engine -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **vLLM** | Latest | LLM inference server | Apache-2.0 | -| **PyTorch** | 2.5.0+ | Deep learning framework | BSD-3 | -| **Transformers** | 4.46.0+ | Model library | Apache-2.0 | -| **Accelerate** | 0.34.0+ | Training acceleration | Apache-2.0 | +### Moteur d'InfĂ©rence +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **vLLM** | Latest | Serveur d'infĂ©rence LLM | Apache-2.0 | +| **PyTorch** | 2.5.0+ | Framework d'apprentissage profond | BSD-3 | +| **Transformers** | 4.46.0+ | BibliothĂšque de modĂšles | Apache-2.0 | +| **Accelerate** | 0.34.0+ | AccĂ©lĂ©ration d'entraĂźnement | Apache-2.0 | -### Model Management -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **MLflow** | 2.8+ | Model lifecycle management | Apache-2.0 | -| **Hugging Face Hub** | 0.25.0+ | Model repository | Apache-2.0 | +### Gestion des ModĂšles +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **MLflow** | 2.8+ | Gestion du cycle de vie des modĂšles | Apache-2.0 | +| **Hugging Face Hub** | 0.25.0+ | DĂ©pĂŽt de modĂšles | Apache-2.0 | -### Quantization -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **AWQ** | Latest | 4-bit quantization | MIT | -| **GPTQ** | Latest | Alternative quantization | MIT | -| **TorchAO** | Nightly | Advanced optimizations | BSD-3 | +### Quantification +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **AWQ** | Latest | Quantification 4-bit | MIT | +| **GPTQ** | Latest | Quantification alternative | MIT | +| **TorchAO** | Nightly | Optimisations avancĂ©es | BSD-3 | -## Networking & Load Balancing +## RĂ©seau et RĂ©partition de Charge -### Load Balancing -| Tool | Version | Purpose | License | -|------|---------|---------|---------| +### RĂ©partition de Charge +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| | **HAProxy** | 2.8+ | Load balancer | GPL-2.0 | -| **Keepalived** | 2.2+ | High availability | GPL-2.0 | +| **Keepalived** | 2.2+ | Haute disponibilitĂ© | GPL-2.0 | ### SSL/TLS -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Let's Encrypt** | Current | Free SSL certificates | ISRG | -| **Certbot** | 2.7+ | Certificate automation | Apache-2.0 | +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **Let's Encrypt** | Actuel | Certificats SSL gratuits | ISRG | +| **Certbot** | 2.7+ | Automatisation de certificats | Apache-2.0 | -## Monitoring & Observability +## Surveillance et ObservabilitĂ© -### Core Monitoring -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Prometheus** | 2.47+ | Metrics collection | Apache-2.0 | -| **Grafana** | 10.2+ | Metrics visualization | AGPL-3.0 | -| **AlertManager** | 0.26+ | Alert routing | Apache-2.0 | +### Surveillance de Base +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **Prometheus** | 2.47+ | Collection de mĂ©triques | Apache-2.0 | +| **Grafana** | 10.2+ | Visualisation de mĂ©triques | AGPL-3.0 | +| **AlertManager** | 0.26+ | Routage d'alertes | Apache-2.0 | -### Exporters -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Node Exporter** | 1.7+ | System metrics | Apache-2.0 | -| **nvidia-smi Exporter** | Custom | GPU metrics | MIT | -| **HAProxy Exporter** | 0.15+ | Load balancer metrics | Apache-2.0 | +### Exporteurs +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **Node Exporter** | 1.7+ | MĂ©triques systĂšme | Apache-2.0 | +| **nvidia-smi Exporter** | PersonnalisĂ© | MĂ©triques GPU | MIT | +| **HAProxy Exporter** | 0.15+ | MĂ©triques load balancer | Apache-2.0 | -### Log Management -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **systemd-journald** | Built-in | Log collection | GPL-2.0 | -| **Logrotate** | 3.21+ | Log rotation | GPL-2.0 | +### Gestion des Logs +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **systemd-journald** | IntĂ©grĂ© | Collection de logs | GPL-2.0 | +| **Logrotate** | 3.21+ | Rotation des logs | GPL-2.0 | -## CI/CD & Development +## CI/CD et DĂ©veloppement -### CI/CD Platform -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **GitLab** | 16.0+ | CI/CD pipeline | MIT | -| **GitLab Runner** | 16.0+ | Job execution | MIT | +### Plateforme CI/CD +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **GitLab** | 16.0+ | Pipeline CI/CD | MIT | +| **GitLab Runner** | 16.0+ | ExĂ©cution de tĂąches | MIT | -### Development Tools -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Python** | 3.12+ | Scripting language | PSF | -| **pip** | 23.0+ | Package manager | MIT | -| **Poetry** | 1.7+ | Dependency management | MIT | +### Outils de DĂ©veloppement +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **Python** | 3.12+ | Langage de script | PSF | +| **pip** | 23.0+ | Gestionnaire de paquets | MIT | +| **Poetry** | 1.7+ | Gestion des dĂ©pendances | MIT | -### Testing -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **pytest** | 7.4+ | Python testing | MIT | -| **requests** | 2.31+ | HTTP testing | Apache-2.0 | -| **locust** | 2.17+ | Load testing | MIT | +### Tests +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **pytest** | 7.4+ | Tests Python | MIT | +| **requests** | 2.31+ | Tests HTTP | Apache-2.0 | +| **locust** | 2.17+ | Tests de charge | MIT | -## Security & Compliance +## SĂ©curitĂ© et ConformitĂ© -### Firewall & Security -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **ufw** | 0.36+ | Firewall management | GPL-3.0 | -| **fail2ban** | 1.0+ | Intrusion prevention | GPL-2.0 | -| **SSH** | OpenSSH 9.3+ | Secure access | BSD | +### Pare-feu et SĂ©curitĂ© +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **ufw** | 0.36+ | Gestion du pare-feu | GPL-3.0 | +| **fail2ban** | 1.0+ | PrĂ©vention d'intrusion | GPL-2.0 | +| **SSH** | OpenSSH 9.3+ | AccĂšs sĂ©curisĂ© | BSD | -### Secrets Management -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Ansible Vault** | Built-in | Configuration secrets | GPL-3.0 | -| **GitLab CI Variables** | Built-in | CI/CD secrets | MIT | +### Gestion des Secrets +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **Ansible Vault** | IntĂ©grĂ© | Secrets de configuration | GPL-3.0 | +| **GitLab CI Variables** | IntĂ©grĂ© | Secrets CI/CD | MIT | -## Cloud Provider APIs +## APIs Fournisseur Cloud -### Hetzner Services -| Service | API Version | Purpose | Pricing | -|---------|-------------|---------|---------| -| **Hetzner Cloud** | v1 | Cloud resources | Pay-per-use | -| **Hetzner Robot** | v1 | Dedicated servers | Monthly | -| **Hetzner DNS** | v1 | DNS management | Free | +### Services Hetzner +| Service | Version API | Objectif | Tarification | +|---------|-------------|----------|--------------| +| **Hetzner Cloud** | v1 | Ressources cloud | Paiement Ă  l'usage | +| **Hetzner Robot** | v1 | Serveurs dĂ©diĂ©s | Mensuel | +| **Hetzner DNS** | v1 | Gestion DNS | Gratuit | -## Backup & Storage +## Sauvegarde et Stockage -### Storage Solutions -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **rsync** | 3.2+ | File synchronization | GPL-3.0 | -| **tar** | 1.34+ | Archive creation | GPL-3.0 | +### Solutions de Stockage +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **rsync** | 3.2+ | Synchronisation de fichiers | GPL-3.0 | +| **tar** | 1.34+ | CrĂ©ation d'archives | GPL-3.0 | | **gzip** | 1.12+ | Compression | GPL-3.0 | -### Cloud Storage -| Service | Purpose | Pricing | -|---------|---------|---------| -| **Hetzner Storage Box** | Backup storage | €0.0104/GB/month | -| **Hetzner Cloud Volumes** | Block storage | €0.0476/GB/month | +### Stockage Cloud +| Service | Objectif | Tarification | +|---------|----------|--------------| +| **Hetzner Storage Box** | Stockage de sauvegarde | 0,0104€/GB/mois | +| **Hetzner Cloud Volumes** | Stockage bloc | 0,0476€/GB/mois | -## Performance & Optimization +## Performance et Optimisation -### System Optimization -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **htop** | 3.2+ | Process monitoring | GPL-2.0 | -| **iotop** | 0.6+ | I/O monitoring | GPL-2.0 | -| **nvidia-smi** | Included | GPU monitoring | NVIDIA | +### Optimisation SystĂšme +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **htop** | 3.2+ | Surveillance des processus | GPL-2.0 | +| **iotop** | 0.6+ | Surveillance I/O | GPL-2.0 | +| **nvidia-smi** | Inclus | Surveillance GPU | NVIDIA | -### Network Optimization -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **iperf3** | 3.12+ | Network testing | BSD-3 | -| **tc** | Built-in | Traffic control | GPL-2.0 | +### Optimisation RĂ©seau +| Outil | Version | Objectif | Licence | +|-------|---------|----------|---------| +| **iperf3** | 3.12+ | Tests rĂ©seau | BSD-3 | +| **tc** | IntĂ©grĂ© | ContrĂŽle du trafic | GPL-2.0 | -## Documentation & Collaboration +## Documentation et Collaboration -### Documentation -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Markdown** | CommonMark | Documentation format | BSD | -| **Mermaid** | 10.6+ | Diagram generation | MIT | -### Version Control -| Tool | Version | Purpose | License | -|------|---------|---------|---------| -| **Git** | 2.40+ | Version control | GPL-2.0 | -| **Git LFS** | 3.4+ | Large file storage | MIT | +## Commandes d'Installation -## Installation Commands - -### Ubuntu 24.04 Setup +### Configuration Ubuntu 24.04 ```bash -# Update system +# Mettre Ă  jour le systĂšme sudo apt update && sudo apt upgrade -y -# Install core tools +# Installer les outils de base sudo apt install -y curl wget git python3-pip -# Install Docker +# Installer Docker curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh -# Install NVIDIA drivers (sur GEX44) +# Installer les pilotes NVIDIA (sur GEX44) sudo apt install -y nvidia-driver-545 sudo nvidia-smi -# Install Terraform +# Installer Terraform wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list sudo apt update && sudo apt install -y terraform -# Install Ansible +# Installer Ansible sudo apt install -y ansible -# Install Python dependencies +# Installer les dĂ©pendances Python pip3 install mlflow requests prometheus-client ``` -### Verification Commands +### Commandes de VĂ©rification ```bash -# Verify versions +# VĂ©rifier les versions terraform version ansible --version docker version python3 --version -# Verify GPU (sur GEX44) +# VĂ©rifier le GPU (sur GEX44) nvidia-smi docker run --rm --gpus all nvidia/cuda:12.3-runtime-ubuntu22.04 nvidia-smi ``` -## Architecture Compatibility +## CompatibilitĂ© d'Architecture -### Supported Hardware +### MatĂ©riel SupportĂ© - **CPU** : Intel x86_64, AMD x86_64 - **GPU** : NVIDIA RTX 4000 Ada (Compute Capability 8.9) -- **Memory** : 64GB DDR4 minimum -- **Storage** : NVMe SSD minimum +- **MĂ©moire** : 64GB DDR4 minimum +- **Stockage** : SSD NVMe minimum -### Network Requirements -- **Bandwidth** : 1 Gbps minimum -- **Latency** : < 10ms intra-datacenter -- **Ports** : 22 (SSH), 80/443 (HTTP/HTTPS), 8000 (vLLM), 9090-9100 (Monitoring) +### Exigences RĂ©seau +- **Bande passante** : 1 Gbps minimum +- **Latence** : < 10ms intra-datacenter +- **Ports** : 22 (SSH), 80/443 (HTTP/HTTPS), 8000 (vLLM), 9090-9100 (Surveillance) -## License Compliance +## ConformitĂ© de Licence -### Open Source Components -- **GPL-licensed** : Linux kernel, systemd, Ansible -- **Apache-licensed** : Terraform, MLflow, Prometheus -- **MIT-licensed** : Docker, GitLab, pytest -- **BSD-licensed** : PyTorch, OpenSSH +### Composants Open Source +- **Licence GPL** : Noyau Linux, systemd, Ansible +- **Licence Apache** : Terraform, MLflow, Prometheus +- **Licence MIT** : Docker, GitLab, pytest +- **Licence BSD** : PyTorch, OpenSSH -### Proprietary Components -- **NVIDIA drivers** : NVIDIA License (redistribution restrictions) -- **Hetzner services** : Commercial terms -- **GitLab Enterprise** : Commercial (si utilisĂ©) \ No newline at end of file +### Composants PropriĂ©taires +- **Pilotes NVIDIA** : Licence NVIDIA (restrictions de redistribution) +- **Services Hetzner** : Conditions commerciales +- **GitLab Enterprise** : Commercial (si utilisĂ©) diff --git a/docs/05_troubleshooting.md b/docs/05_troubleshooting.md index 55d3cc2..7e993b3 100644 --- a/docs/05_troubleshooting.md +++ b/docs/05_troubleshooting.md @@ -1,659 +1,659 @@ -# Troubleshooting Guide +# Guide de DĂ©pannage -This guide helps diagnose and resolve common issues with the AI Infrastructure deployment. +Ce guide aide Ă  diagnostiquer et rĂ©soudre les problĂšmes courants avec le dĂ©ploiement de l'Infrastructure IA. -## Quick Diagnostic Commands +## Commandes de Diagnostic Rapide ```bash -# Overall system health +# SantĂ© globale du systĂšme make status ENV=production -# Check all services +# VĂ©rifier tous les services ansible all -i inventory/production.yml -a "systemctl list-units --failed" -# View recent logs +# Voir les logs rĂ©cents ansible all -i inventory/production.yml -a "journalctl --since '10 minutes ago' --no-pager" -# Check GPU status +# VĂ©rifier le statut GPU ansible gex44 -i inventory/production.yml -a "nvidia-smi" -# Test API endpoints +# Tester les endpoints API curl -f https://api.yourdomain.com/health curl -f https://api.yourdomain.com/v1/models ``` -## Infrastructure Issues +## ProblĂšmes d'Infrastructure -### Server Not Responding +### Serveur qui ne RĂ©pond Pas -**Symptoms**: Server unreachable via SSH or API +**SymptĂŽmes** : Serveur injoignable via SSH ou API -**Diagnosis**: +**Diagnostic** : ```bash -# Check server status in Hetzner Console -# Ping test +# VĂ©rifier le statut du serveur dans la Console Hetzner +# Test de ping ping SERVER_IP -# SSH connectivity test +# Test de connectivitĂ© SSH ssh -v -i ~/.ssh/hetzner_key ubuntu@SERVER_IP -# Check from other servers +# VĂ©rifier depuis d'autres serveurs ansible other_servers -i inventory/production.yml -a "ping -c 3 SERVER_IP" ``` -**Solutions**: -1. **Network Issues**: +**Solutions** : +1. **ProblĂšmes RĂ©seau** : ```bash - # Restart networking + # RedĂ©marrer le rĂ©seau ansible TARGET_SERVER -i inventory/production.yml -a "systemctl restart networking" - - # Check firewall status + + # VĂ©rifier le statut du pare-feu ansible TARGET_SERVER -i inventory/production.yml -a "ufw status" - - # Reset firewall if needed + + # RĂ©initialiser le pare-feu si nĂ©cessaire ansible TARGET_SERVER -i inventory/production.yml -a "ufw --force reset" ``` -2. **Server Overload**: +2. **Surcharge du Serveur** : ```bash - # Check resource usage + # VĂ©rifier l'utilisation des ressources ansible TARGET_SERVER -i inventory/production.yml -a "top -bn1 | head -20" - - # Check disk space + + # VĂ©rifier l'espace disque ansible TARGET_SERVER -i inventory/production.yml -a "df -h" - - # Check memory + + # VĂ©rifier la mĂ©moire ansible TARGET_SERVER -i inventory/production.yml -a "free -h" ``` -3. **Hardware Issues**: - - Contact Hetzner support - - Check Robot console for hardware alerts - - Consider server replacement +3. **ProblĂšmes MatĂ©riels** : + - Contacter le support Hetzner + - VĂ©rifier la console Robot pour les alertes matĂ©rielles + - Envisager le remplacement du serveur -### Private Network Issues +### ProblĂšmes de RĂ©seau PrivĂ© -**Symptoms**: Servers can't communicate over private network +**SymptĂŽmes** : Les serveurs ne peuvent pas communiquer sur le rĂ©seau privĂ© -**Diagnosis**: +**Diagnostic** : ```bash -# Check private network configuration +# VĂ©rifier la configuration du rĂ©seau privĂ© ansible all -i inventory/production.yml -a "ip route show" -# Test private network connectivity +# Tester la connectivitĂ© du rĂ©seau privĂ© ansible all -i inventory/production.yml -a "ping -c 3 10.0.2.10" -# Check network interfaces +# VĂ©rifier les interfaces rĂ©seau ansible all -i inventory/production.yml -a "ip addr show" ``` -**Solutions**: +**Solutions** : ```bash -# Restart network interfaces +# RedĂ©marrer les interfaces rĂ©seau ansible all -i inventory/production.yml -a "systemctl restart networking" -# Re-apply network configuration +# RĂ©-appliquer la configuration rĂ©seau ansible-playbook -i inventory/production.yml playbooks/network-setup.yml -# Check Hetzner Cloud network status +# VĂ©rifier le statut rĂ©seau Hetzner Cloud terraform show | grep network ``` -## GPU Issues +## ProblĂšmes GPU -### GPU Not Detected +### GPU Non DĂ©tectĂ© -**Symptoms**: `nvidia-smi` command fails or shows no GPUs +**SymptĂŽmes** : La commande `nvidia-smi` Ă©choue ou n'affiche aucun GPU -**Diagnosis**: +**Diagnostic** : ```bash -# Check GPU hardware detection +# VĂ©rifier la dĂ©tection matĂ©rielle du GPU ansible gex44 -i inventory/production.yml -a "lspci | grep -i nvidia" -# Check NVIDIA driver status +# VĂ©rifier le statut du pilote NVIDIA ansible gex44 -i inventory/production.yml -a "nvidia-smi" -# Check driver version +# VĂ©rifier la version du pilote ansible gex44 -i inventory/production.yml -a "cat /proc/driver/nvidia/version" -# Check kernel modules +# VĂ©rifier les modules du noyau ansible gex44 -i inventory/production.yml -a "lsmod | grep nvidia" ``` -**Solutions**: -1. **Driver Issues**: +**Solutions** : +1. **ProblĂšmes de Pilote** : ```bash - # Reinstall NVIDIA drivers + # RĂ©installer les pilotes NVIDIA ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=cuda - - # Reboot after driver installation + + # RedĂ©marrer aprĂšs l'installation du pilote ansible gex44 -i inventory/production.yml -a "reboot" --become ``` -2. **Hardware Issues**: +2. **ProblĂšmes MatĂ©riels** : ```bash - # Check hardware detection + # VĂ©rifier la dĂ©tection matĂ©rielle ansible gex44 -i inventory/production.yml -a "lshw -C display" - - # Check BIOS settings (requires physical access) - # Contact Hetzner support for hardware issues + + # VĂ©rifier les paramĂštres BIOS (nĂ©cessite un accĂšs physique) + # Contacter le support Hetzner pour les problĂšmes matĂ©riels ``` -### GPU Memory Issues +### ProblĂšmes de MĂ©moire GPU -**Symptoms**: CUDA out of memory errors, poor performance +**SymptĂŽmes** : Erreurs CUDA de manque de mĂ©moire, performances dĂ©gradĂ©es -**Diagnosis**: +**Diagnostic** : ```bash -# Check GPU memory usage +# VĂ©rifier l'utilisation de la mĂ©moire GPU ansible gex44 -i inventory/production.yml -a "nvidia-smi --query-gpu=memory.used,memory.total --format=csv" -# Check running processes on GPU +# VĂ©rifier les processus en cours d'exĂ©cution sur le GPU ansible gex44 -i inventory/production.yml -a "nvidia-smi pmon" -# Check vLLM memory configuration +# VĂ©rifier la configuration mĂ©moire vLLM ansible gex44 -i inventory/production.yml -a "cat /etc/vllm/config.env | grep MEMORY" ``` -**Solutions**: -1. **Reduce Memory Usage**: +**Solutions** : +1. **RĂ©duire l'Utilisation MĂ©moire** : ```bash - # Lower GPU memory utilization + # RĂ©duire l'utilisation de la mĂ©moire GPU ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_GPU_MEMORY_UTILIZATION=0.8' regexp='^VLLM_GPU_MEMORY_UTILIZATION='" - - # Restart vLLM + + # RedĂ©marrer vLLM ansible gex44 -i inventory/production.yml -a "systemctl restart vllm-api" ``` -2. **Clear GPU Memory**: +2. **LibĂ©rer la MĂ©moire GPU** : ```bash - # Kill all GPU processes + # Tuer tous les processus GPU ansible gex44 -i inventory/production.yml -a "pkill -f python" - - # Reset GPU + + # RĂ©initialiser le GPU ansible gex44 -i inventory/production.yml -a "nvidia-smi --gpu-reset" ``` -### GPU Temperature Issues +### ProblĂšmes de TempĂ©rature GPU -**Symptoms**: High GPU temperatures, thermal throttling +**SymptĂŽmes** : TempĂ©ratures Ă©levĂ©es du GPU, limitation thermique -**Diagnosis**: +**Diagnostic** : ```bash -# Check current temperatures +# VĂ©rifier les tempĂ©ratures actuelles ansible gex44 -i inventory/production.yml -a "nvidia-smi --query-gpu=temperature.gpu,temperature.memory --format=csv" -# Check temperature history in Grafana -# Navigate to GPU Metrics dashboard +# VĂ©rifier l'historique des tempĂ©ratures dans Grafana +# Naviguer vers le tableau de bord MĂ©triques GPU ``` -**Solutions**: -1. **Immediate Cooling**: +**Solutions** : +1. **Refroidissement ImmĂ©diat** : ```bash - # Reduce GPU workload - # Scale down inference requests temporarily - - # Check cooling system + # RĂ©duire la charge GPU + # RĂ©duire temporairement les requĂȘtes d'infĂ©rence + + # VĂ©rifier le systĂšme de refroidissement ansible gex44 -i inventory/production.yml -a "sensors" ``` -2. **Long-term Solutions**: - - Contact Hetzner for datacenter cooling issues - - Reduce GPU utilization limits - - Implement better load balancing +2. **Solutions Ă  Long Terme** : + - Contacter Hetzner pour les problĂšmes de refroidissement du datacenter + - RĂ©duire les limites d'utilisation du GPU + - ImplĂ©menter une meilleure rĂ©partition de charge -## vLLM Service Issues +## ProblĂšmes du Service vLLM -### vLLM Service Won't Start +### Le Service vLLM ne DĂ©marre Pas -**Symptoms**: `systemctl status vllm-api` shows failed state +**SymptĂŽmes** : `systemctl status vllm-api` affiche un Ă©tat d'Ă©chec -**Diagnosis**: +**Diagnostic** : ```bash -# Check service status +# VĂ©rifier le statut du service ansible gex44 -i inventory/production.yml -a "systemctl status vllm-api" -# Check service logs +# VĂ©rifier les logs du service ansible gex44 -i inventory/production.yml -a "journalctl -u vllm-api -n 50" -# Check vLLM configuration +# VĂ©rifier la configuration vLLM ansible gex44 -i inventory/production.yml -a "cat /etc/vllm/config.env" -# Test manual start +# Tester le dĂ©marrage manuel ansible gex44 -i inventory/production.yml -a "sudo -u vllm python -m vllm.entrypoints.openai.api_server --help" ``` -**Solutions**: -1. **Configuration Issues**: +**Solutions** : +1. **ProblĂšmes de Configuration** : ```bash - # Validate configuration + # Valider la configuration ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=config --check - - # Regenerate configuration + + # RĂ©gĂ©nĂ©rer la configuration ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=config ``` -2. **Permission Issues**: +2. **ProblĂšmes de Permissions** : ```bash - # Fix file permissions + # Corriger les permissions de fichiers ansible gex44 -i inventory/production.yml -a "chown -R vllm:vllm /opt/vllm" ansible gex44 -i inventory/production.yml -a "chmod 755 /opt/vllm" ``` -3. **Dependency Issues**: +3. **ProblĂšmes de DĂ©pendances** : ```bash - # Reinstall vLLM + # RĂ©installer vLLM ansible gex44 -i inventory/production.yml -a "pip install --force-reinstall vllm" ``` -### Model Loading Issues +### ProblĂšmes de Chargement de ModĂšles -**Symptoms**: vLLM starts but models fail to load +**SymptĂŽmes** : vLLM dĂ©marre mais les modĂšles ne se chargent pas -**Diagnosis**: +**Diagnostic** : ```bash -# Check model files +# VĂ©rifier les fichiers de modĂšles ansible gex44 -i inventory/production.yml -a "ls -la /opt/vllm/models/" -# Check disk space +# VĂ©rifier l'espace disque ansible gex44 -i inventory/production.yml -a "df -h /opt/vllm/models/" -# Check model loading logs +# VĂ©rifier les logs de chargement des modĂšles ansible gex44 -i inventory/production.yml -a "tail -f /var/log/vllm/model-loading.log" -# Test model access +# Tester l'accĂšs aux modĂšles ansible gex44 -i inventory/production.yml -a "sudo -u vllm python -c \"from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('/opt/vllm/models/mixtral-8x7b')\"" ``` -**Solutions**: -1. **Missing Models**: +**Solutions** : +1. **ModĂšles Manquants** : ```bash - # Re-download models + # Re-tĂ©lĂ©charger les modĂšles ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=models - - # Check HuggingFace connectivity + + # VĂ©rifier la connectivitĂ© HuggingFace ansible gex44 -i inventory/production.yml -a "curl -f https://huggingface.co" ``` -2. **Corrupted Models**: +2. **ModĂšles Corrompus** : ```bash - # Remove corrupted models + # Supprimer les modĂšles corrompus ansible gex44 -i inventory/production.yml -a "rm -rf /opt/vllm/models/mixtral-8x7b" - - # Re-download + + # Re-tĂ©lĂ©charger ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=models ``` -3. **Insufficient Resources**: +3. **Ressources Insuffisantes** : ```bash - # Use smaller model or quantization - # Update configuration to use quantized models + # Utiliser un modĂšle plus petit ou la quantification + # Mettre Ă  jour la configuration pour utiliser des modĂšles quantifiĂ©s ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_QUANTIZATION=awq' regexp='^VLLM_QUANTIZATION='" ``` -### High Latency Issues +### ProblĂšmes de Latence ÉlevĂ©e -**Symptoms**: API responses take too long +**SymptĂŽmes** : Les rĂ©ponses API prennent trop de temps -**Diagnosis**: +**Diagnostic** : ```bash -# Check current latency +# VĂ©rifier la latence actuelle curl -w "@curl-format.txt" -o /dev/null -s https://api.yourdomain.com/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"mixtral-8x7b","messages":[{"role":"user","content":"Hello"}],"max_tokens":10}' -# Check queue size +# VĂ©rifier la taille de la file d'attente curl -s https://api.yourdomain.com/metrics | grep vllm_queue_size -# Check GPU utilization +# VĂ©rifier l'utilisation GPU ansible gex44 -i inventory/production.yml -a "nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits" ``` -**Solutions**: -1. **Scale Up**: +**Solutions** : +1. **Augmenter l'Échelle** : ```bash - # Add more GPU servers + # Ajouter plus de serveurs GPU make scale-up ENV=production - - # Or manually order new servers + + # Ou commander manuellement de nouveaux serveurs python scripts/autoscaler.py --action=scale-up --count=1 ``` -2. **Optimize Configuration**: +2. **Optimiser la Configuration** : ```bash - # Reduce model precision + # RĂ©duire la prĂ©cision du modĂšle ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_DTYPE=float16' regexp='^VLLM_DTYPE='" - - # Increase batch size + + # Augmenter la taille des lots ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_MAX_NUM_SEQS=512' regexp='^VLLM_MAX_NUM_SEQS='" ``` -3. **Load Balancing**: +3. **RĂ©partition de Charge** : ```bash - # Check load balancer configuration + # VĂ©rifier la configuration du load balancer ansible load_balancers -i inventory/production.yml -a "curl -s http://localhost:8404/stats" - - # Verify all backends are healthy + + # VĂ©rifier que tous les backends sont en bonne santĂ© curl -s http://LOAD_BALANCER_IP:8404/stats | grep UP ``` -## Load Balancer Issues +## ProblĂšmes de Load Balancer -### Load Balancer Not Routing Traffic +### Load Balancer ne Route pas le Trafic -**Symptoms**: Requests fail to reach backend servers +**SymptĂŽmes** : Les requĂȘtes n'atteignent pas les serveurs backend -**Diagnosis**: +**Diagnostic** : ```bash -# Check HAProxy status +# VĂ©rifier le statut HAProxy ansible load_balancers -i inventory/production.yml -a "systemctl status haproxy" -# Check HAProxy configuration +# VĂ©rifier la configuration HAProxy ansible load_balancers -i inventory/production.yml -a "haproxy -f /etc/haproxy/haproxy.cfg -c" -# Check backend health +# VĂ©rifier la santĂ© des backends curl -s http://LOAD_BALANCER_IP:8404/stats -# Test direct backend access +# Tester l'accĂšs direct aux backends curl -f http://10.0.1.10:8000/health ``` -**Solutions**: -1. **Configuration Issues**: +**Solutions** : +1. **ProblĂšmes de Configuration** : ```bash - # Regenerate HAProxy configuration + # RĂ©gĂ©nĂ©rer la configuration HAProxy ansible-playbook -i inventory/production.yml playbooks/load-balancer-setup.yml - - # Restart HAProxy + + # RedĂ©marrer HAProxy ansible load_balancers -i inventory/production.yml -a "systemctl restart haproxy" ``` -2. **Backend Health Issues**: +2. **ProblĂšmes de SantĂ© des Backends** : ```bash - # Check why backends are failing health checks + # VĂ©rifier pourquoi les backends Ă©chouent aux contrĂŽles de santĂ© ansible gex44 -i inventory/production.yml -a "curl -f http://localhost:8000/health" - - # Fix unhealthy backends + + # Corriger les backends dĂ©faillants ansible gex44 -i inventory/production.yml -a "systemctl restart vllm-api" ``` -### SSL Certificate Issues +### ProblĂšmes de Certificats SSL -**Symptoms**: HTTPS requests fail with certificate errors +**SymptĂŽmes** : Les requĂȘtes HTTPS Ă©chouent avec des erreurs de certificat -**Diagnosis**: +**Diagnostic** : ```bash -# Check certificate validity +# VĂ©rifier la validitĂ© du certificat openssl s_client -connect api.yourdomain.com:443 -servername api.yourdomain.com -# Check certificate files +# VĂ©rifier les fichiers de certificats ansible load_balancers -i inventory/production.yml -a "ls -la /etc/ssl/certs/" -# Check certificate expiration +# VĂ©rifier l'expiration du certificat ansible load_balancers -i inventory/production.yml -a "openssl x509 -in /etc/ssl/certs/haproxy.pem -text -noout | grep 'Not After'" ``` -**Solutions**: -1. **Renew Certificates**: +**Solutions** : +1. **Renouveler les Certificats** : ```bash - # Renew Let's Encrypt certificates + # Renouveler les certificats Let's Encrypt ansible load_balancers -i inventory/production.yml -a "certbot renew" - - # Reload HAProxy + + # Recharger HAProxy ansible load_balancers -i inventory/production.yml -a "systemctl reload haproxy" ``` -2. **Fix Certificate Configuration**: +2. **Corriger la Configuration des Certificats** : ```bash - # Regenerate certificate bundle + # RĂ©gĂ©nĂ©rer le bundle de certificats ansible load_balancers -i inventory/production.yml -a "cat /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem /etc/letsencrypt/live/api.yourdomain.com/privkey.pem > /etc/ssl/certs/haproxy.pem" ``` -## Monitoring Issues +## ProblĂšmes de Surveillance -### Prometheus Not Collecting Metrics +### Prometheus ne Collecte pas les MĂ©triques -**Symptoms**: Missing data in Grafana dashboards +**SymptĂŽmes** : DonnĂ©es manquantes dans les tableaux de bord Grafana -**Diagnosis**: +**Diagnostic** : ```bash -# Check Prometheus status +# VĂ©rifier le statut Prometheus ansible monitoring -i inventory/production.yml -a "systemctl status prometheus" -# Check Prometheus configuration +# VĂ©rifier la configuration Prometheus ansible monitoring -i inventory/production.yml -a "promtool check config /etc/prometheus/prometheus.yml" -# Check target status +# VĂ©rifier le statut des cibles curl -s http://MONITORING_IP:9090/api/v1/targets | jq . -# Test metric endpoints +# Tester les endpoints de mĂ©triques curl -s http://10.0.1.10:9835/metrics | head -10 ``` -**Solutions**: -1. **Configuration Issues**: +**Solutions** : +1. **ProblĂšmes de Configuration** : ```bash - # Regenerate Prometheus configuration + # RĂ©gĂ©nĂ©rer la configuration Prometheus ansible-playbook -i inventory/production.yml playbooks/monitoring-setup.yml --tags=prometheus - - # Restart Prometheus + + # RedĂ©marrer Prometheus ansible monitoring -i inventory/production.yml -a "systemctl restart prometheus" ``` -2. **Target Connectivity**: +2. **ConnectivitĂ© des Cibles** : ```bash - # Check network connectivity to targets + # VĂ©rifier la connectivitĂ© rĂ©seau vers les cibles ansible monitoring -i inventory/production.yml -a "curl -f http://10.0.1.10:9835/metrics" - - # Check firewall rules + + # VĂ©rifier les rĂšgles de pare-feu ansible gex44 -i inventory/production.yml -a "ufw status | grep 9835" ``` -### Grafana Dashboard Issues +### ProblĂšmes de Tableaux de Bord Grafana -**Symptoms**: Dashboards show no data or errors +**SymptĂŽmes** : Les tableaux de bord n'affichent aucune donnĂ©e ou montrent des erreurs -**Diagnosis**: +**Diagnostic** : ```bash -# Check Grafana status +# VĂ©rifier le statut Grafana ansible monitoring -i inventory/production.yml -a "systemctl status grafana-server" -# Check Grafana logs +# VĂ©rifier les logs Grafana ansible monitoring -i inventory/production.yml -a "journalctl -u grafana-server -n 50" -# Test Prometheus data source +# Tester la source de donnĂ©es Prometheus curl -s http://MONITORING_IP:3000/api/datasources ``` -**Solutions**: -1. **Data Source Issues**: +**Solutions** : +1. **ProblĂšmes de Source de DonnĂ©es** : ```bash - # Reconfigure Grafana data sources + # Reconfigurer les sources de donnĂ©es Grafana ansible-playbook -i inventory/production.yml playbooks/monitoring-setup.yml --tags=grafana - - # Restart Grafana + + # RedĂ©marrer Grafana ansible monitoring -i inventory/production.yml -a "systemctl restart grafana-server" ``` -2. **Dashboard Import Issues**: +2. **ProblĂšmes d'Import de Tableaux de Bord** : ```bash - # Re-import dashboards + # RĂ©-importer les tableaux de bord ansible-playbook -i inventory/production.yml playbooks/monitoring-setup.yml --tags=dashboards ``` -## Performance Issues +## ProblĂšmes de Performance -### High CPU Usage +### Utilisation ÉlevĂ©e du CPU -**Symptoms**: Server becomes slow, high load average +**SymptĂŽmes** : Le serveur devient lent, charge moyenne Ă©levĂ©e -**Diagnosis**: +**Diagnostic** : ```bash -# Check CPU usage +# VĂ©rifier l'utilisation du CPU ansible all -i inventory/production.yml -a "top -bn1 | head -20" -# Check process list +# VĂ©rifier la liste des processus ansible all -i inventory/production.yml -a "ps aux --sort=-%cpu | head -10" -# Check load average +# VĂ©rifier la charge moyenne ansible all -i inventory/production.yml -a "uptime" ``` -**Solutions**: -1. **Identify Resource-Heavy Processes**: +**Solutions** : +1. **Identifier les Processus Gourmands en Ressources** : ```bash - # Kill problematic processes + # Tuer les processus problĂ©matiques ansible TARGET_SERVER -i inventory/production.yml -a "pkill -f PROCESS_NAME" - - # Restart services + + # RedĂ©marrer les services ansible TARGET_SERVER -i inventory/production.yml -a "systemctl restart SERVICE_NAME" ``` -2. **Scale Resources**: +2. **Augmenter les Ressources** : ```bash - # Add more servers or upgrade existing ones - # Consider upgrading cloud server types in Terraform + # Ajouter plus de serveurs ou mettre Ă  niveau les existants + # Envisager de mettre Ă  niveau les types de serveurs cloud dans Terraform ``` -### High Memory Usage +### Utilisation ÉlevĂ©e de la MĂ©moire -**Symptoms**: Out of memory errors, swap usage +**SymptĂŽmes** : Erreurs de manque de mĂ©moire, utilisation du swap -**Diagnosis**: +**Diagnostic** : ```bash -# Check memory usage +# VĂ©rifier l'utilisation de la mĂ©moire ansible all -i inventory/production.yml -a "free -h" -# Check swap usage +# VĂ©rifier l'utilisation du swap ansible all -i inventory/production.yml -a "swapon --show" -# Check memory-heavy processes +# VĂ©rifier les processus gourmands en mĂ©moire ansible all -i inventory/production.yml -a "ps aux --sort=-%mem | head -10" ``` -**Solutions**: -1. **Free Memory**: +**Solutions** : +1. **LibĂ©rer la MĂ©moire** : ```bash - # Clear caches + # Vider les caches ansible all -i inventory/production.yml -a "sync && echo 3 > /proc/sys/vm/drop_caches" - - # Restart memory-heavy services + + # RedĂ©marrer les services gourmands en mĂ©moire ansible gex44 -i inventory/production.yml -a "systemctl restart vllm-api" ``` -2. **Optimize Configuration**: +2. **Optimiser la Configuration** : ```bash - # Reduce model cache size + # RĂ©duire la taille du cache de modĂšles ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_SWAP_SPACE=2' regexp='^VLLM_SWAP_SPACE='" ``` -## Network Issues +## ProblĂšmes RĂ©seau -### High Latency Between Servers +### Latence ÉlevĂ©e entre Serveurs -**Symptoms**: Slow inter-server communication +**SymptĂŽmes** : Communication inter-serveurs lente -**Diagnosis**: +**Diagnostic** : ```bash -# Test latency between servers +# Tester la latence entre serveurs ansible all -i inventory/production.yml -a "ping -c 10 10.0.1.10" -# Check network interface statistics +# VĂ©rifier les statistiques d'interfaces rĂ©seau ansible all -i inventory/production.yml -a "cat /proc/net/dev" -# Test bandwidth +# Tester la bande passante ansible all -i inventory/production.yml -a "iperf3 -c 10.0.1.10 -t 10" ``` -**Solutions**: -1. **Network Optimization**: +**Solutions** : +1. **Optimisation RĂ©seau** : ```bash - # Optimize network settings + # Optimiser les paramĂštres rĂ©seau ansible-playbook -i inventory/production.yml playbooks/network-optimization.yml - - # Check for network congestion - # Consider upgrading network interfaces + + # VĂ©rifier la congestion rĂ©seau + # Envisager de mettre Ă  niveau les interfaces rĂ©seau ``` -### DNS Resolution Issues +### ProblĂšmes de RĂ©solution DNS -**Symptoms**: Domain names not resolving correctly +**SymptĂŽmes** : Les noms de domaine ne se rĂ©solvent pas correctement -**Diagnosis**: +**Diagnostic** : ```bash -# Test DNS resolution +# Tester la rĂ©solution DNS ansible all -i inventory/production.yml -a "nslookup api.yourdomain.com" -# Check DNS configuration +# VĂ©rifier la configuration DNS ansible all -i inventory/production.yml -a "cat /etc/resolv.conf" -# Test external DNS +# Tester le DNS externe ansible all -i inventory/production.yml -a "nslookup google.com 8.8.8.8" ``` -**Solutions**: +**Solutions** : ```bash -# Update DNS configuration +# Mettre Ă  jour la configuration DNS ansible all -i inventory/production.yml -m lineinfile -a "path=/etc/resolv.conf line='nameserver 8.8.8.8'" -# Restart networking +# RedĂ©marrer le rĂ©seau ansible all -i inventory/production.yml -a "systemctl restart systemd-resolved" ``` -## Emergency Procedures +## ProcĂ©dures d'Urgence -### Complete Service Outage +### Panne ComplĂšte de Service -1. **Immediate Response**: +1. **RĂ©ponse ImmĂ©diate** : ```bash - # Check all critical services + # VĂ©rifier tous les services critiques make status ENV=production - - # Enable maintenance mode + + # Activer le mode maintenance ansible load_balancers -i inventory/production.yml -a "systemctl stop haproxy" - - # Notify stakeholders + + # Notifier les parties prenantes ``` -2. **Diagnosis**: +2. **Diagnostic** : ```bash - # Check recent changes + # VĂ©rifier les changements rĂ©cents git log --since="2 hours ago" --oneline - - # Check system logs + + # VĂ©rifier les logs systĂšme ansible all -i inventory/production.yml -a "journalctl --since '2 hours ago' --no-pager" - - # Check monitoring alerts + + # VĂ©rifier les alertes de surveillance curl -s http://MONITORING_IP:9090/api/v1/alerts ``` -3. **Recovery**: +3. **RĂ©cupĂ©ration** : ```bash - # Rollback recent changes if necessary + # Rollback des changements rĂ©cents si nĂ©cessaire make rollback ENV=production BACKUP_DATE=YYYYMMDD - - # Or restart all services + + # Ou redĂ©marrer tous les services ansible all -i inventory/production.yml -a "systemctl restart vllm-api haproxy prometheus grafana-server" - - # Re-enable load balancer + + # RĂ©activer le load balancer ansible load_balancers -i inventory/production.yml -a "systemctl start haproxy" ``` -### Data Loss Prevention +### PrĂ©vention de Perte de DonnĂ©es ```bash -# Immediate backup +# Sauvegarde immĂ©diate make backup ENV=production -# Snapshot critical volumes -# Use Hetzner Cloud console to create snapshots +# InstantanĂ© des volumes critiques +# Utiliser la console Hetzner Cloud pour crĂ©er des snapshots -# Document the incident -# Create incident report with timeline and actions taken +# Documenter l'incident +# CrĂ©er un rapport d'incident avec chronologie et actions entreprises ``` -For issues not covered in this guide, contact the infrastructure team or create an issue in the project repository with: -- Detailed problem description -- Error messages and logs -- Steps already taken -- Current system status \ No newline at end of file +Pour les problĂšmes non couverts dans ce guide, contactez l'Ă©quipe d'infrastructure ou crĂ©ez un ticket dans le dĂ©pĂŽt du projet avec : +- Description dĂ©taillĂ©e du problĂšme +- Messages d'erreur et logs +- Étapes dĂ©jĂ  entreprises +- Statut actuel du systĂšme \ No newline at end of file diff --git a/docs/INDEX.md b/docs/INDEX.md new file mode 100644 index 0000000..3e20c9c --- /dev/null +++ b/docs/INDEX.md @@ -0,0 +1,174 @@ +# Index de la Documentation + +## 📚 Infrastructure IA Production-Ready avec Hetzner + +Cette documentation couvre l'infrastructure complĂšte pour dĂ©ployer des modĂšles IA sur serveurs Hetzner GEX44 avec GitLab CI/CD, Terraform et Ansible. + +### 🎯 Navigation Rapide + +| Document | Description | Statut | +|----------|-------------|--------| +| [**01_architecture.md**](./01_architecture.md) | Architecture complĂšte de l'infrastructure | ✅ Complet | +| [**02_deployment.md**](./02_deployment.md) | Guide de dĂ©ploiement Ă©tape par Ă©tape | ✅ Complet | +| [**03_applications.md**](./03_applications.md) | Organisation multi-projets et Ă©quipes | ✅ Complet | +| [**04_tools.md**](./04_tools.md) | Outils et technologies utilisĂ©s | ✅ Complet | +| [**05_troubleshooting.md**](./05_troubleshooting.md) | Guide de dĂ©pannage et rĂ©solution | ✅ Complet | +| [**vpn-setup.md**](./vpn-setup.md) | Configuration VPN WireGuard | ✅ Complet | + +--- + +## 🚀 DĂ©marrage Rapide + +### PrĂ©requis +- Compte Hetzner (Robot + Cloud) +- GitLab account pour CI/CD +- 3x serveurs GEX44 commandĂ©s + +### Installation en 5 minutes +```bash +# 1. Clone et setup +git clone https://github.com/spham/hetzner-ai-infrastructure.git +cd ai-infrastructure +make setup + +# 2. Configure secrets +cp .env.example .env +# Éditer .env avec vos tokens Hetzner + +# 3. Deploy development +make deploy-dev + +# 4. VĂ©rifier deployment +make test +``` + +--- + +## 📖 Guides par ThĂšme + +### đŸ—ïž **Infrastructure & Architecture** +- **[Architecture](./01_architecture.md)** - Conception globale, composants, rĂ©seaux + - Architecture de haut niveau + - DĂ©tails des composants (Load Balancer, API Gateway, GPU Servers) + - Architecture rĂ©seau et sĂ©curitĂ© + - Performance et coĂ»ts + +### ⚡ **DĂ©ploiement & Configuration** +- **[DĂ©ploiement](./02_deployment.md)** - Guide complet d'installation + - PrĂ©requis et prĂ©paration + - DĂ©ploiement automatisĂ© + - Validation et tests + - ProcĂ©dures de rollback + +### đŸ‘„ **Gestion & Organisation** +- **[Applications](./03_applications.md)** - Organisation multi-projets + - Structure organisationnelle + - Gestion des Ă©quipes + - Workflows de dĂ©veloppement + - Bonnes pratiques + +### đŸ› ïž **Outils & Technologies** +- **[Outils](./04_tools.md)** - Stack technologique complĂšte + - Infrastructure as Code (Terraform, Ansible) + - Containerisation (Docker) + - Monitoring (Prometheus, Grafana) + - CI/CD (GitLab CI) + +### 🔧 **Maintenance & DĂ©pannage** +- **[DĂ©pannage](./05_troubleshooting.md)** - RĂ©solution de problĂšmes + - Diagnostics systĂšme + - ProblĂšmes GPU et vLLM + - Issues rĂ©seau et performance + - ProcĂ©dures d'urgence + +### 🔒 **SĂ©curitĂ© & AccĂšs** +- **[Configuration VPN](./vpn-setup.md)** - AccĂšs externe sĂ©curisĂ© + - Setup WireGuard + - Configuration client/serveur + - AccĂšs entreprise externe + - RĂšgles de sĂ©curitĂ© + +--- + +## 🎼 Commandes Principales + +| Commande | Description | Documentation | +|----------|-------------|---------------| +| `make setup` | Installation dĂ©pendances | [02_deployment.md](./02_deployment.md#prerequisites) | +| `make test` | Tests complets | [02_deployment.md](./02_deployment.md#testing) | +| `make deploy-dev` | DĂ©ploiement dev | [02_deployment.md](./02_deployment.md#development) | +| `make deploy-prod` | DĂ©ploiement production | [02_deployment.md](./02_deployment.md#production) | +| `make cost-report` | Rapport de coĂ»ts | [01_architecture.md](./01_architecture.md#costs) | +| `make scale-up` | Ajout serveur GPU | [01_architecture.md](./01_architecture.md#scaling) | + +--- + +## 📊 Aperçu Technique + +### **Architecture** +``` +Internet → HAProxy → 3x GEX44 GPU Servers → vLLM APIs + ↓ + Monitoring Stack (Prometheus/Grafana) +``` + +### **CoĂ»ts Mensuels** +- **Infrastructure**: 634€/mois vs 10570€ AWS (12x moins cher) +- **Performance**: 255 tokens/sec, P95 latency <2s +- **ROI**: 2.7x plus efficace qu'AWS + +### **SpĂ©cifications GPU** +- **3x GEX44**: RTX 4000 Ada, 20GB VRAM chacune +- **ModĂšles**: Mixtral-8x7B, Llama2-70B, CodeLlama-34B +- **Auto-scaling**: BasĂ© sur utilisation GPU + +--- + +## 🔍 Index par Mots-ClĂ©s + +### A-C +- **Ansible**: [02_deployment.md](./02_deployment.md), [04_tools.md](./04_tools.md) +- **Architecture**: [01_architecture.md](./01_architecture.md) +- **Auto-scaling**: [01_architecture.md](./01_architecture.md#scaling) +- **CoĂ»ts**: [01_architecture.md](./01_architecture.md#costs) + +### D-H +- **DĂ©ploiement**: [02_deployment.md](./02_deployment.md) +- **DĂ©pannage**: [05_troubleshooting.md](./05_troubleshooting.md) +- **Docker**: [04_tools.md](./04_tools.md) +- **GPU**: [01_architecture.md](./01_architecture.md#gpu), [05_troubleshooting.md](./05_troubleshooting.md#gpu) +- **Grafana**: [04_tools.md](./04_tools.md), [05_troubleshooting.md](./05_troubleshooting.md#monitoring) +- **HAProxy**: [01_architecture.md](./01_architecture.md#load-balancer), [05_troubleshooting.md](./05_troubleshooting.md#load-balancer) +- **Hetzner**: [01_architecture.md](./01_architecture.md), [02_deployment.md](./02_deployment.md) + +### I-P +- **Infrastructure**: [01_architecture.md](./01_architecture.md) +- **Monitoring**: [01_architecture.md](./01_architecture.md#monitoring), [04_tools.md](./04_tools.md), [05_troubleshooting.md](./05_troubleshooting.md#monitoring) +- **Performance**: [01_architecture.md](./01_architecture.md#performance), [05_troubleshooting.md](./05_troubleshooting.md#performance) +- **Prometheus**: [04_tools.md](./04_tools.md), [05_troubleshooting.md](./05_troubleshooting.md#monitoring) + +### R-Z +- **RĂ©seau**: [01_architecture.md](./01_architecture.md#network), [05_troubleshooting.md](./05_troubleshooting.md#network) +- **SĂ©curitĂ©**: [01_architecture.md](./01_architecture.md#security), [vpn-setup.md](./vpn-setup.md) +- **Terraform**: [02_deployment.md](./02_deployment.md), [04_tools.md](./04_tools.md) +- **vLLM**: [01_architecture.md](./01_architecture.md#gpu), [05_troubleshooting.md](./05_troubleshooting.md#vllm) +- **VPN**: [vpn-setup.md](./vpn-setup.md) + +--- + +## 📞 Support & Contribution + +### Obtenir de l'Aide +1. **DĂ©pannage**: Consultez [05_troubleshooting.md](./05_troubleshooting.md) +2. **Issues**: CrĂ©ez une issue sur GitLab +3. **Documentation**: RĂ©fĂ©rez-vous aux guides spĂ©cifiques ci-dessus + +### Contribuer +- Fork le repository +- Suivez les conventions de [03_applications.md](./03_applications.md) +- Testez vos changements avec `make test` +- Soumettez une merge request + +--- + +*Documentation maintenue par l'Ă©quipe Infrastructure IA - DerniĂšre mise Ă  jour: {{ ansible_date_time.iso8601 }}* \ No newline at end of file diff --git a/docs/vpn-setup.md b/docs/vpn-setup.md new file mode 100644 index 0000000..e69588b --- /dev/null +++ b/docs/vpn-setup.md @@ -0,0 +1,82 @@ +# VPN Setup pour Entreprise Externe + +## Configuration WireGuard + +Cette documentation explique comment configurer l'accĂšs VPN pour une entreprise externe vers votre infrastructure AI Hetzner. + +### Architecture + +``` +Entreprise Externe → Internet → VPN Gateway (Load Balancer) → RĂ©seaux Internes + ↓ + ┌─ GEX44 GPU (10.0.1.0/24) + └─ Cloud Services (10.0.2.0/24) +``` + +### DĂ©ploiement + +1. **Configuration des variables**: +```bash +# Dans ansible/group_vars/all/main.yml ou via variables d'environnement +export external_company_public_key="PUBLIC_KEY_FROM_EXTERNAL_COMPANY" +export load_balancer_public_ip="YOUR_LOAD_BALANCER_PUBLIC_IP" +``` + +2. **DĂ©ploiement du VPN**: +```bash +cd ansible +ansible-playbook -i inventory/production.yml playbooks/vpn-setup.yml +``` + +### Configuration Client (Entreprise Externe) + +1. **GĂ©nĂ©rer les clĂ©s cĂŽtĂ© client**: +```bash +# Sur le systĂšme de l'entreprise externe +wg genkey | tee private.key | wg pubkey > public.key +``` + +2. **Configuration client** (`wg0.conf`): +```ini +[Interface] +PrivateKey = CONTENU_DE_private.key +Address = 10.0.10.10/32 +DNS = 8.8.8.8 + +[Peer] +PublicKey = CLE_PUBLIQUE_SERVEUR +Endpoint = VOTRE_IP_PUBLIQUE:51820 +AllowedIPs = 10.0.1.0/24, 10.0.2.0/24 +PersistentKeepalive = 25 +``` + +### AccĂšs AutorisĂ© + +L'entreprise externe pourra accĂ©der Ă : +- **Serveurs GPU (GEX44)**: `10.0.1.10-12` (ports vLLM 8000) +- **Services Cloud**: `10.0.2.0/24` +- **Monitoring**: `10.0.2.12:3000` (Grafana) + +### SĂ©curitĂ© + +- Chiffrement WireGuard (ChaCha20Poly1305) +- Authentification par clĂ© publique +- Firewall UFW configurĂ© automatiquement +- Routage limitĂ© aux rĂ©seaux autorisĂ©s + +### VĂ©rification + +```bash +# Sur le serveur VPN +sudo wg show + +# Test de connectivitĂ© depuis l'entreprise externe +ping 10.0.1.10 +curl http://10.0.1.10:8000/health +``` + +### Troubleshooting + +- VĂ©rifier que le port UDP 51820 est ouvert +- ContrĂŽler les logs: `sudo journalctl -u wg-quick@wg0` +- Tester la connectivitĂ©: `sudo wg show` \ No newline at end of file