Amélioration de la documentation et ajout configuration VPN
- Traduction complète de la documentation d'architecture en français - Amélioration de la navigation dans le README avec INDEX complet - Ajout configuration WireGuard VPN pour accès externe sécurisé - Configuration Ansible pour support VPN dans tous environnements - Ajout documentation setup VPN avec guides détaillés
This commit is contained in:
parent
5c050b2443
commit
9d8383ee37
20
README.md
20
README.md
@ -229,11 +229,21 @@ k6 run tests/load/k6_inference_test.js
|
||||
|
||||
## 📚 Documentation
|
||||
|
||||
- [**Architecture**](docs/01_architecture.md) : Diagrammes et décisions
|
||||
- [**Deployment**](docs/02_deployment.md) : Guide étape par étape
|
||||
- [**Troubleshooting**](docs/05_troubleshooting.md) : Solutions aux problèmes courants
|
||||
- [**Applications**](docs/03_applications.md) : Guide des applications
|
||||
- [**Tools**](docs/04_tools.md) : Outils disponibles
|
||||
> 🇫🇷 **Documentation complète en français** - [**INDEX Complet**](docs/INDEX.md)
|
||||
|
||||
### Guides Principaux
|
||||
- [**🏗️ Architecture**](docs/01_architecture.md) : Architecture complète et composants
|
||||
- [**⚡ Déploiement**](docs/02_deployment.md) : Guide étape par étape
|
||||
- [**🔧 Dépannage**](docs/05_troubleshooting.md) : Résolution de problèmes
|
||||
- [**👥 Applications**](docs/03_applications.md) : Organisation multi-projets
|
||||
- [**🛠️ Outils**](docs/04_tools.md) : Stack technologique
|
||||
- [**🔒 VPN Setup**](docs/vpn-setup.md) : Configuration accès externe
|
||||
|
||||
### Navigation Rapide
|
||||
- **[📖 INDEX Complet](docs/INDEX.md)** - Navigation thématique et index par mots-clés
|
||||
- **Démarrage Rapide** : [Quick Start](#-quick-start-5-minutes) + [Déploiement](docs/02_deployment.md)
|
||||
- **Architecture** : [Vue d'ensemble](docs/01_architecture.md#architecture-de-haut-niveau) + [Coûts](docs/01_architecture.md#répartition-des-coûts)
|
||||
- **Problèmes** : [Guide de dépannage](docs/05_troubleshooting.md) + [Diagnostics](docs/05_troubleshooting.md#commandes-de-diagnostic)
|
||||
|
||||
## 📈 Roadmap
|
||||
|
||||
|
||||
@ -92,6 +92,13 @@ firewall_rules:
|
||||
proto: tcp
|
||||
src: "{{ private_network_cidr }}"
|
||||
comment: "Node exporter from private network"
|
||||
- rule: allow
|
||||
port: "{{ wireguard_port }}"
|
||||
proto: udp
|
||||
comment: "WireGuard VPN"
|
||||
- rule: allow
|
||||
from: "{{ wireguard_network }}"
|
||||
comment: "Allow traffic from VPN clients"
|
||||
|
||||
# Logging configuration
|
||||
rsyslog_enabled: true
|
||||
@ -120,6 +127,26 @@ net_core_somaxconn: 32768
|
||||
net_core_netdev_max_backlog: 5000
|
||||
tcp_max_syn_backlog: 8192
|
||||
|
||||
# WireGuard VPN configuration
|
||||
wireguard_enabled: true
|
||||
wireguard_port: 51820
|
||||
wireguard_interface: wg0
|
||||
wireguard_network: "10.0.10.0/24"
|
||||
wireguard_server_ip: "10.0.10.1"
|
||||
|
||||
# External client networks allowed to access via VPN
|
||||
wireguard_clients:
|
||||
- name: "external_company"
|
||||
ip: "10.0.10.10"
|
||||
allowed_networks:
|
||||
- "10.0.1.0/24" # GEX44 GPU servers
|
||||
- "10.0.2.0/24" # Cloud services
|
||||
public_key: "{{ external_company_public_key | default('') }}"
|
||||
|
||||
# WireGuard server configuration
|
||||
wireguard_server_private_key: "{{ wireguard_server_private_key | default('') }}"
|
||||
wireguard_server_public_key: "{{ wireguard_server_public_key | default('') }}"
|
||||
|
||||
# Memory tuning (for ML workloads)
|
||||
transparent_hugepage: "madvise"
|
||||
oom_kill_allocating_task: 1
|
||||
|
||||
@ -129,4 +129,14 @@ all:
|
||||
min_gex44_count: 1
|
||||
max_gex44_count: 10
|
||||
scale_up_threshold: 0.8
|
||||
scale_down_threshold: 0.3
|
||||
scale_down_threshold: 0.3
|
||||
|
||||
# VPN Gateway (runs on load balancer or dedicated server)
|
||||
vpn_gateway:
|
||||
vars:
|
||||
wireguard_enabled: true
|
||||
wireguard_server_role: true
|
||||
hosts:
|
||||
load-balancer:
|
||||
wireguard_gateway: true
|
||||
wireguard_public_endpoint: "{{ load_balancer_public_ip }}"
|
||||
76
ansible/playbooks/vpn-setup.yml
Normal file
76
ansible/playbooks/vpn-setup.yml
Normal file
@ -0,0 +1,76 @@
|
||||
---
|
||||
# VPN Setup Playbook
|
||||
# Sets up WireGuard VPN for external company access
|
||||
- name: Configure WireGuard VPN Gateway
|
||||
hosts: vpn_gateway
|
||||
become: yes
|
||||
vars:
|
||||
# Override defaults for VPN gateway
|
||||
wireguard_enabled: true
|
||||
|
||||
pre_tasks:
|
||||
- name: Verify VPN gateway configuration
|
||||
debug:
|
||||
msg: |
|
||||
Setting up WireGuard VPN on {{ inventory_hostname }}
|
||||
Public endpoint: {{ wireguard_public_endpoint | default('NOT_SET') }}
|
||||
Network: {{ wireguard_network }}
|
||||
Port: {{ wireguard_port }}
|
||||
|
||||
- name: Ensure public IP is configured
|
||||
fail:
|
||||
msg: "wireguard_public_endpoint must be set for VPN gateway"
|
||||
when: wireguard_public_endpoint is not defined or wireguard_public_endpoint == ''
|
||||
|
||||
roles:
|
||||
- role: wireguard
|
||||
when: wireguard_enabled | default(false)
|
||||
|
||||
post_tasks:
|
||||
- name: Display client configuration instructions
|
||||
debug:
|
||||
msg: |
|
||||
WireGuard VPN setup complete!
|
||||
|
||||
Server public key: {{ wireguard_server_public_key }}
|
||||
Server endpoint: {{ wireguard_public_endpoint }}:{{ wireguard_port }}
|
||||
|
||||
Client configurations have been generated in:
|
||||
/etc/wireguard/clients/
|
||||
|
||||
Next steps:
|
||||
1. Share server public key with external company
|
||||
2. Get external company's public key
|
||||
3. Update inventory with external_company_public_key variable
|
||||
4. Re-run this playbook to update server configuration
|
||||
when: wireguard_server_public_key is defined
|
||||
|
||||
- name: Display routing configuration
|
||||
debug:
|
||||
msg: |
|
||||
VPN Routing Configuration:
|
||||
- VPN Network: {{ wireguard_network }}
|
||||
- GEX44 GPU Access: {{ gex44_subnet }}
|
||||
- Cloud Services Access: {{ cloud_subnet }}
|
||||
- Private Network: {{ private_network_cidr }}
|
||||
|
||||
The external company will be able to access:
|
||||
{% for client in wireguard_clients %}
|
||||
- {{ client.name }}: {{ client.allowed_networks | join(', ') }}
|
||||
{% endfor %}
|
||||
|
||||
- name: Update firewall rules on all servers
|
||||
hosts: all
|
||||
become: yes
|
||||
tasks:
|
||||
- name: Allow VPN traffic to reach internal services
|
||||
ufw:
|
||||
rule: allow
|
||||
from_ip: "{{ wireguard_network }}"
|
||||
comment: "Allow VPN clients access"
|
||||
when: firewall_enabled | default(true) and wireguard_enabled | default(false)
|
||||
|
||||
- name: Reload firewall
|
||||
ufw:
|
||||
state: reloaded
|
||||
when: firewall_enabled | default(true)
|
||||
19
ansible/roles/wireguard/handlers/main.yml
Normal file
19
ansible/roles/wireguard/handlers/main.yml
Normal file
@ -0,0 +1,19 @@
|
||||
---
|
||||
# WireGuard handlers
|
||||
- name: restart wireguard
|
||||
systemd:
|
||||
name: "wg-quick@{{ wireguard_interface }}"
|
||||
state: restarted
|
||||
become: yes
|
||||
listen: restart wireguard
|
||||
|
||||
- name: reload wireguard
|
||||
shell: "wg-quick down {{ wireguard_interface }} && wg-quick up {{ wireguard_interface }}"
|
||||
become: yes
|
||||
listen: reload wireguard
|
||||
|
||||
- name: save iptables
|
||||
shell: iptables-save > /etc/iptables/rules.v4
|
||||
become: yes
|
||||
listen: save iptables
|
||||
when: ansible_os_family == "Debian"
|
||||
124
ansible/roles/wireguard/tasks/main.yml
Normal file
124
ansible/roles/wireguard/tasks/main.yml
Normal file
@ -0,0 +1,124 @@
|
||||
---
|
||||
# WireGuard VPN Setup
|
||||
- name: Install WireGuard
|
||||
apt:
|
||||
name:
|
||||
- wireguard
|
||||
- wireguard-tools
|
||||
state: present
|
||||
update_cache: yes
|
||||
become: yes
|
||||
|
||||
- name: Enable IP forwarding
|
||||
sysctl:
|
||||
name: net.ipv4.ip_forward
|
||||
value: '1'
|
||||
state: present
|
||||
reload: yes
|
||||
become: yes
|
||||
|
||||
- name: Enable IP forwarding for IPv6
|
||||
sysctl:
|
||||
name: net.ipv6.conf.all.forwarding
|
||||
value: '1'
|
||||
state: present
|
||||
reload: yes
|
||||
become: yes
|
||||
when: wireguard_ipv6_enabled | default(false)
|
||||
|
||||
- name: Generate WireGuard server private key
|
||||
shell: wg genkey
|
||||
register: wireguard_server_private_key_generated
|
||||
when: wireguard_server_private_key == ''
|
||||
no_log: true
|
||||
|
||||
- name: Generate WireGuard server public key
|
||||
shell: echo "{{ wireguard_server_private_key_generated.stdout | default(wireguard_server_private_key) }}" | wg pubkey
|
||||
register: wireguard_server_public_key_generated
|
||||
when: wireguard_server_public_key == '' or wireguard_server_private_key == ''
|
||||
|
||||
- name: Set WireGuard server keys facts
|
||||
set_fact:
|
||||
wireguard_server_private_key: "{{ wireguard_server_private_key_generated.stdout | default(wireguard_server_private_key) }}"
|
||||
wireguard_server_public_key: "{{ wireguard_server_public_key_generated.stdout | default(wireguard_server_public_key) }}"
|
||||
|
||||
- name: Create WireGuard configuration directory
|
||||
file:
|
||||
path: /etc/wireguard
|
||||
state: directory
|
||||
mode: '0700'
|
||||
owner: root
|
||||
group: root
|
||||
become: yes
|
||||
|
||||
- name: Generate WireGuard server configuration
|
||||
template:
|
||||
src: wg0.conf.j2
|
||||
dest: "/etc/wireguard/{{ wireguard_interface }}.conf"
|
||||
mode: '0600'
|
||||
owner: root
|
||||
group: root
|
||||
become: yes
|
||||
notify: restart wireguard
|
||||
|
||||
- name: Enable and start WireGuard service
|
||||
systemd:
|
||||
name: "wg-quick@{{ wireguard_interface }}"
|
||||
enabled: yes
|
||||
state: started
|
||||
daemon_reload: yes
|
||||
become: yes
|
||||
|
||||
- name: Configure firewall rules for WireGuard
|
||||
ufw:
|
||||
rule: "{{ item.rule }}"
|
||||
port: "{{ item.port | default(omit) }}"
|
||||
proto: "{{ item.proto | default(omit) }}"
|
||||
from_ip: "{{ item.from | default(omit) }}"
|
||||
comment: "{{ item.comment | default(omit) }}"
|
||||
become: yes
|
||||
loop:
|
||||
- rule: allow
|
||||
port: "{{ wireguard_port }}"
|
||||
proto: udp
|
||||
comment: "WireGuard VPN"
|
||||
- rule: allow
|
||||
from: "{{ wireguard_network }}"
|
||||
comment: "Allow traffic from VPN clients"
|
||||
when: firewall_enabled | default(true)
|
||||
|
||||
- name: Configure NAT rules for WireGuard
|
||||
iptables:
|
||||
table: nat
|
||||
chain: POSTROUTING
|
||||
source: "{{ wireguard_network }}"
|
||||
out_interface: "{{ ansible_default_ipv4.interface }}"
|
||||
jump: MASQUERADE
|
||||
comment: "WireGuard NAT"
|
||||
become: yes
|
||||
notify: save iptables
|
||||
|
||||
- name: Display WireGuard server public key
|
||||
debug:
|
||||
msg: "WireGuard server public key: {{ wireguard_server_public_key }}"
|
||||
when: wireguard_server_public_key is defined
|
||||
|
||||
- name: Create client configuration directory
|
||||
file:
|
||||
path: /etc/wireguard/clients
|
||||
state: directory
|
||||
mode: '0700'
|
||||
owner: root
|
||||
group: root
|
||||
become: yes
|
||||
|
||||
- name: Generate client configurations
|
||||
template:
|
||||
src: client.conf.j2
|
||||
dest: "/etc/wireguard/clients/{{ item.name }}.conf"
|
||||
mode: '0600'
|
||||
owner: root
|
||||
group: root
|
||||
become: yes
|
||||
loop: "{{ wireguard_clients }}"
|
||||
when: wireguard_clients is defined
|
||||
27
ansible/roles/wireguard/templates/client.conf.j2
Normal file
27
ansible/roles/wireguard/templates/client.conf.j2
Normal file
@ -0,0 +1,27 @@
|
||||
# WireGuard Client Configuration for {{ item.name }}
|
||||
# Generated by Ansible - Do not edit manually
|
||||
|
||||
[Interface]
|
||||
# Client private key (generate with: wg genkey)
|
||||
PrivateKey = CLIENT_PRIVATE_KEY_HERE
|
||||
Address = {{ item.ip }}/32
|
||||
DNS = 8.8.8.8, 8.8.4.4
|
||||
|
||||
[Peer]
|
||||
# Server public key
|
||||
PublicKey = {{ wireguard_server_public_key }}
|
||||
# Server endpoint (replace with actual public IP)
|
||||
Endpoint = YOUR_SERVER_PUBLIC_IP:{{ wireguard_port }}
|
||||
# Networks accessible through VPN
|
||||
AllowedIPs = {% if item.allowed_networks is defined %}{{ item.allowed_networks | join(', ') }}{% else %}{{ private_network_cidr }}{% endif %}
|
||||
|
||||
# Keep connection alive
|
||||
PersistentKeepalive = 25
|
||||
|
||||
# Instructions for client setup:
|
||||
# 1. Generate client key pair:
|
||||
# wg genkey | tee private.key | wg pubkey > public.key
|
||||
# 2. Replace CLIENT_PRIVATE_KEY_HERE with contents of private.key
|
||||
# 3. Replace YOUR_SERVER_PUBLIC_IP with server's public IP address
|
||||
# 4. Add the public key to server configuration
|
||||
# 5. Import this config to WireGuard client
|
||||
26
ansible/roles/wireguard/templates/wg0.conf.j2
Normal file
26
ansible/roles/wireguard/templates/wg0.conf.j2
Normal file
@ -0,0 +1,26 @@
|
||||
# WireGuard Server Configuration
|
||||
# Generated by Ansible - Do not edit manually
|
||||
[Interface]
|
||||
PrivateKey = {{ wireguard_server_private_key }}
|
||||
Address = {{ wireguard_server_ip }}/{{ wireguard_network.split('/')[1] }}
|
||||
ListenPort = {{ wireguard_port }}
|
||||
|
||||
# Enable packet forwarding
|
||||
PostUp = iptables -A FORWARD -i {{ wireguard_interface }} -j ACCEPT; iptables -A FORWARD -o {{ wireguard_interface }} -j ACCEPT; iptables -t nat -A POSTROUTING -o {{ ansible_default_ipv4.interface }} -j MASQUERADE
|
||||
PostDown = iptables -D FORWARD -i {{ wireguard_interface }} -j ACCEPT; iptables -D FORWARD -o {{ wireguard_interface }} -j ACCEPT; iptables -t nat -D POSTROUTING -o {{ ansible_default_ipv4.interface }} -j MASQUERADE
|
||||
|
||||
{% if wireguard_clients is defined %}
|
||||
{% for client in wireguard_clients %}
|
||||
# Client: {{ client.name }}
|
||||
[Peer]
|
||||
PublicKey = {{ client.public_key }}
|
||||
AllowedIPs = {{ client.ip }}/32
|
||||
{% if client.allowed_networks is defined %}
|
||||
# Routes for client access to internal networks
|
||||
{% for network in client.allowed_networks %}
|
||||
# Access to {{ network }}
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
16
ansible/roles/wireguard/vars/main.yml
Normal file
16
ansible/roles/wireguard/vars/main.yml
Normal file
@ -0,0 +1,16 @@
|
||||
---
|
||||
# WireGuard default variables
|
||||
wireguard_interface: "wg0"
|
||||
wireguard_port: 51820
|
||||
wireguard_network: "10.0.10.0/24"
|
||||
wireguard_server_ip: "10.0.10.1"
|
||||
wireguard_ipv6_enabled: false
|
||||
|
||||
# Package dependencies
|
||||
wireguard_packages:
|
||||
- wireguard
|
||||
- wireguard-tools
|
||||
- iptables-persistent
|
||||
|
||||
# Firewall integration
|
||||
wireguard_firewall_enabled: true
|
||||
@ -1,29 +1,29 @@
|
||||
# Infrastructure Architecture
|
||||
# Architecture de l'Infrastructure
|
||||
|
||||
## Overview
|
||||
## Aperçu
|
||||
|
||||
This document describes the architecture of the AI Infrastructure running on Hetzner Cloud and dedicated servers. The system is designed for high-performance AI inference with cost optimization, automatic scaling, and production-grade reliability.
|
||||
Ce document décrit l'architecture de l'Infrastructure IA fonctionnant sur Hetzner Cloud et serveurs dédiés. Le système est conçu pour l'inférence IA haute performance avec optimisation des coûts, mise à l'échelle automatique et fiabilité de niveau production.
|
||||
|
||||
## High-Level Architecture
|
||||
## Architecture de Haut Niveau
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
Internet[Internet]
|
||||
CF[CloudFlare Proxy<br/>Optional CDN/DDoS protection]
|
||||
CF[CloudFlare Proxy<br/>Protection CDN/DDoS optionnelle]
|
||||
|
||||
subgraph Cloud[Hetzner Cloud]
|
||||
LB[HAProxy LB<br/>cx31 - 8CPU/32GB<br/>€22.68/month]
|
||||
GW[API Gateway<br/>cx31 - 8CPU/32GB<br/>€22.68/month]
|
||||
MON[Monitoring<br/>cx21 - 4CPU/16GB<br/>€11.76/month]
|
||||
LB[HAProxy LB<br/>cx31 - 8CPU/32GB<br/>22,68€/mois]
|
||||
GW[API Gateway<br/>cx31 - 8CPU/32GB<br/>22,68€/mois]
|
||||
MON[Monitoring<br/>cx21 - 4CPU/16GB<br/>11,76€/mois]
|
||||
end
|
||||
|
||||
subgraph Dedicated[Hetzner Dedicated Servers]
|
||||
GEX1[GEX44 #1<br/>vLLM API<br/>Mixtral-8x7B<br/>€184/month]
|
||||
GEX2[GEX44 #2<br/>vLLM API<br/>Llama-70B<br/>€184/month]
|
||||
GEX3[GEX44 #3<br/>vLLM API<br/>CodeLlama<br/>€184/month]
|
||||
subgraph Dedicated[Serveurs Dédiés Hetzner]
|
||||
GEX1[GEX44 #1<br/>API vLLM<br/>Mixtral-8x7B<br/>184€/mois]
|
||||
GEX2[GEX44 #2<br/>API vLLM<br/>Llama-70B<br/>184€/mois]
|
||||
GEX3[GEX44 #3<br/>API vLLM<br/>CodeLlama<br/>184€/mois]
|
||||
end
|
||||
|
||||
PrivateNet[Hetzner Private Network<br/>10.0.0.0/16 - VXLAN overlay]
|
||||
PrivateNet[Réseau Privé Hetzner<br/>10.0.0.0/16 - Overlay VXLAN]
|
||||
|
||||
Internet --> CF
|
||||
CF --> LB
|
||||
@ -45,21 +45,21 @@ graph TB
|
||||
MON -.-> PrivateNet
|
||||
```
|
||||
|
||||
## Component Details
|
||||
## Détails des Composants
|
||||
|
||||
### 1. Load Balancer (HAProxy)
|
||||
### 1. Répartiteur de Charge (HAProxy)
|
||||
|
||||
**Hardware**: Hetzner Cloud cx31 (8 vCPU, 32GB RAM)
|
||||
**Location**: Private IP 10.0.2.10
|
||||
**Role**: Traffic distribution, SSL termination, health checks
|
||||
**Matériel**: Hetzner Cloud cx31 (8 vCPU, 32GB RAM)
|
||||
**Localisation**: IP privée 10.0.2.10
|
||||
**Rôle**: Distribution du trafic, terminaison SSL, contrôles de santé
|
||||
|
||||
**Features**:
|
||||
- Round-robin load balancing with health checks
|
||||
- SSL/TLS termination with automatic certificate renewal
|
||||
- Statistics dashboard (port 8404)
|
||||
- Request routing based on URL patterns
|
||||
- Rate limiting and DDoS protection
|
||||
- Prometheus metrics export
|
||||
**Fonctionnalités**:
|
||||
- Répartition de charge round-robin avec contrôles de santé
|
||||
- Terminaison SSL/TLS avec renouvellement automatique des certificats
|
||||
- Tableau de bord statistiques (port 8404)
|
||||
- Routage des requêtes basé sur les patterns d'URL
|
||||
- Limitation de débit et protection DDoS
|
||||
- Export des métriques Prometheus
|
||||
|
||||
**Configuration**:
|
||||
```haproxy
|
||||
@ -71,338 +71,338 @@ backend vllm_backend
|
||||
server gex44-3 10.0.1.12:8000 check
|
||||
```
|
||||
|
||||
### 2. API Gateway (Nginx)
|
||||
### 2. Passerelle API (Nginx)
|
||||
|
||||
**Hardware**: Hetzner Cloud cx31 (8 vCPU, 32GB RAM)
|
||||
**Location**: Private IP 10.0.2.11
|
||||
**Role**: API management, authentication, rate limiting
|
||||
**Matériel**: Hetzner Cloud cx31 (8 vCPU, 32GB RAM)
|
||||
**Localisation**: IP privée 10.0.2.11
|
||||
**Rôle**: Gestion API, authentification, limitation de débit
|
||||
|
||||
**Features**:
|
||||
- Request/response transformation
|
||||
- API versioning and routing
|
||||
- Authentication and authorization
|
||||
- Request/response logging
|
||||
- API analytics and metrics
|
||||
- Caching for frequently requested data
|
||||
**Fonctionnalités**:
|
||||
- Transformation requête/réponse
|
||||
- Versioning et routage API
|
||||
- Authentification et autorisation
|
||||
- Journalisation requête/réponse
|
||||
- Analytics et métriques API
|
||||
- Mise en cache des données fréquemment demandées
|
||||
|
||||
### 3. GPU Servers (GEX44)
|
||||
### 3. Serveurs GPU (GEX44)
|
||||
|
||||
**Hardware per server**:
|
||||
- CPU: Intel i5-13500 (12 cores, 20 threads)
|
||||
**Matériel par serveur**:
|
||||
- CPU: Intel i5-13500 (12 cœurs, 20 threads)
|
||||
- GPU: NVIDIA RTX 4000 Ada Generation (20GB VRAM)
|
||||
- RAM: 64GB DDR4
|
||||
- Storage: 2x 1.92TB NVMe SSD (RAID 1)
|
||||
- Network: 1 Gbit/s
|
||||
- Stockage: 2x 1.92TB NVMe SSD (RAID 1)
|
||||
- Réseau: 1 Gbit/s
|
||||
|
||||
**Software Stack**:
|
||||
**Stack Logiciel**:
|
||||
- OS: Ubuntu 22.04 LTS
|
||||
- CUDA: 12.3
|
||||
- Python: 3.11
|
||||
- vLLM: 0.3.0+
|
||||
- Docker: 24.0.5
|
||||
|
||||
**Network Configuration**:
|
||||
- Private IPs: 10.0.1.10, 10.0.1.11, 10.0.1.12
|
||||
- vLLM API: Port 8000
|
||||
- Metrics: Port 9835 (nvidia-smi-exporter)
|
||||
- Node metrics: Port 9100 (node-exporter)
|
||||
**Configuration Réseau**:
|
||||
- IPs privées: 10.0.1.10, 10.0.1.11, 10.0.1.12
|
||||
- API vLLM: Port 8000
|
||||
- Métriques: Port 9835 (nvidia-smi-exporter)
|
||||
- Métriques nœud: Port 9100 (node-exporter)
|
||||
|
||||
### 4. Monitoring Stack
|
||||
### 4. Stack de Monitoring
|
||||
|
||||
**Hardware**: Hetzner Cloud cx21 (4 vCPU, 16GB RAM)
|
||||
**Location**: Private IP 10.0.2.12
|
||||
**Matériel**: Hetzner Cloud cx21 (4 vCPU, 16GB RAM)
|
||||
**Localisation**: IP privée 10.0.2.12
|
||||
|
||||
**Components**:
|
||||
- **Prometheus**: Metrics collection and storage
|
||||
- **Grafana**: Visualization and dashboards
|
||||
- **AlertManager**: Alert routing and notification
|
||||
- **Node Exporter**: System metrics
|
||||
- **nvidia-smi-exporter**: GPU metrics
|
||||
**Composants**:
|
||||
- **Prometheus**: Collection et stockage des métriques
|
||||
- **Grafana**: Visualisation et tableaux de bord
|
||||
- **AlertManager**: Routage et notification des alertes
|
||||
- **Node Exporter**: Métriques système
|
||||
- **nvidia-smi-exporter**: Métriques GPU
|
||||
|
||||
## Network Architecture
|
||||
## Architecture Réseau
|
||||
|
||||
### Private Network
|
||||
### Réseau Privé
|
||||
|
||||
**CIDR**: 10.0.0.0/16
|
||||
**Subnets**:
|
||||
- Cloud servers: 10.0.2.0/24
|
||||
- GEX44 servers: 10.0.1.0/24
|
||||
**Sous-réseaux**:
|
||||
- Serveurs cloud: 10.0.2.0/24
|
||||
- Serveurs GEX44: 10.0.1.0/24
|
||||
|
||||
### Security Groups
|
||||
### Groupes de Sécurité
|
||||
|
||||
1. **SSH Access**: Port 22 (restricted IPs)
|
||||
1. **Accès SSH**: Port 22 (IPs restreintes)
|
||||
2. **HTTP/HTTPS**: Ports 80, 443 (public)
|
||||
3. **API Access**: Port 8000 (internal only)
|
||||
4. **Monitoring**: Ports 3000, 9090 (restricted)
|
||||
5. **Internal Communication**: All ports within private network
|
||||
3. **Accès API**: Port 8000 (interne uniquement)
|
||||
4. **Monitoring**: Ports 3000, 9090 (restreint)
|
||||
5. **Communication Interne**: Tous ports dans le réseau privé
|
||||
|
||||
### Firewall Rules
|
||||
### Règles de Pare-feu
|
||||
|
||||
```yaml
|
||||
# Public access
|
||||
- HTTP (80) from 0.0.0.0/0
|
||||
- HTTPS (443) from 0.0.0.0/0
|
||||
# Accès public
|
||||
- HTTP (80) depuis 0.0.0.0/0
|
||||
- HTTPS (443) depuis 0.0.0.0/0
|
||||
|
||||
# Management access (restrict to office IPs)
|
||||
- SSH (22) from office_cidr
|
||||
- Grafana (3000) from office_cidr
|
||||
- Prometheus (9090) from office_cidr
|
||||
# Accès de gestion (restreindre aux IPs bureau)
|
||||
- SSH (22) depuis office_cidr
|
||||
- Grafana (3000) depuis office_cidr
|
||||
- Prometheus (9090) depuis office_cidr
|
||||
|
||||
# Internal communication
|
||||
- All traffic within 10.0.0.0/16
|
||||
# Communication interne
|
||||
- Tout trafic dans 10.0.0.0/16
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
## Flux de Données
|
||||
|
||||
### Inference Request Flow
|
||||
### Flux de Requête d'Inférence
|
||||
|
||||
1. **Client** → **Load Balancer** (HAProxy)
|
||||
- SSL termination
|
||||
- Request routing
|
||||
- Health check validation
|
||||
1. **Client** → **Répartiteur de Charge** (HAProxy)
|
||||
- Terminaison SSL
|
||||
- Routage des requêtes
|
||||
- Validation des contrôles de santé
|
||||
|
||||
2. **Load Balancer** → **GPU Server** (vLLM)
|
||||
- HTTP request to /v1/chat/completions
|
||||
- Model selection and processing
|
||||
- Response generation
|
||||
2. **Répartiteur de Charge** → **Serveur GPU** (vLLM)
|
||||
- Requête HTTP vers /v1/chat/completions
|
||||
- Sélection et traitement du modèle
|
||||
- Génération de réponse
|
||||
|
||||
3. **GPU Server** → **Load Balancer** → **Client**
|
||||
- JSON response with completion
|
||||
- Usage metrics included
|
||||
3. **Serveur GPU** → **Répartiteur de Charge** → **Client**
|
||||
- Réponse JSON avec complétion
|
||||
- Métriques d'utilisation incluses
|
||||
|
||||
### Monitoring Data Flow
|
||||
### Flux de Données de Monitoring
|
||||
|
||||
1. **GPU Servers** → **Prometheus**
|
||||
- nvidia-smi metrics (GPU utilization, temperature, memory)
|
||||
- vLLM metrics (requests, latency, tokens)
|
||||
- System metrics (CPU, memory, disk)
|
||||
1. **Serveurs GPU** → **Prometheus**
|
||||
- Métriques nvidia-smi (utilisation GPU, température, mémoire)
|
||||
- Métriques vLLM (requêtes, latence, tokens)
|
||||
- Métriques système (CPU, mémoire, disque)
|
||||
|
||||
2. **Load Balancer** → **Prometheus**
|
||||
- HAProxy metrics (requests, response times, errors)
|
||||
- Backend server health status
|
||||
2. **Répartiteur de Charge** → **Prometheus**
|
||||
- Métriques HAProxy (requêtes, temps de réponse, erreurs)
|
||||
- État de santé des serveurs backend
|
||||
|
||||
3. **Prometheus** → **Grafana**
|
||||
- Time-series data visualization
|
||||
- Dashboard rendering
|
||||
- Alert evaluation
|
||||
- Visualisation des données de séries temporelles
|
||||
- Rendu des tableaux de bord
|
||||
- Évaluation des alertes
|
||||
|
||||
## Storage Architecture
|
||||
## Architecture de Stockage
|
||||
|
||||
### Model Storage
|
||||
### Stockage des Modèles
|
||||
|
||||
**Location**: Each GEX44 server
|
||||
**Path**: `/opt/vllm/models/`
|
||||
**Size**: ~100GB per model
|
||||
**Localisation**: Chaque serveur GEX44
|
||||
**Chemin**: `/opt/vllm/models/`
|
||||
**Taille**: ~100GB par modèle
|
||||
|
||||
**Models Stored**:
|
||||
**Modèles Stockés**:
|
||||
- Mixtral-8x7B-Instruct (87GB)
|
||||
- Llama-2-70B-Chat (140GB, quantized)
|
||||
- Llama-2-70B-Chat (140GB, quantifié)
|
||||
- CodeLlama-34B (68GB)
|
||||
|
||||
### Shared Storage
|
||||
### Stockage Partagé
|
||||
|
||||
**Type**: Hetzner Cloud Volume
|
||||
**Size**: 500GB
|
||||
**Mount**: `/mnt/shared`
|
||||
**Purpose**: Configuration, logs, backups
|
||||
**Type**: Volume Hetzner Cloud
|
||||
**Taille**: 500GB
|
||||
**Montage**: `/mnt/shared`
|
||||
**Objectif**: Configuration, journaux, sauvegardes
|
||||
|
||||
### Backup Strategy
|
||||
### Stratégie de Sauvegarde
|
||||
|
||||
**What is backed up**:
|
||||
- Terraform state files
|
||||
- Ansible configurations
|
||||
- Grafana dashboards
|
||||
- Prometheus configuration
|
||||
- Application logs (last 7 days)
|
||||
**Ce qui est sauvegardé**:
|
||||
- Fichiers d'état Terraform
|
||||
- Configurations Ansible
|
||||
- Tableaux de bord Grafana
|
||||
- Configuration Prometheus
|
||||
- Journaux d'application (7 derniers jours)
|
||||
|
||||
**What is NOT backed up**:
|
||||
- Model files (re-downloadable)
|
||||
- Prometheus metrics (30-day retention)
|
||||
- Large log files (rotated automatically)
|
||||
**Ce qui n'est PAS sauvegardé**:
|
||||
- Fichiers de modèles (re-téléchargeables)
|
||||
- Métriques Prometheus (rétention 30 jours)
|
||||
- Gros fichiers de journaux (rotation automatique)
|
||||
|
||||
## Scaling Architecture
|
||||
## Architecture de Mise à l'Échelle
|
||||
|
||||
### Horizontal Scaling
|
||||
### Mise à l'Échelle Horizontale
|
||||
|
||||
**Auto-scaling triggers**:
|
||||
- GPU utilization > 80% for 10 minutes → Scale up
|
||||
- GPU utilization < 30% for 30 minutes → Scale down
|
||||
- Queue depth > 50 requests → Immediate scale up
|
||||
**Déclencheurs d'auto-scaling**:
|
||||
- Utilisation GPU > 80% pendant 10 minutes → Monter en échelle
|
||||
- Utilisation GPU < 30% pendant 30 minutes → Réduire l'échelle
|
||||
- Profondeur de file > 50 requêtes → Montée immédiate en échelle
|
||||
|
||||
**Scaling process**:
|
||||
1. Monitor metrics via Prometheus
|
||||
2. Autoscaler service evaluates conditions
|
||||
3. Order new GEX44 via Robot API
|
||||
4. Ansible configures new server
|
||||
5. Add to load balancer pool
|
||||
**Processus de mise à l'échelle**:
|
||||
1. Surveiller les métriques via Prometheus
|
||||
2. Le service d'autoscaler évalue les conditions
|
||||
3. Commande nouveau GEX44 via API Robot
|
||||
4. Ansible configure le nouveau serveur
|
||||
5. Ajout au pool du répartiteur de charge
|
||||
|
||||
### Vertical Scaling
|
||||
### Mise à l'Échelle Verticale
|
||||
|
||||
**Model optimization**:
|
||||
- Quantization (AWQ, GPTQ)
|
||||
- Tensor parallelism for large models
|
||||
- Memory optimization techniques
|
||||
**Optimisation des modèles**:
|
||||
- Quantification (AWQ, GPTQ)
|
||||
- Parallélisme tensoriel pour gros modèles
|
||||
- Techniques d'optimisation mémoire
|
||||
|
||||
## High Availability
|
||||
## Haute Disponibilité
|
||||
|
||||
### Redundancy
|
||||
### Redondance
|
||||
|
||||
- **Load Balancer**: Single point (acceptable for cost/benefit)
|
||||
- **GPU Servers**: 3 servers minimum (N+1 redundancy)
|
||||
- **Monitoring**: Single instance with backup configuration
|
||||
- **Répartiteur de Charge**: Point unique (acceptable pour coût/bénéfice)
|
||||
- **Serveurs GPU**: 3 serveurs minimum (redondance N+1)
|
||||
- **Monitoring**: Instance unique avec configuration de sauvegarde
|
||||
|
||||
### Failure Scenarios
|
||||
### Scénarios de Panne
|
||||
|
||||
1. **Single GPU server failure**:
|
||||
- Automatic removal from load balancer
|
||||
- 66% capacity maintained
|
||||
- Automatic replacement order
|
||||
1. **Panne d'un serveur GPU**:
|
||||
- Suppression automatique du répartiteur de charge
|
||||
- 66% de capacité maintenue
|
||||
- Commande de remplacement automatique
|
||||
|
||||
2. **Load balancer failure**:
|
||||
- Manual failover to backup
|
||||
- DNS change required
|
||||
- ~10 minute downtime
|
||||
2. **Panne du répartiteur de charge**:
|
||||
- Basculement manuel vers sauvegarde
|
||||
- Changement DNS requis
|
||||
- ~10 minutes d'arrêt
|
||||
|
||||
3. **Network partition**:
|
||||
- Private network redundancy
|
||||
- Automatic retry logic
|
||||
- Graceful degradation
|
||||
3. **Partition réseau**:
|
||||
- Redondance du réseau privé
|
||||
- Logique de retry automatique
|
||||
- Dégradation gracieuse
|
||||
|
||||
## Security Architecture
|
||||
## Architecture de Sécurité
|
||||
|
||||
### Network Security
|
||||
### Sécurité Réseau
|
||||
|
||||
- Private network isolation
|
||||
- Firewall rules at multiple levels
|
||||
- No direct internet access to GPU servers
|
||||
- VPN for administrative access
|
||||
- Isolation du réseau privé
|
||||
- Règles de pare-feu à plusieurs niveaux
|
||||
- Pas d'accès internet direct aux serveurs GPU
|
||||
- VPN pour accès administratif
|
||||
|
||||
### Application Security
|
||||
### Sécurité Application
|
||||
|
||||
- API rate limiting
|
||||
- Request validation
|
||||
- Input sanitization
|
||||
- Output filtering
|
||||
- Limitation de débit API
|
||||
- Validation des requêtes
|
||||
- Sanitisation des entrées
|
||||
- Filtrage des sorties
|
||||
|
||||
### Infrastructure Security
|
||||
### Sécurité Infrastructure
|
||||
|
||||
- SSH key-based authentication
|
||||
- Regular security updates
|
||||
- Intrusion detection
|
||||
- Log monitoring
|
||||
- Authentification basée sur clés SSH
|
||||
- Mises à jour de sécurité régulières
|
||||
- Détection d'intrusion
|
||||
- Surveillance des journaux
|
||||
|
||||
## Performance Characteristics
|
||||
## Caractéristiques de Performance
|
||||
|
||||
### Latency
|
||||
### Latence
|
||||
|
||||
- **P50**: <1.5 seconds
|
||||
- **P95**: <3 seconds
|
||||
- **P99**: <5 seconds
|
||||
- **P50**: <1.5 secondes
|
||||
- **P95**: <3 secondes
|
||||
- **P99**: <5 secondes
|
||||
|
||||
### Throughput
|
||||
### Débit
|
||||
|
||||
- **Total**: ~255 tokens/second (3 servers)
|
||||
- **Per server**: ~85 tokens/second
|
||||
- **Max RPS**: ~50 requests/second
|
||||
- **Total**: ~255 tokens/seconde (3 serveurs)
|
||||
- **Par serveur**: ~85 tokens/seconde
|
||||
- **RPS Max**: ~50 requêtes/seconde
|
||||
|
||||
### Resource Utilization
|
||||
### Utilisation des Ressources
|
||||
|
||||
- **GPU**: 65-75% average utilization
|
||||
- **CPU**: 30-40% average utilization
|
||||
- **Memory**: 70-80% utilization (model loading)
|
||||
- **Network**: <100 Mbps typical
|
||||
- **GPU**: 65-75% utilisation moyenne
|
||||
- **CPU**: 30-40% utilisation moyenne
|
||||
- **Mémoire**: 70-80% utilisation (chargement modèle)
|
||||
- **Réseau**: <100 Mbps typique
|
||||
|
||||
## Cost Breakdown
|
||||
## Répartition des Coûts
|
||||
|
||||
### Monthly Costs (EUR)
|
||||
### Coûts Mensuels (EUR)
|
||||
|
||||
| Component | Quantity | Unit Cost | Total |
|
||||
|-----------|----------|-----------|--------|
|
||||
| GEX44 Servers | 3 | €184 | €552 |
|
||||
| cx31 (LB) | 1 | €22.68 | €22.68 |
|
||||
| cx31 (API GW) | 1 | €22.68 | €22.68 |
|
||||
| cx21 (Monitor) | 1 | €11.76 | €11.76 |
|
||||
| Storage | 500GB | €0.05/GB | €25 |
|
||||
| **Total** | | | **€634.12** |
|
||||
| Composant | Quantité | Coût Unitaire | Total |
|
||||
|-----------|----------|---------------|--------|
|
||||
| Serveurs GEX44 | 3 | 184€ | 552€ |
|
||||
| cx31 (LB) | 1 | 22,68€ | 22,68€ |
|
||||
| cx31 (API GW) | 1 | 22,68€ | 22,68€ |
|
||||
| cx21 (Monitor) | 1 | 11,76€ | 11,76€ |
|
||||
| Stockage | 500GB | 0,05€/GB | 25€ |
|
||||
| **Total** | | | **634,12€** |
|
||||
|
||||
### Cost per Request
|
||||
### Coût par Requête
|
||||
|
||||
At 100,000 requests/day:
|
||||
- Monthly requests: 3,000,000
|
||||
- Cost per request: €0.0002
|
||||
- Cost per token: €0.0000025
|
||||
À 100 000 requêtes/jour:
|
||||
- Requêtes mensuelles: 3 000 000
|
||||
- Coût par requête: 0,0002€
|
||||
- Coût par token: 0,0000025€
|
||||
|
||||
## Disaster Recovery
|
||||
## Reprise après Sinistre
|
||||
|
||||
### Backup Procedures
|
||||
### Procédures de Sauvegarde
|
||||
|
||||
1. **Daily**: Configuration backup to cloud storage
|
||||
2. **Weekly**: Full system state backup
|
||||
3. **Monthly**: Disaster recovery test
|
||||
1. **Quotidien**: Sauvegarde configuration vers stockage cloud
|
||||
2. **Hebdomadaire**: Sauvegarde complète état système
|
||||
3. **Mensuel**: Test de reprise après sinistre
|
||||
|
||||
### Recovery Procedures
|
||||
### Procédures de Récupération
|
||||
|
||||
1. **Infrastructure**: Terraform state restoration
|
||||
2. **Configuration**: Ansible playbook execution
|
||||
3. **Models**: Re-download from HuggingFace
|
||||
4. **Data**: Restore from backup storage
|
||||
1. **Infrastructure**: Restauration état Terraform
|
||||
2. **Configuration**: Exécution playbooks Ansible
|
||||
3. **Modèles**: Re-téléchargement depuis HuggingFace
|
||||
4. **Données**: Restauration depuis stockage de sauvegarde
|
||||
|
||||
### RTO/RPO Targets
|
||||
### Objectifs RTO/RPO
|
||||
|
||||
- **RTO**: 2 hours (Recovery Time Objective)
|
||||
- **RPO**: 24 hours (Recovery Point Objective)
|
||||
- **RTO**: 2 heures (Objectif Temps de Récupération)
|
||||
- **RPO**: 24 heures (Objectif Point de Récupération)
|
||||
|
||||
## Monitoring and Alerting
|
||||
## Surveillance et Alertes
|
||||
|
||||
### Key Metrics
|
||||
### Métriques Clés
|
||||
|
||||
**Infrastructure**:
|
||||
- GPU utilization and temperature
|
||||
- Memory usage and availability
|
||||
- Network throughput
|
||||
- Storage usage
|
||||
- Utilisation et température GPU
|
||||
- Utilisation et disponibilité mémoire
|
||||
- Débit réseau
|
||||
- Utilisation stockage
|
||||
|
||||
**Application**:
|
||||
- Request rate and latency
|
||||
- Error rate and types
|
||||
- Token generation rate
|
||||
- Queue depth
|
||||
- Taux et latence des requêtes
|
||||
- Taux et types d'erreurs
|
||||
- Taux de génération de tokens
|
||||
- Profondeur de file
|
||||
|
||||
**Business**:
|
||||
- Cost per request
|
||||
- Revenue per request
|
||||
- SLA compliance
|
||||
- User satisfaction
|
||||
- Coût par requête
|
||||
- Revenus par requête
|
||||
- Conformité SLA
|
||||
- Satisfaction utilisateur
|
||||
|
||||
### Alert Levels
|
||||
### Niveaux d'Alerte
|
||||
|
||||
1. **Info**: Cost optimization opportunities
|
||||
2. **Warning**: Performance degradation
|
||||
3. **Critical**: Service outage or severe issues
|
||||
1. **Info**: Opportunités d'optimisation des coûts
|
||||
2. **Warning**: Dégradation des performances
|
||||
3. **Critique**: Panne de service ou problèmes graves
|
||||
|
||||
## Future Architecture Considerations
|
||||
## Considérations Architecturales Futures
|
||||
|
||||
### Planned Improvements
|
||||
### Améliorations Prévues
|
||||
|
||||
1. **Multi-region deployment** (Q4 2024)
|
||||
- Nuremberg + Helsinki regions
|
||||
- Cross-region load balancing
|
||||
- Improved latency for global users
|
||||
1. **Déploiement multi-région** (T4 2024)
|
||||
- Régions Nuremberg + Helsinki
|
||||
- Répartition de charge inter-régions
|
||||
- Latence améliorée pour utilisateurs globaux
|
||||
|
||||
2. **Advanced auto-scaling** (Q1 2025)
|
||||
- Predictive scaling based on usage patterns
|
||||
- Spot instance integration
|
||||
- More sophisticated cost optimization
|
||||
2. **Auto-scaling avancé** (T1 2025)
|
||||
- Mise à l'échelle prédictive basée sur patterns d'usage
|
||||
- Intégration instances spot
|
||||
- Optimisation coûts plus sophistiquée
|
||||
|
||||
3. **Edge deployment** (Q2 2025)
|
||||
- Smaller models at edge locations
|
||||
- Reduced latency for simple requests
|
||||
- Hybrid edge-cloud architecture
|
||||
3. **Déploiement edge** (T2 2025)
|
||||
- Modèles plus petits aux emplacements edge
|
||||
- Latence réduite pour requêtes simples
|
||||
- Architecture hybride edge-cloud
|
||||
|
||||
### Technology Evolution
|
||||
### Évolution Technologique
|
||||
|
||||
- **Hardware**: Migration to H100 when cost-effective
|
||||
- **Software**: Continuous optimization of inference stack
|
||||
- **Networking**: 10 Gbit/s upgrade for high-throughput scenarios
|
||||
- **Matériel**: Migration vers H100 quand rentable
|
||||
- **Logiciel**: Optimisation continue de la stack d'inférence
|
||||
- **Réseau**: Upgrade 10 Gbit/s pour scénarios haut débit
|
||||
|
||||
This architecture provides a solid foundation for scaling from thousands to millions of requests per day while maintaining cost efficiency and performance.
|
||||
Cette architecture fournit une base solide pour passer de milliers à millions de requêtes par jour tout en maintenant l'efficacité coût et les performances.
|
||||
@ -1,37 +1,37 @@
|
||||
# Deployment Guide
|
||||
# Guide de Déploiement
|
||||
|
||||
This guide provides step-by-step instructions for deploying the AI Infrastructure on Hetzner Cloud and dedicated servers.
|
||||
Ce guide fournit des instructions étape par étape pour déployer l'Infrastructure IA sur Hetzner Cloud et les serveurs dédiés.
|
||||
|
||||
## Prerequisites
|
||||
## Prérequis
|
||||
|
||||
Before starting the deployment, ensure you have:
|
||||
Avant de commencer le déploiement, assurez-vous d'avoir :
|
||||
|
||||
### Required Accounts and Access
|
||||
### Comptes et Accès Requis
|
||||
|
||||
1. **Hetzner Cloud Account**
|
||||
- API token with read/write permissions
|
||||
- Budget sufficient for cloud resources (~€60/month)
|
||||
1. **Compte Hetzner Cloud**
|
||||
- Token API avec permissions lecture/écriture
|
||||
- Budget suffisant pour les ressources cloud (~60€/mois)
|
||||
|
||||
2. **Hetzner Robot Account**
|
||||
- API credentials for dedicated server management
|
||||
- Budget for GEX44 servers (€184/month each)
|
||||
2. **Compte Hetzner Robot**
|
||||
- Identifiants API pour la gestion des serveurs dédiés
|
||||
- Budget pour les serveurs GEX44 (184€/mois chacun)
|
||||
|
||||
3. **GitLab Account** (for CI/CD)
|
||||
- Project with CI/CD pipelines enabled
|
||||
- Variables configured for secrets
|
||||
3. **Compte GitLab** (pour CI/CD)
|
||||
- Projet avec pipelines CI/CD activés
|
||||
- Variables configurées pour les secrets
|
||||
|
||||
### Local Development Environment
|
||||
### Environnement de Développement Local
|
||||
|
||||
```bash
|
||||
# Required tools
|
||||
# Outils requis
|
||||
terraform >= 1.5.0
|
||||
ansible >= 8.0.0
|
||||
kubectl >= 1.28.0 # Optional
|
||||
kubectl >= 1.28.0 # Optionnel
|
||||
docker >= 24.0.0
|
||||
python >= 3.11
|
||||
go >= 1.21 # For testing
|
||||
go >= 1.21 # Pour les tests
|
||||
|
||||
# Install tools on Ubuntu/Debian
|
||||
# Installation des outils sur Ubuntu/Debian
|
||||
sudo apt update
|
||||
sudo apt install -y software-properties-common
|
||||
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
|
||||
@ -39,96 +39,96 @@ sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(l
|
||||
sudo apt update
|
||||
sudo apt install terraform ansible python3-pip docker.io
|
||||
|
||||
# Install additional tools
|
||||
# Installation d'outils supplémentaires
|
||||
pip3 install ansible-lint molecule[docker]
|
||||
```
|
||||
|
||||
### SSH Key Setup
|
||||
### Configuration des Clés SSH
|
||||
|
||||
```bash
|
||||
# Generate SSH key for server access
|
||||
# Générer une clé SSH pour l'accès au serveur
|
||||
ssh-keygen -t rsa -b 4096 -f ~/.ssh/hetzner_key -C "ai-infrastructure"
|
||||
|
||||
# Add to SSH agent
|
||||
# Ajouter à l'agent SSH
|
||||
ssh-add ~/.ssh/hetzner_key
|
||||
|
||||
# Copy public key content
|
||||
# Copier le contenu de la clé publique
|
||||
cat ~/.ssh/hetzner_key.pub
|
||||
```
|
||||
|
||||
## Pre-Deployment Checklist
|
||||
## Liste de Vérification Pré-Déploiement
|
||||
|
||||
### 1. Order GEX44 Servers
|
||||
### 1. Commander les Serveurs GEX44
|
||||
|
||||
**Important**: GEX44 servers must be ordered manually through Hetzner Robot portal or API.
|
||||
**Important** : Les serveurs GEX44 doivent être commandés manuellement via le portail Hetzner Robot ou l'API.
|
||||
|
||||
```bash
|
||||
# Order via Robot API (optional)
|
||||
# Commander via l'API Robot (optionnel)
|
||||
curl -X POST https://robot-ws.your-server.de/order/server \
|
||||
-H "Authorization: Basic $(echo -n 'username:password' | base64)" \
|
||||
-d "product_id=GEX44&location=FSN1-DC14&os=ubuntu-22.04"
|
||||
```
|
||||
|
||||
**Manual ordering steps**:
|
||||
1. Login to [Robot Console](https://robot.your-server.de/)
|
||||
2. Navigate to "Order" → "Dedicated Servers"
|
||||
3. Select GEX44 configuration:
|
||||
- Location: FSN1-DC14 (Frankfurt)
|
||||
- OS: Ubuntu 22.04 LTS
|
||||
- Quantity: 3 (for production)
|
||||
4. Complete payment and wait for provisioning (2-24 hours)
|
||||
**Étapes de commande manuelle** :
|
||||
1. Se connecter à la [Console Robot](https://robot.your-server.de/)
|
||||
2. Naviguer vers "Order" → "Dedicated Servers"
|
||||
3. Sélectionner la configuration GEX44 :
|
||||
- Localisation : FSN1-DC14 (Frankfurt)
|
||||
- OS : Ubuntu 22.04 LTS
|
||||
- Quantité : 3 (pour la production)
|
||||
4. Finaliser le paiement et attendre le provisioning (2-24 heures)
|
||||
|
||||
### 2. Configure Environment Variables
|
||||
### 2. Configurer les Variables d'Environnement
|
||||
|
||||
Create environment file:
|
||||
Créer le fichier d'environnement :
|
||||
|
||||
```bash
|
||||
# Copy example environment file
|
||||
# Copier le fichier d'environnement exemple
|
||||
cp .env.example .env
|
||||
|
||||
# Edit with your credentials
|
||||
# Éditer avec vos identifiants
|
||||
vim .env
|
||||
```
|
||||
|
||||
Required variables:
|
||||
Variables requises :
|
||||
|
||||
```bash
|
||||
# Hetzner credentials
|
||||
# Identifiants Hetzner
|
||||
HCLOUD_TOKEN=your_hcloud_token_here
|
||||
ROBOT_API_USER=your_robot_username
|
||||
ROBOT_API_PASSWORD=your_robot_password
|
||||
|
||||
# SSH configuration
|
||||
# Configuration SSH
|
||||
SSH_PUBLIC_KEY="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQ..."
|
||||
SSH_PRIVATE_KEY_PATH=~/.ssh/hetzner_key
|
||||
|
||||
# Domain configuration (optional)
|
||||
# Configuration du domaine (optionnel)
|
||||
API_DOMAIN=api.yourdomain.com
|
||||
MONITORING_DOMAIN=monitoring.yourdomain.com
|
||||
|
||||
# Monitoring credentials
|
||||
# Identifiants de surveillance
|
||||
GRAFANA_ADMIN_PASSWORD=secure_password_here
|
||||
|
||||
# GitLab CI/CD
|
||||
GITLAB_TOKEN=your_gitlab_token
|
||||
ANSIBLE_VAULT_PASSWORD=secure_vault_password
|
||||
|
||||
# Cost tracking
|
||||
# Suivi des coûts
|
||||
PROJECT_NAME=ai-infrastructure
|
||||
COST_CENTER=engineering
|
||||
|
||||
# Auto-scaling configuration
|
||||
# Configuration d'auto-scaling
|
||||
MIN_GEX44_COUNT=1
|
||||
MAX_GEX44_COUNT=5
|
||||
SCALE_UP_THRESHOLD=0.8
|
||||
SCALE_DOWN_THRESHOLD=0.3
|
||||
```
|
||||
|
||||
### 3. Configure Terraform Backend
|
||||
### 3. Configurer le Backend Terraform
|
||||
|
||||
Choose your state backend:
|
||||
Choisir votre backend d'état :
|
||||
|
||||
#### Option A: GitLab Backend (Recommended)
|
||||
#### Option A : Backend GitLab (Recommandé)
|
||||
|
||||
```hcl
|
||||
# terraform/backend.tf
|
||||
@ -146,7 +146,7 @@ terraform {
|
||||
}
|
||||
```
|
||||
|
||||
#### Option B: S3-Compatible Backend
|
||||
#### Option B : Backend Compatible S3
|
||||
|
||||
```hcl
|
||||
# terraform/backend.tf
|
||||
@ -163,172 +163,172 @@ terraform {
|
||||
}
|
||||
```
|
||||
|
||||
## Deployment Process
|
||||
## Processus de Déploiement
|
||||
|
||||
### Step 1: Initial Setup
|
||||
### Étape 1 : Configuration Initiale
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
# Cloner le dépôt
|
||||
git clone https://github.com/yourorg/ai-infrastructure.git
|
||||
cd ai-infrastructure
|
||||
|
||||
# Install dependencies
|
||||
# Installer les dépendances
|
||||
make setup
|
||||
|
||||
# Validate configuration
|
||||
# Valider la configuration
|
||||
make validate
|
||||
```
|
||||
|
||||
### Step 2: Development Environment
|
||||
### Étape 2 : Environnement de Développement
|
||||
|
||||
Start with a development deployment to test the configuration:
|
||||
Commencer par un déploiement de développement pour tester la configuration :
|
||||
|
||||
```bash
|
||||
# Deploy development environment
|
||||
# Déployer l'environnement de développement
|
||||
make deploy-dev
|
||||
|
||||
# Wait for completion (15-20 minutes)
|
||||
# Check deployment status
|
||||
# Attendre la finalisation (15-20 minutes)
|
||||
# Vérifier le statut du déploiement
|
||||
make status ENV=dev
|
||||
|
||||
# Test the deployment
|
||||
# Tester le déploiement
|
||||
make test ENV=dev
|
||||
```
|
||||
|
||||
### Step 3: Staging Environment
|
||||
### Étape 3 : Environnement de Staging
|
||||
|
||||
Once development is working, deploy staging:
|
||||
Une fois que le développement fonctionne, déployer le staging :
|
||||
|
||||
```bash
|
||||
# Plan staging deployment
|
||||
# Planifier le déploiement staging
|
||||
make plan ENV=staging
|
||||
|
||||
# Review the plan carefully
|
||||
# Deploy staging
|
||||
# Examiner attentivement le plan
|
||||
# Déployer le staging
|
||||
make deploy-staging
|
||||
|
||||
# Run integration tests
|
||||
# Exécuter les tests d'intégration
|
||||
make test-load API_URL=https://api-staging.yourdomain.com
|
||||
```
|
||||
|
||||
### Step 4: Production Deployment
|
||||
### Étape 4 : Déploiement Production
|
||||
|
||||
**Warning**: Production deployment should be done during maintenance windows.
|
||||
**Attention** : Le déploiement en production doit être effectué pendant les fenêtres de maintenance.
|
||||
|
||||
```bash
|
||||
# Create backup of current state
|
||||
# Créer une sauvegarde de l'état actuel
|
||||
make backup ENV=production
|
||||
|
||||
# Plan production deployment
|
||||
# Planifier le déploiement production
|
||||
make plan ENV=production
|
||||
|
||||
# Review plan with team
|
||||
# Get approval for production deployment
|
||||
# Examiner le plan avec l'équipe
|
||||
# Obtenir l'approbation pour le déploiement production
|
||||
|
||||
# Deploy production (requires manual confirmation)
|
||||
# Déployer en production (nécessite confirmation manuelle)
|
||||
make deploy-prod
|
||||
|
||||
# Verify deployment
|
||||
# Vérifier le déploiement
|
||||
make status ENV=production
|
||||
make test ENV=production
|
||||
```
|
||||
|
||||
## Detailed Deployment Steps
|
||||
## Étapes Détaillées de Déploiement
|
||||
|
||||
### Infrastructure Deployment (Terraform)
|
||||
### Déploiement d'Infrastructure (Terraform)
|
||||
|
||||
```bash
|
||||
# Navigate to terraform directory
|
||||
# Naviguer vers le répertoire terraform
|
||||
cd terraform/environments/production
|
||||
|
||||
# Initialize Terraform
|
||||
# Initialiser Terraform
|
||||
terraform init
|
||||
|
||||
# Create execution plan
|
||||
# Créer le plan d'exécution
|
||||
terraform plan -out=production.tfplan
|
||||
|
||||
# Review the plan
|
||||
# Examiner le plan
|
||||
terraform show production.tfplan
|
||||
|
||||
# Apply the plan
|
||||
# Appliquer le plan
|
||||
terraform apply production.tfplan
|
||||
```
|
||||
|
||||
Expected resources to be created:
|
||||
- 1x Private network (10.0.0.0/16)
|
||||
- 2x Subnets (cloud and GEX44)
|
||||
- 4x Firewall rules
|
||||
- 3x Cloud servers (LB, API GW, Monitoring)
|
||||
Ressources attendues à créer :
|
||||
- 1x Réseau privé (10.0.0.0/16)
|
||||
- 2x Sous-réseaux (cloud et GEX44)
|
||||
- 4x Règles de pare-feu
|
||||
- 3x Serveurs cloud (LB, API GW, Monitoring)
|
||||
- 1x Volume (500GB)
|
||||
- Various security groups
|
||||
- Divers groupes de sécurité
|
||||
|
||||
### Server Configuration (Ansible)
|
||||
### Configuration des Serveurs (Ansible)
|
||||
|
||||
```bash
|
||||
# Navigate to ansible directory
|
||||
# Naviguer vers le répertoire ansible
|
||||
cd ansible
|
||||
|
||||
# Test connectivity
|
||||
# Tester la connectivité
|
||||
ansible all -i inventory/production.yml -m ping
|
||||
|
||||
# Run full configuration
|
||||
# Exécuter la configuration complète
|
||||
ansible-playbook -i inventory/production.yml playbooks/site.yml
|
||||
|
||||
# Verify services are running
|
||||
# Vérifier que les services fonctionnent
|
||||
ansible all -i inventory/production.yml -a "systemctl status vllm-api"
|
||||
```
|
||||
|
||||
### GEX44 Configuration
|
||||
### Configuration GEX44
|
||||
|
||||
The GEX44 servers require special handling due to their dedicated nature:
|
||||
Les serveurs GEX44 nécessitent une manipulation spéciale due à leur nature dédiée :
|
||||
|
||||
```bash
|
||||
# Configure GEX44 servers specifically
|
||||
# Configurer spécifiquement les serveurs GEX44
|
||||
ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml
|
||||
|
||||
# Wait for model downloads (can take 1-2 hours)
|
||||
# Monitor progress
|
||||
# Attendre les téléchargements de modèles (peut prendre 1-2 heures)
|
||||
# Surveiller le progrès
|
||||
ansible gex44 -i inventory/production.yml -a "tail -f /var/log/vllm/model-download.log"
|
||||
|
||||
# Verify GPU accessibility
|
||||
# Vérifier l'accessibilité GPU
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi"
|
||||
|
||||
# Test vLLM API
|
||||
# Tester l'API vLLM
|
||||
ansible gex44 -i inventory/production.yml -a "curl -f http://localhost:8000/health"
|
||||
```
|
||||
|
||||
### Load Balancer Configuration
|
||||
### Configuration du Load Balancer
|
||||
|
||||
```bash
|
||||
# Configure HAProxy load balancer
|
||||
# Configurer le load balancer HAProxy
|
||||
ansible-playbook -i inventory/production.yml playbooks/load-balancer-setup.yml
|
||||
|
||||
# Test load balancer
|
||||
# Tester le load balancer
|
||||
curl -f http://LOAD_BALANCER_IP/health
|
||||
|
||||
# Check HAProxy stats
|
||||
# Vérifier les statistiques HAProxy
|
||||
curl http://LOAD_BALANCER_IP:8404/stats
|
||||
```
|
||||
|
||||
### Monitoring Setup
|
||||
### Configuration de la Surveillance
|
||||
|
||||
```bash
|
||||
# Configure monitoring stack
|
||||
# Configurer la pile de surveillance
|
||||
ansible-playbook -i inventory/production.yml playbooks/monitoring-setup.yml
|
||||
|
||||
# Access Grafana (after DNS setup)
|
||||
# Accéder à Grafana (après configuration DNS)
|
||||
open https://monitoring.yourdomain.com
|
||||
|
||||
# Default credentials:
|
||||
# Username: admin
|
||||
# Password: (from GRAFANA_ADMIN_PASSWORD)
|
||||
# Identifiants par défaut :
|
||||
# Nom d'utilisateur : admin
|
||||
# Mot de passe : (depuis GRAFANA_ADMIN_PASSWORD)
|
||||
```
|
||||
|
||||
## Post-Deployment Configuration
|
||||
## Configuration Post-Déploiement
|
||||
|
||||
### 1. DNS Configuration
|
||||
### 1. Configuration DNS
|
||||
|
||||
Update your DNS records to point to the deployed infrastructure:
|
||||
Mettre à jour vos enregistrements DNS pour pointer vers l'infrastructure déployée :
|
||||
|
||||
```dns
|
||||
api.yourdomain.com. 300 IN A LOAD_BALANCER_IP
|
||||
@ -336,233 +336,233 @@ monitoring.yourdomain.com. 300 IN A MONITORING_IP
|
||||
*.api.yourdomain.com. 300 IN A LOAD_BALANCER_IP
|
||||
```
|
||||
|
||||
### 2. SSL Certificate Setup
|
||||
### 2. Configuration des Certificats SSL
|
||||
|
||||
```bash
|
||||
# Let's Encrypt certificates (automatic)
|
||||
# Certificats Let's Encrypt (automatique)
|
||||
ansible-playbook -i inventory/production.yml playbooks/ssl-setup.yml
|
||||
|
||||
# Or manually with certbot
|
||||
# Ou manuellement avec certbot
|
||||
sudo certbot --nginx -d api.yourdomain.com -d monitoring.yourdomain.com
|
||||
```
|
||||
|
||||
### 3. Monitoring Configuration
|
||||
### 3. Configuration de la Surveillance
|
||||
|
||||
#### Grafana Dashboards
|
||||
#### Tableaux de Bord Grafana
|
||||
|
||||
1. Login to Grafana at https://monitoring.yourdomain.com
|
||||
2. Import pre-built dashboards from `monitoring/grafana/dashboards/`
|
||||
3. Configure alert channels (email, Slack, etc.)
|
||||
1. Se connecter à Grafana sur https://monitoring.yourdomain.com
|
||||
2. Importer les tableaux de bord pré-construits depuis `monitoring/grafana/dashboards/`
|
||||
3. Configurer les canaux d'alerte (email, Slack, etc.)
|
||||
|
||||
#### Prometheus Alerts
|
||||
#### Alertes Prometheus
|
||||
|
||||
Alerts are automatically configured, but you may want to customize:
|
||||
Les alertes sont automatiquement configurées, mais vous pourriez vouloir personnaliser :
|
||||
|
||||
```bash
|
||||
# Edit alert rules
|
||||
# Éditer les règles d'alerte
|
||||
vim monitoring/prometheus/alerts.yml
|
||||
|
||||
# Reload Prometheus configuration
|
||||
# Recharger la configuration Prometheus
|
||||
ansible monitoring -i inventory/production.yml -a "systemctl reload prometheus"
|
||||
```
|
||||
|
||||
### 4. Backup Configuration
|
||||
### 4. Configuration de Sauvegarde
|
||||
|
||||
```bash
|
||||
# Setup automated backups
|
||||
# Configurer les sauvegardes automatisées
|
||||
ansible-playbook -i inventory/production.yml playbooks/backup-setup.yml
|
||||
|
||||
# Test backup process
|
||||
# Tester le processus de sauvegarde
|
||||
make backup ENV=production
|
||||
|
||||
# Verify backup files
|
||||
# Vérifier les fichiers de sauvegarde
|
||||
ls -la backups/$(date +%Y%m%d)/
|
||||
```
|
||||
|
||||
## Validation and Testing
|
||||
## Validation et Tests
|
||||
|
||||
### Health Checks
|
||||
### Contrôles de Santé
|
||||
|
||||
```bash
|
||||
# Infrastructure health
|
||||
# Santé de l'infrastructure
|
||||
make status ENV=production
|
||||
|
||||
# API health
|
||||
# Santé de l'API
|
||||
curl -f https://api.yourdomain.com/health
|
||||
|
||||
# Monitoring health
|
||||
# Santé de la surveillance
|
||||
curl -f https://monitoring.yourdomain.com/api/health
|
||||
```
|
||||
|
||||
### Load Testing
|
||||
### Tests de Charge
|
||||
|
||||
```bash
|
||||
# Basic load test
|
||||
# Test de charge basique
|
||||
make test-load API_URL=https://api.yourdomain.com
|
||||
|
||||
# Extended load test
|
||||
# Test de charge étendu
|
||||
k6 run tests/load/k6_inference_test.js --env API_URL=https://api.yourdomain.com
|
||||
```
|
||||
|
||||
### Contract Testing
|
||||
### Tests de Contrat
|
||||
|
||||
```bash
|
||||
# API contract tests
|
||||
# Tests de contrat API
|
||||
python tests/contracts/test_inference_api.py --api-url=https://api.yourdomain.com
|
||||
```
|
||||
|
||||
## Troubleshooting Deployment Issues
|
||||
## Dépannage des Problèmes de Déploiement
|
||||
|
||||
### Common Issues
|
||||
### Problèmes Courants
|
||||
|
||||
#### 1. Terraform State Lock
|
||||
#### 1. Verrouillage d'État Terraform
|
||||
|
||||
```bash
|
||||
# If state is locked
|
||||
# Si l'état est verrouillé
|
||||
terraform force-unlock LOCK_ID
|
||||
|
||||
# Or reset state (dangerous)
|
||||
# Ou réinitialiser l'état (dangereux)
|
||||
terraform state pull > backup.tfstate
|
||||
terraform state rm # problematic resource
|
||||
terraform import # re-import resource
|
||||
terraform state rm # ressource problématique
|
||||
terraform import # ré-importer la ressource
|
||||
```
|
||||
|
||||
#### 2. Ansible Connection Issues
|
||||
#### 2. Problèmes de Connexion Ansible
|
||||
|
||||
```bash
|
||||
# Test SSH connectivity
|
||||
# Tester la connectivité SSH
|
||||
ansible all -i inventory/production.yml -m ping
|
||||
|
||||
# Check SSH agent
|
||||
# Vérifier l'agent SSH
|
||||
ssh-add -l
|
||||
|
||||
# Debug connection
|
||||
# Déboguer la connexion
|
||||
ansible all -i inventory/production.yml -m ping -vvv
|
||||
```
|
||||
|
||||
#### 3. GEX44 Not Accessible
|
||||
#### 3. GEX44 Non Accessible
|
||||
|
||||
```bash
|
||||
# Check server status in Robot console
|
||||
# Verify network configuration
|
||||
# Ensure servers are in same private network
|
||||
# Vérifier le statut du serveur dans la console Robot
|
||||
# Vérifier la configuration réseau
|
||||
# S'assurer que les serveurs sont dans le même réseau privé
|
||||
|
||||
# Manual SSH to debug
|
||||
# SSH manuel pour déboguer
|
||||
ssh -i ~/.ssh/hetzner_key ubuntu@GEX44_IP
|
||||
```
|
||||
|
||||
#### 4. Model Download Failures
|
||||
#### 4. Échecs de Téléchargement de Modèles
|
||||
|
||||
```bash
|
||||
# Check disk space
|
||||
# Vérifier l'espace disque
|
||||
ansible gex44 -i inventory/production.yml -a "df -h"
|
||||
|
||||
# Check download logs
|
||||
# Vérifier les logs de téléchargement
|
||||
ansible gex44 -i inventory/production.yml -a "tail -f /var/log/vllm/model-download.log"
|
||||
|
||||
# Retry download
|
||||
# Réessayer le téléchargement
|
||||
ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=models
|
||||
```
|
||||
|
||||
### Debug Commands
|
||||
### Commandes de Débogage
|
||||
|
||||
```bash
|
||||
# Check all service statuses
|
||||
# Vérifier tous les statuts de service
|
||||
ansible all -i inventory/production.yml -a "systemctl list-units --failed"
|
||||
|
||||
# View logs
|
||||
# Voir les logs
|
||||
ansible all -i inventory/production.yml -a "journalctl -u vllm-api -n 50"
|
||||
|
||||
# Check GPU status
|
||||
# Vérifier le statut GPU
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi"
|
||||
|
||||
# Check network connectivity
|
||||
# Vérifier la connectivité réseau
|
||||
ansible all -i inventory/production.yml -a "ping -c 3 8.8.8.8"
|
||||
```
|
||||
|
||||
## Rollback Procedures
|
||||
## Procédures de Rollback
|
||||
|
||||
### Emergency Rollback
|
||||
### Rollback d'Urgence
|
||||
|
||||
```bash
|
||||
# Stop accepting new traffic
|
||||
# Update load balancer to maintenance mode
|
||||
# Arrêter l'acceptation de nouveau trafic
|
||||
# Mettre le load balancer en mode maintenance
|
||||
ansible load_balancers -i inventory/production.yml -a "systemctl stop haproxy"
|
||||
|
||||
# Rollback Terraform changes
|
||||
# Rollback des changements Terraform
|
||||
cd terraform/environments/production
|
||||
terraform plan -destroy -out=rollback.tfplan
|
||||
terraform apply rollback.tfplan
|
||||
|
||||
# Restore from backup
|
||||
# Restaurer depuis une sauvegarde
|
||||
make restore BACKUP_DATE=20241201 ENV=production
|
||||
```
|
||||
|
||||
### Gradual Rollback
|
||||
### Rollback Graduel
|
||||
|
||||
```bash
|
||||
# Remove problematic servers from load balancer
|
||||
# Update HAProxy configuration to exclude failed servers
|
||||
# Retirer les serveurs problématiques du load balancer
|
||||
# Mettre à jour la configuration HAProxy pour exclure les serveurs défaillants
|
||||
ansible-playbook -i inventory/production.yml playbooks/load-balancer-setup.yml --extra-vars="exclude_servers=['gex44-3']"
|
||||
|
||||
# Fix issues on excluded servers
|
||||
# Re-add to load balancer when ready
|
||||
# Corriger les problèmes sur les serveurs exclus
|
||||
# Les rajouter au load balancer quand prêts
|
||||
```
|
||||
|
||||
## Maintenance Procedures
|
||||
## Procédures de Maintenance
|
||||
|
||||
### Regular Maintenance
|
||||
### Maintenance Régulière
|
||||
|
||||
```bash
|
||||
# Weekly: Update all packages
|
||||
# Hebdomadaire : Mettre à jour tous les paquets
|
||||
ansible all -i inventory/production.yml -a "apt update && apt upgrade -y"
|
||||
|
||||
# Monthly: Restart services
|
||||
# Mensuelle : Redémarrer les services
|
||||
ansible all -i inventory/production.yml -a "systemctl restart vllm-api"
|
||||
|
||||
# Quarterly: Full system reboot (during maintenance window)
|
||||
# Trimestrielle : Redémarrage complet du système (pendant la fenêtre de maintenance)
|
||||
ansible all -i inventory/production.yml -a "reboot" --become
|
||||
```
|
||||
|
||||
### Cost Optimization
|
||||
### Optimisation des Coûts
|
||||
|
||||
```bash
|
||||
# Generate cost report
|
||||
# Générer un rapport de coûts
|
||||
make cost-report ENV=production
|
||||
|
||||
# Review unused resources
|
||||
# Examiner les ressources inutilisées
|
||||
python scripts/cost-analysis.py --find-unused
|
||||
|
||||
# Implement recommendations
|
||||
# Scale down during low usage periods
|
||||
# Implémenter les recommandations
|
||||
# Réduire l'échelle pendant les périodes de faible utilisation
|
||||
```
|
||||
|
||||
## Security Hardening
|
||||
## Durcissement de Sécurité
|
||||
|
||||
### Post-Deployment Security
|
||||
### Sécurité Post-Déploiement
|
||||
|
||||
```bash
|
||||
# Run security hardening playbook
|
||||
# Exécuter le playbook de durcissement de sécurité
|
||||
ansible-playbook -i inventory/production.yml playbooks/security-hardening.yml
|
||||
|
||||
# Update firewall rules
|
||||
# Mettre à jour les règles de pare-feu
|
||||
ansible-playbook -i inventory/production.yml playbooks/firewall-setup.yml
|
||||
|
||||
# Rotate SSH keys
|
||||
# Rotation des clés SSH
|
||||
ansible-playbook -i inventory/production.yml playbooks/ssh-key-rotation.yml
|
||||
```
|
||||
|
||||
### Security Monitoring
|
||||
### Surveillance de Sécurité
|
||||
|
||||
```bash
|
||||
# Enable fail2ban
|
||||
# Activer fail2ban
|
||||
ansible all -i inventory/production.yml -a "systemctl enable fail2ban"
|
||||
|
||||
# Setup log monitoring
|
||||
# Configurer la surveillance des logs
|
||||
ansible-playbook -i inventory/production.yml playbooks/log-monitoring.yml
|
||||
|
||||
# Configure intrusion detection
|
||||
# Configurer la détection d'intrusion
|
||||
ansible-playbook -i inventory/production.yml playbooks/ids-setup.yml
|
||||
```
|
||||
|
||||
This deployment guide provides a comprehensive path from initial setup to production deployment. Always test changes in development and staging environments before applying to production.
|
||||
Ce guide de déploiement fournit un chemin complet depuis la configuration initiale jusqu'au déploiement en production. Testez toujours les changements dans les environnements de développement et de staging avant de les appliquer en production.
|
||||
335
docs/04_tools.md
335
docs/04_tools.md
@ -1,249 +1,238 @@
|
||||
# Tools & Technologies
|
||||
# Outils et Technologies
|
||||
|
||||
## Core Infrastructure
|
||||
## Infrastructure de Base
|
||||
|
||||
### Infrastructure as Code
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Terraform** | 1.12+ | Infrastructure provisioning | MPL-2.0 |
|
||||
| **Hetzner Provider** | 1.45+ | Hetzner Cloud resources | MPL-2.0 |
|
||||
### Infrastructure en tant que Code
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **Terraform** | 1.12+ | Provisioning d'infrastructure | MPL-2.0 |
|
||||
| **Hetzner Provider** | 1.45+ | Ressources Hetzner Cloud | MPL-2.0 |
|
||||
|
||||
### Configuration Management
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Ansible** | 8.0+ | Server configuration | GPL-3.0 |
|
||||
| **Ansible Vault** | Included | Secrets management | GPL-3.0 |
|
||||
### Gestion de Configuration
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **Ansible** | 8.0+ | Configuration de serveurs | GPL-3.0 |
|
||||
| **Ansible Vault** | Inclus | Gestion des secrets | GPL-3.0 |
|
||||
|
||||
## Operating System & Runtime
|
||||
## Système d'Exploitation et Runtime
|
||||
|
||||
### Base System
|
||||
| Component | Version | Purpose | Support |
|
||||
|-----------|---------|---------|---------|
|
||||
| **Ubuntu Server** | 24.04 LTS | Base operating system | Until 2034 |
|
||||
| **Docker** | 24.0.x | Container runtime | Docker Inc. |
|
||||
| **systemd** | 253+ | Service management | Built-in |
|
||||
### Système de Base
|
||||
| Composant | Version | Objectif | Support |
|
||||
|-----------|---------|----------|---------|
|
||||
| **Ubuntu Server** | 24.04 LTS | Système d'exploitation de base | Jusqu'en 2034 |
|
||||
| **Docker** | 24.0.x | Runtime de conteneurs | Docker Inc. |
|
||||
| **systemd** | 253+ | Gestion des services | Intégré |
|
||||
|
||||
### GPU Stack
|
||||
| Component | Version | Purpose | Support |
|
||||
|-----------|---------|---------|---------|
|
||||
| **NVIDIA Driver** | 545.23.08 | GPU driver | NVIDIA |
|
||||
| **CUDA Toolkit** | 12.3+ | GPU computing | NVIDIA |
|
||||
| **NVIDIA Container Toolkit** | 1.14+ | Docker GPU support | NVIDIA |
|
||||
### Stack GPU
|
||||
| Composant | Version | Objectif | Support |
|
||||
|-----------|---------|----------|---------|
|
||||
| **NVIDIA Driver** | 545.23.08 | Pilote GPU | NVIDIA |
|
||||
| **CUDA Toolkit** | 12.3+ | Computing GPU | NVIDIA |
|
||||
| **NVIDIA Container Toolkit** | 1.14+ | Support GPU Docker | NVIDIA |
|
||||
|
||||
## AI/ML Stack
|
||||
## Stack IA/ML
|
||||
|
||||
### Inference Engine
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **vLLM** | Latest | LLM inference server | Apache-2.0 |
|
||||
| **PyTorch** | 2.5.0+ | Deep learning framework | BSD-3 |
|
||||
| **Transformers** | 4.46.0+ | Model library | Apache-2.0 |
|
||||
| **Accelerate** | 0.34.0+ | Training acceleration | Apache-2.0 |
|
||||
### Moteur d'Inférence
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **vLLM** | Latest | Serveur d'inférence LLM | Apache-2.0 |
|
||||
| **PyTorch** | 2.5.0+ | Framework d'apprentissage profond | BSD-3 |
|
||||
| **Transformers** | 4.46.0+ | Bibliothèque de modèles | Apache-2.0 |
|
||||
| **Accelerate** | 0.34.0+ | Accélération d'entraînement | Apache-2.0 |
|
||||
|
||||
### Model Management
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **MLflow** | 2.8+ | Model lifecycle management | Apache-2.0 |
|
||||
| **Hugging Face Hub** | 0.25.0+ | Model repository | Apache-2.0 |
|
||||
### Gestion des Modèles
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **MLflow** | 2.8+ | Gestion du cycle de vie des modèles | Apache-2.0 |
|
||||
| **Hugging Face Hub** | 0.25.0+ | Dépôt de modèles | Apache-2.0 |
|
||||
|
||||
### Quantization
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **AWQ** | Latest | 4-bit quantization | MIT |
|
||||
| **GPTQ** | Latest | Alternative quantization | MIT |
|
||||
| **TorchAO** | Nightly | Advanced optimizations | BSD-3 |
|
||||
### Quantification
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **AWQ** | Latest | Quantification 4-bit | MIT |
|
||||
| **GPTQ** | Latest | Quantification alternative | MIT |
|
||||
| **TorchAO** | Nightly | Optimisations avancées | BSD-3 |
|
||||
|
||||
## Networking & Load Balancing
|
||||
## Réseau et Répartition de Charge
|
||||
|
||||
### Load Balancing
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
### Répartition de Charge
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **HAProxy** | 2.8+ | Load balancer | GPL-2.0 |
|
||||
| **Keepalived** | 2.2+ | High availability | GPL-2.0 |
|
||||
| **Keepalived** | 2.2+ | Haute disponibilité | GPL-2.0 |
|
||||
|
||||
### SSL/TLS
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Let's Encrypt** | Current | Free SSL certificates | ISRG |
|
||||
| **Certbot** | 2.7+ | Certificate automation | Apache-2.0 |
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **Let's Encrypt** | Actuel | Certificats SSL gratuits | ISRG |
|
||||
| **Certbot** | 2.7+ | Automatisation de certificats | Apache-2.0 |
|
||||
|
||||
## Monitoring & Observability
|
||||
## Surveillance et Observabilité
|
||||
|
||||
### Core Monitoring
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Prometheus** | 2.47+ | Metrics collection | Apache-2.0 |
|
||||
| **Grafana** | 10.2+ | Metrics visualization | AGPL-3.0 |
|
||||
| **AlertManager** | 0.26+ | Alert routing | Apache-2.0 |
|
||||
### Surveillance de Base
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **Prometheus** | 2.47+ | Collection de métriques | Apache-2.0 |
|
||||
| **Grafana** | 10.2+ | Visualisation de métriques | AGPL-3.0 |
|
||||
| **AlertManager** | 0.26+ | Routage d'alertes | Apache-2.0 |
|
||||
|
||||
### Exporters
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Node Exporter** | 1.7+ | System metrics | Apache-2.0 |
|
||||
| **nvidia-smi Exporter** | Custom | GPU metrics | MIT |
|
||||
| **HAProxy Exporter** | 0.15+ | Load balancer metrics | Apache-2.0 |
|
||||
### Exporteurs
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **Node Exporter** | 1.7+ | Métriques système | Apache-2.0 |
|
||||
| **nvidia-smi Exporter** | Personnalisé | Métriques GPU | MIT |
|
||||
| **HAProxy Exporter** | 0.15+ | Métriques load balancer | Apache-2.0 |
|
||||
|
||||
### Log Management
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **systemd-journald** | Built-in | Log collection | GPL-2.0 |
|
||||
| **Logrotate** | 3.21+ | Log rotation | GPL-2.0 |
|
||||
### Gestion des Logs
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **systemd-journald** | Intégré | Collection de logs | GPL-2.0 |
|
||||
| **Logrotate** | 3.21+ | Rotation des logs | GPL-2.0 |
|
||||
|
||||
## CI/CD & Development
|
||||
## CI/CD et Développement
|
||||
|
||||
### CI/CD Platform
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **GitLab** | 16.0+ | CI/CD pipeline | MIT |
|
||||
| **GitLab Runner** | 16.0+ | Job execution | MIT |
|
||||
### Plateforme CI/CD
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **GitLab** | 16.0+ | Pipeline CI/CD | MIT |
|
||||
| **GitLab Runner** | 16.0+ | Exécution de tâches | MIT |
|
||||
|
||||
### Development Tools
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Python** | 3.12+ | Scripting language | PSF |
|
||||
| **pip** | 23.0+ | Package manager | MIT |
|
||||
| **Poetry** | 1.7+ | Dependency management | MIT |
|
||||
### Outils de Développement
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **Python** | 3.12+ | Langage de script | PSF |
|
||||
| **pip** | 23.0+ | Gestionnaire de paquets | MIT |
|
||||
| **Poetry** | 1.7+ | Gestion des dépendances | MIT |
|
||||
|
||||
### Testing
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **pytest** | 7.4+ | Python testing | MIT |
|
||||
| **requests** | 2.31+ | HTTP testing | Apache-2.0 |
|
||||
| **locust** | 2.17+ | Load testing | MIT |
|
||||
### Tests
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **pytest** | 7.4+ | Tests Python | MIT |
|
||||
| **requests** | 2.31+ | Tests HTTP | Apache-2.0 |
|
||||
| **locust** | 2.17+ | Tests de charge | MIT |
|
||||
|
||||
## Security & Compliance
|
||||
## Sécurité et Conformité
|
||||
|
||||
### Firewall & Security
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **ufw** | 0.36+ | Firewall management | GPL-3.0 |
|
||||
| **fail2ban** | 1.0+ | Intrusion prevention | GPL-2.0 |
|
||||
| **SSH** | OpenSSH 9.3+ | Secure access | BSD |
|
||||
### Pare-feu et Sécurité
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **ufw** | 0.36+ | Gestion du pare-feu | GPL-3.0 |
|
||||
| **fail2ban** | 1.0+ | Prévention d'intrusion | GPL-2.0 |
|
||||
| **SSH** | OpenSSH 9.3+ | Accès sécurisé | BSD |
|
||||
|
||||
### Secrets Management
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Ansible Vault** | Built-in | Configuration secrets | GPL-3.0 |
|
||||
| **GitLab CI Variables** | Built-in | CI/CD secrets | MIT |
|
||||
### Gestion des Secrets
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **Ansible Vault** | Intégré | Secrets de configuration | GPL-3.0 |
|
||||
| **GitLab CI Variables** | Intégré | Secrets CI/CD | MIT |
|
||||
|
||||
## Cloud Provider APIs
|
||||
## APIs Fournisseur Cloud
|
||||
|
||||
### Hetzner Services
|
||||
| Service | API Version | Purpose | Pricing |
|
||||
|---------|-------------|---------|---------|
|
||||
| **Hetzner Cloud** | v1 | Cloud resources | Pay-per-use |
|
||||
| **Hetzner Robot** | v1 | Dedicated servers | Monthly |
|
||||
| **Hetzner DNS** | v1 | DNS management | Free |
|
||||
### Services Hetzner
|
||||
| Service | Version API | Objectif | Tarification |
|
||||
|---------|-------------|----------|--------------|
|
||||
| **Hetzner Cloud** | v1 | Ressources cloud | Paiement à l'usage |
|
||||
| **Hetzner Robot** | v1 | Serveurs dédiés | Mensuel |
|
||||
| **Hetzner DNS** | v1 | Gestion DNS | Gratuit |
|
||||
|
||||
## Backup & Storage
|
||||
## Sauvegarde et Stockage
|
||||
|
||||
### Storage Solutions
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **rsync** | 3.2+ | File synchronization | GPL-3.0 |
|
||||
| **tar** | 1.34+ | Archive creation | GPL-3.0 |
|
||||
### Solutions de Stockage
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **rsync** | 3.2+ | Synchronisation de fichiers | GPL-3.0 |
|
||||
| **tar** | 1.34+ | Création d'archives | GPL-3.0 |
|
||||
| **gzip** | 1.12+ | Compression | GPL-3.0 |
|
||||
|
||||
### Cloud Storage
|
||||
| Service | Purpose | Pricing |
|
||||
|---------|---------|---------|
|
||||
| **Hetzner Storage Box** | Backup storage | €0.0104/GB/month |
|
||||
| **Hetzner Cloud Volumes** | Block storage | €0.0476/GB/month |
|
||||
### Stockage Cloud
|
||||
| Service | Objectif | Tarification |
|
||||
|---------|----------|--------------|
|
||||
| **Hetzner Storage Box** | Stockage de sauvegarde | 0,0104€/GB/mois |
|
||||
| **Hetzner Cloud Volumes** | Stockage bloc | 0,0476€/GB/mois |
|
||||
|
||||
## Performance & Optimization
|
||||
## Performance et Optimisation
|
||||
|
||||
### System Optimization
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **htop** | 3.2+ | Process monitoring | GPL-2.0 |
|
||||
| **iotop** | 0.6+ | I/O monitoring | GPL-2.0 |
|
||||
| **nvidia-smi** | Included | GPU monitoring | NVIDIA |
|
||||
### Optimisation Système
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **htop** | 3.2+ | Surveillance des processus | GPL-2.0 |
|
||||
| **iotop** | 0.6+ | Surveillance I/O | GPL-2.0 |
|
||||
| **nvidia-smi** | Inclus | Surveillance GPU | NVIDIA |
|
||||
|
||||
### Network Optimization
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **iperf3** | 3.12+ | Network testing | BSD-3 |
|
||||
| **tc** | Built-in | Traffic control | GPL-2.0 |
|
||||
### Optimisation Réseau
|
||||
| Outil | Version | Objectif | Licence |
|
||||
|-------|---------|----------|---------|
|
||||
| **iperf3** | 3.12+ | Tests réseau | BSD-3 |
|
||||
| **tc** | Intégré | Contrôle du trafic | GPL-2.0 |
|
||||
|
||||
## Documentation & Collaboration
|
||||
## Documentation et Collaboration
|
||||
|
||||
### Documentation
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Markdown** | CommonMark | Documentation format | BSD |
|
||||
| **Mermaid** | 10.6+ | Diagram generation | MIT |
|
||||
|
||||
### Version Control
|
||||
| Tool | Version | Purpose | License |
|
||||
|------|---------|---------|---------|
|
||||
| **Git** | 2.40+ | Version control | GPL-2.0 |
|
||||
| **Git LFS** | 3.4+ | Large file storage | MIT |
|
||||
## Commandes d'Installation
|
||||
|
||||
## Installation Commands
|
||||
|
||||
### Ubuntu 24.04 Setup
|
||||
### Configuration Ubuntu 24.04
|
||||
```bash
|
||||
# Update system
|
||||
# Mettre à jour le système
|
||||
sudo apt update && sudo apt upgrade -y
|
||||
|
||||
# Install core tools
|
||||
# Installer les outils de base
|
||||
sudo apt install -y curl wget git python3-pip
|
||||
|
||||
# Install Docker
|
||||
# Installer Docker
|
||||
curl -fsSL https://get.docker.com -o get-docker.sh
|
||||
sudo sh get-docker.sh
|
||||
|
||||
# Install NVIDIA drivers (sur GEX44)
|
||||
# Installer les pilotes NVIDIA (sur GEX44)
|
||||
sudo apt install -y nvidia-driver-545
|
||||
sudo nvidia-smi
|
||||
|
||||
# Install Terraform
|
||||
# Installer Terraform
|
||||
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
|
||||
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
|
||||
sudo apt update && sudo apt install -y terraform
|
||||
|
||||
# Install Ansible
|
||||
# Installer Ansible
|
||||
sudo apt install -y ansible
|
||||
|
||||
# Install Python dependencies
|
||||
# Installer les dépendances Python
|
||||
pip3 install mlflow requests prometheus-client
|
||||
```
|
||||
|
||||
### Verification Commands
|
||||
### Commandes de Vérification
|
||||
```bash
|
||||
# Verify versions
|
||||
# Vérifier les versions
|
||||
terraform version
|
||||
ansible --version
|
||||
docker version
|
||||
python3 --version
|
||||
|
||||
# Verify GPU (sur GEX44)
|
||||
# Vérifier le GPU (sur GEX44)
|
||||
nvidia-smi
|
||||
docker run --rm --gpus all nvidia/cuda:12.3-runtime-ubuntu22.04 nvidia-smi
|
||||
```
|
||||
|
||||
## Architecture Compatibility
|
||||
## Compatibilité d'Architecture
|
||||
|
||||
### Supported Hardware
|
||||
### Matériel Supporté
|
||||
- **CPU** : Intel x86_64, AMD x86_64
|
||||
- **GPU** : NVIDIA RTX 4000 Ada (Compute Capability 8.9)
|
||||
- **Memory** : 64GB DDR4 minimum
|
||||
- **Storage** : NVMe SSD minimum
|
||||
- **Mémoire** : 64GB DDR4 minimum
|
||||
- **Stockage** : SSD NVMe minimum
|
||||
|
||||
### Network Requirements
|
||||
- **Bandwidth** : 1 Gbps minimum
|
||||
- **Latency** : < 10ms intra-datacenter
|
||||
- **Ports** : 22 (SSH), 80/443 (HTTP/HTTPS), 8000 (vLLM), 9090-9100 (Monitoring)
|
||||
### Exigences Réseau
|
||||
- **Bande passante** : 1 Gbps minimum
|
||||
- **Latence** : < 10ms intra-datacenter
|
||||
- **Ports** : 22 (SSH), 80/443 (HTTP/HTTPS), 8000 (vLLM), 9090-9100 (Surveillance)
|
||||
|
||||
## License Compliance
|
||||
## Conformité de Licence
|
||||
|
||||
### Open Source Components
|
||||
- **GPL-licensed** : Linux kernel, systemd, Ansible
|
||||
- **Apache-licensed** : Terraform, MLflow, Prometheus
|
||||
- **MIT-licensed** : Docker, GitLab, pytest
|
||||
- **BSD-licensed** : PyTorch, OpenSSH
|
||||
### Composants Open Source
|
||||
- **Licence GPL** : Noyau Linux, systemd, Ansible
|
||||
- **Licence Apache** : Terraform, MLflow, Prometheus
|
||||
- **Licence MIT** : Docker, GitLab, pytest
|
||||
- **Licence BSD** : PyTorch, OpenSSH
|
||||
|
||||
### Proprietary Components
|
||||
- **NVIDIA drivers** : NVIDIA License (redistribution restrictions)
|
||||
- **Hetzner services** : Commercial terms
|
||||
- **GitLab Enterprise** : Commercial (si utilisé)
|
||||
### Composants Propriétaires
|
||||
- **Pilotes NVIDIA** : Licence NVIDIA (restrictions de redistribution)
|
||||
- **Services Hetzner** : Conditions commerciales
|
||||
- **GitLab Enterprise** : Commercial (si utilisé)
|
||||
|
||||
@ -1,659 +1,659 @@
|
||||
# Troubleshooting Guide
|
||||
# Guide de Dépannage
|
||||
|
||||
This guide helps diagnose and resolve common issues with the AI Infrastructure deployment.
|
||||
Ce guide aide à diagnostiquer et résoudre les problèmes courants avec le déploiement de l'Infrastructure IA.
|
||||
|
||||
## Quick Diagnostic Commands
|
||||
## Commandes de Diagnostic Rapide
|
||||
|
||||
```bash
|
||||
# Overall system health
|
||||
# Santé globale du système
|
||||
make status ENV=production
|
||||
|
||||
# Check all services
|
||||
# Vérifier tous les services
|
||||
ansible all -i inventory/production.yml -a "systemctl list-units --failed"
|
||||
|
||||
# View recent logs
|
||||
# Voir les logs récents
|
||||
ansible all -i inventory/production.yml -a "journalctl --since '10 minutes ago' --no-pager"
|
||||
|
||||
# Check GPU status
|
||||
# Vérifier le statut GPU
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi"
|
||||
|
||||
# Test API endpoints
|
||||
# Tester les endpoints API
|
||||
curl -f https://api.yourdomain.com/health
|
||||
curl -f https://api.yourdomain.com/v1/models
|
||||
```
|
||||
|
||||
## Infrastructure Issues
|
||||
## Problèmes d'Infrastructure
|
||||
|
||||
### Server Not Responding
|
||||
### Serveur qui ne Répond Pas
|
||||
|
||||
**Symptoms**: Server unreachable via SSH or API
|
||||
**Symptômes** : Serveur injoignable via SSH ou API
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check server status in Hetzner Console
|
||||
# Ping test
|
||||
# Vérifier le statut du serveur dans la Console Hetzner
|
||||
# Test de ping
|
||||
ping SERVER_IP
|
||||
|
||||
# SSH connectivity test
|
||||
# Test de connectivité SSH
|
||||
ssh -v -i ~/.ssh/hetzner_key ubuntu@SERVER_IP
|
||||
|
||||
# Check from other servers
|
||||
# Vérifier depuis d'autres serveurs
|
||||
ansible other_servers -i inventory/production.yml -a "ping -c 3 SERVER_IP"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Network Issues**:
|
||||
**Solutions** :
|
||||
1. **Problèmes Réseau** :
|
||||
```bash
|
||||
# Restart networking
|
||||
# Redémarrer le réseau
|
||||
ansible TARGET_SERVER -i inventory/production.yml -a "systemctl restart networking"
|
||||
|
||||
# Check firewall status
|
||||
|
||||
# Vérifier le statut du pare-feu
|
||||
ansible TARGET_SERVER -i inventory/production.yml -a "ufw status"
|
||||
|
||||
# Reset firewall if needed
|
||||
|
||||
# Réinitialiser le pare-feu si nécessaire
|
||||
ansible TARGET_SERVER -i inventory/production.yml -a "ufw --force reset"
|
||||
```
|
||||
|
||||
2. **Server Overload**:
|
||||
2. **Surcharge du Serveur** :
|
||||
```bash
|
||||
# Check resource usage
|
||||
# Vérifier l'utilisation des ressources
|
||||
ansible TARGET_SERVER -i inventory/production.yml -a "top -bn1 | head -20"
|
||||
|
||||
# Check disk space
|
||||
|
||||
# Vérifier l'espace disque
|
||||
ansible TARGET_SERVER -i inventory/production.yml -a "df -h"
|
||||
|
||||
# Check memory
|
||||
|
||||
# Vérifier la mémoire
|
||||
ansible TARGET_SERVER -i inventory/production.yml -a "free -h"
|
||||
```
|
||||
|
||||
3. **Hardware Issues**:
|
||||
- Contact Hetzner support
|
||||
- Check Robot console for hardware alerts
|
||||
- Consider server replacement
|
||||
3. **Problèmes Matériels** :
|
||||
- Contacter le support Hetzner
|
||||
- Vérifier la console Robot pour les alertes matérielles
|
||||
- Envisager le remplacement du serveur
|
||||
|
||||
### Private Network Issues
|
||||
### Problèmes de Réseau Privé
|
||||
|
||||
**Symptoms**: Servers can't communicate over private network
|
||||
**Symptômes** : Les serveurs ne peuvent pas communiquer sur le réseau privé
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check private network configuration
|
||||
# Vérifier la configuration du réseau privé
|
||||
ansible all -i inventory/production.yml -a "ip route show"
|
||||
|
||||
# Test private network connectivity
|
||||
# Tester la connectivité du réseau privé
|
||||
ansible all -i inventory/production.yml -a "ping -c 3 10.0.2.10"
|
||||
|
||||
# Check network interfaces
|
||||
# Vérifier les interfaces réseau
|
||||
ansible all -i inventory/production.yml -a "ip addr show"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
**Solutions** :
|
||||
```bash
|
||||
# Restart network interfaces
|
||||
# Redémarrer les interfaces réseau
|
||||
ansible all -i inventory/production.yml -a "systemctl restart networking"
|
||||
|
||||
# Re-apply network configuration
|
||||
# Ré-appliquer la configuration réseau
|
||||
ansible-playbook -i inventory/production.yml playbooks/network-setup.yml
|
||||
|
||||
# Check Hetzner Cloud network status
|
||||
# Vérifier le statut réseau Hetzner Cloud
|
||||
terraform show | grep network
|
||||
```
|
||||
|
||||
## GPU Issues
|
||||
## Problèmes GPU
|
||||
|
||||
### GPU Not Detected
|
||||
### GPU Non Détecté
|
||||
|
||||
**Symptoms**: `nvidia-smi` command fails or shows no GPUs
|
||||
**Symptômes** : La commande `nvidia-smi` échoue ou n'affiche aucun GPU
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check GPU hardware detection
|
||||
# Vérifier la détection matérielle du GPU
|
||||
ansible gex44 -i inventory/production.yml -a "lspci | grep -i nvidia"
|
||||
|
||||
# Check NVIDIA driver status
|
||||
# Vérifier le statut du pilote NVIDIA
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi"
|
||||
|
||||
# Check driver version
|
||||
# Vérifier la version du pilote
|
||||
ansible gex44 -i inventory/production.yml -a "cat /proc/driver/nvidia/version"
|
||||
|
||||
# Check kernel modules
|
||||
# Vérifier les modules du noyau
|
||||
ansible gex44 -i inventory/production.yml -a "lsmod | grep nvidia"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Driver Issues**:
|
||||
**Solutions** :
|
||||
1. **Problèmes de Pilote** :
|
||||
```bash
|
||||
# Reinstall NVIDIA drivers
|
||||
# Réinstaller les pilotes NVIDIA
|
||||
ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=cuda
|
||||
|
||||
# Reboot after driver installation
|
||||
|
||||
# Redémarrer après l'installation du pilote
|
||||
ansible gex44 -i inventory/production.yml -a "reboot" --become
|
||||
```
|
||||
|
||||
2. **Hardware Issues**:
|
||||
2. **Problèmes Matériels** :
|
||||
```bash
|
||||
# Check hardware detection
|
||||
# Vérifier la détection matérielle
|
||||
ansible gex44 -i inventory/production.yml -a "lshw -C display"
|
||||
|
||||
# Check BIOS settings (requires physical access)
|
||||
# Contact Hetzner support for hardware issues
|
||||
|
||||
# Vérifier les paramètres BIOS (nécessite un accès physique)
|
||||
# Contacter le support Hetzner pour les problèmes matériels
|
||||
```
|
||||
|
||||
### GPU Memory Issues
|
||||
### Problèmes de Mémoire GPU
|
||||
|
||||
**Symptoms**: CUDA out of memory errors, poor performance
|
||||
**Symptômes** : Erreurs CUDA de manque de mémoire, performances dégradées
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check GPU memory usage
|
||||
# Vérifier l'utilisation de la mémoire GPU
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi --query-gpu=memory.used,memory.total --format=csv"
|
||||
|
||||
# Check running processes on GPU
|
||||
# Vérifier les processus en cours d'exécution sur le GPU
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi pmon"
|
||||
|
||||
# Check vLLM memory configuration
|
||||
# Vérifier la configuration mémoire vLLM
|
||||
ansible gex44 -i inventory/production.yml -a "cat /etc/vllm/config.env | grep MEMORY"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Reduce Memory Usage**:
|
||||
**Solutions** :
|
||||
1. **Réduire l'Utilisation Mémoire** :
|
||||
```bash
|
||||
# Lower GPU memory utilization
|
||||
# Réduire l'utilisation de la mémoire GPU
|
||||
ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_GPU_MEMORY_UTILIZATION=0.8' regexp='^VLLM_GPU_MEMORY_UTILIZATION='"
|
||||
|
||||
# Restart vLLM
|
||||
|
||||
# Redémarrer vLLM
|
||||
ansible gex44 -i inventory/production.yml -a "systemctl restart vllm-api"
|
||||
```
|
||||
|
||||
2. **Clear GPU Memory**:
|
||||
2. **Libérer la Mémoire GPU** :
|
||||
```bash
|
||||
# Kill all GPU processes
|
||||
# Tuer tous les processus GPU
|
||||
ansible gex44 -i inventory/production.yml -a "pkill -f python"
|
||||
|
||||
# Reset GPU
|
||||
|
||||
# Réinitialiser le GPU
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi --gpu-reset"
|
||||
```
|
||||
|
||||
### GPU Temperature Issues
|
||||
### Problèmes de Température GPU
|
||||
|
||||
**Symptoms**: High GPU temperatures, thermal throttling
|
||||
**Symptômes** : Températures élevées du GPU, limitation thermique
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check current temperatures
|
||||
# Vérifier les températures actuelles
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi --query-gpu=temperature.gpu,temperature.memory --format=csv"
|
||||
|
||||
# Check temperature history in Grafana
|
||||
# Navigate to GPU Metrics dashboard
|
||||
# Vérifier l'historique des températures dans Grafana
|
||||
# Naviguer vers le tableau de bord Métriques GPU
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Immediate Cooling**:
|
||||
**Solutions** :
|
||||
1. **Refroidissement Immédiat** :
|
||||
```bash
|
||||
# Reduce GPU workload
|
||||
# Scale down inference requests temporarily
|
||||
|
||||
# Check cooling system
|
||||
# Réduire la charge GPU
|
||||
# Réduire temporairement les requêtes d'inférence
|
||||
|
||||
# Vérifier le système de refroidissement
|
||||
ansible gex44 -i inventory/production.yml -a "sensors"
|
||||
```
|
||||
|
||||
2. **Long-term Solutions**:
|
||||
- Contact Hetzner for datacenter cooling issues
|
||||
- Reduce GPU utilization limits
|
||||
- Implement better load balancing
|
||||
2. **Solutions à Long Terme** :
|
||||
- Contacter Hetzner pour les problèmes de refroidissement du datacenter
|
||||
- Réduire les limites d'utilisation du GPU
|
||||
- Implémenter une meilleure répartition de charge
|
||||
|
||||
## vLLM Service Issues
|
||||
## Problèmes du Service vLLM
|
||||
|
||||
### vLLM Service Won't Start
|
||||
### Le Service vLLM ne Démarre Pas
|
||||
|
||||
**Symptoms**: `systemctl status vllm-api` shows failed state
|
||||
**Symptômes** : `systemctl status vllm-api` affiche un état d'échec
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check service status
|
||||
# Vérifier le statut du service
|
||||
ansible gex44 -i inventory/production.yml -a "systemctl status vllm-api"
|
||||
|
||||
# Check service logs
|
||||
# Vérifier les logs du service
|
||||
ansible gex44 -i inventory/production.yml -a "journalctl -u vllm-api -n 50"
|
||||
|
||||
# Check vLLM configuration
|
||||
# Vérifier la configuration vLLM
|
||||
ansible gex44 -i inventory/production.yml -a "cat /etc/vllm/config.env"
|
||||
|
||||
# Test manual start
|
||||
# Tester le démarrage manuel
|
||||
ansible gex44 -i inventory/production.yml -a "sudo -u vllm python -m vllm.entrypoints.openai.api_server --help"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Configuration Issues**:
|
||||
**Solutions** :
|
||||
1. **Problèmes de Configuration** :
|
||||
```bash
|
||||
# Validate configuration
|
||||
# Valider la configuration
|
||||
ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=config --check
|
||||
|
||||
# Regenerate configuration
|
||||
|
||||
# Régénérer la configuration
|
||||
ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=config
|
||||
```
|
||||
|
||||
2. **Permission Issues**:
|
||||
2. **Problèmes de Permissions** :
|
||||
```bash
|
||||
# Fix file permissions
|
||||
# Corriger les permissions de fichiers
|
||||
ansible gex44 -i inventory/production.yml -a "chown -R vllm:vllm /opt/vllm"
|
||||
ansible gex44 -i inventory/production.yml -a "chmod 755 /opt/vllm"
|
||||
```
|
||||
|
||||
3. **Dependency Issues**:
|
||||
3. **Problèmes de Dépendances** :
|
||||
```bash
|
||||
# Reinstall vLLM
|
||||
# Réinstaller vLLM
|
||||
ansible gex44 -i inventory/production.yml -a "pip install --force-reinstall vllm"
|
||||
```
|
||||
|
||||
### Model Loading Issues
|
||||
### Problèmes de Chargement de Modèles
|
||||
|
||||
**Symptoms**: vLLM starts but models fail to load
|
||||
**Symptômes** : vLLM démarre mais les modèles ne se chargent pas
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check model files
|
||||
# Vérifier les fichiers de modèles
|
||||
ansible gex44 -i inventory/production.yml -a "ls -la /opt/vllm/models/"
|
||||
|
||||
# Check disk space
|
||||
# Vérifier l'espace disque
|
||||
ansible gex44 -i inventory/production.yml -a "df -h /opt/vllm/models/"
|
||||
|
||||
# Check model loading logs
|
||||
# Vérifier les logs de chargement des modèles
|
||||
ansible gex44 -i inventory/production.yml -a "tail -f /var/log/vllm/model-loading.log"
|
||||
|
||||
# Test model access
|
||||
# Tester l'accès aux modèles
|
||||
ansible gex44 -i inventory/production.yml -a "sudo -u vllm python -c \"from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('/opt/vllm/models/mixtral-8x7b')\""
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Missing Models**:
|
||||
**Solutions** :
|
||||
1. **Modèles Manquants** :
|
||||
```bash
|
||||
# Re-download models
|
||||
# Re-télécharger les modèles
|
||||
ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=models
|
||||
|
||||
# Check HuggingFace connectivity
|
||||
|
||||
# Vérifier la connectivité HuggingFace
|
||||
ansible gex44 -i inventory/production.yml -a "curl -f https://huggingface.co"
|
||||
```
|
||||
|
||||
2. **Corrupted Models**:
|
||||
2. **Modèles Corrompus** :
|
||||
```bash
|
||||
# Remove corrupted models
|
||||
# Supprimer les modèles corrompus
|
||||
ansible gex44 -i inventory/production.yml -a "rm -rf /opt/vllm/models/mixtral-8x7b"
|
||||
|
||||
# Re-download
|
||||
|
||||
# Re-télécharger
|
||||
ansible-playbook -i inventory/production.yml playbooks/gex44-setup.yml --tags=models
|
||||
```
|
||||
|
||||
3. **Insufficient Resources**:
|
||||
3. **Ressources Insuffisantes** :
|
||||
```bash
|
||||
# Use smaller model or quantization
|
||||
# Update configuration to use quantized models
|
||||
# Utiliser un modèle plus petit ou la quantification
|
||||
# Mettre à jour la configuration pour utiliser des modèles quantifiés
|
||||
ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_QUANTIZATION=awq' regexp='^VLLM_QUANTIZATION='"
|
||||
```
|
||||
|
||||
### High Latency Issues
|
||||
### Problèmes de Latence Élevée
|
||||
|
||||
**Symptoms**: API responses take too long
|
||||
**Symptômes** : Les réponses API prennent trop de temps
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check current latency
|
||||
# Vérifier la latence actuelle
|
||||
curl -w "@curl-format.txt" -o /dev/null -s https://api.yourdomain.com/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model":"mixtral-8x7b","messages":[{"role":"user","content":"Hello"}],"max_tokens":10}'
|
||||
|
||||
# Check queue size
|
||||
# Vérifier la taille de la file d'attente
|
||||
curl -s https://api.yourdomain.com/metrics | grep vllm_queue_size
|
||||
|
||||
# Check GPU utilization
|
||||
# Vérifier l'utilisation GPU
|
||||
ansible gex44 -i inventory/production.yml -a "nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Scale Up**:
|
||||
**Solutions** :
|
||||
1. **Augmenter l'Échelle** :
|
||||
```bash
|
||||
# Add more GPU servers
|
||||
# Ajouter plus de serveurs GPU
|
||||
make scale-up ENV=production
|
||||
|
||||
# Or manually order new servers
|
||||
|
||||
# Ou commander manuellement de nouveaux serveurs
|
||||
python scripts/autoscaler.py --action=scale-up --count=1
|
||||
```
|
||||
|
||||
2. **Optimize Configuration**:
|
||||
2. **Optimiser la Configuration** :
|
||||
```bash
|
||||
# Reduce model precision
|
||||
# Réduire la précision du modèle
|
||||
ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_DTYPE=float16' regexp='^VLLM_DTYPE='"
|
||||
|
||||
# Increase batch size
|
||||
|
||||
# Augmenter la taille des lots
|
||||
ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_MAX_NUM_SEQS=512' regexp='^VLLM_MAX_NUM_SEQS='"
|
||||
```
|
||||
|
||||
3. **Load Balancing**:
|
||||
3. **Répartition de Charge** :
|
||||
```bash
|
||||
# Check load balancer configuration
|
||||
# Vérifier la configuration du load balancer
|
||||
ansible load_balancers -i inventory/production.yml -a "curl -s http://localhost:8404/stats"
|
||||
|
||||
# Verify all backends are healthy
|
||||
|
||||
# Vérifier que tous les backends sont en bonne santé
|
||||
curl -s http://LOAD_BALANCER_IP:8404/stats | grep UP
|
||||
```
|
||||
|
||||
## Load Balancer Issues
|
||||
## Problèmes de Load Balancer
|
||||
|
||||
### Load Balancer Not Routing Traffic
|
||||
### Load Balancer ne Route pas le Trafic
|
||||
|
||||
**Symptoms**: Requests fail to reach backend servers
|
||||
**Symptômes** : Les requêtes n'atteignent pas les serveurs backend
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check HAProxy status
|
||||
# Vérifier le statut HAProxy
|
||||
ansible load_balancers -i inventory/production.yml -a "systemctl status haproxy"
|
||||
|
||||
# Check HAProxy configuration
|
||||
# Vérifier la configuration HAProxy
|
||||
ansible load_balancers -i inventory/production.yml -a "haproxy -f /etc/haproxy/haproxy.cfg -c"
|
||||
|
||||
# Check backend health
|
||||
# Vérifier la santé des backends
|
||||
curl -s http://LOAD_BALANCER_IP:8404/stats
|
||||
|
||||
# Test direct backend access
|
||||
# Tester l'accès direct aux backends
|
||||
curl -f http://10.0.1.10:8000/health
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Configuration Issues**:
|
||||
**Solutions** :
|
||||
1. **Problèmes de Configuration** :
|
||||
```bash
|
||||
# Regenerate HAProxy configuration
|
||||
# Régénérer la configuration HAProxy
|
||||
ansible-playbook -i inventory/production.yml playbooks/load-balancer-setup.yml
|
||||
|
||||
# Restart HAProxy
|
||||
|
||||
# Redémarrer HAProxy
|
||||
ansible load_balancers -i inventory/production.yml -a "systemctl restart haproxy"
|
||||
```
|
||||
|
||||
2. **Backend Health Issues**:
|
||||
2. **Problèmes de Santé des Backends** :
|
||||
```bash
|
||||
# Check why backends are failing health checks
|
||||
# Vérifier pourquoi les backends échouent aux contrôles de santé
|
||||
ansible gex44 -i inventory/production.yml -a "curl -f http://localhost:8000/health"
|
||||
|
||||
# Fix unhealthy backends
|
||||
|
||||
# Corriger les backends défaillants
|
||||
ansible gex44 -i inventory/production.yml -a "systemctl restart vllm-api"
|
||||
```
|
||||
|
||||
### SSL Certificate Issues
|
||||
### Problèmes de Certificats SSL
|
||||
|
||||
**Symptoms**: HTTPS requests fail with certificate errors
|
||||
**Symptômes** : Les requêtes HTTPS échouent avec des erreurs de certificat
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check certificate validity
|
||||
# Vérifier la validité du certificat
|
||||
openssl s_client -connect api.yourdomain.com:443 -servername api.yourdomain.com
|
||||
|
||||
# Check certificate files
|
||||
# Vérifier les fichiers de certificats
|
||||
ansible load_balancers -i inventory/production.yml -a "ls -la /etc/ssl/certs/"
|
||||
|
||||
# Check certificate expiration
|
||||
# Vérifier l'expiration du certificat
|
||||
ansible load_balancers -i inventory/production.yml -a "openssl x509 -in /etc/ssl/certs/haproxy.pem -text -noout | grep 'Not After'"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Renew Certificates**:
|
||||
**Solutions** :
|
||||
1. **Renouveler les Certificats** :
|
||||
```bash
|
||||
# Renew Let's Encrypt certificates
|
||||
# Renouveler les certificats Let's Encrypt
|
||||
ansible load_balancers -i inventory/production.yml -a "certbot renew"
|
||||
|
||||
# Reload HAProxy
|
||||
|
||||
# Recharger HAProxy
|
||||
ansible load_balancers -i inventory/production.yml -a "systemctl reload haproxy"
|
||||
```
|
||||
|
||||
2. **Fix Certificate Configuration**:
|
||||
2. **Corriger la Configuration des Certificats** :
|
||||
```bash
|
||||
# Regenerate certificate bundle
|
||||
# Régénérer le bundle de certificats
|
||||
ansible load_balancers -i inventory/production.yml -a "cat /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem /etc/letsencrypt/live/api.yourdomain.com/privkey.pem > /etc/ssl/certs/haproxy.pem"
|
||||
```
|
||||
|
||||
## Monitoring Issues
|
||||
## Problèmes de Surveillance
|
||||
|
||||
### Prometheus Not Collecting Metrics
|
||||
### Prometheus ne Collecte pas les Métriques
|
||||
|
||||
**Symptoms**: Missing data in Grafana dashboards
|
||||
**Symptômes** : Données manquantes dans les tableaux de bord Grafana
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check Prometheus status
|
||||
# Vérifier le statut Prometheus
|
||||
ansible monitoring -i inventory/production.yml -a "systemctl status prometheus"
|
||||
|
||||
# Check Prometheus configuration
|
||||
# Vérifier la configuration Prometheus
|
||||
ansible monitoring -i inventory/production.yml -a "promtool check config /etc/prometheus/prometheus.yml"
|
||||
|
||||
# Check target status
|
||||
# Vérifier le statut des cibles
|
||||
curl -s http://MONITORING_IP:9090/api/v1/targets | jq .
|
||||
|
||||
# Test metric endpoints
|
||||
# Tester les endpoints de métriques
|
||||
curl -s http://10.0.1.10:9835/metrics | head -10
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Configuration Issues**:
|
||||
**Solutions** :
|
||||
1. **Problèmes de Configuration** :
|
||||
```bash
|
||||
# Regenerate Prometheus configuration
|
||||
# Régénérer la configuration Prometheus
|
||||
ansible-playbook -i inventory/production.yml playbooks/monitoring-setup.yml --tags=prometheus
|
||||
|
||||
# Restart Prometheus
|
||||
|
||||
# Redémarrer Prometheus
|
||||
ansible monitoring -i inventory/production.yml -a "systemctl restart prometheus"
|
||||
```
|
||||
|
||||
2. **Target Connectivity**:
|
||||
2. **Connectivité des Cibles** :
|
||||
```bash
|
||||
# Check network connectivity to targets
|
||||
# Vérifier la connectivité réseau vers les cibles
|
||||
ansible monitoring -i inventory/production.yml -a "curl -f http://10.0.1.10:9835/metrics"
|
||||
|
||||
# Check firewall rules
|
||||
|
||||
# Vérifier les règles de pare-feu
|
||||
ansible gex44 -i inventory/production.yml -a "ufw status | grep 9835"
|
||||
```
|
||||
|
||||
### Grafana Dashboard Issues
|
||||
### Problèmes de Tableaux de Bord Grafana
|
||||
|
||||
**Symptoms**: Dashboards show no data or errors
|
||||
**Symptômes** : Les tableaux de bord n'affichent aucune donnée ou montrent des erreurs
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check Grafana status
|
||||
# Vérifier le statut Grafana
|
||||
ansible monitoring -i inventory/production.yml -a "systemctl status grafana-server"
|
||||
|
||||
# Check Grafana logs
|
||||
# Vérifier les logs Grafana
|
||||
ansible monitoring -i inventory/production.yml -a "journalctl -u grafana-server -n 50"
|
||||
|
||||
# Test Prometheus data source
|
||||
# Tester la source de données Prometheus
|
||||
curl -s http://MONITORING_IP:3000/api/datasources
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Data Source Issues**:
|
||||
**Solutions** :
|
||||
1. **Problèmes de Source de Données** :
|
||||
```bash
|
||||
# Reconfigure Grafana data sources
|
||||
# Reconfigurer les sources de données Grafana
|
||||
ansible-playbook -i inventory/production.yml playbooks/monitoring-setup.yml --tags=grafana
|
||||
|
||||
# Restart Grafana
|
||||
|
||||
# Redémarrer Grafana
|
||||
ansible monitoring -i inventory/production.yml -a "systemctl restart grafana-server"
|
||||
```
|
||||
|
||||
2. **Dashboard Import Issues**:
|
||||
2. **Problèmes d'Import de Tableaux de Bord** :
|
||||
```bash
|
||||
# Re-import dashboards
|
||||
# Ré-importer les tableaux de bord
|
||||
ansible-playbook -i inventory/production.yml playbooks/monitoring-setup.yml --tags=dashboards
|
||||
```
|
||||
|
||||
## Performance Issues
|
||||
## Problèmes de Performance
|
||||
|
||||
### High CPU Usage
|
||||
### Utilisation Élevée du CPU
|
||||
|
||||
**Symptoms**: Server becomes slow, high load average
|
||||
**Symptômes** : Le serveur devient lent, charge moyenne élevée
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check CPU usage
|
||||
# Vérifier l'utilisation du CPU
|
||||
ansible all -i inventory/production.yml -a "top -bn1 | head -20"
|
||||
|
||||
# Check process list
|
||||
# Vérifier la liste des processus
|
||||
ansible all -i inventory/production.yml -a "ps aux --sort=-%cpu | head -10"
|
||||
|
||||
# Check load average
|
||||
# Vérifier la charge moyenne
|
||||
ansible all -i inventory/production.yml -a "uptime"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Identify Resource-Heavy Processes**:
|
||||
**Solutions** :
|
||||
1. **Identifier les Processus Gourmands en Ressources** :
|
||||
```bash
|
||||
# Kill problematic processes
|
||||
# Tuer les processus problématiques
|
||||
ansible TARGET_SERVER -i inventory/production.yml -a "pkill -f PROCESS_NAME"
|
||||
|
||||
# Restart services
|
||||
|
||||
# Redémarrer les services
|
||||
ansible TARGET_SERVER -i inventory/production.yml -a "systemctl restart SERVICE_NAME"
|
||||
```
|
||||
|
||||
2. **Scale Resources**:
|
||||
2. **Augmenter les Ressources** :
|
||||
```bash
|
||||
# Add more servers or upgrade existing ones
|
||||
# Consider upgrading cloud server types in Terraform
|
||||
# Ajouter plus de serveurs ou mettre à niveau les existants
|
||||
# Envisager de mettre à niveau les types de serveurs cloud dans Terraform
|
||||
```
|
||||
|
||||
### High Memory Usage
|
||||
### Utilisation Élevée de la Mémoire
|
||||
|
||||
**Symptoms**: Out of memory errors, swap usage
|
||||
**Symptômes** : Erreurs de manque de mémoire, utilisation du swap
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Check memory usage
|
||||
# Vérifier l'utilisation de la mémoire
|
||||
ansible all -i inventory/production.yml -a "free -h"
|
||||
|
||||
# Check swap usage
|
||||
# Vérifier l'utilisation du swap
|
||||
ansible all -i inventory/production.yml -a "swapon --show"
|
||||
|
||||
# Check memory-heavy processes
|
||||
# Vérifier les processus gourmands en mémoire
|
||||
ansible all -i inventory/production.yml -a "ps aux --sort=-%mem | head -10"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Free Memory**:
|
||||
**Solutions** :
|
||||
1. **Libérer la Mémoire** :
|
||||
```bash
|
||||
# Clear caches
|
||||
# Vider les caches
|
||||
ansible all -i inventory/production.yml -a "sync && echo 3 > /proc/sys/vm/drop_caches"
|
||||
|
||||
# Restart memory-heavy services
|
||||
|
||||
# Redémarrer les services gourmands en mémoire
|
||||
ansible gex44 -i inventory/production.yml -a "systemctl restart vllm-api"
|
||||
```
|
||||
|
||||
2. **Optimize Configuration**:
|
||||
2. **Optimiser la Configuration** :
|
||||
```bash
|
||||
# Reduce model cache size
|
||||
# Réduire la taille du cache de modèles
|
||||
ansible gex44 -i inventory/production.yml -m lineinfile -a "path=/etc/vllm/config.env line='VLLM_SWAP_SPACE=2' regexp='^VLLM_SWAP_SPACE='"
|
||||
```
|
||||
|
||||
## Network Issues
|
||||
## Problèmes Réseau
|
||||
|
||||
### High Latency Between Servers
|
||||
### Latence Élevée entre Serveurs
|
||||
|
||||
**Symptoms**: Slow inter-server communication
|
||||
**Symptômes** : Communication inter-serveurs lente
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Test latency between servers
|
||||
# Tester la latence entre serveurs
|
||||
ansible all -i inventory/production.yml -a "ping -c 10 10.0.1.10"
|
||||
|
||||
# Check network interface statistics
|
||||
# Vérifier les statistiques d'interfaces réseau
|
||||
ansible all -i inventory/production.yml -a "cat /proc/net/dev"
|
||||
|
||||
# Test bandwidth
|
||||
# Tester la bande passante
|
||||
ansible all -i inventory/production.yml -a "iperf3 -c 10.0.1.10 -t 10"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. **Network Optimization**:
|
||||
**Solutions** :
|
||||
1. **Optimisation Réseau** :
|
||||
```bash
|
||||
# Optimize network settings
|
||||
# Optimiser les paramètres réseau
|
||||
ansible-playbook -i inventory/production.yml playbooks/network-optimization.yml
|
||||
|
||||
# Check for network congestion
|
||||
# Consider upgrading network interfaces
|
||||
|
||||
# Vérifier la congestion réseau
|
||||
# Envisager de mettre à niveau les interfaces réseau
|
||||
```
|
||||
|
||||
### DNS Resolution Issues
|
||||
### Problèmes de Résolution DNS
|
||||
|
||||
**Symptoms**: Domain names not resolving correctly
|
||||
**Symptômes** : Les noms de domaine ne se résolvent pas correctement
|
||||
|
||||
**Diagnosis**:
|
||||
**Diagnostic** :
|
||||
```bash
|
||||
# Test DNS resolution
|
||||
# Tester la résolution DNS
|
||||
ansible all -i inventory/production.yml -a "nslookup api.yourdomain.com"
|
||||
|
||||
# Check DNS configuration
|
||||
# Vérifier la configuration DNS
|
||||
ansible all -i inventory/production.yml -a "cat /etc/resolv.conf"
|
||||
|
||||
# Test external DNS
|
||||
# Tester le DNS externe
|
||||
ansible all -i inventory/production.yml -a "nslookup google.com 8.8.8.8"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
**Solutions** :
|
||||
```bash
|
||||
# Update DNS configuration
|
||||
# Mettre à jour la configuration DNS
|
||||
ansible all -i inventory/production.yml -m lineinfile -a "path=/etc/resolv.conf line='nameserver 8.8.8.8'"
|
||||
|
||||
# Restart networking
|
||||
# Redémarrer le réseau
|
||||
ansible all -i inventory/production.yml -a "systemctl restart systemd-resolved"
|
||||
```
|
||||
|
||||
## Emergency Procedures
|
||||
## Procédures d'Urgence
|
||||
|
||||
### Complete Service Outage
|
||||
### Panne Complète de Service
|
||||
|
||||
1. **Immediate Response**:
|
||||
1. **Réponse Immédiate** :
|
||||
```bash
|
||||
# Check all critical services
|
||||
# Vérifier tous les services critiques
|
||||
make status ENV=production
|
||||
|
||||
# Enable maintenance mode
|
||||
|
||||
# Activer le mode maintenance
|
||||
ansible load_balancers -i inventory/production.yml -a "systemctl stop haproxy"
|
||||
|
||||
# Notify stakeholders
|
||||
|
||||
# Notifier les parties prenantes
|
||||
```
|
||||
|
||||
2. **Diagnosis**:
|
||||
2. **Diagnostic** :
|
||||
```bash
|
||||
# Check recent changes
|
||||
# Vérifier les changements récents
|
||||
git log --since="2 hours ago" --oneline
|
||||
|
||||
# Check system logs
|
||||
|
||||
# Vérifier les logs système
|
||||
ansible all -i inventory/production.yml -a "journalctl --since '2 hours ago' --no-pager"
|
||||
|
||||
# Check monitoring alerts
|
||||
|
||||
# Vérifier les alertes de surveillance
|
||||
curl -s http://MONITORING_IP:9090/api/v1/alerts
|
||||
```
|
||||
|
||||
3. **Recovery**:
|
||||
3. **Récupération** :
|
||||
```bash
|
||||
# Rollback recent changes if necessary
|
||||
# Rollback des changements récents si nécessaire
|
||||
make rollback ENV=production BACKUP_DATE=YYYYMMDD
|
||||
|
||||
# Or restart all services
|
||||
|
||||
# Ou redémarrer tous les services
|
||||
ansible all -i inventory/production.yml -a "systemctl restart vllm-api haproxy prometheus grafana-server"
|
||||
|
||||
# Re-enable load balancer
|
||||
|
||||
# Réactiver le load balancer
|
||||
ansible load_balancers -i inventory/production.yml -a "systemctl start haproxy"
|
||||
```
|
||||
|
||||
### Data Loss Prevention
|
||||
### Prévention de Perte de Données
|
||||
|
||||
```bash
|
||||
# Immediate backup
|
||||
# Sauvegarde immédiate
|
||||
make backup ENV=production
|
||||
|
||||
# Snapshot critical volumes
|
||||
# Use Hetzner Cloud console to create snapshots
|
||||
# Instantané des volumes critiques
|
||||
# Utiliser la console Hetzner Cloud pour créer des snapshots
|
||||
|
||||
# Document the incident
|
||||
# Create incident report with timeline and actions taken
|
||||
# Documenter l'incident
|
||||
# Créer un rapport d'incident avec chronologie et actions entreprises
|
||||
```
|
||||
|
||||
For issues not covered in this guide, contact the infrastructure team or create an issue in the project repository with:
|
||||
- Detailed problem description
|
||||
- Error messages and logs
|
||||
- Steps already taken
|
||||
- Current system status
|
||||
Pour les problèmes non couverts dans ce guide, contactez l'équipe d'infrastructure ou créez un ticket dans le dépôt du projet avec :
|
||||
- Description détaillée du problème
|
||||
- Messages d'erreur et logs
|
||||
- Étapes déjà entreprises
|
||||
- Statut actuel du système
|
||||
174
docs/INDEX.md
Normal file
174
docs/INDEX.md
Normal file
@ -0,0 +1,174 @@
|
||||
# Index de la Documentation
|
||||
|
||||
## 📚 Infrastructure IA Production-Ready avec Hetzner
|
||||
|
||||
Cette documentation couvre l'infrastructure complète pour déployer des modèles IA sur serveurs Hetzner GEX44 avec GitLab CI/CD, Terraform et Ansible.
|
||||
|
||||
### 🎯 Navigation Rapide
|
||||
|
||||
| Document | Description | Statut |
|
||||
|----------|-------------|--------|
|
||||
| [**01_architecture.md**](./01_architecture.md) | Architecture complète de l'infrastructure | ✅ Complet |
|
||||
| [**02_deployment.md**](./02_deployment.md) | Guide de déploiement étape par étape | ✅ Complet |
|
||||
| [**03_applications.md**](./03_applications.md) | Organisation multi-projets et équipes | ✅ Complet |
|
||||
| [**04_tools.md**](./04_tools.md) | Outils et technologies utilisés | ✅ Complet |
|
||||
| [**05_troubleshooting.md**](./05_troubleshooting.md) | Guide de dépannage et résolution | ✅ Complet |
|
||||
| [**vpn-setup.md**](./vpn-setup.md) | Configuration VPN WireGuard | ✅ Complet |
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Démarrage Rapide
|
||||
|
||||
### Prérequis
|
||||
- Compte Hetzner (Robot + Cloud)
|
||||
- GitLab account pour CI/CD
|
||||
- 3x serveurs GEX44 commandés
|
||||
|
||||
### Installation en 5 minutes
|
||||
```bash
|
||||
# 1. Clone et setup
|
||||
git clone https://github.com/spham/hetzner-ai-infrastructure.git
|
||||
cd ai-infrastructure
|
||||
make setup
|
||||
|
||||
# 2. Configure secrets
|
||||
cp .env.example .env
|
||||
# Éditer .env avec vos tokens Hetzner
|
||||
|
||||
# 3. Deploy development
|
||||
make deploy-dev
|
||||
|
||||
# 4. Vérifier deployment
|
||||
make test
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📖 Guides par Thème
|
||||
|
||||
### 🏗️ **Infrastructure & Architecture**
|
||||
- **[Architecture](./01_architecture.md)** - Conception globale, composants, réseaux
|
||||
- Architecture de haut niveau
|
||||
- Détails des composants (Load Balancer, API Gateway, GPU Servers)
|
||||
- Architecture réseau et sécurité
|
||||
- Performance et coûts
|
||||
|
||||
### ⚡ **Déploiement & Configuration**
|
||||
- **[Déploiement](./02_deployment.md)** - Guide complet d'installation
|
||||
- Prérequis et préparation
|
||||
- Déploiement automatisé
|
||||
- Validation et tests
|
||||
- Procédures de rollback
|
||||
|
||||
### 👥 **Gestion & Organisation**
|
||||
- **[Applications](./03_applications.md)** - Organisation multi-projets
|
||||
- Structure organisationnelle
|
||||
- Gestion des équipes
|
||||
- Workflows de développement
|
||||
- Bonnes pratiques
|
||||
|
||||
### 🛠️ **Outils & Technologies**
|
||||
- **[Outils](./04_tools.md)** - Stack technologique complète
|
||||
- Infrastructure as Code (Terraform, Ansible)
|
||||
- Containerisation (Docker)
|
||||
- Monitoring (Prometheus, Grafana)
|
||||
- CI/CD (GitLab CI)
|
||||
|
||||
### 🔧 **Maintenance & Dépannage**
|
||||
- **[Dépannage](./05_troubleshooting.md)** - Résolution de problèmes
|
||||
- Diagnostics système
|
||||
- Problèmes GPU et vLLM
|
||||
- Issues réseau et performance
|
||||
- Procédures d'urgence
|
||||
|
||||
### 🔒 **Sécurité & Accès**
|
||||
- **[Configuration VPN](./vpn-setup.md)** - Accès externe sécurisé
|
||||
- Setup WireGuard
|
||||
- Configuration client/serveur
|
||||
- Accès entreprise externe
|
||||
- Règles de sécurité
|
||||
|
||||
---
|
||||
|
||||
## 🎮 Commandes Principales
|
||||
|
||||
| Commande | Description | Documentation |
|
||||
|----------|-------------|---------------|
|
||||
| `make setup` | Installation dépendances | [02_deployment.md](./02_deployment.md#prerequisites) |
|
||||
| `make test` | Tests complets | [02_deployment.md](./02_deployment.md#testing) |
|
||||
| `make deploy-dev` | Déploiement dev | [02_deployment.md](./02_deployment.md#development) |
|
||||
| `make deploy-prod` | Déploiement production | [02_deployment.md](./02_deployment.md#production) |
|
||||
| `make cost-report` | Rapport de coûts | [01_architecture.md](./01_architecture.md#costs) |
|
||||
| `make scale-up` | Ajout serveur GPU | [01_architecture.md](./01_architecture.md#scaling) |
|
||||
|
||||
---
|
||||
|
||||
## 📊 Aperçu Technique
|
||||
|
||||
### **Architecture**
|
||||
```
|
||||
Internet → HAProxy → 3x GEX44 GPU Servers → vLLM APIs
|
||||
↓
|
||||
Monitoring Stack (Prometheus/Grafana)
|
||||
```
|
||||
|
||||
### **Coûts Mensuels**
|
||||
- **Infrastructure**: 634€/mois vs 10570€ AWS (12x moins cher)
|
||||
- **Performance**: 255 tokens/sec, P95 latency <2s
|
||||
- **ROI**: 2.7x plus efficace qu'AWS
|
||||
|
||||
### **Spécifications GPU**
|
||||
- **3x GEX44**: RTX 4000 Ada, 20GB VRAM chacune
|
||||
- **Modèles**: Mixtral-8x7B, Llama2-70B, CodeLlama-34B
|
||||
- **Auto-scaling**: Basé sur utilisation GPU
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Index par Mots-Clés
|
||||
|
||||
### A-C
|
||||
- **Ansible**: [02_deployment.md](./02_deployment.md), [04_tools.md](./04_tools.md)
|
||||
- **Architecture**: [01_architecture.md](./01_architecture.md)
|
||||
- **Auto-scaling**: [01_architecture.md](./01_architecture.md#scaling)
|
||||
- **Coûts**: [01_architecture.md](./01_architecture.md#costs)
|
||||
|
||||
### D-H
|
||||
- **Déploiement**: [02_deployment.md](./02_deployment.md)
|
||||
- **Dépannage**: [05_troubleshooting.md](./05_troubleshooting.md)
|
||||
- **Docker**: [04_tools.md](./04_tools.md)
|
||||
- **GPU**: [01_architecture.md](./01_architecture.md#gpu), [05_troubleshooting.md](./05_troubleshooting.md#gpu)
|
||||
- **Grafana**: [04_tools.md](./04_tools.md), [05_troubleshooting.md](./05_troubleshooting.md#monitoring)
|
||||
- **HAProxy**: [01_architecture.md](./01_architecture.md#load-balancer), [05_troubleshooting.md](./05_troubleshooting.md#load-balancer)
|
||||
- **Hetzner**: [01_architecture.md](./01_architecture.md), [02_deployment.md](./02_deployment.md)
|
||||
|
||||
### I-P
|
||||
- **Infrastructure**: [01_architecture.md](./01_architecture.md)
|
||||
- **Monitoring**: [01_architecture.md](./01_architecture.md#monitoring), [04_tools.md](./04_tools.md), [05_troubleshooting.md](./05_troubleshooting.md#monitoring)
|
||||
- **Performance**: [01_architecture.md](./01_architecture.md#performance), [05_troubleshooting.md](./05_troubleshooting.md#performance)
|
||||
- **Prometheus**: [04_tools.md](./04_tools.md), [05_troubleshooting.md](./05_troubleshooting.md#monitoring)
|
||||
|
||||
### R-Z
|
||||
- **Réseau**: [01_architecture.md](./01_architecture.md#network), [05_troubleshooting.md](./05_troubleshooting.md#network)
|
||||
- **Sécurité**: [01_architecture.md](./01_architecture.md#security), [vpn-setup.md](./vpn-setup.md)
|
||||
- **Terraform**: [02_deployment.md](./02_deployment.md), [04_tools.md](./04_tools.md)
|
||||
- **vLLM**: [01_architecture.md](./01_architecture.md#gpu), [05_troubleshooting.md](./05_troubleshooting.md#vllm)
|
||||
- **VPN**: [vpn-setup.md](./vpn-setup.md)
|
||||
|
||||
---
|
||||
|
||||
## 📞 Support & Contribution
|
||||
|
||||
### Obtenir de l'Aide
|
||||
1. **Dépannage**: Consultez [05_troubleshooting.md](./05_troubleshooting.md)
|
||||
2. **Issues**: Créez une issue sur GitLab
|
||||
3. **Documentation**: Référez-vous aux guides spécifiques ci-dessus
|
||||
|
||||
### Contribuer
|
||||
- Fork le repository
|
||||
- Suivez les conventions de [03_applications.md](./03_applications.md)
|
||||
- Testez vos changements avec `make test`
|
||||
- Soumettez une merge request
|
||||
|
||||
---
|
||||
|
||||
*Documentation maintenue par l'équipe Infrastructure IA - Dernière mise à jour: {{ ansible_date_time.iso8601 }}*
|
||||
82
docs/vpn-setup.md
Normal file
82
docs/vpn-setup.md
Normal file
@ -0,0 +1,82 @@
|
||||
# VPN Setup pour Entreprise Externe
|
||||
|
||||
## Configuration WireGuard
|
||||
|
||||
Cette documentation explique comment configurer l'accès VPN pour une entreprise externe vers votre infrastructure AI Hetzner.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
Entreprise Externe → Internet → VPN Gateway (Load Balancer) → Réseaux Internes
|
||||
↓
|
||||
┌─ GEX44 GPU (10.0.1.0/24)
|
||||
└─ Cloud Services (10.0.2.0/24)
|
||||
```
|
||||
|
||||
### Déploiement
|
||||
|
||||
1. **Configuration des variables**:
|
||||
```bash
|
||||
# Dans ansible/group_vars/all/main.yml ou via variables d'environnement
|
||||
export external_company_public_key="PUBLIC_KEY_FROM_EXTERNAL_COMPANY"
|
||||
export load_balancer_public_ip="YOUR_LOAD_BALANCER_PUBLIC_IP"
|
||||
```
|
||||
|
||||
2. **Déploiement du VPN**:
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook -i inventory/production.yml playbooks/vpn-setup.yml
|
||||
```
|
||||
|
||||
### Configuration Client (Entreprise Externe)
|
||||
|
||||
1. **Générer les clés côté client**:
|
||||
```bash
|
||||
# Sur le système de l'entreprise externe
|
||||
wg genkey | tee private.key | wg pubkey > public.key
|
||||
```
|
||||
|
||||
2. **Configuration client** (`wg0.conf`):
|
||||
```ini
|
||||
[Interface]
|
||||
PrivateKey = CONTENU_DE_private.key
|
||||
Address = 10.0.10.10/32
|
||||
DNS = 8.8.8.8
|
||||
|
||||
[Peer]
|
||||
PublicKey = CLE_PUBLIQUE_SERVEUR
|
||||
Endpoint = VOTRE_IP_PUBLIQUE:51820
|
||||
AllowedIPs = 10.0.1.0/24, 10.0.2.0/24
|
||||
PersistentKeepalive = 25
|
||||
```
|
||||
|
||||
### Accès Autorisé
|
||||
|
||||
L'entreprise externe pourra accéder à:
|
||||
- **Serveurs GPU (GEX44)**: `10.0.1.10-12` (ports vLLM 8000)
|
||||
- **Services Cloud**: `10.0.2.0/24`
|
||||
- **Monitoring**: `10.0.2.12:3000` (Grafana)
|
||||
|
||||
### Sécurité
|
||||
|
||||
- Chiffrement WireGuard (ChaCha20Poly1305)
|
||||
- Authentification par clé publique
|
||||
- Firewall UFW configuré automatiquement
|
||||
- Routage limité aux réseaux autorisés
|
||||
|
||||
### Vérification
|
||||
|
||||
```bash
|
||||
# Sur le serveur VPN
|
||||
sudo wg show
|
||||
|
||||
# Test de connectivité depuis l'entreprise externe
|
||||
ping 10.0.1.10
|
||||
curl http://10.0.1.10:8000/health
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
- Vérifier que le port UDP 51820 est ouvert
|
||||
- Contrôler les logs: `sudo journalctl -u wg-quick@wg0`
|
||||
- Tester la connectivité: `sudo wg show`
|
||||
Loading…
x
Reference in New Issue
Block a user