Ajout du lab de workflows reproductibles avec documentation complète

- Implémentation du notebook lab.ipynb avec code complet pour créer des workflows d'IA reproductibles - Ajout d'un README.md pédagogique de 600+ lignes en français - Configuration des graines aléatoires pour la reproductibilité - Implémentation de la génération, division, normalisation et sauvegarde des données - Création et entraînement d'un réseau de neurones avec TensorFlow/Keras - Démonstration du rechargement et de la vérification de la reproductibilité 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2025-11-14 12:13:54 +01:00 · 2025-11-14 12:13:54 +01:00 · 8dfd897cac
commit 8dfd897cac
4 changed files with 1340 additions and 0 deletions
--- a/.instructions/INSTRUCTIONS.mdoc
+++ b/.instructions/INSTRUCTIONS.mdoc
@ -0,0 +1,102 @@
 {% steps %}
 {% step title="Introduction to Reproducible Workflows" %}
 ###  Introduction
 Welcome to the "Introduction to Reproducible Workflows" lab! 
 This lab is designed to give you a foundational understanding of creating reproducible workflows for training an AI model,
 its importance, and examples of key parts in model training to define fixed seeds.
 ###   Learning Objectives
 - Display key areas within AI model workflows to define fixed seeds
 - Review saving datasets after train/test splits
 - Practice recovering models and training datasets to repeat results
 ###   Prerequisites
 - A high level understanding of AI neural networks
 - Experience in training models.
 {% /step %}
 {% step title="Creating Reproducible Datasets" %}
 ###  Creating Reproducible Datasets
 For you to create a repeatable workflow you will need to start by defining the random seeds. When you define the random seed, it 
 will determine how data is split, generated, and how these train/test datasets are fed into the model. Your starting seed 
 helps ensure that all aspects of the workflow are repeatable. The provided code below sets the random seed for several 
 different libraries used in model development.
 ```python
 random.seed(42)
 np.random.seed(42)
 tf.random.set_seed(42)
 ```
 From here you can define the synthetic data generation similar to how it is defined in other labs.
 ```python
 X, y = make_classification(
    n_samples=1000, 
    n_features=20, 
    n_informative=15,
    n_redundant=5, 
    n_classes=2, 
    random_state=42
 )
 ```
 {% /step %}
 {% step title="Training Reproducible Model" %}
 ###  Training Reproducible Models
 For the next step of model creation you will need to split your data into training and test datasets. The code provided below 
 allows for the initial dataset to be split into training and test sets. The key parameter for you to focus on here is the 
 ```random_state``` which defines the random seed for the split. When you define the random seed, you are 
 ensuring future instances of the models training will result in the same training dataset being used, and therefore the 
 same model being created.
 ```python
 X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
 )
 scaler = StandardScaler()
 X_train = scaler.fit_transform(X_train)
 X_test = scaler.transform(X_test)
 ```
 You can then also save the `train` and `test` data splits to allow for reproducibility in instances where the dataset you are 
 using may be a subset of a larger dataset.
 ```python
 joblib.dump((X_train, y_train), 'train_data.pkl')
 joblib.dump((X_test, y_test), 'test_data.pkl')
 ```
 With the datasets saved and reproducible you can move forward with defining and training your model. You will define and train 
 your model in this lab so you can evaluate it and compare it to a model trained on the same training dataset to ensure it's 
 properly reproducible. Within the lab all code is provided for 
 training and evaluating the model.
 {% /step %}
 {% step title="Saving Model and Dataset" %}
 ###  Saving Model and Dataset
 Now that your model has been trained and evaluated, you can save it with the following code 
 ```python
 model.save('my_model.keras')
 ```
 From there you can move onto reproducing models which is meant to mimic an entirely new environment of loading a model into.
 You can load the previously saved model and datasets using the following code below, including required imports.
 ```python
 from tensorflow.keras.models import load_model
 import joblib
 modelReloaded = load_model('my_model.keras')
 X_train_reloaded, y_train_reloaded = joblib.load('train_data.pkl')
 X_test_reloaded, y_test_reloaded = joblib.load('test_data.pkl')
 ```
 As you revaluate the model on the exact same training and test sets using the previously defined evaluation code, you should 
 see results that are exactly the same as the previously evaluted model. 
 ```python
 loss, accuracy = modelReloaded.evaluate(X_test, y_test)
 print(f"Test accuracy: {accuracy:.2f}")
 ```
 {% /step %}
 {% /steps %}
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@ -0,0 +1,22 @@
 {
    "terminal.integrated.fontSize": 15,
      "editor.fontSize": 15,
      "terminal.integrated.defaultProfile.linux": "bash",
      "workbench.colorTheme": "Default Dark Modern",
      "workbench.startupEditor": "none",
      "files.associations": {
          "*.md": "markdoc"
      },
      "workspace": {
          "view": "readme",
          "terminals": [
              {
                  "name": "Terminal",
                  "active": false
              }
          ],
          "files": [
              "./lab.ipynb"
              ],
      }
  }
--- a/README.md
+++ b/README.md
--- a/lab.ipynb
+++ b/lab.ipynb
@ -0,0 +1,166 @@
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "import random\n",
    "import joblib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating Reproducible Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# 1. Set random seeds for reproducibility\n# Définir les graines aléatoires pour la reproductibilité\nrandom.seed(42)           # Pour le module random\nnp.random.seed(42)        # Pour NumPy\ntf.random.set_seed(42)    # Pour TensorFlow\n\nprint(\"✓ Graines aléatoires définies avec succès !\")"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# 2. Generate synthetic data\n# Générer des données synthétiques pour la classification\nX, y = make_classification(\n    n_samples=1000,        # 1000 exemples\n    n_features=20,         # 20 caractéristiques\n    n_informative=15,      # 15 caractéristiques utiles\n    n_redundant=5,         # 5 caractéristiques redondantes\n    n_classes=2,           # 2 classes (classification binaire)\n    random_state=42        # Graine aléatoire pour reproductibilité\n)\n\nprint(f\"✓ Données générées : {X.shape[0]} exemples, {X.shape[1]} caractéristiques\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training Reproducible Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# 3. Split and scale the data\n# Diviser les données en ensembles d'entraînement (80%) et de test (20%)\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, \n    test_size=0.2,         # 20% pour le test\n    random_state=42        # Important pour la reproductibilité !\n)\n\n# Normaliser les données (StandardScaler centre les données autour de 0)\nscaler = StandardScaler()\nX_train = scaler.fit_transform(X_train)  # Apprendre et transformer les données d'entraînement\nX_test = scaler.transform(X_test)        # Transformer les données de test\n\nprint(f\"✓ Données divisées :\")\nprint(f\"  - Entraînement : {X_train.shape[0]} exemples\")\nprint(f\"  - Test : {X_test.shape[0]} exemples\")"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Sauvegarder les données d'entraînement et de test pour la reproductibilité\njoblib.dump((X_train, y_train), 'train_data.pkl')\njoblib.dump((X_test, y_test), 'test_data.pkl')\n\nprint(\"✓ Données sauvegardées :\")\nprint(\"  - train_data.pkl\")\nprint(\"  - test_data.pkl\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Initalization and Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 4. Build a neural network\n",
    "model = keras.Sequential([\n",
    "    keras.layers.Dense(32, activation='relu', input_shape=(X_train.shape[1],)),\n",
    "    keras.layers.Dense(16, activation='relu'),\n",
    "    keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "model.compile(\n",
    "    optimizer='adam',\n",
    "    loss='binary_crossentropy',\n",
    "    metrics=['accuracy']\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 5. Train the model\n",
    "model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 6. Evaluate the model\n",
    "loss, accuracy = model.evaluate(X_test, y_test)\n",
    "print(f\"Test accuracy: {accuracy:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Saving Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# 7. Save the model and scaler\n# Sauvegarder le modèle entraîné\nmodel.save('my_model.keras')\n\nprint(\"✓ Modèle sauvegardé : my_model.keras\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reproducing Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "#8. Reloading model later\n# Recharger le modèle et les données sauvegardées\nfrom tensorflow.keras.models import load_model\n\nmodelReloaded = load_model('my_model.keras')\nX_train_reloaded, y_train_reloaded = joblib.load('train_data.pkl')\nX_test_reloaded, y_test_reloaded = joblib.load('test_data.pkl')\n\nprint(\"✓ Modèle et données rechargés avec succès !\")"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Vérifier que le modèle rechargé donne les mêmes résultats\nloss_reloaded, accuracy_reloaded = modelReloaded.evaluate(X_test_reloaded, y_test_reloaded)\nprint(f\"\\n🎯 Précision du modèle rechargé : {accuracy_reloaded:.2f}\")\nprint(\"\\n💡 Si la précision est identique à celle obtenue plus haut,\")\nprint(\"   votre workflow est reproductible ! ✓\")"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }