init

2025-11-12 17:34:59 +01:00 · 2025-11-12 17:34:59 +01:00 · 74650b6912
commit 74650b6912
4 changed files with 659 additions and 0 deletions
--- a/.instructions/INSTRUCTIONS.mdoc
+++ b/.instructions/INSTRUCTIONS.mdoc
@ -0,0 +1,210 @@
 {% steps %}
 {% step title="Introduction to ML Model Generalization" %}
 ###  Introduction
 Welcome to the "Introduction to ML Model Generalization" lab! 
 This lab is designed to give you a foundational understanding of generalization in machine learning models, and how to prevent 
 over or under fitting in models.
 ###  Learning Objectives
 - Review generalization and the importance of not over or under fitting models.
 - Practice implementing early learning cutoff and learning rate decay
 ### Prerequisites
 Familiarity with basic ML principals and key concepts around, learning rates, and model structure.
 {% /step %}
 {% step title="Synthetic Data Generation" %}
 ### Synthetic Data Generation
 Provided below is a basic function to create some synthetic data for classification. This data will have 2000 samples, each with 20 
 features, where 5 features do not affect the outcome of the classification and 15 are directly correlated to the classification. The 
 options for correct classification will only be between two options. Most importantly, the random state is defined to allow repeatability 
 of the model generation.
 ```python
 X, y = make_classification(n_samples=2000,
                            n_features=20, 
                            n_classes=2, 
                            n_informative=15, 
                            n_redundant=5, 
                            random_state=42)
 scaler = StandardScaler()
 X = scaler.fit_transform(X)
 ```
 ### Train/Validation/Test Splits
 For splitting the data you will first split into the test and training/validation sets. From there you will split out training/validation 
 into their separate sets of training and validation, resulting in a data distribution of train (64%), val (16%), test (20%).
 ```python
 X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=42)
 ```
 In practice it is best to define your test dataset before model creation begins and to keep it out of production environments entirely 
 to ensure there is no data leakage or over fitting of a model.
 {% /step %}
 {% step title="Model and Generalization Feature Setup" %}
 ### Introduction
 For this lab you will use a basic feed forward neural network because neural networks allow you to implement additional features 
 such as learning rate schedulers, and early stopping, that more traditional models such as linear regression do not have.
 ###  Model Setup
 For this 
 model you will use two dense layers of Relu activation functions, allowing for 
 more complex patterns to be learned, and ending the model with a sigmoid.
 When setting up the model you could also include additional generalization techniques such as drop out, which 
 selectively turns off a certain percentage of neurons to ensure no single neuron within the neural net 
 learns to perform a single aspect of prediction.
 **Note:** RELU is used to introduce non linearity into a neural networks learning, and sigmoid is used as a classification function
 ```python
 model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
 ])
 ```
 ### Loss function and Model Initialization Parameters
 For this lab you will be using the Adam optimizer as Adam is a good starting optimizer for most problems.
 Adam takes one parameter which is the starting learning rate, Most 
 models begin with a learning rate under 0.3 and usually closer to 0.1 at most. Here you also define the loss function 
 as ```binary_crossentropy``` which is a simple loss function that just compares is the predicted values match 
 the actual values.
 ```python
 model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.05),
    loss='binary_crossentropy',
    metrics=['accuracy']
 )
 ```
 #### Learning Rate Scheduler
 For your learning rate scheduler in this lab you will be using "ReduceLROnPlateau" from the keras library which sets the learning rate 
 decay to plateau at a certain amount, ensuring the model never slows to a halt during training. Below the function parameters are defined:
 - ```val_loss``` is the loss applied on the validation training set
 - ```factor``` is the factor as which the learning rate will be cut 
 - ```patience``` is the number of epochs between learning rate reductions
 - ```min_lr``` defines the lowest possible value of learning rate
 - ``` verbose``` has 3 possible values, 0 which returns nothing, 1 which returns a progress bar, 
 and 2 which displays the values for each epoch individually as its own line.
 ```python
 lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',        # metric to monitor
    factor=0.5,                # reduce by a factor
    patience=2,                # wait 2 epochs before reducing LR
    min_lr=1e-5,               # don't reduce below this
    verbose=1
 )
 ```
 #### Early Stopping
 Next in the code you will implement the early stopping aspect of the model. This function will prevent the model from over fitting by 
 returning to a previous model's parameter in the training if the monitored value does not increase by a specific increment. In your case 
 The monitored value is the validation loss again. A patience of 3 is defined, state the model has 3 times of not incrementing by a 
 great enough change before the previous model that did, is selected as the final model. ```min_delta``` defines how much the value 
 needs to change to be considered a large enough value difference to keep the model. Finally the ```restore_best_weights``` set to true 
 allows the model to restore to the last model that performed the best, in cases where the models ```min_delta``` was not met. This 
 functionality is important to ensure the model does not overfit to the training data and keeps some aspect of generalization.
 ```python
 early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=3,
    min_delta=0.01,  # minimum change to be considered an improvement
    restore_best_weights=True,
    verbose=1
 )
 ```
 {% /step %}
 {% step title="Model training" %}
 ###  Model Training
 Finally onto the model training, you will use the basic fit method and set the validation set in the hyper parameters, this lets the 
 ```val_loss``` correctly be used by the learning rate schedule and the early stopping mechanism. For this case the epochs defaulted 
 to 100 and the verbose is set as 2, this will ensure you have plenty of epochs to end early and the line by line model training information 
 can help you better understand the values of ```val_loss``` and how they are changing per epoch. As you run the model pay close attention 
 to the change in ```val_loss``` and how it correlates to when the model initiates early stopping and rolling back to previous models.
 ```python
 model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=[early_stop, lr_scheduler],  # your custom early stopping + LR scheduler
    verbose=2
 )
 ```
 {% /step %}
 {% step title="Evaluating Model Results" %}
 ### Evaluating Model Results
 The following code provides a basic metric test of your neural network. Depending on the domain of the model different levels of accuracy 
 are acceptable. It's more important to see a considerable increase in accuracy in predictions compared to existing methods, than it is 
 to hit a particular threshold of accuracy. Accuracy above 99.5% for validation can be a bit concerning as it may be a sign of over fitting, 
 and an accuracy below previous methods may be a sign of under fitting.
 ```python
 y_pred_probs = model.predict(X_test).flatten()
 y_pred = (y_pred_probs >= 0.5).astype(int)
 print("\n Test Set Evaluation:")
 print(classification_report(y_test, y_pred))
 print("Confusion Matrix:")
 print(confusion_matrix(y_test, y_pred))
 ```
 {% /step %}
 {% /steps %}
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@ -0,0 +1,22 @@
 {
    "terminal.integrated.fontSize": 15,
      "editor.fontSize": 15,
      "terminal.integrated.defaultProfile.linux": "bash",
      "workbench.colorTheme": "Default Dark Modern",
      "workbench.startupEditor": "none",
      "files.associations": {
          "*.md": "markdoc"
      },
      "workspace": {
          "view": "readme",
          "terminals": [
              {
                  "name": "Terminal",
                  "active": false
              }
          ],
          "files": [
              "./lab.ipynb"
              ],
      }
  }
--- a/README.md
+++ b/README.md
@ -0,0 +1,136 @@
 # ML Model Generalization Lab
 ## Objectif du Lab
 Ce lab démontre les techniques essentielles pour améliorer la **généralisation** d'un modèle de Machine Learning et éviter l'**overfitting** (surapprentissage).
 ## Concepts Clés
 ### 1. Split des Données (Train/Validation/Test)
 ```
 Total: 2000 échantillons
 ├── Train: 64% (1280 échantillons) - entraînement du modèle
 ├── Validation: 16% (320 échantillons) - ajustement des hyperparamètres
 └── Test: 20% (400 échantillons) - évaluation finale
 ```
 **Pourquoi 3 splits ?**
 - **Train** : apprend les patterns
 - **Validation** : détecte l'overfitting pendant l'entraînement
 - **Test** : mesure la performance réelle sur des données jamais vues
 ### 2. Early Stopping
 ```python
 EarlyStopping(
    monitor='val_loss',
    patience=3,
    min_delta=0.01,
    restore_best_weights=True
 )
 ```
 **Rôle** : Arrête l'entraînement quand la `val_loss` ne s'améliore plus
 - Évite l'overfitting en stoppant avant que le modèle ne "mémorise" les données
 - Restaure les meilleurs poids (epoch 7 dans notre cas)
 ### 3. Learning Rate Scheduler
 ```python
 ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=2,
    min_lr=1e-5
 )
 ```
 **Rôle** : Réduit le learning rate quand l'apprentissage stagne
 - Learning rate initial : 0.05
 - Réduction par 2 après 2 epochs sans amélioration
 - Permet une convergence plus fine vers l'optimum
 ### 4. Architecture du Réseau
 ```
 Input (20 features)
    ↓
 Dense(64, relu)
    ↓
 Dense(32, relu)
    ↓
 Dense(1, sigmoid) → Probabilité binaire
 ```
 Architecture simple mais efficace pour la classification binaire.
 ## Résultats Obtenus
 ### Métriques de Performance
 | Métrique | Valeur |
 |----------|--------|
 | Accuracy | 97% |
 | Precision (classe 0) | 95% |
 | Precision (classe 1) | 99% |
 | Recall (classe 0) | 99% |
 | Recall (classe 1) | 95% |
 ### Matrice de Confusion
 ```
              Prédictions
              0     1
 Réel    0   [205    2]
        1   [ 10  183]
 ```
 - **Vrais positifs** : 183 + 205 = 388
 - **Faux positifs** : 2 + 10 = 12
 - **Taux d'erreur** : 3% seulement
 ## Ce qu'il faut Retenir
 ### ✅ Bonnes Pratiques Appliquées
 1. **Toujours séparer les données** en 3 ensembles distincts
 2. **Utiliser la validation** pour monitorer l'overfitting en temps réel
 3. **Early stopping** est crucial pour éviter le surapprentissage
 4. **Learning rate adaptatif** améliore la convergence
 5. **Normalisation** des features avec StandardScaler pour stabiliser l'apprentissage
 ### 📊 Signes d'une Bonne Généralisation
 - ✅ Performance similaire sur train et test
 - ✅ Val_loss se stabilise sans diverger
 - ✅ Le modèle s'arrête avant de surapprendre (epoch 10/100)
 - ✅ Métriques équilibrées entre les classes
 ### ⚠️ Signes d'Overfitting (absents ici)
 - ❌ Train accuracy >> Test accuracy
 - ❌ Val_loss augmente alors que train_loss diminue
 - ❌ Performance dégradée sur nouvelles données
 ## Exécution
 ```bash
 # Activer l'environnement virtuel
 source venv/bin/activate
 # Lancer Jupyter
 jupyter notebook lab.ipynb
 ```
 ## Technologies Utilisées
 - **TensorFlow/Keras** : construction et entraînement du réseau de neurones
 - **Scikit-learn** : génération de données, preprocessing, métriques
 - **Python 3.12** : langage de programmation
 ## Conclusion
 Ce lab illustre qu'un modèle bien régularisé avec early stopping et learning rate scheduling peut atteindre d'excellentes performances (97%) tout en généralisant correctement sur des données non vues.
 **Principe fondamental** : Un bon modèle ne mémorise pas les données, il apprend les patterns généraux.
--- a/lab.ipynb
+++ b/lab.ipynb
@ -0,0 +1,291 @@
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-11-12 17:18:24.255077: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.\n",
      "2025-11-12 17:18:24.312342: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
      "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "2025-11-12 17:18:25.689783: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics import classification_report, confusion_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Synthetic Data Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# synthetic data generation of 2000 samples\n",
    "X, y = make_classification(n_samples=2000,\n",
    "                            n_features=20, \n",
    "                            n_classes=2, \n",
    "                            n_informative=15, \n",
    "                            n_redundant=5, \n",
    "                            random_state=42)\n",
    "scaler = StandardScaler()\n",
    "X = scaler.fit_transform(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train/Validation/Test Splits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split into train (64%), val (16%), test (20%)\n",
    "X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feed Forward Neural Network Initalization\n",
    "model = tf.keras.Sequential([\n",
    "    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),\n",
    "    tf.keras.layers.Dense(32, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Hyperparameters Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#optimizer and loss setup\n",
    "model.compile(\n",
    "    optimizer=tf.keras.optimizers.Adam(learning_rate=0.05),\n",
    "    loss='binary_crossentropy',\n",
    "    metrics=['accuracy']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning Rate Scheduler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Learning rate scheduler\n",
    "lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(\n",
    "    monitor='val_loss',        # metric to monitor\n",
    "    factor=0.5,                # reduce by a factor\n",
    "    patience=2,                # wait 2 epochs before reducing LR\n",
    "    min_lr=1e-5,               # don't reduce below this\n",
    "    verbose=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Early Stopping Logic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3. Early stopping callback with patience and loss threshold\n",
    "early_stop = tf.keras.callbacks.EarlyStopping(\n",
    "    monitor='val_loss',\n",
    "    patience=3,\n",
    "    min_delta=0.01,  # minimum change to be considered an improvement\n",
    "    restore_best_weights=True,\n",
    "    verbose=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Model Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/100\n",
      "40/40 - 1s - 29ms/step - accuracy: 0.8359 - loss: 0.3691 - val_accuracy: 0.9187 - val_loss: 0.2269 - learning_rate: 0.0500\n",
      "Epoch 2/100\n",
      "40/40 - 0s - 3ms/step - accuracy: 0.9102 - loss: 0.2240 - val_accuracy: 0.9438 - val_loss: 0.1643 - learning_rate: 0.0500\n",
      "Epoch 3/100\n",
      "40/40 - 0s - 3ms/step - accuracy: 0.9477 - loss: 0.1400 - val_accuracy: 0.9531 - val_loss: 0.1484 - learning_rate: 0.0500\n",
      "Epoch 4/100\n",
      "40/40 - 0s - 3ms/step - accuracy: 0.9547 - loss: 0.1338 - val_accuracy: 0.9344 - val_loss: 0.1857 - learning_rate: 0.0500\n",
      "Epoch 5/100\n",
      "\n",
      "Epoch 5: ReduceLROnPlateau reducing learning rate to 0.02500000037252903.\n",
      "40/40 - 0s - 3ms/step - accuracy: 0.9555 - loss: 0.1402 - val_accuracy: 0.9219 - val_loss: 0.1695 - learning_rate: 0.0500\n",
      "Epoch 6/100\n",
      "40/40 - 0s - 3ms/step - accuracy: 0.9688 - loss: 0.0904 - val_accuracy: 0.9656 - val_loss: 0.1186 - learning_rate: 0.0250\n",
      "Epoch 7/100\n",
      "40/40 - 0s - 3ms/step - accuracy: 0.9812 - loss: 0.0491 - val_accuracy: 0.9688 - val_loss: 0.1048 - learning_rate: 0.0250\n",
      "Epoch 8/100\n",
      "40/40 - 0s - 4ms/step - accuracy: 0.9922 - loss: 0.0317 - val_accuracy: 0.9563 - val_loss: 0.1213 - learning_rate: 0.0250\n",
      "Epoch 9/100\n",
      "\n",
      "Epoch 9: ReduceLROnPlateau reducing learning rate to 0.012500000186264515.\n",
      "40/40 - 0s - 3ms/step - accuracy: 0.9922 - loss: 0.0220 - val_accuracy: 0.9625 - val_loss: 0.1212 - learning_rate: 0.0250\n",
      "Epoch 10/100\n",
      "40/40 - 0s - 3ms/step - accuracy: 0.9953 - loss: 0.0177 - val_accuracy: 0.9563 - val_loss: 0.1283 - learning_rate: 0.0125\n",
      "Epoch 10: early stopping\n",
      "Restoring model weights from the end of the best epoch: 7.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.src.callbacks.history.History at 0x7f6c3ff320c0>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 4. Train the model\n",
    "model.fit(\n",
    "    X_train, y_train,\n",
    "    validation_data=(X_val, y_val),\n",
    "    epochs=100,\n",
    "    callbacks=[early_stop, lr_scheduler],  # your custom early stopping + LR scheduler\n",
    "    verbose=2\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "evaluation metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 4ms/step \n",
      "\n",
      " Test Set Evaluation:\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.95      0.99      0.97       207\n",
      "           1       0.99      0.95      0.97       193\n",
      "\n",
      "    accuracy                           0.97       400\n",
      "   macro avg       0.97      0.97      0.97       400\n",
      "weighted avg       0.97      0.97      0.97       400\n",
      "\n",
      "Confusion Matrix:\n",
      "[[205   2]\n",
      " [ 10 183]]\n"
     ]
    }
   ],
   "source": [
    "# 5. Evaluate on test set\n",
    "y_pred_probs = model.predict(X_test).flatten()\n",
    "y_pred = (y_pred_probs >= 0.5).astype(int)\n",
    "\n",
    "\n",
    "print(\"\\n Test Set Evaluation:\")\n",
    "print(classification_report(y_test, y_pred))\n",
    "print(\"Confusion Matrix:\")\n",
    "print(confusion_matrix(y_test, y_pred))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }