init

2025-11-12 17:34:59 +01:00 · 2025-11-12 17:34:59 +01:00 · 74650b6912
commit 74650b6912
4 changed files with 659 additions and 0 deletions
--- a/.instructions/INSTRUCTIONS.mdoc
+++ b/.instructions/INSTRUCTIONS.mdoc
@ -0,0 +1,210 @@
+{% steps %}
+{% step title="Introduction to ML Model Generalization" %}
+
+
+###  Introduction
+
+
+Welcome to the "Introduction to ML Model Generalization" lab! 
+This lab is designed to give you a foundational understanding of generalization in machine learning models, and how to prevent 
+over or under fitting in models.
+
+
+###  Learning Objectives
+
+
+- Review generalization and the importance of not over or under fitting models.
+- Practice implementing early learning cutoff and learning rate decay
+
+
+### Prerequisites
+
+
+Familiarity with basic ML principals and key concepts around, learning rates, and model structure.
+
+
+{% /step %}
+
+
+{% step title="Synthetic Data Generation" %}
+
+
+### Synthetic Data Generation
+Provided below is a basic function to create some synthetic data for classification. This data will have 2000 samples, each with 20 
+features, where 5 features do not affect the outcome of the classification and 15 are directly correlated to the classification. The 
+options for correct classification will only be between two options. Most importantly, the random state is defined to allow repeatability 
+of the model generation.
+```python
+X, y = make_classification(n_samples=2000,
+                            n_features=20, 
+                            n_classes=2, 
+                            n_informative=15, 
+                            n_redundant=5, 
+                            random_state=42)
+scaler = StandardScaler()
+X = scaler.fit_transform(X)
+```
+
+
+### Train/Validation/Test Splits
+For splitting the data you will first split into the test and training/validation sets. From there you will split out training/validation 
+into their separate sets of training and validation, resulting in a data distribution of train (64%), val (16%), test (20%).
+
+
+```python
+X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=42)
+```
+
+
+In practice it is best to define your test dataset before model creation begins and to keep it out of production environments entirely 
+to ensure there is no data leakage or over fitting of a model.
+
+
+
+
+{% /step %}
+
+
+{% step title="Model and Generalization Feature Setup" %}
+
+### Introduction
+For this lab you will use a basic feed forward neural network because neural networks allow you to implement additional features 
+such as learning rate schedulers, and early stopping, that more traditional models such as linear regression do not have.
+
+###  Model Setup
+For this 
+model you will use two dense layers of Relu activation functions, allowing for 
+more complex patterns to be learned, and ending the model with a sigmoid.
+When setting up the model you could also include additional generalization techniques such as drop out, which 
+selectively turns off a certain percentage of neurons to ensure no single neuron within the neural net 
+learns to perform a single aspect of prediction.
+
+**Note:** RELU is used to introduce non linearity into a neural networks learning, and sigmoid is used as a classification function
+
+
+```python
+model = tf.keras.Sequential([
+    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
+    tf.keras.layers.Dense(32, activation='relu'),
+    tf.keras.layers.Dense(1, activation='sigmoid')
+])
+```
+
+
+### Loss function and Model Initialization Parameters
+For this lab you will be using the Adam optimizer as Adam is a good starting optimizer for most problems.
+Adam takes one parameter which is the starting learning rate, Most 
+models begin with a learning rate under 0.3 and usually closer to 0.1 at most. Here you also define the loss function 
+as ```binary_crossentropy``` which is a simple loss function that just compares is the predicted values match 
+the actual values.
+```python
+model.compile(
+    optimizer=tf.keras.optimizers.Adam(learning_rate=0.05),
+    loss='binary_crossentropy',
+    metrics=['accuracy']
+)
+```
+
+
+#### Learning Rate Scheduler
+For your learning rate scheduler in this lab you will be using "ReduceLROnPlateau" from the keras library which sets the learning rate 
+decay to plateau at a certain amount, ensuring the model never slows to a halt during training. Below the function parameters are defined:
+- ```val_loss``` is the loss applied on the validation training set
+- ```factor``` is the factor as which the learning rate will be cut 
+- ```patience``` is the number of epochs between learning rate reductions
+- ```min_lr``` defines the lowest possible value of learning rate
+- ``` verbose``` has 3 possible values, 0 which returns nothing, 1 which returns a progress bar, 
+and 2 which displays the values for each epoch individually as its own line.
+
+```python
+lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
+    monitor='val_loss',        # metric to monitor
+    factor=0.5,                # reduce by a factor
+    patience=2,                # wait 2 epochs before reducing LR
+    min_lr=1e-5,               # don't reduce below this
+    verbose=1
+)
+```
+
+
+
+
+#### Early Stopping
+
+
+Next in the code you will implement the early stopping aspect of the model. This function will prevent the model from over fitting by 
+returning to a previous model's parameter in the training if the monitored value does not increase by a specific increment. In your case 
+The monitored value is the validation loss again. A patience of 3 is defined, state the model has 3 times of not incrementing by a 
+great enough change before the previous model that did, is selected as the final model. ```min_delta``` defines how much the value 
+needs to change to be considered a large enough value difference to keep the model. Finally the ```restore_best_weights``` set to true 
+allows the model to restore to the last model that performed the best, in cases where the models ```min_delta``` was not met. This 
+functionality is important to ensure the model does not overfit to the training data and keeps some aspect of generalization.
+
+
+```python
+early_stop = tf.keras.callbacks.EarlyStopping(
+    monitor='val_loss',
+    patience=3,
+    min_delta=0.01,  # minimum change to be considered an improvement
+    restore_best_weights=True,
+    verbose=1
+)
+```
+
+
+{% /step %}
+
+
+{% step title="Model training" %}
+
+
+###  Model Training
+
+
+Finally onto the model training, you will use the basic fit method and set the validation set in the hyper parameters, this lets the 
+```val_loss``` correctly be used by the learning rate schedule and the early stopping mechanism. For this case the epochs defaulted 
+to 100 and the verbose is set as 2, this will ensure you have plenty of epochs to end early and the line by line model training information 
+can help you better understand the values of ```val_loss``` and how they are changing per epoch. As you run the model pay close attention 
+to the change in ```val_loss``` and how it correlates to when the model initiates early stopping and rolling back to previous models.
+```python
+model.fit(
+    X_train, y_train,
+    validation_data=(X_val, y_val),
+    epochs=100,
+    callbacks=[early_stop, lr_scheduler],  # your custom early stopping + LR scheduler
+    verbose=2
+)
+```
+
+
+{% /step %}
+
+
+{% step title="Evaluating Model Results" %}
+
+
+### Evaluating Model Results
+
+
+The following code provides a basic metric test of your neural network. Depending on the domain of the model different levels of accuracy 
+are acceptable. It's more important to see a considerable increase in accuracy in predictions compared to existing methods, than it is 
+to hit a particular threshold of accuracy. Accuracy above 99.5% for validation can be a bit concerning as it may be a sign of over fitting, 
+and an accuracy below previous methods may be a sign of under fitting.
+
+
+```python
+y_pred_probs = model.predict(X_test).flatten()
+y_pred = (y_pred_probs >= 0.5).astype(int)
+
+
+print("\n Test Set Evaluation:")
+print(classification_report(y_test, y_pred))
+print("Confusion Matrix:")
+print(confusion_matrix(y_test, y_pred))
+```
+
+
+{% /step %}
+{% /steps %}
+
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@ -0,0 +1,22 @@
+{
+    "terminal.integrated.fontSize": 15,
+      "editor.fontSize": 15,
+      "terminal.integrated.defaultProfile.linux": "bash",
+      "workbench.colorTheme": "Default Dark Modern",
+      "workbench.startupEditor": "none",
+      "files.associations": {
+          "*.md": "markdoc"
+      },
+      "workspace": {
+          "view": "readme",
+          "terminals": [
+              {
+                  "name": "Terminal",
+                  "active": false
+              }
+          ],
+          "files": [
+              "./lab.ipynb"
+              ],
+      }
+  }
--- a/README.md
+++ b/README.md
@ -0,0 +1,136 @@
+# ML Model Generalization Lab
+
+## Objectif du Lab
+
+Ce lab démontre les techniques essentielles pour améliorer la **généralisation** d'un modèle de Machine Learning et éviter l'**overfitting** (surapprentissage).
+
+## Concepts Clés
+
+### 1. Split des Données (Train/Validation/Test)
+
+```
+Total: 2000 échantillons
+├── Train: 64% (1280 échantillons) - entraînement du modèle
+├── Validation: 16% (320 échantillons) - ajustement des hyperparamètres
+└── Test: 20% (400 échantillons) - évaluation finale
+```
+
+**Pourquoi 3 splits ?**
+- **Train** : apprend les patterns
+- **Validation** : détecte l'overfitting pendant l'entraînement
+- **Test** : mesure la performance réelle sur des données jamais vues
+
+### 2. Early Stopping
+
+```python
+EarlyStopping(
+    monitor='val_loss',
+    patience=3,
+    min_delta=0.01,
+    restore_best_weights=True
+)
+```
+
+**Rôle** : Arrête l'entraînement quand la `val_loss` ne s'améliore plus
+- Évite l'overfitting en stoppant avant que le modèle ne "mémorise" les données
+- Restaure les meilleurs poids (epoch 7 dans notre cas)
+
+### 3. Learning Rate Scheduler
+
+```python
+ReduceLROnPlateau(
+    monitor='val_loss',
+    factor=0.5,
+    patience=2,
+    min_lr=1e-5
+)
+```
+
+**Rôle** : Réduit le learning rate quand l'apprentissage stagne
+- Learning rate initial : 0.05
+- Réduction par 2 après 2 epochs sans amélioration
+- Permet une convergence plus fine vers l'optimum
+
+### 4. Architecture du Réseau
+
+```
+Input (20 features)
+    ↓
+Dense(64, relu)
+    ↓
+Dense(32, relu)
+    ↓
+Dense(1, sigmoid) → Probabilité binaire
+```
+
+Architecture simple mais efficace pour la classification binaire.
+
+## Résultats Obtenus
+
+### Métriques de Performance
+
+| Métrique | Valeur |
+|----------|--------|
+| Accuracy | 97% |
+| Precision (classe 0) | 95% |
+| Precision (classe 1) | 99% |
+| Recall (classe 0) | 99% |
+| Recall (classe 1) | 95% |
+
+### Matrice de Confusion
+
+```
+              Prédictions
+              0     1
+Réel    0   [205    2]
+        1   [ 10  183]
+```
+
+- **Vrais positifs** : 183 + 205 = 388
+- **Faux positifs** : 2 + 10 = 12
+- **Taux d'erreur** : 3% seulement
+
+## Ce qu'il faut Retenir
+
+### ✅ Bonnes Pratiques Appliquées
+
+1. **Toujours séparer les données** en 3 ensembles distincts
+2. **Utiliser la validation** pour monitorer l'overfitting en temps réel
+3. **Early stopping** est crucial pour éviter le surapprentissage
+4. **Learning rate adaptatif** améliore la convergence
+5. **Normalisation** des features avec StandardScaler pour stabiliser l'apprentissage
+
+### 📊 Signes d'une Bonne Généralisation
+
+- ✅ Performance similaire sur train et test
+- ✅ Val_loss se stabilise sans diverger
+- ✅ Le modèle s'arrête avant de surapprendre (epoch 10/100)
+- ✅ Métriques équilibrées entre les classes
+
+### ⚠️ Signes d'Overfitting (absents ici)
+
+- ❌ Train accuracy >> Test accuracy
+- ❌ Val_loss augmente alors que train_loss diminue
+- ❌ Performance dégradée sur nouvelles données
+
+## Exécution
+
+```bash
+# Activer l'environnement virtuel
+source venv/bin/activate
+
+# Lancer Jupyter
+jupyter notebook lab.ipynb
+```
+
+## Technologies Utilisées
+
+- **TensorFlow/Keras** : construction et entraînement du réseau de neurones
+- **Scikit-learn** : génération de données, preprocessing, métriques
+- **Python 3.12** : langage de programmation
+
+## Conclusion
+
+Ce lab illustre qu'un modèle bien régularisé avec early stopping et learning rate scheduling peut atteindre d'excellentes performances (97%) tout en généralisant correctement sur des données non vues.
+
+**Principe fondamental** : Un bon modèle ne mémorise pas les données, il apprend les patterns généraux.
--- a/lab.ipynb
+++ b/lab.ipynb
@ -0,0 +1,291 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-11-12 17:18:24.255077: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.\n",
+      "2025-11-12 17:18:24.312342: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "2025-11-12 17:18:25.689783: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import tensorflow as tf\n",
+    "from sklearn.datasets import make_classification\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "from sklearn.metrics import classification_report, confusion_matrix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Synthetic Data Generation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# synthetic data generation of 2000 samples\n",
+    "X, y = make_classification(n_samples=2000,\n",
+    "                            n_features=20, \n",
+    "                            n_classes=2, \n",
+    "                            n_informative=15, \n",
+    "                            n_redundant=5, \n",
+    "                            random_state=42)\n",
+    "scaler = StandardScaler()\n",
+    "X = scaler.fit_transform(X)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Train/Validation/Test Splits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Split into train (64%), val (16%), test (20%)\n",
+    "X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
+    "X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=42)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Model Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Feed Forward Neural Network Initalization\n",
+    "model = tf.keras.Sequential([\n",
+    "    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),\n",
+    "    tf.keras.layers.Dense(32, activation='relu'),\n",
+    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
+    "])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training Hyperparameters Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#optimizer and loss setup\n",
+    "model.compile(\n",
+    "    optimizer=tf.keras.optimizers.Adam(learning_rate=0.05),\n",
+    "    loss='binary_crossentropy',\n",
+    "    metrics=['accuracy']\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Learning Rate Scheduler"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Learning rate scheduler\n",
+    "lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(\n",
+    "    monitor='val_loss',        # metric to monitor\n",
+    "    factor=0.5,                # reduce by a factor\n",
+    "    patience=2,                # wait 2 epochs before reducing LR\n",
+    "    min_lr=1e-5,               # don't reduce below this\n",
+    "    verbose=1\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Early Stopping Logic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 3. Early stopping callback with patience and loss threshold\n",
+    "early_stop = tf.keras.callbacks.EarlyStopping(\n",
+    "    monitor='val_loss',\n",
+    "    patience=3,\n",
+    "    min_delta=0.01,  # minimum change to be considered an improvement\n",
+    "    restore_best_weights=True,\n",
+    "    verbose=1\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Model Training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Epoch 1/100\n",
+      "40/40 - 1s - 29ms/step - accuracy: 0.8359 - loss: 0.3691 - val_accuracy: 0.9187 - val_loss: 0.2269 - learning_rate: 0.0500\n",
+      "Epoch 2/100\n",
+      "40/40 - 0s - 3ms/step - accuracy: 0.9102 - loss: 0.2240 - val_accuracy: 0.9438 - val_loss: 0.1643 - learning_rate: 0.0500\n",
+      "Epoch 3/100\n",
+      "40/40 - 0s - 3ms/step - accuracy: 0.9477 - loss: 0.1400 - val_accuracy: 0.9531 - val_loss: 0.1484 - learning_rate: 0.0500\n",
+      "Epoch 4/100\n",
+      "40/40 - 0s - 3ms/step - accuracy: 0.9547 - loss: 0.1338 - val_accuracy: 0.9344 - val_loss: 0.1857 - learning_rate: 0.0500\n",
+      "Epoch 5/100\n",
+      "\n",
+      "Epoch 5: ReduceLROnPlateau reducing learning rate to 0.02500000037252903.\n",
+      "40/40 - 0s - 3ms/step - accuracy: 0.9555 - loss: 0.1402 - val_accuracy: 0.9219 - val_loss: 0.1695 - learning_rate: 0.0500\n",
+      "Epoch 6/100\n",
+      "40/40 - 0s - 3ms/step - accuracy: 0.9688 - loss: 0.0904 - val_accuracy: 0.9656 - val_loss: 0.1186 - learning_rate: 0.0250\n",
+      "Epoch 7/100\n",
+      "40/40 - 0s - 3ms/step - accuracy: 0.9812 - loss: 0.0491 - val_accuracy: 0.9688 - val_loss: 0.1048 - learning_rate: 0.0250\n",
+      "Epoch 8/100\n",
+      "40/40 - 0s - 4ms/step - accuracy: 0.9922 - loss: 0.0317 - val_accuracy: 0.9563 - val_loss: 0.1213 - learning_rate: 0.0250\n",
+      "Epoch 9/100\n",
+      "\n",
+      "Epoch 9: ReduceLROnPlateau reducing learning rate to 0.012500000186264515.\n",
+      "40/40 - 0s - 3ms/step - accuracy: 0.9922 - loss: 0.0220 - val_accuracy: 0.9625 - val_loss: 0.1212 - learning_rate: 0.0250\n",
+      "Epoch 10/100\n",
+      "40/40 - 0s - 3ms/step - accuracy: 0.9953 - loss: 0.0177 - val_accuracy: 0.9563 - val_loss: 0.1283 - learning_rate: 0.0125\n",
+      "Epoch 10: early stopping\n",
+      "Restoring model weights from the end of the best epoch: 7.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "<keras.src.callbacks.history.History at 0x7f6c3ff320c0>"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# 4. Train the model\n",
+    "model.fit(\n",
+    "    X_train, y_train,\n",
+    "    validation_data=(X_val, y_val),\n",
+    "    epochs=100,\n",
+    "    callbacks=[early_stop, lr_scheduler],  # your custom early stopping + LR scheduler\n",
+    "    verbose=2\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "evaluation metrics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 4ms/step \n",
+      "\n",
+      " Test Set Evaluation:\n",
+      "              precision    recall  f1-score   support\n",
+      "\n",
+      "           0       0.95      0.99      0.97       207\n",
+      "           1       0.99      0.95      0.97       193\n",
+      "\n",
+      "    accuracy                           0.97       400\n",
+      "   macro avg       0.97      0.97      0.97       400\n",
+      "weighted avg       0.97      0.97      0.97       400\n",
+      "\n",
+      "Confusion Matrix:\n",
+      "[[205   2]\n",
+      " [ 10 183]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 5. Evaluate on test set\n",
+    "y_pred_probs = model.predict(X_test).flatten()\n",
+    "y_pred = (y_pred_probs >= 0.5).astype(int)\n",
+    "\n",
+    "\n",
+    "print(\"\\n Test Set Evaluation:\")\n",
+    "print(classification_report(y_test, y_pred))\n",
+    "print(\"Confusion Matrix:\")\n",
+    "print(confusion_matrix(y_test, y_pred))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}