commit 74650b6912b36c6f685c07b7ad594cbe0cf8ba70 Author: spham Date: Wed Nov 12 17:34:59 2025 +0100 init diff --git a/.instructions/INSTRUCTIONS.mdoc b/.instructions/INSTRUCTIONS.mdoc new file mode 100644 index 0000000..92aeba6 --- /dev/null +++ b/.instructions/INSTRUCTIONS.mdoc @@ -0,0 +1,210 @@ +{% steps %} +{% step title="Introduction to ML Model Generalization" %} + + +### Introduction + + +Welcome to the "Introduction to ML Model Generalization" lab! +This lab is designed to give you a foundational understanding of generalization in machine learning models, and how to prevent +over or under fitting in models. + + +### Learning Objectives + + +- Review generalization and the importance of not over or under fitting models. +- Practice implementing early learning cutoff and learning rate decay + + +### Prerequisites + + +Familiarity with basic ML principals and key concepts around, learning rates, and model structure. + + +{% /step %} + + +{% step title="Synthetic Data Generation" %} + + +### Synthetic Data Generation +Provided below is a basic function to create some synthetic data for classification. This data will have 2000 samples, each with 20 +features, where 5 features do not affect the outcome of the classification and 15 are directly correlated to the classification. The +options for correct classification will only be between two options. Most importantly, the random state is defined to allow repeatability +of the model generation. +```python +X, y = make_classification(n_samples=2000, + n_features=20, + n_classes=2, + n_informative=15, + n_redundant=5, + random_state=42) +scaler = StandardScaler() +X = scaler.fit_transform(X) +``` + + +### Train/Validation/Test Splits +For splitting the data you will first split into the test and training/validation sets. From there you will split out training/validation +into their separate sets of training and validation, resulting in a data distribution of train (64%), val (16%), test (20%). + + +```python +X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42) +X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=42) +``` + + +In practice it is best to define your test dataset before model creation begins and to keep it out of production environments entirely +to ensure there is no data leakage or over fitting of a model. + + + + +{% /step %} + + +{% step title="Model and Generalization Feature Setup" %} + +### Introduction +For this lab you will use a basic feed forward neural network because neural networks allow you to implement additional features +such as learning rate schedulers, and early stopping, that more traditional models such as linear regression do not have. + +### Model Setup +For this +model you will use two dense layers of Relu activation functions, allowing for +more complex patterns to be learned, and ending the model with a sigmoid. +When setting up the model you could also include additional generalization techniques such as drop out, which +selectively turns off a certain percentage of neurons to ensure no single neuron within the neural net +learns to perform a single aspect of prediction. + +**Note:** RELU is used to introduce non linearity into a neural networks learning, and sigmoid is used as a classification function + + +```python +model = tf.keras.Sequential([ + tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)), + tf.keras.layers.Dense(32, activation='relu'), + tf.keras.layers.Dense(1, activation='sigmoid') +]) +``` + + +### Loss function and Model Initialization Parameters +For this lab you will be using the Adam optimizer as Adam is a good starting optimizer for most problems. +Adam takes one parameter which is the starting learning rate, Most +models begin with a learning rate under 0.3 and usually closer to 0.1 at most. Here you also define the loss function +as ```binary_crossentropy``` which is a simple loss function that just compares is the predicted values match +the actual values. +```python +model.compile( + optimizer=tf.keras.optimizers.Adam(learning_rate=0.05), + loss='binary_crossentropy', + metrics=['accuracy'] +) +``` + + +#### Learning Rate Scheduler +For your learning rate scheduler in this lab you will be using "ReduceLROnPlateau" from the keras library which sets the learning rate +decay to plateau at a certain amount, ensuring the model never slows to a halt during training. Below the function parameters are defined: +- ```val_loss``` is the loss applied on the validation training set +- ```factor``` is the factor as which the learning rate will be cut +- ```patience``` is the number of epochs between learning rate reductions +- ```min_lr``` defines the lowest possible value of learning rate +- ``` verbose``` has 3 possible values, 0 which returns nothing, 1 which returns a progress bar, +and 2 which displays the values for each epoch individually as its own line. + +```python +lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau( + monitor='val_loss', # metric to monitor + factor=0.5, # reduce by a factor + patience=2, # wait 2 epochs before reducing LR + min_lr=1e-5, # don't reduce below this + verbose=1 +) +``` + + + + +#### Early Stopping + + +Next in the code you will implement the early stopping aspect of the model. This function will prevent the model from over fitting by +returning to a previous model's parameter in the training if the monitored value does not increase by a specific increment. In your case +The monitored value is the validation loss again. A patience of 3 is defined, state the model has 3 times of not incrementing by a +great enough change before the previous model that did, is selected as the final model. ```min_delta``` defines how much the value +needs to change to be considered a large enough value difference to keep the model. Finally the ```restore_best_weights``` set to true +allows the model to restore to the last model that performed the best, in cases where the models ```min_delta``` was not met. This +functionality is important to ensure the model does not overfit to the training data and keeps some aspect of generalization. + + +```python +early_stop = tf.keras.callbacks.EarlyStopping( + monitor='val_loss', + patience=3, + min_delta=0.01, # minimum change to be considered an improvement + restore_best_weights=True, + verbose=1 +) +``` + + +{% /step %} + + +{% step title="Model training" %} + + +### Model Training + + +Finally onto the model training, you will use the basic fit method and set the validation set in the hyper parameters, this lets the +```val_loss``` correctly be used by the learning rate schedule and the early stopping mechanism. For this case the epochs defaulted +to 100 and the verbose is set as 2, this will ensure you have plenty of epochs to end early and the line by line model training information +can help you better understand the values of ```val_loss``` and how they are changing per epoch. As you run the model pay close attention +to the change in ```val_loss``` and how it correlates to when the model initiates early stopping and rolling back to previous models. +```python +model.fit( + X_train, y_train, + validation_data=(X_val, y_val), + epochs=100, + callbacks=[early_stop, lr_scheduler], # your custom early stopping + LR scheduler + verbose=2 +) +``` + + +{% /step %} + + +{% step title="Evaluating Model Results" %} + + +### Evaluating Model Results + + +The following code provides a basic metric test of your neural network. Depending on the domain of the model different levels of accuracy +are acceptable. It's more important to see a considerable increase in accuracy in predictions compared to existing methods, than it is +to hit a particular threshold of accuracy. Accuracy above 99.5% for validation can be a bit concerning as it may be a sign of over fitting, +and an accuracy below previous methods may be a sign of under fitting. + + +```python +y_pred_probs = model.predict(X_test).flatten() +y_pred = (y_pred_probs >= 0.5).astype(int) + + +print("\n Test Set Evaluation:") +print(classification_report(y_test, y_pred)) +print("Confusion Matrix:") +print(confusion_matrix(y_test, y_pred)) +``` + + +{% /step %} +{% /steps %} + diff --git a/.vscode/settings.json b/.vscode/settings.json new file mode 100644 index 0000000..5d8e9de --- /dev/null +++ b/.vscode/settings.json @@ -0,0 +1,22 @@ +{ + "terminal.integrated.fontSize": 15, + "editor.fontSize": 15, + "terminal.integrated.defaultProfile.linux": "bash", + "workbench.colorTheme": "Default Dark Modern", + "workbench.startupEditor": "none", + "files.associations": { + "*.md": "markdoc" + }, + "workspace": { + "view": "readme", + "terminals": [ + { + "name": "Terminal", + "active": false + } + ], + "files": [ + "./lab.ipynb" + ], + } + } \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..5ee5da3 --- /dev/null +++ b/README.md @@ -0,0 +1,136 @@ +# ML Model Generalization Lab + +## Objectif du Lab + +Ce lab démontre les techniques essentielles pour améliorer la **généralisation** d'un modèle de Machine Learning et éviter l'**overfitting** (surapprentissage). + +## Concepts Clés + +### 1. Split des Données (Train/Validation/Test) + +``` +Total: 2000 échantillons +├── Train: 64% (1280 échantillons) - entraînement du modèle +├── Validation: 16% (320 échantillons) - ajustement des hyperparamètres +└── Test: 20% (400 échantillons) - évaluation finale +``` + +**Pourquoi 3 splits ?** +- **Train** : apprend les patterns +- **Validation** : détecte l'overfitting pendant l'entraînement +- **Test** : mesure la performance réelle sur des données jamais vues + +### 2. Early Stopping + +```python +EarlyStopping( + monitor='val_loss', + patience=3, + min_delta=0.01, + restore_best_weights=True +) +``` + +**Rôle** : Arrête l'entraînement quand la `val_loss` ne s'améliore plus +- Évite l'overfitting en stoppant avant que le modèle ne "mémorise" les données +- Restaure les meilleurs poids (epoch 7 dans notre cas) + +### 3. Learning Rate Scheduler + +```python +ReduceLROnPlateau( + monitor='val_loss', + factor=0.5, + patience=2, + min_lr=1e-5 +) +``` + +**Rôle** : Réduit le learning rate quand l'apprentissage stagne +- Learning rate initial : 0.05 +- Réduction par 2 après 2 epochs sans amélioration +- Permet une convergence plus fine vers l'optimum + +### 4. Architecture du Réseau + +``` +Input (20 features) + ↓ +Dense(64, relu) + ↓ +Dense(32, relu) + ↓ +Dense(1, sigmoid) → Probabilité binaire +``` + +Architecture simple mais efficace pour la classification binaire. + +## Résultats Obtenus + +### Métriques de Performance + +| Métrique | Valeur | +|----------|--------| +| Accuracy | 97% | +| Precision (classe 0) | 95% | +| Precision (classe 1) | 99% | +| Recall (classe 0) | 99% | +| Recall (classe 1) | 95% | + +### Matrice de Confusion + +``` + Prédictions + 0 1 +Réel 0 [205 2] + 1 [ 10 183] +``` + +- **Vrais positifs** : 183 + 205 = 388 +- **Faux positifs** : 2 + 10 = 12 +- **Taux d'erreur** : 3% seulement + +## Ce qu'il faut Retenir + +### ✅ Bonnes Pratiques Appliquées + +1. **Toujours séparer les données** en 3 ensembles distincts +2. **Utiliser la validation** pour monitorer l'overfitting en temps réel +3. **Early stopping** est crucial pour éviter le surapprentissage +4. **Learning rate adaptatif** améliore la convergence +5. **Normalisation** des features avec StandardScaler pour stabiliser l'apprentissage + +### 📊 Signes d'une Bonne Généralisation + +- ✅ Performance similaire sur train et test +- ✅ Val_loss se stabilise sans diverger +- ✅ Le modèle s'arrête avant de surapprendre (epoch 10/100) +- ✅ Métriques équilibrées entre les classes + +### ⚠️ Signes d'Overfitting (absents ici) + +- ❌ Train accuracy >> Test accuracy +- ❌ Val_loss augmente alors que train_loss diminue +- ❌ Performance dégradée sur nouvelles données + +## Exécution + +```bash +# Activer l'environnement virtuel +source venv/bin/activate + +# Lancer Jupyter +jupyter notebook lab.ipynb +``` + +## Technologies Utilisées + +- **TensorFlow/Keras** : construction et entraînement du réseau de neurones +- **Scikit-learn** : génération de données, preprocessing, métriques +- **Python 3.12** : langage de programmation + +## Conclusion + +Ce lab illustre qu'un modèle bien régularisé avec early stopping et learning rate scheduling peut atteindre d'excellentes performances (97%) tout en généralisant correctement sur des données non vues. + +**Principe fondamental** : Un bon modèle ne mémorise pas les données, il apprend les patterns généraux. diff --git a/lab.ipynb b/lab.ipynb new file mode 100644 index 0000000..30e5d81 --- /dev/null +++ b/lab.ipynb @@ -0,0 +1,291 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-11-12 17:18:24.255077: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.\n", + "2025-11-12 17:18:24.312342: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", + "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", + "2025-11-12 17:18:25.689783: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.\n" + ] + } + ], + "source": [ + "import tensorflow as tf\n", + "from sklearn.datasets import make_classification\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.metrics import classification_report, confusion_matrix" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Synthetic Data Generation" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# synthetic data generation of 2000 samples\n", + "X, y = make_classification(n_samples=2000,\n", + " n_features=20, \n", + " n_classes=2, \n", + " n_informative=15, \n", + " n_redundant=5, \n", + " random_state=42)\n", + "scaler = StandardScaler()\n", + "X = scaler.fit_transform(X)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Train/Validation/Test Splits" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# Split into train (64%), val (16%), test (20%)\n", + "X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", + "X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=42)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Model Setup" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# Feed Forward Neural Network Initalization\n", + "model = tf.keras.Sequential([\n", + " tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),\n", + " tf.keras.layers.Dense(32, activation='relu'),\n", + " tf.keras.layers.Dense(1, activation='sigmoid')\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Training Hyperparameters Setup" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "#optimizer and loss setup\n", + "model.compile(\n", + " optimizer=tf.keras.optimizers.Adam(learning_rate=0.05),\n", + " loss='binary_crossentropy',\n", + " metrics=['accuracy']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning Rate Scheduler" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# Learning rate scheduler\n", + "lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(\n", + " monitor='val_loss', # metric to monitor\n", + " factor=0.5, # reduce by a factor\n", + " patience=2, # wait 2 epochs before reducing LR\n", + " min_lr=1e-5, # don't reduce below this\n", + " verbose=1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Early Stopping Logic" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# 3. Early stopping callback with patience and loss threshold\n", + "early_stop = tf.keras.callbacks.EarlyStopping(\n", + " monitor='val_loss',\n", + " patience=3,\n", + " min_delta=0.01, # minimum change to be considered an improvement\n", + " restore_best_weights=True,\n", + " verbose=1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Model Training" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1/100\n", + "40/40 - 1s - 29ms/step - accuracy: 0.8359 - loss: 0.3691 - val_accuracy: 0.9187 - val_loss: 0.2269 - learning_rate: 0.0500\n", + "Epoch 2/100\n", + "40/40 - 0s - 3ms/step - accuracy: 0.9102 - loss: 0.2240 - val_accuracy: 0.9438 - val_loss: 0.1643 - learning_rate: 0.0500\n", + "Epoch 3/100\n", + "40/40 - 0s - 3ms/step - accuracy: 0.9477 - loss: 0.1400 - val_accuracy: 0.9531 - val_loss: 0.1484 - learning_rate: 0.0500\n", + "Epoch 4/100\n", + "40/40 - 0s - 3ms/step - accuracy: 0.9547 - loss: 0.1338 - val_accuracy: 0.9344 - val_loss: 0.1857 - learning_rate: 0.0500\n", + "Epoch 5/100\n", + "\n", + "Epoch 5: ReduceLROnPlateau reducing learning rate to 0.02500000037252903.\n", + "40/40 - 0s - 3ms/step - accuracy: 0.9555 - loss: 0.1402 - val_accuracy: 0.9219 - val_loss: 0.1695 - learning_rate: 0.0500\n", + "Epoch 6/100\n", + "40/40 - 0s - 3ms/step - accuracy: 0.9688 - loss: 0.0904 - val_accuracy: 0.9656 - val_loss: 0.1186 - learning_rate: 0.0250\n", + "Epoch 7/100\n", + "40/40 - 0s - 3ms/step - accuracy: 0.9812 - loss: 0.0491 - val_accuracy: 0.9688 - val_loss: 0.1048 - learning_rate: 0.0250\n", + "Epoch 8/100\n", + "40/40 - 0s - 4ms/step - accuracy: 0.9922 - loss: 0.0317 - val_accuracy: 0.9563 - val_loss: 0.1213 - learning_rate: 0.0250\n", + "Epoch 9/100\n", + "\n", + "Epoch 9: ReduceLROnPlateau reducing learning rate to 0.012500000186264515.\n", + "40/40 - 0s - 3ms/step - accuracy: 0.9922 - loss: 0.0220 - val_accuracy: 0.9625 - val_loss: 0.1212 - learning_rate: 0.0250\n", + "Epoch 10/100\n", + "40/40 - 0s - 3ms/step - accuracy: 0.9953 - loss: 0.0177 - val_accuracy: 0.9563 - val_loss: 0.1283 - learning_rate: 0.0125\n", + "Epoch 10: early stopping\n", + "Restoring model weights from the end of the best epoch: 7.\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# 4. Train the model\n", + "model.fit(\n", + " X_train, y_train,\n", + " validation_data=(X_val, y_val),\n", + " epochs=100,\n", + " callbacks=[early_stop, lr_scheduler], # your custom early stopping + LR scheduler\n", + " verbose=2\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "evaluation metrics" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 4ms/step \n", + "\n", + " Test Set Evaluation:\n", + " precision recall f1-score support\n", + "\n", + " 0 0.95 0.99 0.97 207\n", + " 1 0.99 0.95 0.97 193\n", + "\n", + " accuracy 0.97 400\n", + " macro avg 0.97 0.97 0.97 400\n", + "weighted avg 0.97 0.97 0.97 400\n", + "\n", + "Confusion Matrix:\n", + "[[205 2]\n", + " [ 10 183]]\n" + ] + } + ], + "source": [ + "# 5. Evaluate on test set\n", + "y_pred_probs = model.predict(X_test).flatten()\n", + "y_pred = (y_pred_probs >= 0.5).astype(int)\n", + "\n", + "\n", + "print(\"\\n Test Set Evaluation:\")\n", + "print(classification_report(y_test, y_pred))\n", + "print(\"Confusion Matrix:\")\n", + "print(confusion_matrix(y_test, y_pred))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}