Model-Management-and-Reprod.../.instructions/INSTRUCTIONS.mdoc

{% steps %}
{% step title="Introduction to Reproducible Workflows" %}

###  Introduction

Welcome to the "Introduction to Reproducible Workflows" lab!
This lab is designed to give you a foundational understanding of creating reproducible workflows for training an AI model,
 its importance, and examples of key parts in model training to define fixed seeds.

###   Learning Objectives

- Display key areas within AI model workflows to define fixed seeds
- Review saving datasets after train/test splits
- Practice recovering models and training datasets to repeat results

###   Prerequisites

- A high level understanding of AI neural networks
- Experience in training models.

{% /step %}

{% step title="Creating Reproducible Datasets" %}

###  Creating Reproducible Datasets
For you to create a repeatable workflow you will need to start by defining the random seeds. When you define the random seed, it
will determine how data is split, generated, and how these train/test datasets are fed into the model. Your starting seed
helps ensure that all aspects of the workflow are repeatable. The provided code below sets the random seed for several
different libraries used in model development.
```python
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)
```
From here you can define the synthetic data generation similar to how it is defined in other labs.
```python
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    random_state=42
)
```
{% /step %}

{% step title="Training Reproducible Model" %}

###  Training Reproducible Models
For the next step of model creation you will need to split your data into training and test datasets. The code provided below
allows for the initial dataset to be split into training and test sets. The key parameter for you to focus on here is the
```random_state``` which defines the random seed for the split. When you define the random seed, you are
ensuring future instances of the models training will result in the same training dataset being used, and therefore the
same model being created.
```python
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```
You can then also save the `train` and `test` data splits to allow for reproducibility in instances where the dataset you are
using may be a subset of a larger dataset.
```python
joblib.dump((X_train, y_train), 'train_data.pkl')
joblib.dump((X_test, y_test), 'test_data.pkl')
```

With the datasets saved and reproducible you can move forward with defining and training your model. You will define and train
your model in this lab so you can evaluate it and compare it to a model trained on the same training dataset to ensure it's
properly reproducible. Within the lab all code is provided for
training and evaluating the model.
{% /step %}

{% step title="Saving Model and Dataset" %}

###  Saving Model and Dataset
Now that your model has been trained and evaluated, you can save it with the following code
```python
model.save('my_model.keras')
```
From there you can move onto reproducing models which is meant to mimic an entirely new environment of loading a model into.
You can load the previously saved model and datasets using the following code below, including required imports.
```python
from tensorflow.keras.models import load_model
import joblib
modelReloaded = load_model('my_model.keras')
X_train_reloaded, y_train_reloaded = joblib.load('train_data.pkl')
X_test_reloaded, y_test_reloaded = joblib.load('test_data.pkl')
```
As you revaluate the model on the exact same training and test sets using the previously defined evaluation code, you should
see results that are exactly the same as the previously evaluted model.
```python
loss, accuracy = modelReloaded.evaluate(X_test, y_test)
print(f"Test accuracy: {accuracy:.2f}")
```

{% /step %}
{% /steps %}