{% steps %} {% step title="Introduction to Reproducible Workflows" %} ### Introduction Welcome to the "Introduction to Reproducible Workflows" lab! This lab is designed to give you a foundational understanding of creating reproducible workflows for training an AI model, its importance, and examples of key parts in model training to define fixed seeds. ### Learning Objectives - Display key areas within AI model workflows to define fixed seeds - Review saving datasets after train/test splits - Practice recovering models and training datasets to repeat results ### Prerequisites - A high level understanding of AI neural networks - Experience in training models. {% /step %} {% step title="Creating Reproducible Datasets" %} ### Creating Reproducible Datasets For you to create a repeatable workflow you will need to start by defining the random seeds. When you define the random seed, it will determine how data is split, generated, and how these train/test datasets are fed into the model. Your starting seed helps ensure that all aspects of the workflow are repeatable. The provided code below sets the random seed for several different libraries used in model development. ```python random.seed(42) np.random.seed(42) tf.random.set_seed(42) ``` From here you can define the synthetic data generation similar to how it is defined in other labs. ```python X, y = make_classification( n_samples=1000, n_features=20, n_informative=15, n_redundant=5, n_classes=2, random_state=42 ) ``` {% /step %} {% step title="Training Reproducible Model" %} ### Training Reproducible Models For the next step of model creation you will need to split your data into training and test datasets. The code provided below allows for the initial dataset to be split into training and test sets. The key parameter for you to focus on here is the ```random_state``` which defines the random seed for the split. When you define the random seed, you are ensuring future instances of the models training will result in the same training dataset being used, and therefore the same model being created. ```python X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) ``` You can then also save the `train` and `test` data splits to allow for reproducibility in instances where the dataset you are using may be a subset of a larger dataset. ```python joblib.dump((X_train, y_train), 'train_data.pkl') joblib.dump((X_test, y_test), 'test_data.pkl') ``` With the datasets saved and reproducible you can move forward with defining and training your model. You will define and train your model in this lab so you can evaluate it and compare it to a model trained on the same training dataset to ensure it's properly reproducible. Within the lab all code is provided for training and evaluating the model. {% /step %} {% step title="Saving Model and Dataset" %} ### Saving Model and Dataset Now that your model has been trained and evaluated, you can save it with the following code ```python model.save('my_model.keras') ``` From there you can move onto reproducing models which is meant to mimic an entirely new environment of loading a model into. You can load the previously saved model and datasets using the following code below, including required imports. ```python from tensorflow.keras.models import load_model import joblib modelReloaded = load_model('my_model.keras') X_train_reloaded, y_train_reloaded = joblib.load('train_data.pkl') X_test_reloaded, y_test_reloaded = joblib.load('test_data.pkl') ``` As you revaluate the model on the exact same training and test sets using the previously defined evaluation code, you should see results that are exactly the same as the previously evaluted model. ```python loss, accuracy = modelReloaded.evaluate(X_test, y_test) print(f"Test accuracy: {accuracy:.2f}") ``` {% /step %} {% /steps %}