General Purpose Tensorflow 2.x Script to Train Any CSV File

Zahash Z
The Startup
Published in
8 min readDec 1, 2020

--

Google colab link to the complete code is at the end of the article.

This article discusses the Tensorflow 2.x script written in Python that you can use to train almost any CSV file.

Before getting started, here is what this script Can and Cannot do

What it Can do

  • Automatically decide the number of layers based on the input size
  • decide the number of nodes in each layer
  • Automatically decide the number of output nodes based on whether its regression task (one output node) or classification task (number of output nodes are according to number of class labels)
  • Automatically decide loss function for each layer
  • Model Checkpointing
  • Early Stopping
  • All the preprocessing is built right into the layers of the model so that the model just accepts raw data and preprocesses it by itself without the need to have separate functions. (makes deployment a hell of a lot easier)
  • The code can handle very large CSV files too. (greater than 10GB)
  • Keras-tuner is used to determine the optimal kernel regularization values and activation functions.

What it Cannot do

  • It can’t clean the dataset. You have to deal with any missing values or outliers beforehand.
  • It can’t perform feature engineering. If you have any irrelevant columns, like, ID, name, etc… then it's your job to remove them.

Although it can detect data types automatically, you need to make sure that they are represented appropriately in the dataset.

For Example, if you have a CSV file with Age and Gender columns like below, then the script will detect that Age is Integer column and Gender is String column which is good.

But if you have a CSV file where the values of the Age column are inside quotes, then the script will think that it's a String column which is bad.

With that out of the way, let's get started.

First, you just need to specify a few parameters.

  • DATASET_PATH — where your dataset is located in the file system
  • LABEL_NAME — target label column name
  • TASK — “r” for regression and “c” for classification
  • DUMMY_BATCH_SIZE — don’t worry about this. Just set it to 5
  • BATCH_SIZE
  • EPOCHS
  • TRAIN_FRAC — the fraction of data to be used for training. float between 0 and 1.
  • CHECKPOINT_DIR — folder to store checkpointed models

Imports

from collections import defaultdict
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import kerastuner as kt
import IPython

during the whole process, we will be using tf.data to handle the dataset. By using tf.data you can load massive datasets one chunk at a time. It also has a lot of useful functionality like caching and prefetch the next chunk of data while the model is training on the current chunk.

The below function will return a fresh dataset object that you can iterate over one chunk(batch size) at a time.

def get_dataset(batch_size = 5):
return tf.data.experimental.make_csv_dataset(DATASET_PATH, batch_size = batch_size, label_name = LABEL_NAME, num_epochs = 1)

If the task is regression, the number of output nodes is one

if TASK == "r":
OUTPUT_NODES = 1

If it is a classification task then we need to find the number of labels(classes) to determine the number of output layers by iterating over the whole dataset.

elif TASK == "c":
unique_labels = set()
for _ , label in get_dataset(batch_size=BATCH_SIZE):
for ele in label.numpy():
unique_labels.add(ele)
num_labels = len(unique_labels) if num_labels <= 2:
OUTPUT_NODES = 1
else:
OUTPUT_NODES = num_labels

As I mentioned before, this script will do all the preprocessing on its own without the need to have external functions because all the preprocessing logic is embedded as one of the layers in the model. This enables you to just give the raw data to the deployed model.

The first step in achieving that is to define what is known as model inputs.

It's simply a dictionary where the keys are the column names and values are tf.keras.Input objects

model_inputs = {}for batch, _ in get_dataset(batch_size=DUMMY_BATCH_SIZE).take(1):
for col_name, col_values in batch.items():
model_inputs[col_name] = tf.keras.Input(shape=(1,), name=col_name, dtype=col_values.dtype)

Sample Output:

{"clouds_all": <tf.Tensor "clouds_all:0" shape=(None, 1) dtype=int32>,"holiday": <tf.Tensor "holiday:0" shape=(None, 1) dtype=string>,"rain_1h": <tf.Tensor "rain_1h:0" shape=(None, 1) dtype=float32>,"snow_1h": <tf.Tensor "snow_1h:0" shape=(None, 1) dtype=float32>,"temp": <tf.Tensor "temp:0" shape=(None, 1) dtype=float32>,"weather_description": <tf.Tensor "weather_description:0" shape=(None, 1) dtype=string>,"weather_main": <tf.Tensor "weather_main:0" shape=(None, 1) dtype=string>}

One feature that tf.data has is that it can automatically detect the data type of each column. Let's use that to our advantage and split the model_input dictionary into multiple dictionaries according to the data type

integer_inputs = {}
float_inputs = {}
string_inputs = {}
for col_name, col_input in model_inputs.items():
if col_input.dtype == tf.int32:
integer_inputs[col_name] = col_input
elif col_input.dtype == tf.float32:
float_inputs[col_name] = col_input
elif col_input.dtype == tf.string:
string_inputs[col_name] = col_input

Sample integer_inputs:

{'clouds_all': <tf.Tensor 'clouds_all:0' shape=(None, 1) dtype=int32>}

Sample float_inputs:

{
'rain_1h': <tf.Tensor 'rain_1h:0' shape=(None, 1) dtype=float32>, 'snow_1h': <tf.Tensor 'snow_1h:0' shape=(None, 1) dtype=float32>, 'temp': <tf.Tensor 'temp:0' shape=(None, 1) dtype=float32>
}

Sample string_inputs:

{'holiday': <tf.Tensor 'holiday:0' shape=(None, 1) dtype=string>,'weather_description': <tf.Tensor 'weather_description:0' shape=(None, 1) dtype=string>,'weather_main': <tf.Tensor 'weather_main:0' shape=(None, 1) dtype=string>}

Now, for the preprocessing, the general flow for the integer and float columns is to first concatenate the layers by passing them through tf.layers.Concatenate and then normalize them through a normalization layer. But to calculate the mean and std for normalization, we have to iterate through the whole dataset once.

when it comes to preprocessing the string columns, we first get the vocabulary(list of unique words) for each column. Then we pass the input through a string lookup layer tf.keras.layers.experimental.preprocessing.StringLookup and then through a one-hot encoding layer tf.keras.layers.experimental.preprocessing.CategoryEncoding. You can also perform some simple operations like converting everything to lowercase and eliminating leading and trailing white spaces.

StringLookup layer maps strings from a vocabulary to integer indices. CategoryEncoding then takes those integer indices and creates a one-hot vector.

Pass the inputs through their corresponding functions to get preprocessed layers

integer_layer = numerical_input_processor(integer_inputs)
float_layer = numerical_input_processor(float_inputs)
string_layer = string_input_processor(string_inputs)

Add all the inputs to a list

preprocessed_inputs = []if integer_layer is not None:
preprocessed_inputs.append(integer_layer)
if float_layer is not None:
preprocessed_inputs.append(float_layer)
if string_layer is not None:
preprocessed_inputs.append(string_layer)

preprocessed_inputs might look something like this

[<tf.Tensor 'normalization/truediv:0' shape=(None, 1) dtype=float32>,  <tf.Tensor 'normalization_1/truediv:0' shape=(None, 3) dtype=float32>,  <tf.Tensor 'concatenate_1/concat:0' shape=(None, 66) dtype=float32>]

Concatenate all the inputs

if len(preprocessed_inputs) > 1:
preprocessed_inputs_cat = tf.keras.layers.Concatenate()(preprocessed_inputs)
else:
preprocessed_inputs_cat = preprocessed_inputs

Finally, create a tf.keras.Model that takes the model_inputs and returns the preprocessed outputs

preprocessing_head = tf.keras.Model(model_inputs, preprocessed_inputs_cat)

You can also plot the preprocessing model to see what the individual layers look like. Increase the value of dpi if you want more the plot to have a higher clarity.

tf.keras.utils.plot_model(model = preprocessing_head, rankdir="LR", dpi=72, show_shapes=True, expand_nested=True, to_file="preprocessing_head.png")

You might get an image something like this

You can also see the preprocessed outputs for the given inputs

Eg:

preprocessing_head({
"clouds_all" : np.array([0]),
"holiday" : np.array(["yes"]),
"rain_1h" : np.array([23.9]),
"snow_1h" : np.array([15.2]),
"temp" : np.array([50.5]),
"weather_description" : np.array(["cloudy"]),
"weather_main" : np.array(["cloudy"]),
})

Now that we have created the preprocessing head, let's pass them through the model_inputs to create the preprocessed outputs

preprocessed_outputs = preprocessing_head(model_inputs)

Find out how many batches should be used for training and how many for validation

dataset_size = 0for _ in get_dataset(batch_size=BATCH_SIZE):
dataset_size += 1
train_size = int(TRAIN_FRAC * dataset_size)

Split the training and validation datasets

dataset = get_dataset(batch_size=BATCH_SIZE)train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)

As mentioned earlier, tf.data comes with some nifty tools to cache and prefetch batches of data while the model is training

AUTOTUNE = tf.data.experimental.AUTOTUNEdataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
train_dataset = train_dataset.cache().prefetch(buffer_size=AUTOTUNE)
val_dataset = val_dataset.cache().prefetch(buffer_size=AUTOTUNE)

Overfitting is a real pain if you don't know how to deal with it. Luckily tf.keras has support for callbacks and one of the callbacks that it supports is EarlyStopping. You can use this to stop the training if the loss value stops decreasing

early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

One more problem with training ML models is that sometimes it takes too long to train. If there is a problem or if the machine accidentally shuts off then you lose all the progress. So, it is logical to save the model after every few epochs. The ModelCheckpoint callback does exactly that.

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=os.path.join(CHECKPOINT_DIR, "M.{epoch:02d}-{val_loss:.2f}"))

The “M.{epoch:02d}-{val_loss:.2f}” part is just the format of the directory name so that you have a new name for each saved model and don't overwrite the previously saved model.

Add the two callbacks to a list so that they can be used when its time to train the model

callbacks = [
early_stopping_callback,
checkpoint_callback,
]

Keras Tuner

One of the most amazing things to ever exist is the Keras Tuner library. It lets you find the best hyperparameters for your models so that you don’t have to spend any time trying out different combinations. And the best part is that it is fast AF.

The way to implement Keras tuner is to define a model builder function that takes in the hyperparameters as an argument and returns a model. Keras tuner will then use that function to create multiple models and test them to see which hyperparameter combination is the best.

In this instance, we will just try to find the best kernel regularization value and the activation functions. But you can easily extend this to search for the optimal number of layers, the number of nodes in each layer, and so on…

Keras tuner has multiple tuners available — RandomSearch, Hyperband, BayesianOptimization, and Sklearn.

But Hyperband is by far the best tuner available.

tuner = kt.Hyperband(
model_builder,
objective="val_loss",
max_epochs=10,
factor=3,
directory="tuner_dir",
project_name="tuner",
)
tuner.search(
train_dataset,
validation_data=val_dataset,
epochs=10,
)

If you want to learn more about Keras tuner then you can check out one of the tutorials on the TensorFlow website here

get the best hyperparameters values

best_hps = tuner.get_best_hyperparameters(num_trials = 1)[0]print(best_hps.values)

Sample output:

{
"activation": "elu",
"kernel_regularization": 0.001,
"tuner/bracket": 2,
"tuner/epochs": 2,
"tuner/initial_epoch": 0,
"tuner/round": 0,
}

You can now use these values to build a model with the best hyperparameters

model = tuner.hypermodel.build(best_hps)

Plot the model if you want

tf.keras.utils.plot_model(model = model, rankdir="LR", dpi=96, show_shapes=True, expand_nested=True, to_file="model.png")

You might see an image like this

Training the model

It's always a good practice to run the untrained model on the data and take a look at the training and validation loss

print(model.evaluate(train_dataset))
print(model.evaluate(val_dataset))

Train the model

history = model.fit(
train_dataset,
validation_data=val_dataset,
callbacks=callbacks,
epochs=EPOCHS,
)

Find out the training and validation loss with the trained model

print(model.evaluate(train_dataset))
print(model.evaluate(val_dataset))

If you observe a decrease in the loss value after training then it means that the model trained successfully.

Here are some helper functions to generate some simple plots to see how loss and accuracy(in case of classification) changed overtime

def plot_loss(history):
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Error')
plt.legend()
plt.grid(True)
plt.show()
def plot_acc(history):
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('accuracy')
plt.legend()
plt.grid(True)
plt.show()

plot loss and accuracy

plot_loss(history)if TASK == "c":
plot_acc(history)

You can also run some predictions on dummy data to see the results for yourself

model.predict({
"clouds_all" : np.array([0]),
"holiday" : np.array(["yes"]),
"rain_1h" : np.array([23.9]),
"snow_1h" : np.array([15.2]),
"temp" : np.array([50.5]),
"weather_description" : np.array(["cloudy"]),
"weather_main" : np.array(["cloudy"]),
})

Here is the whole working code in google colab

--

--

Zahash Z
The Startup

Just a college student who strives to be a data scientist