A Simple Guide to creating Predictive Models in Python, Part-2b

“Artificial Intelligence is the new Electricity” ― Andrew NG

DataDrivenInvestor
Published in
7 min readNov 30, 2018

--

This guide is the second segment and a continuation of the first segment of the second part in the two-part series (now let that sink in a little), one with Preprocessing and Exploration of Data and the other with the actual Modelling. This segment (Part-2b) of Part-2 deals with the “Deep Learning” models while the other segment (Part-2a) deals with the “Machine Learning” models. The data set that is used here came from superdatascience.com. Huge shout out to them for providing amazing courses and content on their website which motivates people like me to pursue a career in Data Science.

Don’t focus too much on the code throughout the course of this article but rather get the general idea of what happens during the Modelling stage.

Part 2b: Modelling of Data: (Deep Learning)

Creating a really good deep learning model can be very challenging sometimes because unlike machine learning we don’t yet have very high-level frameworks or libraries to do the task.

In this article, we will take a look at how to model using ‘TensorFlow’ which is a deep/machine learning framework developed by Google. The reason I choose TensorFlow and you should too is because of the fact that it is ‘made for deployment’, supports multiple platforms and has seen some major improvements over the past year. Developers working on TensorFlow are constantly adding new high-level API to make it easy for anyone to create deep learning models.

Without any further delay, let's get started

import tensorflow as tf

Remember that we copied the un-processed dataframe as ‘deep_df’? Here’s why.

The reason for using a un-processed dataframe is due to the fact that tensorflow.estimator API will take care of most of the preprocessing hard work. We don't even have to decide which features to use (we did a lot of work deciding the relevancy of ‘age’, ‘Gender’ and ‘Geography’ in Part-1).

print(deep_df.info())
deep_df.head()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
CreditScore 10000 non-null int64
Geography 10000 non-null object
Gender 10000 non-null object
Age 10000 non-null int64
Tenure 10000 non-null int64
Balance 10000 non-null float64
NumOfProducts 10000 non-null int64
HasCrCard 10000 non-null int64
IsActiveMember 10000 non-null int64
EstimatedSalary 10000 non-null float64
Exited 10000 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB
# separating the features and labelsdeep_feat = deep_df.drop(columns=["Exited"],axis=1)
deep_label = deep_df["Exited"]

List out and separate the continuous(numerical) and categorical columns from the dataframe columns.

This can be done by manually copying and pasting the names of columns but what if the data has like 60 columns? Therefore the below method is easier and scalable

# first just take a look at all the columnslist(deep_feat.columns)

Output:

['CreditScore',
'Geography',
'Gender',
'Age',
'Tenure',
'Balance',
'NumOfProducts',
'HasCrCard',
'IsActiveMember',
'EstimatedSalary']

Make a list of columns (excluding ‘Exited’) where the number of unique elements is 2 (i.e., 0 or 1) or if the data type is ‘object’ represented by ‘O’ and store them as categorical_columns

categorical_columns = [col for col in deep_feat.columns if len(deep_feat[col].unique())==2 or deep_feat[col].dtype=='O']

Make a list of columns (excluding ‘Exited’) where the number of unique elements is greater than 2 and the data type is either ‘int64’ or ‘float64’ and store them as continuous_columns

continuous_columns = [col for col in deep_feat.columns if len(deep_feat[col].unique())>2 and (deep_feat[col].dtype=='int64' or deep_feat[col].dtype=='float64')]

See what the lists look like

print("categorical columns : ", categorical_columns)
print("continuous columns : ", continuous_columns)

Output:

categorical columns :  ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']
continuous columns : ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

We included the ‘Age’ column in continuous_columns for now but will later bucketize it (using tensorflow API) and change it into a categorical column.

Remember all the hard work we did previously? well, we don’t have to do it anymore because tensorflow will take care of everything

# making a train test split
X_T, X_t, y_T, y_t = train_test_split(deep_feat, deep_label, test_size=0.3)

Lets now scale the data

First, create a list of columns to scale which has all the columns in continuous_columns except ‘Age’ because we want to convert the ‘Age’ column into bucketized column

cols_to_scale = continuous_columns[:]
cols_to_scale.remove("Age")

apply scaling

# scaling the listed columns
scaler = StandardScaler()
X_T.loc[:,cols_to_scale] = scaler.fit_transform(X_T.loc[:,cols_to_scale])
X_t.loc[:,cols_to_scale] = scaler.fit_transform(X_t.loc[:,cols_to_scale])

The below code is a little confusing at first but let's break it down

We are basically creating feature columns in tensorflow corresponding to each of the columns in the dataframe

“tf.feature_column.categorical_column_with_hash_bucket()” takes in a categorical column like ‘Gender’ and applies one hot encoding

“Hash_bucket_size” is the maximum number of categories in the column (in case of ‘Gender’, it is only 2 (Male, Female) which is definitely less than 1000)

This is then passed into “tf.feature_column.embedding_column()”

“dimension” parameter takes in the exact number of categories in the column (in ‘Gender’ it is 2) embedding column is only used for a dense neural network (simple linear model doesn’t require this step)

categorical_object_feat_cols = [tf.feature_column.embedding_column(                                   tf.feature_column.categorical_column_with_hash_bucket(key=col,hash_bucket_size=1000), dimension = len(deep_df[col].unique()))

for col in categorical_columns if deep_df[col].dtype=='O']

“tf.feature_column.categorical_column_with_identity()” is used for categorical columns with integer (or float) dtype

categorical_integer_feat_cols = [tf.feature_column.embedding_column(                 tf.feature_column.categorical_column_with_identity(key=col,num_buckets=2),dimension = len(deep_df[col].unique()))

for col in categorical_columns if deep_df[col].dtype=='int64']

use escape characters after each line if you are trying this out. Or, just type everything in a single line

“tf.feature_column.numeric_column()” takes in a continuous(numeric) column like ‘Balance’

Notice that we excluded ‘Age’ column because we still need to bucketize it

continuous_feat_cols = [tf.feature_column.numeric_column(key=col) for col in continuous_columns if col != "Age"]

we now take the age column and first make a tensorflow numerical column then we pass it into “tf.feature_column.bucketized_column()” to bucketize it with the given boundaries

age_bucket = tf.feature_column.bucketized_column(tf.feature_column.numeric_column(key="Age"), boundaries=[20,30,40,50,60,70,80,90])

Combining all the lists to create a new list of feature columns

feat_cols = categorical_object_feat_cols + \
categorical_integer_feat_cols + \
continuous_feat_cols + \
[age_bucket]
# '\' is just an escape character to support operations in multiple lines.

Creating functions | pipelines to input the dataset into tensorflow (one for training and one for testing)

input_fun = tf.estimator.inputs.pandas_input_fn(X_T,y_T,batch_size=50,num_epochs=1000,shuffle=True)pred_input_fun = tf.estimator.inputs.pandas_input_fn(X_t,batch_size=50,shuffle=False)

Lets now make the deep neural network model where ‘hidden_units’ is the size | shape | architecture of the network (3 layers each containing 10 neurons in our case)

There is no fixed rule to decide the number of layers and nodes. There shouldn’t be too many to overfit and also not too less to underfit. It all depends on your experience.

‘n_classes’ is the number of different labels to classify the data into

DNN_model = tf.estimator.DNNClassifier(hidden_units=[10,10,10], feature_columns=feat_cols, n_classes=2)

Training the dNN model (will take some time)

DNN_model.train(input_fn=input_fun, steps=5000)

Drawing out the predictions of the test set from the trained model

‘predictions’ is a tensorflow object

predictions = DNN_model.predict(pred_input_fun)

We got the predictions as a tensorflow object. Now let's convert it into a list of dictionaries and look at the very first prediction to assess the format of the output

res_pred = list(predictions)
res_pred[0]

Output:

{'logits': array([-2.9706442], dtype=float32),
'logistic': array([0.04876983], dtype=float32),
'probabilities': array([0.95123017, 0.04876983], dtype=float32),
'class_ids': array([0], dtype=int64),
'classes': array([b'0'], dtype=object)}

Extracting only the “class_ids” from the list of dictionaries because it is the final verdict that decides the class it belongs to

y_pred = []
for i in range(len(res_pred)):
y_pred.append(res_pred[i]["class_ids"][0])

Using the sklearn ‘classification_report’ as a metric for evaluation

from sklearn.metrics import classification_reportrep = classification_report(y_t,y_pred)

The ‘avg / total’ gives the final reading of ‘precision’, ‘recall’, ‘f1-score’ and ‘support’

print(rep)

Output:

precision    recall  f1-score   support          0       0.87      0.96      0.91      2383
1 0.74 0.46 0.57 617
avg / total 0.84 0.85 0.84 3000

It is confusing at first that even the ‘state of the art’ models like XBGoost and Deep Neural Networks gave only around 85–86 % accuracy (precision).

The reason for this is clearly seen in the above results

It is because the support for the “Exited = 1” is very small compared to the support of “Exited = 0”

support means the number of rows corresponding to a value (category) in the label

which means our model(any model) can predict “Exited = 0” very accurately because of large support but when it comes to predicting “Exited = 1” our model does not perform very well because the support is low

i.e., the number of rows with “Exited = 1” is low which results in poor/insufficient training

Final conclusion :

The accuracy of the model cannot be increased significantly than what we have now because the data itself is not sufficient and doesn’t seem to have much correlation between features and labels

If you want to save the model and deploy it sometime in the future then check out my other notebook | post which is dedicated to that task

--

--