Preparing a Dataset for Machine Learning with scikit-learn

Kevin Jolly Kevin Jolly ⏰ 7 Minutes 📅 Jan 19, 2019

The first step to implementing any machine learning algorithm with scikit-learn is data preparation. Scikit-learn comes with a set of constraints to implementation. The dataset that we will be using is based on mobile payments and is found on the world’s most popular competitive machine learning website – Kaggle. You can download the dataset from:  https://www.kaggle.com/ntnu-testimon/paysim1.

Once downloaded, open a new Jupyter Notebook using the following code in Terminal (macOS/Linux) or Anaconda Prompt/PowerShell (Windows):

$ Jupyter Notebook

The fundamental goal of this dataset is to predict whether a mobile transaction is fraudulent. In order to do this, we need to first have a brief understanding of the contents of our data. In order to explore the dataset, we will use the pandas package in Python. You can install pandas using the following code in Terminal (macOS/Linux) or PowerShell (Windows):

$ pip3 install pandas

Pandas can be installed on Windows machines in an Anaconda Prompt using the following code:

$ conda install pandas

We can now read in the dataset into our Jupyter Notebook using the following code:

#Package Imports

import pandas as pd

#Reading in the dataset

df = pd.read_csv('PS_20174392719_1491204439457_log.csv')

#Viewing the first 5 rows of the dataset

df.head()

This produces an output as illustrated in the following screenshot:

jupter notebook output screenshot

Dropping features that are redundant

From the dataset seen previously, there are a few columns that are redundant to the machine learning process:

  • nameOrig: This column is a unique identifier that belongs to each customer. Since each identifier is unique with every row of the dataset, the machine learning algorithm will not be able to discern any patterns from this feature.
  • nameDest: This column is also a unique identifier that belongs to each customer and as such provides no value to the machine learning algorithm.
  • isFlaggedFraud: This column flags a transaction as fraudulent if a person tries to transfer more than 200,000 in a single transaction. Since we already have a feature called isFraud that flags a transaction as fraud, this feature becomes redundant.

We can drop these features from the dataset using the following code:

#Dropping the redundant features

df = df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

Reducing the size of the data

The dataset that we are working with contains over 6 million rows of data. Most machine learning algorithms will take a large amount of time to work with a dataset of this size. In order to make our execution time quicker, we will reduce the size of the dataset to 20,000 rows. We can do this using the following code:

#Storing the fraudulent data into a dataframe

df_fraud = df[df['isFraud'] == 1]

#Storing the non-fraudulent data into a dataframe

df_nofraud = df[df['isFraud'] == 0]

#Storing 12,000 rows of non-fraudulent data

df_nofraud = df_nofraud.head(12000)

#Joining both datasets together

df = pd.concat([df_fraud, df_nofraud], axis = 0)

In the preceding code, the fraudulent rows are stored in one dataframe. This dataframe contains a little over 8,000 rows. The 12,000 non-fraudulent rows are stored in another dataframe, and the two dataframes are joined together using the concat method from pandas.

This results in a dataframe with a little over 20,000 rows, over which we can now execute our algorithms relatively quickly.

Encoding the categorical variables

One of the main constraints of scikit-learn is that you cannot implement the machine learning algorithms on columns that are categorical in nature. For example, the type column in our dataset has five categories:

  • CASH-IN
  • CASH-OUT
  • DEBIT
  • PAYMENT
  • TRANSFER

These categories will have to be encoded into numbers that scikit-learn can make sense of. In order to do this, we have to implement a two-step process. The first step is to convert each category into a number: CASH-IN = 0, CASH-OUT = 1, DEBIT = 2, PAYMENT = 3, TRANSFER = 4. We can do this using the following code:

#Package Imports

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#Converting the type column to categorical

df['type'] = df['type'].astype('category')

#Integer Encoding the 'type' column

type_encode = LabelEncoder()

#Integer encoding the 'type' column

df['type'] = type_encode.fit_transform(df.type)

First, the code coverts the type column to a categorical feature. We then use LabelEncoder() in order to initialize an integer encoder object that is called type_encode. Finally, we apply the fit_transform method on the type column in order to convert each category into a number.

Broadly speaking, there are two types of categorical variables:

  • Nominal
  • Ordinal

Nominal categorical variables have no inherent order to them. An example of the nominal type of categorical variable is the type column.

Ordinal categorical variables have an inherent order to them. An example of the ordinal type of categorical variable is Education Level, in which people with a Master’s degree will have a higher order/value compared to people with an Undergraduate degree only.

In the case of ordinal categorical variables, integer encoding, as illustrated previously, is sufficient and we do not need to one-hot encode them. Since the type column is a nominal categorical variable, we have to one-hot encode it after integer encoding it. This is done using the following code:

#One hot encoding the 'type' column

type_one_hot = OneHotEncoder()

type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()

#Adding the one hot encoded variables to the dataset

ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])

df = pd.concat([df, ohe_variable], axis=1)

#Dropping the original type variable

df = df.drop('type', axis = 1)

#Viewing the new dataframe after one-hot-encoding

df.head()

In the code, we first create a one-hot encoding object called type_one_hot. We then transform the type column into one-hot encoded columns using the fit_transform method.

We have five categories that are represented by integers 0 to 4. Each of these five categories will now get its own column. Therefore, we create five columns called type_0type_1type_2type_3, and type_4. Each of these five columns is represented by two values: 1, which indicates the presence of that category, and 0, which indicates the absence of that category.

This information is stored in the ohe_variable. Since this variable holds the five columns, we will join this to the original dataframe using the concat method from pandas.

The ordinal type column is then dropped from the dataframe as this column is now redundant post one hot encoding. The final dataframe now looks like this:

jupter notebook output screenshot

Missing values

Another constraint with scikit-learn is that it cannot handle data with missing values. Therefore, we must check whether our dataset has any missing values in any of the columns to begin with. We can do this using the following code:

#Checking every column for missing values

df.isnull().any()

This produces this output:

jupter notebook output screenshot

Here we note that every column has some amount of missing values. Missing values can be handled in a variety of ways, such as the following:

  • Median imputation
  • Mean imputation
  • Filling them with the majority value

The amount of techniques is quite large and varies depending on the nature of your dataset. This process of handling features with missing values is called feature engineering. Feature engineering can be done for both categorical and numerical columns. We will impute all the missing values with a zero.

We can do this using the following code:

#Imputing the missing values with a 0

df = df.fillna(0)

We now have a dataset that is ready for machine learning with scikit-learn. You can also export this dataset as a .csv file and store it in the same directory that you are working in with the Jupyter Notebook using the following code:

df.to_csv('fraud_prediction.csv')

This will create a .csv file of this dataset in the directory that you are working in, which you can load into the notebook again using pandas.

Conclusion

If you found this article interesting, you can explore Machine Learning with scikit-learn Quick Start Guide to deploy supervised and unsupervised machine learning algorithms using scikit-learn to perform classification, regression, and clustering.

Machine Learning with scikit-learn Quick Start Guide teaches you how to use scikit-learn for machine learning and can help you in building your own machine learning models for accurate predictions.