Preparing a Dataset for Machine Learning with scikit-learn Image Preparing a Dataset for Machine Learning with scikit-learn

Kevin Jolly Kevin Jolly ā° 7 Minutes šŸ“… Jan 19, 2019

The first step to implementing anyĀ machineĀ learning algorithm with scikit-learn is data preparation. Scikit-learn comes with a set of constraints to implementation. The dataset that we will be using is based on mobile payments and is found on the world’s most popular competitive machine learning website ā€“ Kaggle. You canĀ downloadĀ the dataset from:Ā https://www.kaggle.com/ntnu-testimon/paysim1.

Once downloaded, open a new Jupyter Notebook using the following code in Terminal (macOS/Linux) or Anaconda Prompt/PowerShell (Windows):

$ Jupyter Notebook

The fundamental goal of this dataset is to predict whether a mobile transaction is fraudulent. In order to do this, we need to first have a brief understanding of the contents of our data. In order to explore the dataset, we will use theĀ pandasĀ package in Python. You can install pandas using the following code in Terminal (macOS/Linux) or PowerShell (Windows):

$ pip3 install pandas

Pandas can be installed on Windows machines in an Anaconda Prompt using the following code:

$ conda install pandas

We can now read in the datasetĀ intoĀ our Jupyter Notebook using the following code:

#Package Imports

import pandas as pd

#Reading in the dataset

df = pd.read_csv('PS_20174392719_1491204439457_log.csv')

#Viewing the first 5 rows of the dataset

df.head()

This produces an output as illustrated in theĀ followingĀ screenshot:

jupter notebook output screenshot

Dropping features that are redundant

From the dataset seen previously, there are a few columns that areĀ redundantĀ to the machine learning process:

  • nameOrig:Ā This column is a unique identifier that belongs to each customer. Since each identifier is unique with every row of the dataset, the machine learning algorithm will not be able to discern any patterns from this feature.
  • nameDest:Ā This column is also a unique identifier that belongs to each customer and as such provides no value to the machine learning algorithm.
  • isFlaggedFraud:Ā This column flags a transaction as fraudulent if a person tries to transfer more than 200,000 in a single transaction. Since we already have a feature calledĀ isFraudĀ that flags a transaction as fraud, this feature becomes redundant.

We can drop these features from the dataset using the following code:

#Dropping the redundant features

df = df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

Reducing the size of the data

The dataset that we are working with contains over 6Ā millionĀ rows of data. Most machine learning algorithms will take a large amount of time to work with a dataset of this size. In order to make our execution time quicker, we will reduce the size of the dataset to 20,000 rows. We can do this using theĀ followingĀ code:

#Storing the fraudulent data into a dataframe

df_fraud = df[df['isFraud'] == 1]

#Storing the non-fraudulent data into a dataframe

df_nofraud = df[df['isFraud'] == 0]

#Storing 12,000 rows of non-fraudulent data

df_nofraud = df_nofraud.head(12000)

#Joining both datasets together

df = pd.concat([df_fraud, df_nofraud], axis = 0)

In the preceding code, the fraudulent rows are stored in one dataframe. This dataframe contains a little over 8,000 rows. The 12,000 non-fraudulent rows are stored in another dataframe, and the two dataframes are joined together using theĀ concatĀ method from pandas.

This results in a dataframe with a little over 20,000 rows, over which we can now execute our algorithms relatively quickly.

Encoding the categorical variables

One of the main constraints of scikit-learn isĀ thatĀ you cannot implement the machine learning algorithms on columns that are categorical in nature. For example, theĀ typeĀ column in our dataset has five categories:

  • CASH-IN
  • CASH-OUT
  • DEBIT
  • PAYMENT
  • TRANSFER

These categories will have to be encoded into numbers that scikit-learn can make sense of. In order to do this, we have to implement a two-step process.Ā The first step is to convert each category into a number:Ā CASH-IN = 0,Ā CASH-OUT = 1,Ā DEBIT = 2,Ā PAYMENT = 3,Ā TRANSFER = 4. We can do this using the following code:

#Package Imports

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#Converting the type column to categorical

df['type'] = df['type'].astype('category')

#Integer Encoding the 'type' column

type_encode = LabelEncoder()

#Integer encoding the 'type' column

df['type'] = type_encode.fit_transform(df.type)

First, the code coverts theĀ typeĀ column to a categorical feature. We then useĀ LabelEncoder()Ā in order to initialize an integer encoder objectĀ thatĀ is calledĀ type_encode.Ā Finally, we apply theĀ fit_transformĀ method on theĀ typeĀ column in order to convert each category into a number.

Broadly speaking, there are two types of categorical variables:

  • Nominal
  • Ordinal

Nominal categorical variables have no inherent order to them. An example of the nominal type of categorical variable is theĀ typeĀ column.

Ordinal categorical variables have an inherent order to them. An example of the ordinal type of categorical variable is Education Level, in which people with a Master’s degree will have a higher order/value compared to people with an Undergraduate degree only.

In the case of ordinal categorical variables, integer encoding, as illustrated previously, is sufficient and we do not need to one-hot encode them. Since theĀ typeĀ column is a nominal categorical variable, we have to one-hot encode it after integer encoding it. This is done using the following code:

#One hot encoding the 'type' column

type_one_hot = OneHotEncoder()

type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()

#Adding the one hot encoded variables to the dataset

ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])

df = pd.concat([df, ohe_variable], axis=1)

#Dropping the original type variable

df = df.drop('type', axis = 1)

#Viewing the new dataframe after one-hot-encoding

df.head()

In the code, we first create a one-hotĀ encoding object calledĀ type_one_hot.Ā We then transform theĀ typeĀ column into one-hot encoded columns using theĀ fit_transformĀ method.

We have five categoriesĀ thatĀ are represented by integers 0 to 4. Each of these five categories will now get its own column. Therefore, we create five columns calledĀ type_0,Ā type_1,Ā type_2,Ā type_3, andĀ type_4.Ā Each of these five columns is represented by two values:Ā 1, which indicates the presence of that category, andĀ 0, which indicates the absence of that category.

This information is stored in theĀ ohe_variable.Ā Since this variable holds the five columns, we will join this to the original dataframe using theĀ concatĀ method fromĀ pandas.

The ordinalĀ typeĀ column is then dropped from the dataframe as this column is now redundant post one hot encoding. The final dataframe now looks like this:

jupter notebook output screenshot

Missing values

Another constraint with scikit-learn isĀ thatĀ it cannot handle data with missing values. Therefore, we must check whether our dataset has any missing values in any of the columns to begin with. We can do this using the following code:

#Checking every column for missing values

df.isnull().any()

This produces this output:

jupter notebook output screenshot

Here we note that every column has some amount of missing values.Ā Missing values can be handled in a variety of ways, such as the following:

  • Median imputation
  • Mean imputation
  • Filling them with the majority value

The amount of techniques is quite large and varies depending on the nature of your dataset. This process of handling features with missingĀ valuesĀ is calledĀ feature engineering. Feature engineering can beĀ doneĀ for both categorical and numerical columns. We will impute all the missing values with a zero.

We can do this using the following code:

#Imputing the missing values with a 0

df = df.fillna(0)

We now have a dataset that is ready for machine learning with scikit-learn. You can also export this dataset as aĀ .csvĀ file and store it in the same directory that you are working in with the Jupyter Notebook using the following code:

df.to_csv('fraud_prediction.csv')

This will create aĀ .csvĀ file of this dataset in the directory that you are working in, which you can load into the notebook again using pandas.

Conclusion

If you found this article interesting, you can explore Machine Learning with scikit-learn Quick Start Guide to deploy supervised and unsupervised machine learning algorithms using scikit-learn to perform classification, regression, and clustering.

Machine Learning with scikit-learn Quick Start Guide teaches you how to use scikit-learn for machine learning and can help you in building your own machine learning models for accurate predictions.