fbpx
Are you Looking for Cloud Strategy ?

Efficiently Identify Prospects Using Machine Learning

The purpose of this notebook is to create a model that identifies prospective clients, which in this dataset are the customers that have subscribed to a particular term deposit product.

Table of Contents

  1. Data ingestion
  2. Exploratory Data Analysis
  3. Data preprocessing
  4. Model Training
  5. Model Testing
  6. Conclusion
  7. Credits

1. Data Ingestion

1.1 Introduction

The purpose of this notebook is to create a model that identifies prospective clients, which in this dataset are the customers that have subscribed to a particular term deposit product.

Equipped with the information on the customer base and previous marketing campaign efforts, we can understand the effectiveness of the campaign. This model identifies subscribed customers based on a classification algorithm.

This notebook can be useful for any business that wishes to run a re-targeting or similar audience marketing campaign as well as cross-sell a new product.

With similar data to the columns in this dataset, one can identify customers most likely to subscribe to the product and plan subsequent campaigns accordingly by focusing on a subset of potential customers.

Narrowing the customers targeted by a particular campaign not only reduces costs but also the risk of a failed campaign and/or overall product failure.

Thanks to University of California Irvine for providing the dataset.

Source:

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22–31, June 2014 .

1.2 Understanding the data

  • job: Job Category the customer belongs to. (categorical: ’admin.’,’blue collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed
  • marital: Marital status of customer(categorical: ’divorced’,’married’,’single’,’unknown’; note: ’divorced’ means divorced or widowed)
  • education: Education level of customer (categorical: ’basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
  • default: Is the customer a defaulter? (categorical: ’no’,’yes’,’unknown’)
  • balance: Overall balance amount of the customer.
  • housing: Does the customer have any housing loan? (categorical: ’no’,’yes’,’unknown’)
  • loan: Does the customer have any personal loan? (categorical: ’no’,’yes’,’unknown’)

Related with the last contact of the current campaign.

  • contact: Customer contact communication type (categorical: ’cellular’,’telephone’)
  • day: Last day when the customer had been contacted. (numeric: 1, 2, 3, … 31)
  • month: Last month when the customer had been contacted(categorical: ’jan’, ’feb’, ’mar’, …, ’nov’, ’dec’)
  • duration: Last contact duration with the customer, in seconds (numeric). Important note: This attribute highly affects the output target (e.g., if duration=0 then the target variable y=’no’ i.e the customer has not subscribed). Yet, in practical scenario, the duration is not known before a call is performed. Also, after the end of the call y, i.e. whether the customer has subscribed is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes

  • campaign: It is number of customers contacted during this campaign (numeric, includes last contact)
  • pdays: It is the number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  • previous: The number of contacts performed before this campaign and for the bank (numeric)
  • poutcome: The outcome of the previous marketing campaign (categorical: ’failure’,’nonexistent’,’success’) Output variable (desired target)
  • y: It is the target variable. It defines whether the clientsubscribed the term deposit product(for which the campaign was launched)? (binary: ’yes’,’no’)

1.3 Statistical Analysis

Hiding the warnings of the output of python code. These warnings are not necessary and do not provide any additional or useful information.

# ignore warningsimport warnings 
warnings.filterwarnings(’ignore’)

Now let’s import the required libraries. Library pandas are used for easy manipulation of data frame and seaborn, Matplotlib are used for visualization.

# importing the required librariesimport pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Performing statistical analysis on the fields in the dataset.

# statistical description of the fieldsdf_bank.describe()

The results are 7 numeric variables and 10 categorical variables. We have to first convert the categorical variables to numeric, in order to proceed with the analysis.

# checking for null data
print ("Presence of any null values: " +
str(df_bank.isnull().values.any()))

We observe that there are no null values in the dataset.

2. Exploratory Data Analysis

Now we will explore the non-numeric (i.e categorical) variables, to get some valuable insights. The code pasted below, plots four bar plots, that helps us analyze the data and get some insights on the customer data of the bank.

# Checking customer base i.e clients of the bank fig = plt.figure(figsize=(15,15)) 
plt.subplot(2,3,1)
pd.value_counts(df_bank[’education’]).plot.bar() plt.title(’EDUCATION’)
plt.subplot(2,3,2)
pd.value_counts(df_bank[’poutcome’]).plot.bar()
plt.title(’OUTCOME’)
plt.subplot(2,3,3)
pd.value_counts(df_bank[’contact’]).plot.bar()
plt.title(’CONTACT’)
plt.subplot(2,3,4)
pd.value_counts(df_bank[’job’]).plot.bar()
plt.title(’JOB’)
plt.axis(’tight’)
plt.show()
Image for post

The above plot shows that the bank has many customers who have secondary level of education, blue collar jobs, contact communication type is cellular and unknown outcome of previous marketing campaign.

Let us further analyze five bar plots, for the customer data of the bank.

# Analyzing the customer base using bar graphsfig = plt.figure(figsize=(15,15))
plt.subplot(2,3,1)pd.value_counts(df_bank[’month’]).plot.bar()
plt.title(’MONTH’)
plt.subplot(2,3,2)
pd.value_counts(df_bank[’default’]).plot.bar()
plt.title(’DEFAULT’)
plt.subplot(2,3,3)
pd.value_counts(df_bank[’housing’]).plot.bar()
plt.title(’HOUSING’)
plt.subplot(2,3,4)
pd.value_counts(df_bank[’loan’]).plot.bar()
plt.title(’LOAN’)
plt.subplot(2,3,5)
pd.value_counts(df_bank[’subscribed’]).plot.bar() plt.title(’SUBSCRIBED’)
plt.axis(’tight’)
plt.show()
Image for post

The above plot shows that customers were mostly contacted during the month of May, and many customers have not subscribed for the term deposit.

Also, the bank has many customers who own homes vs renting, are not defaulters and do not have any loans.

We will now move on to view the number of customers that have actually subscribed after the campaign event. We will first calculate the percentage value of subscribed customers and then plot it in the pie chart.

We will add appropriate labels and use the explode chart feature to view the cut of this data in the pie chart

# Check the percentage of subscribed customersdf_not_subscribed = df_bank.loc[df_bank[’subscribed’] == ’no’] df_subscribed = df_bank.loc[df_bank[’subscribed’] == ’yes’] 
sub = len(df_not_subscribed.index)
nsub = len(df_subscribed.index) # Pie chartlabels = [’Not Subscribed’, ’Subscribed’] explode = (0, 0.1) # add colorscolors = [’#ff9999’,’#66b3ff’]
sizes = [(sub/(sub+nsub))*100, (nsub/(sub+nsub))*100]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, explode = explode, colors=colors, autopct=’%1.1f%%’,
shadow=True, startangle=90) # Equal aspect ratio ensures that pie is drawn as a circleax1.axis(’equal’)plt.tight_layout()
plt.show()
Image for post

The pie chart shows only 11.7% customers have subscribed for term deposit, after the campaign. For numerical data, it is necessary to check for outliers [1] .

We are going to plot two box plots one for balance [2] and the other one for duration [3] .

# Box plots to check for outliers
fig = plt.figure(figsize = (15, 15)
plt.subplot (2,3,1)
sns.boxplot (x = 'balance', y = 'subscribed', data = df_bank
plt.title ('BALANCE')
plt.subplot (2,3,2)
sns.boxplot (x = 'duration', y = 'subscribed', data = df_bank)
plt.title ('DURATION')
plt.show()
Image for post

A few outliers can be observed [1] in balance [2] and duration [3] . Balance of most of the customers seems to be between $0 and $20,000. Also, the customers who have subscribed had longer duration of contact with bank personnel on average than those who have not subscribed.

Contact duration plays an important role and highly affects the subscription decision.

3. Data Preprocessing

First, we have to convert the categorical variables to dummy variables [4] (a numeric entity).

To do this, we will create a function convertToDummy, and use pandas get dummies method to convert the categorical variables to dummy columns.

Then we will delete one of the dummy columns to avoid the dummy variable trap [5] (multicollinearity [6] issues).

# function to convert categorical
defconvertToDummy (df, column): # Create dummy variables for categorical variables
df_dummies = pd.get_dummies (column)

# Deleting one of the dummy variable to avoid multicollinearity issues
del df_dummies[df_dummies.columns [-1]] # Add the columns to existing data frame
df = pd.concat ([df, df_dummies], axis - 1)
return df

Now we convert the dependent variable and subscribe yes and no values to binary 1 and 0, using map function.

# Function to convert categorical variables to dummy 
# Convert y to binary for further processing

df_bank[’subscribed’] = df_bank[’subscribed’].map({’yes’: 1, ’no’: 0})

First we will be converting the categorical variables to dummy variables [4], and later we will delete the original column, as it is no longer required.

We are also going to delete some columns which are not significant for creation of the model.

The lower correlation [7] can also be observed in a correlation matrix as detailed in section 3.1. Deleting these columns will not have an effect on the accuracy of the model.

# delete unwanted columns day 
del df_bank[’day’]
del df_bank[’month’]

# Create dummy variables for categorical variables
df_bank = convertToDummy(df_bank, df_bank[’marital’])
df_bank = convertToDummy(df_bank, df_bank[’job’])
df_bank = convertToDummy(df_bank, df_bank[’education’])
df_bank = convertToDummy(df_bank, df_bank[’poutcome’])
df_bank = convertToDummy(df_bank, df_bank[’contact’]) # Delete the original categorical columns as new variables are created
del df_bank[’marital’]
del df_bank[’job’]
del df_bank[’education’]
del df_bank[’poutcome’]
del df_bank[’contact’] # Convert default, housing, loan to binary for further processing
df_bank[’default’] = df_bank[’default’].map({’yes’: 1, ’no’: 0}) df_bank[’loan’] = df_bank[’loan’].map({’yes’: 1, ’no’: 0}) df_bank[’housing’] = df_bank[’housing’].map({’yes’: 1, ’no’: 0})

3.1 Checking correlation matrix (Heat Map)

We will plot a heat map to check correlation [7] between the variables. We will use heat map method from the seaborn package.

# Plotting heat map for displaying correlation matrix
fig = plt.figure(figsize=(20,20))
corr = df_bank.corr()
sns.heatmap(corr, annot = True)
Image for post

4. Model Training

Let’s import the libraries that are required for creating the model.

# Importing the required librariesfrom xgboost import XGBClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Now we will convert the dependent [8] and independent variables [9] to a matrix of features i.e converting the dependent and independent variables in matrix form so that they can be used in the model.

# Creating matrix of features of independent variable

x = df_bank.iloc[:,df_bank.columns != ’subscribed’].values
y = df_bank.iloc[:,9].values

We will now divide our dataset into training set, test set and validation set. This is done to avoid overfitting [10] issues.

We will divide the whole dataset, categorizing 70% to the training set and 30% to the test dataset.

We then divide the test dataset further, categorizing 70% as test dataset and 30% as validation dataset.

Now we will use train test split method from sklearn.model selection library, to split the dataset.

# Split dataset into training and test setx_train, x_test, y_train, y_test = train_test_split(x, y, test_size      
= 0.3, random_state = 0)
x_test, x_valid, y_test, y_valid = train_test_split(x_test, y_test,
test_size = 0.3, random_state = 0)

We will now perform feature scaling, that is standardizing [11] the variables before we pass it to the classifier models for classification using StandardScaler method from sklearn.preprocessing package.

# Feature Scaling
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.fit_transform(x_test)
x_valid = sc_x.fit_transform(x_valid)

The function mentioned below classifies the dataset that is passed through it and creates a model.

We first initialize all the classification models, then fit the model with the dataset provided as parameter (mainly we provide training dataset as the input).

def classify(x_train, y_train) : 

# Initializing the models
lr = LogisticRegression(random_state = 0)
xgb = XGBClassifier(random_state = 0)
dtc = DecisionTreeClassifier(random_state = 0)
rfc = RandomForestClassifier(random_state = 0)
nbc = GaussianNB()
svm = SVC(kernel=’rbf’, C=1,gamma=’auto’)
knn = KNeighborsClassifier(n_neighbors=3)

# Fitting Models to training data set
lr.fit(x_train, y_train)
xgb.fit(x_train, y_train)
dtc.fit(x_train, y_train)
rfc.fit(x_train, y_train)
nbc.fit(x_train, y_train)
svm.fit(x_train, y_train)
knn.fit(x_train, y_train)
classifiers = [lr, xgb, dtc, rfc, nbc, svm, knn] return classifiers

Invoke the above method, and pass the training dataset through to train the model.

# Passing training data to classify function to fit various classification modelsclassifiers = classify(x_train, y_train)

5. Testing the Model

The function mentioned below, displays the accuracy of all the models in a graphical view.

# Function to plot the accuracy of the model
def plot_accuracy_plot(accuracy) :
dims = (11.7, 8.27)
fig, ax = plt.subplots(figsize = dims)
plt.xlabel(’Accuracy’)
plt.title(’Classifier Accuracy’)
sns.set_color_codes("muted")
splot = sns.barplot(ax=ax, x=’Accuracy’, y=’Classifier’,
data=accuracy, color="b")
plt.show()

The function mentioned below performs the prediction and returns the accuracy score as a data frame. We first perform prediction using predict method on the data frame passed, calculate accuracy using accuracy score [12] method and save it to a dictionary.

# This function performs prediction, 
# plots the accuracy score of all classifiers and
# returns the accuracy score data frame
def predict(x_test, classifiers, y_test):
lr_test_pred = classifiers[0].predict(x_test)
xgb_test_pred = classifiers[1].predict(x_test)
dtc_test_pred = classifiers[2].predict(x_test)
rfc_test_pred = classifiers[3].predict(x_test)
nbc_test_pred = classifiers[4].predict(x_test)
svm_test_pred = classifiers[5].predict(x_test)
knn_test_pred = classifiers[6].predict(x_test) # judge accuracy using built-in function

accuracy_test = dict()
accuracy_test[’Logistic Regression’] = accuracy_score(y_test,
lr_test_pred)
accuracy_test[’XGBoost’] = accuracy_score(y_test, xgb_test_pred) accuracy_test[’DecisionTree’] = accuracy_score(y_test,dtc_test_pred)
accuracy_test[’RandomForest’] = accuracy_score(y_test,rfc_test_pred) accuracy_test[’Naive_bayes’] = accuracy_score(y_test, nbc_test_pred) accuracy_test[’support_vector_Machines’] =
accuracy_score(y_test,svm_test_pred)
accuracy_test[’KNN’] = accuracy_score(y_test,knn_test_pred) print(accuracy_test)
df_acc= pd.DataFrame([accuracy_test.keys(),
accuracy_test.values()]).T
df_acc.columns= [’Classifier’, ’Accuracy’]
df_acc.sort_values(by=[’Accuracy’], ascending = False) # Plot accuracy plot plot_accuracy_plot(df_acc)
return df_acc.sort_values(’Accuracy’, ascending = False)

We then convert the dictionary into a data frame, plot the accuracy by invoking plot accuracy plot method and returning the accuracy data frame sorted in descending order.

We invoke the predict method and pass through the test dataset.

# Predicting the test data setpredict (x_test, classifiers, y_test)
Image for post

Result: The XGBoost provides the highest accuracy score [12] of 89.99% among other classifiers, for test dataset.

Now, we invoke the predict method and pass the valid dataset.

# Predicting the test data setpredict (x_valid, classifiers, y_valid)
Image for post

Result: XGBoost provides the highest accuracy score [12] of 89.83%among other classifiers, for validation dataset.

6. Conclusion

In summary, we have learned to create a model, with around 89% accuracy, that can identify customers likely to subscribe to the product, enabling us to target those customers and focus our advertising efforts on them.

This will improve marketing and operational efficiency, as we will be focused on a particular section of the customer base which we now know is more likely to subscribe to our services.

For running your dataset in an automated environment and getting valuable insights, kindly try out our new product 5411!

7. Credits

Thanks again to University of California Irvine for providing this dataset.

References

[1] outliers: A value that ”lies outside” (is much smaller or larger than) most of the other values in a set of data. For example in the scores 25,29,3,32,85,33,27,28 both 3 and 85 are ”outliers”

[2] balance: Overall balance amount of the customer

[3] duration: Last contact duration with the customer, in seconds (numeric)

[4] dummy variables: Dummy variables are ”proxy” variables or numeric stand- ins for qualitative facts i.e categorical variables

[5] dummy variable trap:The Dummy Variable trap is a scenario in which the independent variables are multicollinear — a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

[6] multicollinearity: Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because inde- pendent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

[7] correlation: Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. Positive correlation exists when two variables move in the same direction. A basic example of positive correlation is height and weight-taller people tend to be heavier, and vice versa

[8] dependent variable: The dependent variable is sometimes called ”the outcome variable.” also known as the target variable

[9] independent variable: An independent variable is a variable believed to affect the dependent variable

[10] overfitting: Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

[11] standardizing: Standardization is the process of putting different variables on the same scale. This process allows you to compare values between differ- ent range of variables.

[12] accuracy score: accuracy is the fraction of predictions our model got right. It is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage. Higher the accuracy score the better is the model

Write a Reply or Comment

Your email address will not be published. Required fields are marked *


Apply Now

GENERAL INFORMATION

ROLE REQUIREMENT

EXPERIENCE

Legality

By Checking this box, I affirm that all the information submitted is accurate