Problem Statement

Aneesh R
4 min readMar 2, 2021

This problem was part of Analytics-Vidya’s Job-A-Thon

The aim was to predict the success for a lead with respect to a new policy. The given data has the following information:-

  • ID
  • City Code
  • Region Code
  • Accommodation Type
  • Recommended insurance type coded as “Reco_Insurance_Type”- This represents the type of insurance that is either joint or individual that was offered to the given ID.
  • Upper Age- It is defined as the age at which the holding policy of the person matures.
  • Lower Age- It is defined as the age at which the person started paying for the current policy.
  • Is_Spouse- That is married or not.
  • Health Indicator coded as X1,X2,….X9 not much information given about the nature of encoding.
  • Holding Policy Duration.
  • Recommended Policy Category-Not much information provided about the encoding of the categories
  • Recommended Policy Premium- This is the only continuous variable in the data.
  • Response-This is the outcome variable.

Note that this company had approached their own customer base with the new Insurance Schemes.

I have done this analysis in Google Colab

png

Part 1 : Exploratory Analysis.

Unique values in each columns are given below. One thing I notice is that each row represent an unique individual.

pd.DataFrame({'col':train.nunique().index,'unique_val':train.nunique().values})
png

Lets find the number of null values in each columns. Fortunately this data uniformly encodes null values just as Nan instead of any other possible ways to represent missing values and this actually helps in that it reduces some of the work to clean the data. In the below dataframe, we can observe that null values are exclusively found in 3 columns and they are ‘Health Indicator’,’Holding_Policy_Duration’,’Holding_Policy_Type’

pd.DataFrame({'col':train.isnull().sum().index,'null_val':train.isnull().sum().values})
png

One thing I noticed very late in the hackathon was that Upper_Age-Lower_age = Holding_Policy_Duration and this I use to find missing values in Holding Policy duration in fact I just need the above relation to fill in all the Null values in Holding Policy Duration Column.

Also notice that in the column ‘Holding_Policy_Duration’ there are ‘14+’ as a value and it is present in many rows but using the above relation we can replace the 14+ with the proper duration.

Now lets plot count plots.

Looking at the count plot of Health Indicator and Holding Policy type we can notice that X1 and Type 3 is the most frequent Health Indicator and Holding Policy type in the data respectively and so X1 and type 3 may be a possible candidate to impute the missing value with.

sns.countplot('Health Indicator',data=train)
png
sns.countplot('Holding_Policy_Type',data=train)
png

Lets check if the given data is balanced or not.

print_dx_perc(train, 'Response')0 accounts for 76.01% of the Response column
1 accounts for 23.99% of the Response column

Therefore the given data set is imbalanced.

Part 2: Imputation

train,test=my_filler(train),my_filler(test)

Lets check for null values post imputation.

pd.DataFrame({'col':train.isnull().sum().index,'null_val':train.isnull().sum().values})
png

Some necessary cleaning steps.

trainx,Y=my_cleaner(train)
testx,_=my_cleaner(test)
X=pd.concat([trainx,testx]).reset_index(drop=True)

Part 3 : Encoding the variables.

XTRAIN = X[:len(Y)]
XTEST = X[len(Y):]

LightGBM is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

y_pred=clf.predict(X_valid)accuracy=accuracy_score(y_pred, y_valid)
print('LightGBM Model accuracy score: {0:0.4f}'.format(accuracy_score(y_valid, y_pred)))
LightGBM Model accuracy score: 0.7613y_pred_train = clf.predict(X_train)print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))Training-set accuracy score: 0.7666cm = confusion_matrix(y_valid, y_pred)
print('Confusion matrix\n\n', cm)
print('\nTrue Positives(TP) = ', cm[0,0])
print('\nTrue Negatives(TN) = ', cm[1,1])
print('\nFalse Positives(FP) = ', cm[0,1])
print('\nFalse Negatives(FN) = ', cm[1,0])

OUTPUT
------
Confusion matrix

[[7706 29]
[2400 42]]

True Positives(TP) = 7706

True Negatives(TN) = 42

False Positives(FP) = 29

False Negatives(FN) = 2400

--

--

Aneesh R

Hi there, I am a Data Analyst and I am in love with statistics and data. How can I help you?