This is a Kaggle competition for “The Inter-American Development Bank” to identify the families for the financial aid. Currently they use Proxy Means Test (PMT) algorithm to verify the income qualification. To improve the current algorithm, IDB is hosting this competition to get advanced machine learning algorithm to improve the performance of PMT.
This is my first Kaggle competiton and coincided with my thought of doing a charity project first. Let us start with cleaning the data.
Import & Clean Data¶
Let us import the data into pandas dataframe.
import pandas as pd
df = pd.read_csv('train.csv')
df.info()
Empirical Cumulative Distribution Function (ECDF)¶
I am datacamp student and my Exploratory analysis always starts with ECDF. Plotting the ECDF is the best way to analyze the distribution of data.
import numpy as np
# Calculate ECDF for a series
def ecdf(data):
n = len(data)
x = np.sort(data)
y = np.arange(1, n+1/n) / n
return x, y
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
Before plotting ECDF for rent, i would like to check the missing values.
df.v2a1.isnull().sum()
Let us check how many missing rows are rented houses
df[(df.v2a1.isnull()) & (df['tipovivi3'] == 1)]['v2a1'].sum()
There is no rental houses where rental value is null, so let us replace with zero.
df.v2a1.fillna(0, inplace=True)
Plotting the rent by Household category provides us the clear distribution. The below graph shows that one value is too extreme for the given dataset.
x_ep, y_ep = ecdf(df[df['Target']==1].v2a1)
x_mp, y_mp = ecdf(df[df['Target']==2].v2a1)
x_vh, y_vh = ecdf(df[df['Target']==3].v2a1)
x_nh, y_nh = ecdf(df[df['Target']==4].v2a1)
plt.figure(figsize=(15,8))
plt.plot(x_ep, y_ep, marker = '.', linestyle='none')
plt.plot(x_mp, y_mp, marker = '.', linestyle='none')
plt.plot(x_vh, y_vh, marker = '.', linestyle='none')
plt.plot(x_nh, y_nh, marker = '.', linestyle='none', color='y')
plt.legend(('Extreme Poverty', 'Moderate Poverty', 'Vulnerable Household', 'Non-vulnerable Household'))
plt.margins(0.02)
plt.xlabel('Rent')
plt.ylabel('ECDF')
plt.show()
Let us check how many outliers on the rent.
df[df['v2a1'] > 1000000].head()
Looks like there are only 2 rows and let us remove the same.
df = df[df['v2a1'] < 1000000]
Let us clean the data for the remaining feature variables
df.v18q1.fillna(0, inplace=True)
df.meaneduc.fillna(df.SQBmeaned, inplace=True)
df.meaneduc.fillna(0, inplace=True)
df['meaneduc'] = pd.to_numeric(df['meaneduc'])
df.rez_esc.fillna(0, inplace=True)
df.dependency.fillna(df.SQBdependency, inplace=True)
Setting the Target variable as per head of household for incorrect records
for item in df['idhogar'].unique():
df_household = df[df['idhogar'] == item]
head_target = df_household[df_household['parentesco1'] == 1]['Target'].values
for index, row in df_household.iterrows():
if (row['Target'] != head_target):
df.loc[df['Id']==row['Id'], 'Target'] = head_target
Let us select the features based on Pearson correlation
def pearson_r(x, y):
corr_mat = np.corrcoef(x, y)
return corr_mat[0,1]
for col in df.columns:
if ((df[col].dtype != 'str') & (df[col].dtype != 'object')) :
print('Column : {0}, Corr : {1}'.format(col, pearson_r(df[col], df.Target)))
Let us select the features which are highly correlated.
from sklearn.model_selection import train_test_split
X = df[['v2a1','rooms','refrig','v18q','v18q1','r4h2', 'escolari', 'paredblolad','pisomoscer','cielorazo','energcocinar2',
'elimbasu1', 'epared3', 'etecho3','eviv3','estadocivil3','hogar_adul','meaneduc','instlevel8','bedrooms','tipovivi2',
'computer','television','qmobilephone','lugar1','age']]
y= df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=0, oob_score=True, n_jobs=-1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Let us import & clean test data
df_test = pd.read_csv('test.csv')
df_test.v2a1.fillna(0, inplace=True)
df_test.v18q1.fillna(0, inplace=True)
df_test.meaneduc.fillna(df.SQBmeaned, inplace=True)
df_test.meaneduc.fillna(0, inplace=True)
df_test['meaneduc'] = pd.to_numeric(df_test['meaneduc'])
df_test.rez_esc.fillna(0, inplace=True)
df_test.dependency.fillna(df.SQBdependency, inplace=True)
ids = df_test['Id']
test_features = df_test[['v2a1','rooms','refrig','v18q','v18q1','r4h2', 'escolari', 'paredblolad','pisomoscer','cielorazo','energcocinar2',
'elimbasu1', 'epared3', 'etecho3','eviv3','estadocivil3','hogar_adul','meaneduc','instlevel8','bedrooms','tipovivi2',
'computer','television','qmobilephone','lugar1','age']]
test_pred = model.predict(test_features)
submit = pd.DataFrame({'Id' : ids, 'Target' : test_pred})
submit.to_csv('submit.csv', index=False)
Great content useful for all the candidates of Data Science training who want to kick start these career in Data Science training field.