1.Introduction
The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries. Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.
Inurance premium vary over person to person. Insurance companies decide the charges for a perticular person based on the probability of happening of an event prescribed in the helth insurance. To avoid loss companies take the account of personal attributes of the person to calculate the probability of happening of an event. In this project we want to predict the insurance charges using data which includes attributes of the people and charges applied to them
1.1 Brief discription of the attributes in the dataset
1) age: age of primary beneficiary
2) sex: gender of the beneficiary (male,female)
3) bmi: Body mass index, providing an understanding of body,weights that are relatively high or low relative to height, objective index of body weight(kg/m) using the ratio of height to weight, ideally 18.5 to 24.9
4) children: Number of children covered by health insurance/ Number of dependent
5) smoker: if the beneficiary is smoker or not
6) region: the beneficiary's residential area in the US,northeast,southeast,northwest,southwest
7) charges: individual medical premium billed by health insurance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('C:\\Users\\Avinash Godbole\\Documents\\Dataset\\issurance.csv')
df.head()
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
df.info() # it gives the total rows, colums and datatype.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
We can see that varibles such as sex, smoker and region are of object type and other varibles are numerical so we need to convert the varibles with object type into numerical type
df.isna().sum()
age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64
df.describe()
age | bmi | children | charges | |
---|---|---|---|---|
count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |
mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 |
std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 |
min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 |
25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 |
50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 |
75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 |
max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 |
the described method will help to see how data has been spread for nemerical value we can clearly see the minumum value,mean,maximum value,percentile value.
df.hist(bins=10,figsize=(10,10))
plt.show()
# Scatterplot for Age vs Charges
plt.figure(figsize=(8,5))
sns.scatterplot(x='age',y='charges',data=df,hue='smoker')
plt.title('AGE vs Charges')
Text(0.5, 1.0, 'AGE vs Charges')
# scatterplot BMI vs Charges
plt.figure(figsize=(8,5))
sns.scatterplot(x='bmi',y='charges',data=df,hue='smoker')
plt.title('BMI vs Charges')
Text(0.5, 1.0, 'BMI vs Charges')
df.boxplot()
<AxesSubplot:>
df.boxplot(['age','bmi'])
<AxesSubplot:>
data_clip = df.iloc[:]
columns = ['bmi','charges']
def clipValues(data_clip,lower,upper):
for col in columns:
l=data_clip[col].quantile(lower)
u=data_clip[col].quantile(upper)
data_clip[col]=data_clip[col].clip(l,u)
return data_clip
data=clipValues(data_clip,0.05,0.9)
#cheak the all outliers are removed or not
data.boxplot(['age','bmi'])
<AxesSubplot:>
df.boxplot(['charges'])
<AxesSubplot:>
We need to do label encoding to convert object type variables into numerical type so that we can use those variables in the model
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data["sex"]=le.fit_transform((data['sex']))
data["smoker"]=le.fit_transform((data['smoker']))
data["region"]=le.fit_transform((data['region']))
data.head()
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | 0 | 27.900 | 0 | 1 | 3 | 16884.92400 |
1 | 18 | 1 | 33.770 | 1 | 0 | 2 | 1757.75340 |
2 | 28 | 1 | 33.000 | 3 | 0 | 2 | 4449.46200 |
3 | 33 | 1 | 22.705 | 0 | 0 | 1 | 21984.47061 |
4 | 32 | 1 | 28.880 | 0 | 0 | 1 | 3866.85520 |
Now all the variables are in the numerical type
plt.figure(figsize=(8,5))
sns.scatterplot(x='age',y='charges',data=df,hue='smoker')
<AxesSubplot:xlabel='age', ylabel='charges'>
plt.scatter(ytest.values,ypred)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
Text(0, 0.5, 'Residuals')
From the above plot we can conclude that we dont have homoscedasticity in the data
residuals = ytest.values - ypred
sns.histplot(residuals,kde=True)
<AxesSubplot:ylabel='Count'>
We can see that the Residuals have normal distribution
print("Mean of Residual is :", np.mean(residuals))
Mean of Residual is : -0.06781207178145235
We can see that the residuals have mean = 0
import statsmodels.api as sm
sm.graphics.tsa.plot_acf(residuals)
plt.show()
From the above acf plot we can conclude that the there is no autocorrelatiion is the data.
plt.figure(figsize = (15,8))
sns.heatmap(data.corr(),annot = True)
<AxesSubplot:>
from statsmodels.stats.outliers_influence import variance_inflation_factor
def get_vif(df):
d={}
for i in range(df.shape[1]):
d[df.columns[i]]=variance_inflation_factor(df.values,i)
return d
get_vif(data.drop(columns = ['target']))
{'age': 7.9129385711541875, 'sex': 2.008858925319928, 'bmi': 11.045882593807452, 'children': 1.8055690975035243, 'smoker': 1.258195185615136, 'region': 2.9348837902311247}
From the above heatmap we can conclude that there is no multicolinearity between the varibles Also except var bmi the vif of our variables are low
data['charges'].hist()
<AxesSubplot:>
Even after removing the ouliers from the variable charges, charges column is skewed. So deal with that we will be transforming the column charges using log transformation
data['target']=np.log(data.charges)
data = data.drop(columns = ['charges'] )
x=data.drop(columns=['target'])
y=data['target']
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=5)
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(xtrain,ytrain)
ypred=model.predict(xtest)
# Evaluation of Linear Regression Model
print('\033[1m' + 'Linear Regression model evaluation'+'\033[0m')
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
mse=mean_squared_error(ytest,ypred)
print('mse',mse)
rmse=np.sqrt(mse)
print('rmse',rmse)
mae=mean_absolute_error(ytest,ypred)
print('mae',mae)
score=r2_score(ytest,ypred)
print('r2_score',score)
Linear Regression model evaluation
mse 0.1577524445033292
rmse 0.3971806194961295
mae 0.2742527051726538
r2_score 0.809458802076108
data.head()
age | sex | bmi | children | smoker | region | target | |
---|---|---|---|---|---|---|---|
0 | 19 | 0 | 27.900 | 0 | 1 | 3 | 9.734176 |
1 | 18 | 1 | 33.770 | 1 | 0 | 2 | 7.471792 |
2 | 28 | 1 | 33.000 | 3 | 0 | 2 | 8.400538 |
3 | 33 | 1 | 22.705 | 0 | 0 | 1 | 9.998092 |
4 | 32 | 1 | 28.880 | 0 | 0 | 1 | 8.260197 |
As we can see that the bmi and age values are in two digits where as other variables are single digits which can be missleading for the model as it may weigh more to the variables with large even if they are not that important compared to variables with small values.
So to deal with that we will do feature scaling. Wich will convert data in same range. for feature sscaling we are using standard scalar
# Scalling the data with StandardScaler
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
scaled_xtrain=sc.fit_transform(xtrain)
scaled_xtest=sc.fit_transform(xtest)
# fit and train the model with scaled data
model=LinearRegression()
model.fit(scaled_xtrain,ytrain)
ypred=model.predict(scaled_xtest)
# evaluation of model
print('\033[1m' + 'Linear Regression model evaluation for scaled data'+'\033[0m')
linear_mse=mean_squared_error(ytest,ypred)
print("MSE:",linear_mse)
linear_rmse = np.sqrt(linear_mse)
print("RMSE:",linear_rmse)
linear_mae=mean_absolute_error(ytest,ypred)
print("MAE:",linear_mae)
linear_score=r2_score(ytest,ypred)
print("R-squared :",linear_score)
Linear Regression model evaluation for scaled data
MSE: 0.17465206334650324
RMSE: 0.4179139425126939
MAE: 0.31105894981226717
R-squared : 0.7890466073302599
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
poly_reg.fit(xtrain)
x_train_poly=poly_reg.transform(xtrain)
x_test_poly=poly_reg.transform(xtest)
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train_poly,ytrain)
y_pred=lr.predict(x_test_poly)
# Evaluation of Polynomial Regression model
print('\033[1m' + 'Polynomial Regression model evaluation'+'\033[0m')
poly_mse=mean_squared_error(ytest,y_pred)
print("MSE:",poly_mse)
poly_rmse = np.sqrt(poly_mse)
print("RMSE:",poly_rmse)
poly_mae=mean_absolute_error(ytest,y_pred)
print("MAE:",poly_mae)
poly_score=r2_score(ytest,y_pred)
print("R-squared :",poly_score)
Polynomial Regression model evaluation
MSE: 0.09710062938921961
RMSE: 0.3116097389190839
MAE: 0.18443064688024632
R-squared : 0.8827170615248671
from sklearn.linear_model import Ridge
ridge=Ridge(alpha=0.3)
ridge.fit(xtrain,ytrain)
rypred=ridge.predict(xtest)
# Evaluation of the Ridge model
print('\033[1m' + 'Ridge Regression model Evaluation'+'\033[0m')
ridge_mse=mean_squared_error(ytest,rypred)
print('mse',ridge_mse)
ridge_rmse=np.sqrt(ridge_mse)
print('rmse',ridge_rmse)
ridge_mae=mean_absolute_error(ytest,rypred)
print('mae',ridge_mae)
ridge_score=r2_score(ytest,rypred)
print('r2_score',ridge_score)
Ridge Regression model Evaluation
mse 0.15784565986426033
rmse 0.39729794847728617
mae 0.27440842992749637
r2_score 0.80934621195687
from sklearn.linear_model import Lasso
lasso=Lasso(alpha=0.0001)
lasso.fit(xtrain,ytrain)
lypred=lasso.predict(xtest)
# Evaluation of the Lasso Regression model
print('\033[1m' + 'Lasso Regression Model Evaluation'+'\033[0m')
lasso_mse=mean_squared_error(ytest,lypred)
print('mse',lasso_mse)
lasso_rmse=np.sqrt(lasso_mse)
print('rmse',lasso_rmse)
lasso_mae=mean_absolute_error(ytest,lypred)
print('mae',lasso_mae)
lasso_score=r2_score(ytest,lypred)
print('r2_score',lasso_score)
Lasso Regression Model Evaluation
mse 0.15776660307550305
rmse 0.3971984429419419
mae 0.2742557369031013
r2_score 0.8094417006529803
# function for getting best alpha value
from sklearn.linear_model import ElasticNet
alpha=[0.0001,0.001,0.01,0.1,0.3,0.5,1,10]
scores={}
def get_best_alpha(alpha):
for i in alpha:
model=ElasticNet(alpha=i)
model.fit(xtrain,ytrain)
ypred=model.predict(xtest)
elastic_mse=mean_squared_error(ytest,ypred)
elastic_rmse=np.sqrt(elastic_mse)
elastic_mae=mean_absolute_error(ytest,ypred)
scores[i]=model.score(xtest,ytest)
#print("For Alpha = {} | R-square :{} MSE :{} RMSE :{} MAE:{} ".format(i,model.score(xtest,ytest),mse,rmse,mae))
return max(scores, key= lambda x: scores[x])
# Best alpha value
print("Best Alpha is :",get_best_alpha(alpha))
print('R_square of the model is :',scores[0.0001])
Best Alpha is : 0.0001 R_square of the model is : 0.8094304520725136
# fitting model with best alpha value
model= ElasticNet(alpha=0.0001)
model.fit(xtrain,ytrain)
elypred=model.predict(xtest)
# evaluation of ElasticNet model
print('\033[1m' + 'ElasticNet Regression Model'+'\033[0m')
elastic_mse=mean_squared_error(ytest,elypred)
elastic_rmse=np.sqrt(elastic_mse)
elastic_mae=mean_absolute_error(ytest,elypred)
elastic_r2score=r2_score(ytest,elypred)
print('mse',elastic_mse)
print('rmse',elastic_rmse)
print('mae',elastic_mae)
print('r2_score',elastic_r2score)
ElasticNet Regression Model
mse 0.15777591597520732
rmse 0.3972101660018375
mae 0.2742817119107097
r2_score 0.8094304520725136
# Fitting model for ElasticNetCv
from sklearn.linear_model import ElasticNetCV
model=ElasticNetCV(alphas=alpha,cv=5)
model.fit(xtrain,ytrain)
ecv_ypred=model.predict(xtest)
# evaluation of ElasticNetCv model
print('\033[1m' + 'ElasticNetCV Regression model'+'\033[0m')
ecv_mse=mean_squared_error(ytest,ecv_ypred)
ecv_rmse=np.sqrt(ecv_mse)
ecv_mae=mean_absolute_error(ytest,ecv_ypred)
ecv_r2score=r2_score(ytest,ecv_ypred)
print('mse',ecv_mse)
print('rmse',ecv_rmse)
print('mae',ecv_mae)
print('r2_score',ecv_r2score)
ElasticNetCV Regression model
mse 0.1579988794397903
rmse 0.39749072874696123
mae 0.2745472466643872
r2_score 0.8091611457820872
models = ['LinearRegression','Polynomial','Ridge','Lasso','ElasticNet','ElasticNetCV']
rsquare = [score,poly_score,ridge_score,lasso_score,elastic_r2score,ecv_r2score]
mse = [mse,poly_mse,ridge_mse,lasso_mse,elastic_mse,ecv_mse]
rmse = [rmse,poly_rmse,ridge_rmse,lasso_rmse,elastic_rmse,ecv_rmse]
mae = [mae,poly_mae,ridge_mae,lasso_mae,elastic_mae,ecv_mae]
Evaluation = pd.DataFrame({'Model':models,'R-square':rsquare,'MSE':mse,'RMSE':rmse,'MAE':mae})
Evaluation.style.highlight_max(subset = ['R-square'],color = 'lightgreen')
Model | R-square | MSE | RMSE | MAE | |
---|---|---|---|---|---|
0 | LinearRegression | 0.809459 | 0.157752 | 0.397181 | 0.274253 |
1 | Polynomial | 0.882717 | 0.097101 | 0.311610 | 0.184431 |
2 | Ridge | 0.809346 | 0.157846 | 0.397298 | 0.274408 |
3 | Lasso | 0.809442 | 0.157767 | 0.397198 | 0.274256 |
4 | ElasticNet | 0.809430 | 0.157776 | 0.397210 | 0.274282 |
5 | ElasticNetCV | 0.809161 | 0.157999 | 0.397491 | 0.274547 |
Evaluation.sort_values(by='R-square',ascending=False,inplace=True)
plt.figure(figsize = (10,7))
sns.barplot(x='R-square', y='Model',data=Evaluation)
plt.xlabel('R-Square')
plt.ylabel('Models')
plt.title('Evaluation of models')
plt.xlim(0.7,1)
(0.7, 1.0)
Considering R2 score and other parameter it is observed that Polynomial regression for 2nd degree turned out to be suitable to predict the issurance charges.¶
- Get link
- X
- Other Apps
- Get link
- X
- Other Apps