Regression algorithm

1.Introduction

The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries. Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.

Inurance premium vary over person to person. Insurance companies decide the charges for a perticular person based on the probability of happening of an event prescribed in the helth insurance. To avoid loss companies take the account of personal attributes of the person to calculate the probability of happening of an event. In this project we want to predict the insurance charges using data which includes attributes of the people and charges applied to them

1.1 Brief discription of the attributes in the dataset

1) age: age of primary beneficiary

2) sex: gender of the beneficiary (male,female)

3) bmi: Body mass index, providing an understanding of body,weights that are relatively high or low relative to height, objective index of body weight(kg/m) using the ratio of height to weight, ideally 18.5 to 24.9

4) children: Number of children covered by health insurance/ Number of dependent

5) smoker: if the beneficiary is smoker or not

6) region: the beneficiary's residential area in the US,northeast,southeast,northwest,southwest

7) charges: individual medical premium billed by health insurance.

1.2 Dataset source :

Dataset url: https://www.kaggle.com/awaiskaggler/insurance-csv/metadata

2. Problem Statement

The goal of this project is to predict the charges based on the attributes related to the person such as age, gender, bmi, region he belongs to, number of children and if he/she is a smoker.

3. Importing Libraries and Load Dataset

In [1]:

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading Dataset

In [2]:

df = pd.read_csv('C:\\Users\\Avinash Godbole\\Documents\\Dataset\\issurance.csv')

4. Exploratory Data Analysis (EDA)

EDA is a step in data analysis process where a number of techniques are used to understand the dataset better.

In [3]:

df.head()

Out[3]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

In [4]:

df.info()  # it gives the total rows, colums and datatype. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

We can see that varibles such as sex, smoker and region are of object type and other varibles are numerical so we need to convert the varibles with object type into numerical type

4.1 Checking present of Null values

In [5]:

df.isna().sum()

Out[5]:

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [6]:

df.describe()

Out[6]:

	age	bmi	children	charges
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.663397	1.094918	13270.422265
std	14.049960	6.098187	1.205493	12110.011237
min	18.000000	15.960000	0.000000	1121.873900
25%	27.000000	26.296250	0.000000	4740.287150
50%	39.000000	30.400000	1.000000	9382.033000
75%	51.000000	34.693750	2.000000	16639.912515
max	64.000000	53.130000	5.000000	63770.428010

the described method will help to see how data has been spread for nemerical value we can clearly see the minumum value,mean,maximum value,percentile value.

4.2 Data visulization

4.2.1 Histograms

Histogram are one of the most common graphs used to display numeric data distribution of the data whether the data is normally distributed or if it is skewed right or left

In [7]:

df.hist(bins=10,figsize=(10,10))
plt.show()

4.2.2 Scatter Plot

A scatterplot is a type of graph which plots the values of two variables along two axes, like age and height. Scatterplot are useful for many reason like correlation matrics.it allows you to quickly understand a relationship between two variable.

In [8]:

# Scatterplot for Age vs Charges
plt.figure(figsize=(8,5))
sns.scatterplot(x='age',y='charges',data=df,hue='smoker')
plt.title('AGE vs Charges')

Out[8]:

Text(0.5, 1.0, 'AGE vs Charges')

In [ ]:

In [9]:

# scatterplot BMI vs Charges
plt.figure(figsize=(8,5))
sns.scatterplot(x='bmi',y='charges',data=df,hue='smoker')
plt.title('BMI vs Charges')

Out[9]:

Text(0.5, 1.0, 'BMI vs Charges')

4.2.3 Checking if the outliers are present

In [10]:

df.boxplot()

Out[10]:

<AxesSubplot:>

In [11]:

df.boxplot(['age','bmi'])

Out[11]:

<AxesSubplot:>

We have to remove the outliers but we have limited data points so rather than removing the outliers we will use clipping method

In [12]:

data_clip = df.iloc[:]
columns = ['bmi','charges']
def clipValues(data_clip,lower,upper):
    for col in columns:
        l=data_clip[col].quantile(lower) 
        u=data_clip[col].quantile(upper)
        data_clip[col]=data_clip[col].clip(l,u)
    return data_clip
data=clipValues(data_clip,0.05,0.9)

In [13]:

#cheak the all outliers are removed or not

In [14]:

data.boxplot(['age','bmi'])

Out[14]:

<AxesSubplot:>

In [15]:

df.boxplot(['charges'])

Out[15]:

<AxesSubplot:>

from the above plots we can see that outliers are removed

5. Label Encoding

We need to do label encoding to convert object type variables into numerical type so that we can use those variables in the model

In [16]:

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data["sex"]=le.fit_transform((data['sex']))
data["smoker"]=le.fit_transform((data['smoker']))
data["region"]=le.fit_transform((data['region']))

In [17]:

data.head()

Out[17]:

	age	sex	bmi	children	smoker	region	charges
0	19	0	27.900	0	1	3	16884.92400
1	18	1	33.770	1	0	2	1757.75340
2	28	1	33.000	3	0	2	4449.46200
3	33	1	22.705	0	0	1	21984.47061
4	32	1	28.880	0	0	1	3866.85520

Now all the variables are in the numerical type

6. Assumtions of LinearRegression

6.1 Linear Relationship between the independent x and dependent y

Relationship between age and charges

In [18]:

plt.figure(figsize=(8,5))
sns.scatterplot(x='age',y='charges',data=df,hue='smoker')

Out[18]:

<AxesSubplot:xlabel='age', ylabel='charges'>

Variables age and charges are linearly associated with each other

6.2. Homoscedasticity

In [77]:

plt.scatter(ytest.values,ypred)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")

Out[77]:

Text(0, 0.5, 'Residuals')

From the above plot we can conclude that we dont have homoscedasticity in the data

6.3 Normality of Residuals

In [78]:

residuals = ytest.values - ypred
sns.histplot(residuals,kde=True)

Out[78]:

<AxesSubplot:ylabel='Count'>

We can see that the Residuals have normal distribution

6.4. Mean of Residuals

In [79]:

print("Mean of Residual is :", np.mean(residuals))

Mean of Residual is : -0.06781207178145235

We can see that the residuals have mean = 0

6.5.No Autocorrelation

In [80]:

import statsmodels.api as sm
sm.graphics.tsa.plot_acf(residuals)
plt.show()

From the above acf plot we can conclude that the there is no autocorrelatiion is the data.

6.6 No Perfect Multicolinearity

In [81]:

plt.figure(figsize = (15,8))
sns.heatmap(data.corr(),annot = True)

Out[81]:

<AxesSubplot:>

In [82]:

from statsmodels.stats.outliers_influence import variance_inflation_factor
def get_vif(df):
    d={}
    for i in range(df.shape[1]):
        d[df.columns[i]]=variance_inflation_factor(df.values,i)
    return d
get_vif(data.drop(columns = ['target']))

Out[82]:

{'age': 7.9129385711541875,
 'sex': 2.008858925319928,
 'bmi': 11.045882593807452,
 'children': 1.8055690975035243,
 'smoker': 1.258195185615136,
 'region': 2.9348837902311247}

From the above heatmap we can conclude that there is no multicolinearity between the varibles Also except var bmi the vif of our variables are low

7. Detect problems within variables

In [19]:

data['charges'].hist()

Out[19]:

<AxesSubplot:>

Even after removing the ouliers from the variable charges, charges column is skewed. So deal with that we will be transforming the column charges using log transformation

Transforming charges using log transformation

In [20]:

data['target']=np.log(data.charges)

In [21]:

data = data.drop(columns = ['charges'] )

8. dependent and indepedent variables

In [22]:

x=data.drop(columns=['target'])
y=data['target']

8.1 Splitting data into train and test

In [23]:

from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=5)

9. Regression Models

9.1 Linear Regression

Fit the Linear Regression model

In [24]:

from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(xtrain,ytrain)
ypred=model.predict(xtest)

In [25]:

# Evaluation of Linear Regression Model
print('\033[1m' + 'Linear Regression model evaluation'+'\033[0m')
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
mse=mean_squared_error(ytest,ypred)
print('mse',mse)
rmse=np.sqrt(mse)
print('rmse',rmse)
mae=mean_absolute_error(ytest,ypred)
print('mae',mae)
score=r2_score(ytest,ypred)
print('r2_score',score)

Linear Regression model evaluation
mse 0.1577524445033292
rmse 0.3971806194961295
mae 0.2742527051726538
r2_score 0.809458802076108

After applying the linear regression model to the data we get the R2 score of about 0.8094 and the errors are relatively low

Feature scaling

In [26]:

data.head()

Out[26]:

	age	sex	bmi	children	smoker	region	target
0	19	0	27.900	0	1	3	9.734176
1	18	1	33.770	1	0	2	7.471792
2	28	1	33.000	3	0	2	8.400538
3	33	1	22.705	0	0	1	9.998092
4	32	1	28.880	0	0	1	8.260197

As we can see that the bmi and age values are in two digits where as other variables are single digits which can be missleading for the model as it may weigh more to the variables with large even if they are not that important compared to variables with small values.

So to deal with that we will do feature scaling. Wich will convert data in same range. for feature sscaling we are using standard scalar

In [27]:

# Scalling the data with StandardScaler
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
scaled_xtrain=sc.fit_transform(xtrain)
scaled_xtest=sc.fit_transform(xtest)

Fit the linear regression model with scaled data

In [28]:

# fit and train the model with scaled data
model=LinearRegression()
model.fit(scaled_xtrain,ytrain)
ypred=model.predict(scaled_xtest)

In [29]:

# evaluation of model
print('\033[1m' + 'Linear Regression model evaluation for scaled data'+'\033[0m')
linear_mse=mean_squared_error(ytest,ypred)
print("MSE:",linear_mse)
linear_rmse = np.sqrt(linear_mse)
print("RMSE:",linear_rmse)
linear_mae=mean_absolute_error(ytest,ypred)
print("MAE:",linear_mae)
linear_score=r2_score(ytest,ypred)
print("R-squared :",linear_score)

Linear Regression model evaluation for scaled data
MSE: 0.17465206334650324
RMSE: 0.4179139425126939
MAE: 0.31105894981226717
R-squared : 0.7890466073302599

Even after applying feature scaling our model accuracy didnt change so we go the other regression types such as polynomial regression

9.2 Polynomial Regression

Fit the Polynomial Regression Model

In [30]:

from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
poly_reg.fit(xtrain)
x_train_poly=poly_reg.transform(xtrain)
x_test_poly=poly_reg.transform(xtest)

In [31]:

from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train_poly,ytrain)
y_pred=lr.predict(x_test_poly)

In [32]:

# Evaluation of Polynomial Regression model
print('\033[1m' + 'Polynomial Regression model evaluation'+'\033[0m')
poly_mse=mean_squared_error(ytest,y_pred)
print("MSE:",poly_mse)
poly_rmse = np.sqrt(poly_mse)
print("RMSE:",poly_rmse)
poly_mae=mean_absolute_error(ytest,y_pred)
print("MAE:",poly_mae)
poly_score=r2_score(ytest,y_pred)
print("R-squared :",poly_score)

Polynomial Regression model evaluation
MSE: 0.09710062938921961
RMSE: 0.3116097389190839
MAE: 0.18443064688024632
R-squared : 0.8827170615248671

Polynomial regression fits really well and we got the R2 score of 0.88 with relatively low errors

9.3 Ridge Regression

Fit the Ridge Regression model

In [33]:

from sklearn.linear_model import Ridge
ridge=Ridge(alpha=0.3)
ridge.fit(xtrain,ytrain)
rypred=ridge.predict(xtest)

In [34]:

# Evaluation of the Ridge model
print('\033[1m' + 'Ridge Regression model Evaluation'+'\033[0m')
ridge_mse=mean_squared_error(ytest,rypred)
print('mse',ridge_mse)
ridge_rmse=np.sqrt(ridge_mse)
print('rmse',ridge_rmse)
ridge_mae=mean_absolute_error(ytest,rypred)
print('mae',ridge_mae)
ridge_score=r2_score(ytest,rypred)
print('r2_score',ridge_score)

Ridge Regression model Evaluation
mse 0.15784565986426033
rmse 0.39729794847728617
mae 0.27440842992749637
r2_score 0.80934621195687

R2 score obtained from Ridge regression is same as linear regression

9.4 Lasso Regression Model

fit the Lasso Regression model

In [35]:

from sklearn.linear_model import Lasso
lasso=Lasso(alpha=0.0001)
lasso.fit(xtrain,ytrain)
lypred=lasso.predict(xtest)

In [36]:

# Evaluation of the Lasso Regression model
print('\033[1m' + 'Lasso Regression Model Evaluation'+'\033[0m')
lasso_mse=mean_squared_error(ytest,lypred)
print('mse',lasso_mse)
lasso_rmse=np.sqrt(lasso_mse)
print('rmse',lasso_rmse)
lasso_mae=mean_absolute_error(ytest,lypred)
print('mae',lasso_mae)
lasso_score=r2_score(ytest,lypred)
print('r2_score',lasso_score)

Lasso Regression Model Evaluation
mse 0.15776660307550305
rmse 0.3971984429419419
mae 0.2742557369031013
r2_score 0.8094417006529803

R2 score obtained from lasso regression is same as linear regression and ridge regression

9.5 ElasticNet Regression

Fit the ElasticNet Regression Model

In [ ]:

In [37]:

# function for getting best alpha value
​
from sklearn.linear_model import ElasticNet
alpha=[0.0001,0.001,0.01,0.1,0.3,0.5,1,10]
scores={}
def get_best_alpha(alpha):
    for i in alpha:
        model=ElasticNet(alpha=i)
        model.fit(xtrain,ytrain)
        ypred=model.predict(xtest)
        elastic_mse=mean_squared_error(ytest,ypred)
        elastic_rmse=np.sqrt(elastic_mse)
        elastic_mae=mean_absolute_error(ytest,ypred)
        scores[i]=model.score(xtest,ytest)
        #print("For Alpha = {} | R-square :{} MSE :{} RMSE :{} MAE:{} ".format(i,model.score(xtest,ytest),mse,rmse,mae))
    return max(scores, key= lambda x: scores[x])
​
# Best alpha value
print("Best Alpha is :",get_best_alpha(alpha))
print('R_square of the model is :',scores[0.0001])

Best Alpha is : 0.0001
R_square of the model is : 0.8094304520725136

In [38]:

# fitting model with best alpha value
model= ElasticNet(alpha=0.0001)
model.fit(xtrain,ytrain)
elypred=model.predict(xtest)

In [39]:

# evaluation of ElasticNet model
print('\033[1m' + 'ElasticNet Regression Model'+'\033[0m')
elastic_mse=mean_squared_error(ytest,elypred)
elastic_rmse=np.sqrt(elastic_mse)
elastic_mae=mean_absolute_error(ytest,elypred)
elastic_r2score=r2_score(ytest,elypred)
print('mse',elastic_mse)
print('rmse',elastic_rmse)
print('mae',elastic_mae)
print('r2_score',elastic_r2score) 

ElasticNet Regression Model
mse 0.15777591597520732
rmse 0.3972101660018375
mae 0.2742817119107097
r2_score 0.8094304520725136

9.6 ElasticNetCv Regression

In [40]:

# Fitting model for ElasticNetCv
from sklearn.linear_model import ElasticNetCV
model=ElasticNetCV(alphas=alpha,cv=5)
model.fit(xtrain,ytrain)
ecv_ypred=model.predict(xtest)

In [41]:

# evaluation of ElasticNetCv model
print('\033[1m' + 'ElasticNetCV Regression model'+'\033[0m')
ecv_mse=mean_squared_error(ytest,ecv_ypred)
ecv_rmse=np.sqrt(ecv_mse)
ecv_mae=mean_absolute_error(ytest,ecv_ypred)
ecv_r2score=r2_score(ytest,ecv_ypred)
print('mse',ecv_mse)
print('rmse',ecv_rmse)
print('mae',ecv_mae)
print('r2_score',ecv_r2score)

ElasticNetCV Regression model
mse 0.1579988794397903
rmse 0.39749072874696123
mae 0.2745472466643872
r2_score 0.8091611457820872

10. Final Evaluation

In [42]:

models = ['LinearRegression','Polynomial','Ridge','Lasso','ElasticNet','ElasticNetCV']
rsquare = [score,poly_score,ridge_score,lasso_score,elastic_r2score,ecv_r2score]
mse = [mse,poly_mse,ridge_mse,lasso_mse,elastic_mse,ecv_mse]
rmse = [rmse,poly_rmse,ridge_rmse,lasso_rmse,elastic_rmse,ecv_rmse]
mae = [mae,poly_mae,ridge_mae,lasso_mae,elastic_mae,ecv_mae]
Evaluation = pd.DataFrame({'Model':models,'R-square':rsquare,'MSE':mse,'RMSE':rmse,'MAE':mae})

In [43]:

Evaluation.style.highlight_max(subset = ['R-square'],color = 'lightgreen')

Out[43]:

	Model	R-square	MSE	RMSE	MAE
0	LinearRegression	0.809459	0.157752	0.397181	0.274253
1	Polynomial	0.882717	0.097101	0.311610	0.184431
2	Ridge	0.809346	0.157846	0.397298	0.274408
3	Lasso	0.809442	0.157767	0.397198	0.274256
4	ElasticNet	0.809430	0.157776	0.397210	0.274282
5	ElasticNetCV	0.809161	0.157999	0.397491	0.274547

In [55]:

Evaluation.sort_values(by='R-square',ascending=False,inplace=True)
plt.figure(figsize = (10,7))
sns.barplot(x='R-square', y='Model',data=Evaluation)
plt.xlabel('R-Square')
plt.ylabel('Models')
plt.title('Evaluation of models')
plt.xlim(0.7,1)

Out[55]:

(0.7, 1.0)

11. Conclusion

Considering R2 score and other parameter it is observed that Polynomial regression for 2nd degree turned out to be suitable to predict the issurance charges.¶

aviator coding