Skip to main content

Regression algorithm

 

1.Introduction

The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries. Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.

Inurance premium vary over person to person. Insurance companies decide the charges for a perticular person based on the probability of happening of an event prescribed in the helth insurance. To avoid loss companies take the account of personal attributes of the person to calculate the probability of happening of an event. In this project we want to predict the insurance charges using data which includes attributes of the people and charges applied to them

1.1 Brief discription of the attributes in the dataset

1) age: age of primary beneficiary

2) sex: gender of the beneficiary (male,female)

3) bmi: Body mass index, providing an understanding of body,weights that are relatively high or low relative to height, objective index of body weight(kg/m) using the ratio of height to weight, ideally 18.5 to 24.9

4) children: Number of children covered by health insurance/ Number of dependent

5) smoker: if the beneficiary is smoker or not

6) region: the beneficiary's residential area in the US,northeast,southeast,northwest,southwest

7) charges: individual medical premium billed by health insurance.

2. Problem Statement

The goal of this project is to predict the charges based on the attributes related to the person such as age, gender, bmi, region he belongs to, number of children and if he/she is a smoker.

3. Importing Libraries and Load Dataset

In [1]:

Loading Dataset

In [2]:

4. Exploratory Data Analysis (EDA)

EDA is a step in data analysis process where a number of techniques are used to understand the dataset better.

In [3]:
Out[3]:
agesexbmichildrensmokerregioncharges
019female27.9000yessouthwest16884.92400
118male33.7701nosoutheast1725.55230
228male33.0003nosoutheast4449.46200
333male22.7050nonorthwest21984.47061
432male28.8800nonorthwest3866.85520
In [4]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

We can see that varibles such as sex, smoker and region are of object type and other varibles are numerical so we need to convert the varibles with object type into numerical type

4.1 Checking present of Null values

In [5]:
Out[5]:
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
In [6]:
Out[6]:
agebmichildrencharges
count1338.0000001338.0000001338.0000001338.000000
mean39.20702530.6633971.09491813270.422265
std14.0499606.0981871.20549312110.011237
min18.00000015.9600000.0000001121.873900
25%27.00000026.2962500.0000004740.287150
50%39.00000030.4000001.0000009382.033000
75%51.00000034.6937502.00000016639.912515
max64.00000053.1300005.00000063770.428010

the described method will help to see how data has been spread for nemerical value we can clearly see the minumum value,mean,maximum value,percentile value.

4.2 Data visulization

4.2.1 Histograms

Histogram are one of the most common graphs used to display numeric data distribution of the data whether the data is normally distributed or if it is skewed right or left

In [7]:

4.2.2 Scatter Plot

A scatterplot is a type of graph which plots the values of two variables along two axes, like age and height. Scatterplot are useful for many reason like correlation matrics.it allows you to quickly understand a relationship between two variable.

In [8]:
Out[8]:
Text(0.5, 1.0, 'AGE vs Charges')
In [ ]:
In [9]:
Out[9]:
Text(0.5, 1.0, 'BMI vs Charges')

4.2.3 Checking if the outliers are present

In [10]:
Out[10]:
<AxesSubplot:>
In [11]:
Out[11]:
<AxesSubplot:>

We have to remove the outliers but we have limited data points so rather than removing the outliers we will use clipping method

In [12]:
In [13]:
In [14]:
Out[14]:
<AxesSubplot:>
In [15]:
Out[15]:
<AxesSubplot:>

from the above plots we can see that outliers are removed

5. Label Encoding

We need to do label encoding to convert object type variables into numerical type so that we can use those variables in the model

In [16]:
In [17]:
Out[17]:
agesexbmichildrensmokerregioncharges
019027.90001316884.92400
118133.7701021757.75340
228133.0003024449.46200
333122.70500121984.47061
432128.8800013866.85520

Now all the variables are in the numerical type

6. Assumtions of LinearRegression

6.1 Linear Relationship between the independent x and dependent y

Relationship between age and charges
In [18]:
Out[18]:
<AxesSubplot:xlabel='age', ylabel='charges'>

Variables age and charges are linearly associated with each other

6.2. Homoscedasticity

In [77]:
Out[77]:
Text(0, 0.5, 'Residuals')

From the above plot we can conclude that we dont have homoscedasticity in the data

6.3 Normality of Residuals

In [78]:
Out[78]:
<AxesSubplot:ylabel='Count'>

We can see that the Residuals have normal distribution

6.4. Mean of Residuals

In [79]:
Mean of Residual is : -0.06781207178145235

We can see that the residuals have mean = 0

6.5.No Autocorrelation

In [80]:

From the above acf plot we can conclude that the there is no autocorrelatiion is the data.

6.6 No Perfect Multicolinearity

In [81]:
Out[81]:
<AxesSubplot:>
In [82]:
Out[82]:
{'age': 7.9129385711541875,
 'sex': 2.008858925319928,
 'bmi': 11.045882593807452,
 'children': 1.8055690975035243,
 'smoker': 1.258195185615136,
 'region': 2.9348837902311247}

From the above heatmap we can conclude that there is no multicolinearity between the varibles Also except var bmi the vif of our variables are low

7. Detect problems within variables

In [19]:
Out[19]:
<AxesSubplot:>

Even after removing the ouliers from the variable charges, charges column is skewed. So deal with that we will be transforming the column charges using log transformation

Transforming charges using log transformation

In [20]:
In [21]:

8. dependent and indepedent variables

In [22]:

8.1 Splitting data into train and test

In [23]:

9. Regression Models

9.1 Linear Regression

Fit the Linear Regression model

In [24]:
In [25]:
Linear Regression model evaluation
mse 0.1577524445033292
rmse 0.3971806194961295
mae 0.2742527051726538
r2_score 0.809458802076108

After applying the linear regression model to the data we get the R2 score of about 0.8094 and the errors are relatively low

Feature scaling

In [26]:
Out[26]:
agesexbmichildrensmokerregiontarget
019027.9000139.734176
118133.7701027.471792
228133.0003028.400538
333122.7050019.998092
432128.8800018.260197

As we can see that the bmi and age values are in two digits where as other variables are single digits which can be missleading for the model as it may weigh more to the variables with large even if they are not that important compared to variables with small values.

So to deal with that we will do feature scaling. Wich will convert data in same range. for feature sscaling we are using standard scalar

In [27]:

Fit the linear regression model with scaled data

In [28]:
In [29]:
Linear Regression model evaluation for scaled data
MSE: 0.17465206334650324
RMSE: 0.4179139425126939
MAE: 0.31105894981226717
R-squared : 0.7890466073302599

Even after applying feature scaling our model accuracy didnt change so we go the other regression types such as polynomial regression

9.2 Polynomial Regression

Fit the Polynomial Regression Model

In [30]:
In [31]:
In [32]:
Polynomial Regression model evaluation
MSE: 0.09710062938921961
RMSE: 0.3116097389190839
MAE: 0.18443064688024632
R-squared : 0.8827170615248671

Polynomial regression fits really well and we got the R2 score of 0.88 with relatively low errors

9.3 Ridge Regression

Fit the Ridge Regression model

In [33]:
In [34]:
Ridge Regression model Evaluation
mse 0.15784565986426033
rmse 0.39729794847728617
mae 0.27440842992749637
r2_score 0.80934621195687

R2 score obtained from Ridge regression is same as linear regression

9.4 Lasso Regression Model

fit the Lasso Regression model

In [35]:
In [36]:
Lasso Regression Model Evaluation
mse 0.15776660307550305
rmse 0.3971984429419419
mae 0.2742557369031013
r2_score 0.8094417006529803

R2 score obtained from lasso regression is same as linear regression and ridge regression

9.5 ElasticNet Regression

Fit the ElasticNet Regression Model

In [ ]:
In [37]:
Best Alpha is : 0.0001
R_square of the model is : 0.8094304520725136
In [38]:
In [39]:
ElasticNet Regression Model
mse 0.15777591597520732
rmse 0.3972101660018375
mae 0.2742817119107097
r2_score 0.8094304520725136

9.6 ElasticNetCv Regression

In [40]:
In [41]:
ElasticNetCV Regression model
mse 0.1579988794397903
rmse 0.39749072874696123
mae 0.2745472466643872
r2_score 0.8091611457820872

10. Final Evaluation

In [42]:
In [43]:
Out[43]:
ModelR-squareMSERMSEMAE
0LinearRegression0.8094590.1577520.3971810.274253
1Polynomial0.8827170.0971010.3116100.184431
2Ridge0.8093460.1578460.3972980.274408
3Lasso0.8094420.1577670.3971980.274256
4ElasticNet0.8094300.1577760.3972100.274282
5ElasticNetCV0.8091610.1579990.3974910.274547
In [55]:
Out[55]:
(0.7, 1.0)

11. Conclusion

Considering R2 score and other parameter it is observed that Polynomial regression for 2nd degree turned out to be suitable to predict the issurance charges.¶

Popular posts from this blog

deploying Machine learning Model : pkl, Flask,postman

1)Create model and train          #  importing Librarys         import pandas as pd         import numpy as np         import matplotlib . pyplot as plt         import seaborn as sns         import requests         from pickle import dump , load         # Load Dataset         url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"         names = [ "sepal_length" , "sepal_width" , "petal_length" , "petal_width" , "species" ]         # Loading Dataset         df = pd . read_csv ( url , names = names )         df . tail ( 11 )         df . columns         test = [         { 'sepal_length' : 5.1 , 'sepal_width' : 3.5 , 'peta...

Binomial Distribution

  The binomial distribution formula is:                                                    b(x; n, P) =  n C x  * P x  * (1 – P) n – x Where: b = binomial probability x = total number of “successes” (pass or fail, heads or tails etc.) P = probability of a success on an individual trial n = number of trials Note:  The binomial distribution formula can also be written in a slightly different way, because  n C x  = n! / x!(n – x)! (this binomial distribution formula uses factorials  (What is a factorial? ). “q” in this formula is just the probability of failure (subtract your probability of success from 1). Using the First Binomial Distribution Formula The binomial distribution formula can calculate the probability of success for binomial distributions. Often you’ll be told to “plug in” the numbers to the  formul...

cammand for installing library in python

 Command for installing in jupyter notebook:                pip install library_name                ex. pip install nump installing from anaconda prompt:           1. pip install numpy           2.   conda install -c conda-forge matplotlib search for conda command for matplotlib and go to official website. Installing from anaconda navigator easy. Somtime give error then open as administrator