Skip to main content

Classification algorithm

 

Introduction

Email spam, also referred to as junk email or simply spam, is unsolicited messages sent in bulk by email. Email spam has steadily grown since the early 1990s, and by 2014 was estimated to account for around 90% of total email traffic.Most email spam messages are commercial in nature. Whether commercial or not, many are not only annoying as a form of attention theft, but also dangerous because they may contain links that lead to phishing web sites or sites that are hosting malware or include malware as file attachments. That's why its necessary to filter spam messages/mails to protect user. In this project we are going to use different classification algorithms to classify emails as spam or not spam. For that we have different features extracted from emails, these features include percentage of words, characters etc.

Variables discription

In [3]:
In [4]:
Out[4]:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
Variablesword_freq_make:word_freq_address:word_freq_all:word_freq_3d:word_freq_our:word_freq_over:word_freq_remove:word_freq_internet:word_freq_order:word_freq_mail:word_freq_receive:word_freq_will:word_freq_people:word_freq_report:word_freq_addresses:word_freq_free:word_freq_business:word_freq_email:word_freq_you:word_freq_credit:word_freq_your:word_freq_font:word_freq_000:word_freq_money:word_freq_hp:word_freq_hpl:word_freq_george:word_freq_650:word_freq_lab:word_freq_labs:word_freq_telnet:word_freq_857:word_freq_data:word_freq_415:word_freq_85:word_freq_technology:word_freq_1999:word_freq_parts:word_freq_pm:word_freq_direct:word_freq_cs:word_freq_meeting:word_freq_original:word_freq_project:word_freq_re:word_freq_edu:word_freq_table:word_freq_conference:char_freq_;:char_freq_(:char_freq_[:char_freq_!:char_freq_$:char_freq_#:capital_run_length_average:capital_run_length_longest:capital_run_length_total:
typecontinuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.continuous.
  • Here first 48 variables are percentage of the words in the email.

    100*(Number of times word appears in the email / Total number of words in the email)

  • The second 6 variables are percentage of the symboles in the email.

    100*(Number of times character appears in the email / Total number of characters in the email)

  • capital_run_length_average defines the average length uninterrupted sequence of the capital letters.

  • capital_run_length_longest defines the length of longest uninterrupted sequence of the capital letters.

  • capital_run_length_total is the sum of lengths of uninterrupted sequence of the capital letters.

  • Class is the target variable [0: not spam,1: spam]

Columns are just the frequency of given words occured in the email and we predict the email is spam or not based on these columns

Objective :

To predict the email is spam or not using different features extracted from the emails

Importing required libraries

In [2]:
In [66]:
In [67]:
Out[67]:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
00.210.280.500.00.140.280.210.070.000.940.210.790.650.210.140.140.070.283.470.001.590.00.430.430.00.00.00.00.00.00.00.00.00.00.00.00.070.00.00.000.00.00.000.00.000.000.00.00.000.1320.00.3720.1800.0485.11410110281
10.060.000.710.01.230.190.190.120.640.250.380.450.120.001.750.060.061.031.360.320.510.01.160.060.00.00.00.00.00.00.00.00.00.00.00.00.000.00.00.060.00.00.120.00.060.060.00.00.010.1430.00.2760.1840.0109.82148522591
20.000.000.000.00.630.000.310.630.310.630.310.310.310.000.000.310.000.003.180.000.310.00.000.000.00.00.00.00.00.00.00.00.00.00.00.00.000.00.00.000.00.00.000.00.000.000.00.00.000.1370.00.1370.0000.0003.537401911
30.000.000.000.00.630.000.310.630.310.630.310.310.310.000.000.310.000.003.180.000.310.00.000.000.00.00.00.00.00.00.00.00.00.00.00.00.000.00.00.000.00.00.000.00.000.000.00.00.000.1350.00.1350.0000.0003.537401911
40.000.000.000.01.850.000.001.850.000.000.000.000.000.000.000.000.000.000.000.000.000.00.000.000.00.00.00.00.00.00.00.00.00.00.00.00.000.00.00.000.00.00.000.00.000.000.00.00.000.2230.00.0000.0000.0003.00015541
In [8]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 58 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   make                        4600 non-null   float64
 1   address                     4600 non-null   float64
 2   all                         4600 non-null   float64
 3   3d                          4600 non-null   float64
 4   our                         4600 non-null   float64
 5   over                        4600 non-null   float64
 6   remove                      4600 non-null   float64
 7   internet                    4600 non-null   float64
 8   order                       4600 non-null   float64
 9   mail                        4600 non-null   float64
 10  receive                     4600 non-null   float64
 11  will                        4600 non-null   float64
 12  people                      4600 non-null   float64
 13  report                      4600 non-null   float64
 14  addresses                   4600 non-null   float64
 15  free                        4600 non-null   float64
 16  business                    4600 non-null   float64
 17  email                       4600 non-null   float64
 18  you                         4600 non-null   float64
 19  credit                      4600 non-null   float64
 20  your                        4600 non-null   float64
 21  font                        4600 non-null   float64
 22  000                         4600 non-null   float64
 23  money                       4600 non-null   float64
 24  hp                          4600 non-null   float64
 25  hpl                         4600 non-null   float64
 26  george                      4600 non-null   float64
 27  650                         4600 non-null   float64
 28  lab                         4600 non-null   float64
 29  labs                        4600 non-null   float64
 30  telnet                      4600 non-null   float64
 31  857                         4600 non-null   float64
 32  data                        4600 non-null   float64
 33  415                         4600 non-null   float64
 34  85                          4600 non-null   float64
 35  technology                  4600 non-null   float64
 36  1999                        4600 non-null   float64
 37  parts                       4600 non-null   float64
 38  pm                          4600 non-null   float64
 39  direct                      4600 non-null   float64
 40  cs                          4600 non-null   float64
 41  meeting                     4600 non-null   float64
 42  original                    4600 non-null   float64
 43  project                     4600 non-null   float64
 44  re                          4600 non-null   float64
 45  edu                         4600 non-null   float64
 46  table                       4600 non-null   float64
 47  conference                  4600 non-null   float64
 48  char_freq_;                 4600 non-null   float64
 49  char_freq_(                 4600 non-null   float64
 50  char_freq_[                 4600 non-null   float64
 51  char_freq_!                 4600 non-null   float64
 52  char_freq_$                 4600 non-null   float64
 53  char_freq_#                 4600 non-null   float64
 54  capital_run_length_average  4600 non-null   float64
 55  capital_run_length_longest  4600 non-null   int64  
 56  capital_run_length_total    4600 non-null   int64  
 57  Class                       4600 non-null   int64  
dtypes: float64(55), int64(3)
memory usage: 2.0 MB

dtypes: float64(57), int64(1)

In [ ]:
Out[61]:
(4601, 58)

Rows=214 , columns(features)=57

In [ ]:
Out[62]:
0

No null values present in the dataset

EDA plots

In [ ]:

From above bar plot we can see that 'you' world used most of the time in the mail.

In [ ]:

form the above bar plot chatracter '!' used most of the time in the mail

In [ ]:

Here we can see that word "you" is highly correlated with class.

In [ ]:
Out[113]:
Index(['make', 'address', 'all', '3d', 'our', 'over', 'remove', 'internet',
       'order', 'mail', 'receive', 'will', 'people', 'report', 'addresses',
       'free', 'business', 'email', 'you', 'credit', 'your', 'font', '000',
       'money', 'hp', 'hpl', 'george', '650', 'lab', 'labs', 'telnet', '857',
       'data', '415', '85', 'technology', '1999', 'parts', 'pm', 'direct',
       'cs', 'meeting', 'original', 'project', 're', 'edu', 'table',
       'conference', 'char_freq_;', 'char_freq_(', 'char_freq_[',
       'char_freq_!', 'char_freq_$', 'char_freq_#',
       'capital_run_length_average', 'capital_run_length_longest',
       'capital_run_length_total', 'Class'],
      dtype='object')
In [ ]:
Out[148]:
array([1., 0.])
In [ ]:
Out[42]:
makeaddressall3douroverremoveinternetordermail...char_freq_;char_freq_(char_freq_[char_freq_!char_freq_$char_freq_#capital_run_length_averagecapital_run_length_longestcapital_run_length_totalClass
count4600.0000004600.0000004600.0000004600.0000004600.0000004600.0000004600.0000004600.0000004600.0000004600.000000...4600.0000004600.0000004600.0000004600.0000004600.0000004600.0000004600.0000004600.0000004600.0000004600.000000
mean0.0231970.0149100.0551290.0015290.0312850.0163130.0157130.0094800.0171270.013172...0.0087990.0142710.0041610.0083080.0126320.0022310.0043270.0072970.0434720.393913
std0.0686140.0903850.0992400.0325930.0673930.0465730.0538490.0361040.0529740.035468...0.0555300.0277320.0268090.0251750.0409640.0216550.0380990.1059521.2925520.488669
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0005360.0005010.0021460.000000
50%0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0066650.0000000.0000000.0000000.0000000.0011590.0014020.0059340.000000
75%0.0000000.0000000.0823530.0000000.0390000.0000000.0000000.0000000.0000000.008801...0.0000000.0192780.0000000.0096990.0086620.0000000.0024580.0042050.0167301.000000
max1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000...1.0000001.0000001.0000001.0000001.0000001.0000001.2500005.00000078.0000001.000000

8 rows × 58 columns

column 56 and 57 need to be scaled as mean and variance of these columns is higher than other columns

In [68]:
Out[68]:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
00.0462560.0196080.0980390.00.0140.0476190.0288860.0063010.0000000.0517050.0804600.0816960.1171170.0210.0317460.00700.0098040.0308030.1850670.0000000.1431140.00.0788990.03440.00.00.00.00.00.00.00.00.00.00.00.00.010160.00.00.0000000.00.00.0000000.00.0000000.0000000.00.00.0000000.0135360.00.0114540.0299850.0024210.0037350.0100120.0648361.0
10.0132160.0000000.1392160.00.1230.0323130.0261350.0108010.1216730.0137510.1455940.0465360.0216220.0000.3968250.00300.0084030.1133110.0725330.0176020.0459050.00.2128440.00480.00.00.00.00.00.00.00.00.00.00.00.00.000000.00.00.0126050.00.00.0336130.00.0028010.0027210.00.00.0022810.0146640.00.0084980.0306510.0005040.0080080.0484580.1425511.0
20.0000000.0000000.0000000.00.0630.0000000.0426410.0567060.0589350.0346530.1187740.0320580.0558560.0000.0000000.01550.0000000.0000000.1696000.0000000.0279030.00.0000000.00000.00.00.00.00.00.00.00.00.00.00.00.00.000000.00.00.0000000.00.00.0000000.00.0000000.0000000.00.00.0000000.0140480.00.0042180.0000000.0000000.0023030.0039050.0119951.0
In [10]:
Out[10]:
0.0    2788
1.0    1812
Name: Class, dtype: int64
In [ ]:
Out[67]:
<seaborn.axisgrid.FacetGrid at 0x7feab0e48150>

from the above plot both bars not having huge difference between two classes.

In [69]:
In [70]:

models

1) LogesticRegression

In [71]:
In [23]:
Accuracy is : 0.7869565217391304
[[381 182]
 [ 14 343]]
              precision    recall  f1-score   support

         0.0       0.96      0.68      0.80       563
         1.0       0.65      0.96      0.78       357

    accuracy                           0.79       920
   macro avg       0.81      0.82      0.79       920
weighted avg       0.84      0.79      0.79       920

Here by Using Logistic Regression we got the accuracy 0.899022. to get the better accuracy we go with 'Hyper Parameter Tunning'.

hyper parameter tunning of logesticRegression

In [ ]:
0.9243662171083656
{'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}

Here by Hyper Parameter Tunning we got the accuracy 0.924366 with the best parameter

Retraining the logistic regression model on best parameters

In [72]:
In [73]:
Accuracy is : 0.9510869565217391
[[534  29]
 [ 44 313]]
              precision    recall  f1-score   support

         0.0       0.92      0.95      0.94       563
         1.0       0.92      0.88      0.90       357

    accuracy                           0.92       920
   macro avg       0.92      0.91      0.92       920
weighted avg       0.92      0.92      0.92       920

Retraining the logistic regression model on best parameters Accuracy is : 0.8990228013029316

2)Naive Bayes classification

In [74]:
In [75]:
Accuracy is : 0.7869565217391304
              precision    recall  f1-score   support

         0.0       0.96      0.68      0.80       563
         1.0       0.65      0.96      0.78       357

    accuracy                           0.79       920
   macro avg       0.81      0.82      0.79       920
weighted avg       0.84      0.79      0.79       920

Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4e6abffb90>

After the performing Naive Bayes classification we got the accuracy is 0.7926167209554832

3) Support Vector Machines

In [27]:
In [ ]:
Accuracy is:  0.9207383279044516
              precision    recall  f1-score   support

         0.0       0.90      0.98      0.94       564
         1.0       0.96      0.83      0.89       357

    accuracy                           0.92       921
   macro avg       0.93      0.90      0.91       921
weighted avg       0.92      0.92      0.92       921

Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x7feaa01e6110>

After Performing the Support Vector Machines algorithm Accuracy is: 0.9207383279044516

hyper parameter tunning of SVM

In [ ]:
{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
0.9031702898550723

After the hyper parameter tunning of SVM we get best parameter 'C': 1, 'gamma': 'scale', 'kernel': 'rbf' with accuracy is 0.9031702898550723

Retraining the SVM model on best parameters

In [76]:
In [77]:
Accuracy is:  0.9043478260869565
              precision    recall  f1-score   support

         0.0       0.89      0.96      0.92       563
         1.0       0.93      0.82      0.87       357

    accuracy                           0.90       920
   macro avg       0.91      0.89      0.90       920
weighted avg       0.91      0.90      0.90       920

Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4e6ab97690>

After the Retraining the logistic regression model on best parameters Accuracy is: 0.9207383279044516

4) KNN

In [30]:
In [33]:
Accuracy is:  0.925
              precision    recall  f1-score   support

         0.0       0.92      0.96      0.94       563
         1.0       0.94      0.86      0.90       357

    accuracy                           0.93       920
   macro avg       0.93      0.91      0.92       920
weighted avg       0.93      0.93      0.92       920

Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4e7f2b1790>

using the KNN algorithm Accuracy is: 0.8946796959826275

hyper parameter tunning of KNN

In [78]:
{'metric': 'manhattan', 'n_neighbors': 6, 'weights': 'distance'}
0.9193840579710144

Retraining the KNN model on best parameters

In [79]:
In [80]:
Accuracy is:  0.925
              precision    recall  f1-score   support

         0.0       0.92      0.96      0.94       563
         1.0       0.94      0.86      0.90       357

    accuracy                           0.93       920
   macro avg       0.93      0.91      0.92       920
weighted avg       0.93      0.93      0.92       920

Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4e6a91d850>

Retraining the logistic regression model on best parameters Accuracy is: 0.9163952225841476

5)Decision Tree

In [36]:
In [ ]:
Accuracy is:  0.9174809989142236
              precision    recall  f1-score   support

         0.0       0.93      0.93      0.93       564
         1.0       0.90      0.89      0.89       357

    accuracy                           0.92       921
   macro avg       0.91      0.91      0.91       921
weighted avg       0.92      0.92      0.92       921

Out[86]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fea8eb2f310>

Using Decision Tree Algorithm Accuracy is: 0.9174809989142236

hyper parameter tunning of DecisionTree

In [ ]:

Retraining the model on best parameter

In [81]:
In [82]:
Accuracy is:  0.8804347826086957
              precision    recall  f1-score   support

         0.0       0.89      0.91      0.90       563
         1.0       0.86      0.83      0.84       357

    accuracy                           0.88       920
   macro avg       0.88      0.87      0.87       920
weighted avg       0.88      0.88      0.88       920

Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4e6ab97450>

Retraining the model on best parameter Accuracy is: 0.8957654723127035

6) Ensemble Learning

1) Bagging Metaestimator

In [39]:
In [ ]:
accuracy is : 0.9370249728555917
              precision    recall  f1-score   support

         0.0       0.94      0.96      0.95       564
         1.0       0.94      0.90      0.92       357

    accuracy                           0.94       921
   macro avg       0.94      0.93      0.93       921
weighted avg       0.94      0.94      0.94       921

Out[90]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fea8e9f38d0>

Using Bagging Metaestimator algorithm accuracy is : 0.9370249728555917

Hyper parameter tunning of metaestimator

In [ ]:

Retraining the model on best parameters

In [104]:
In [105]:
accuracy is : 0.9543478260869566
              precision    recall  f1-score   support

         0.0       0.96      0.97      0.96       563
         1.0       0.95      0.93      0.94       357

    accuracy                           0.95       920
   macro avg       0.95      0.95      0.95       920
weighted avg       0.95      0.95      0.95       920

After Retraining the model on best parameters accuracy is 0.9467969598262758

ii) RandomForest

In [42]:
In [ ]:
accuracy is : 0.9609120521172638
              precision    recall  f1-score   support

         0.0       0.96      0.98      0.97       564
         1.0       0.97      0.93      0.95       357

    accuracy                           0.96       921
   macro avg       0.96      0.96      0.96       921
weighted avg       0.96      0.96      0.96       921

Out[94]:
<matplotlib.axes._subplots.AxesSubplot at 0x7feabce4ccd0>

Using Random Forest Algorithm Accuracy is 0.9609120521172638

hyper parameter tuning of the random forest

In [ ]:
best parameters are : {'criterion': 'entropy', 'max_features': 'log2', 'n_estimators': 1000}
best accuracy is : 0.9520833333333332

By Hyper Parameter tunning of the random forest best Accuracy is best accuracy is : 0.9520833333333332 with best parameter are {'criterion': 'entropy', 'max_features': 'log2', 'n_estimators': 1000}

Retraining the data on best parameters:

In [102]:
In [103]:
Accuracy is:  0.9554347826086956
              precision    recall  f1-score   support

         0.0       0.95      0.98      0.96       563
         1.0       0.96      0.92      0.94       357

    accuracy                           0.96       920
   macro avg       0.96      0.95      0.95       920
weighted avg       0.96      0.96      0.96       920

Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4e6acc2510>

After Retraining the data on best parameters Accuracy is Accuracy is: 0.9598262757871878

7) Boosting

i) Adaboost

In [45]:
In [ ]:
Accuracy is : 0.9348534201954397
              precision    recall  f1-score   support

         0.0       0.95      0.95      0.95       564
         1.0       0.91      0.92      0.92       357

    accuracy                           0.93       921
   macro avg       0.93      0.93      0.93       921
weighted avg       0.93      0.93      0.93       921

After Using Adaboost Algorithm Accuracy is : 0.9348534201954397

In [ ]:

Retraining the Adaboost model on best parameters

In [87]:
In [88]:
Accuracy is : 0.9478260869565217
              precision    recall  f1-score   support

         0.0       0.95      0.96      0.96       563
         1.0       0.94      0.93      0.93       357

    accuracy                           0.95       920
   macro avg       0.95      0.94      0.94       920
weighted avg       0.95      0.95      0.95       920

After Retraining the logistic regression model on best parameters Accuracy is 0.9500542888165038

ii) GradientBoost

In [89]:
In [90]:
Accuracy is : 0.9445652173913044
              precision    recall  f1-score   support

         0.0       0.95      0.96      0.96       563
         1.0       0.94      0.91      0.93       357

    accuracy                           0.94       920
   macro avg       0.94      0.94      0.94       920
weighted avg       0.94      0.94      0.94       920

After using the Gradient Boost Classifier algorithm Accuracy is 0.9511400651465798

iii) XGBoost

In [91]:
In [92]:
Accuracy is : 0.9445652173913044
              precision    recall  f1-score   support

         0.0       0.95      0.96      0.96       563
         1.0       0.94      0.91      0.93       357

    accuracy                           0.94       920
   macro avg       0.94      0.94      0.94       920
weighted avg       0.94      0.94      0.94       920

After Using the XGboost Classifier Algorithm Accuracy is 0.9457111834961998

In [ ]:

Retraining the XGBoost regression model on best parameters

In [93]:
In [94]:
Accuracy is : 0.9456521739130435
              precision    recall  f1-score   support

         0.0       0.95      0.96      0.96       563
         1.0       0.93      0.92      0.93       357

    accuracy                           0.95       920
   macro avg       0.94      0.94      0.94       920
weighted avg       0.95      0.95      0.95       920

After Retraining the logistic regression model on best parameters Accuracy is 0.9565689467969598

7)Voting

In [95]:
In [96]:
In [64]:
Accuracy is : 0.95
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       563
           1       0.93      0.94      0.94       357

    accuracy                           0.95       920
   macro avg       0.95      0.95      0.95       920
weighted avg       0.95      0.95      0.95       920

After Using the Voting Classifier Accuracy is 0.9511400651465798

8) Stacking

In [97]:
In [98]:
Accuracy is  0.9271739130434783
              precision    recall  f1-score   support

         0.0       0.93      0.95      0.94       563
         1.0       0.92      0.89      0.90       357

    accuracy                           0.93       920
   macro avg       0.93      0.92      0.92       920
weighted avg       0.93      0.93      0.93       920

Out[98]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4e6a7c5910>

After using Stacking Algorithm Accuracy is 0.9305103148751357

Evaluation

In [106]:
In [107]:
Out[107]:
No.ModelAccuracyPrecisionRecallfscore
01Logistic Regression0.9206520.9195400.9126200.915793
12Naive Bayes0.7869570.8089450.8187580.786592
23SVM0.9043480.9097730.8885370.896853
34KNN0.9250000.9281270.9136110.919765
45Decision Tree0.8804350.8759420.8710490.873322
56Bagging metaestimator0.9543480.9532690.9504010.951788
67Random forest0.9554350.9562140.9497510.952758
78AdaBoost0.9478260.9459110.9440470.944958
89Gradient Boost0.9445650.9440950.9388210.941299
910XGBoost0.9456520.9436130.9417590.942665
1011Voting classifier0.9500000.9465900.9483860.947467
1112stacking Classifier0.9271740.9251170.9210240.922967
In [101]:
Out[101]:
Text(0.5, 1.0, 'Evaluation of models')

From the Above barplot it seems that Random Forest and Bagging are doing well on the data¶

Popular posts from this blog

deploying Machine learning Model : pkl, Flask,postman

1)Create model and train          #  importing Librarys         import pandas as pd         import numpy as np         import matplotlib . pyplot as plt         import seaborn as sns         import requests         from pickle import dump , load         # Load Dataset         url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"         names = [ "sepal_length" , "sepal_width" , "petal_length" , "petal_width" , "species" ]         # Loading Dataset         df = pd . read_csv ( url , names = names )         df . tail ( 11 )         df . columns         test = [         { 'sepal_length' : 5.1 , 'sepal_width' : 3.5 , 'peta...

Binomial Distribution

  The binomial distribution formula is:                                                    b(x; n, P) =  n C x  * P x  * (1 – P) n – x Where: b = binomial probability x = total number of “successes” (pass or fail, heads or tails etc.) P = probability of a success on an individual trial n = number of trials Note:  The binomial distribution formula can also be written in a slightly different way, because  n C x  = n! / x!(n – x)! (this binomial distribution formula uses factorials  (What is a factorial? ). “q” in this formula is just the probability of failure (subtract your probability of success from 1). Using the First Binomial Distribution Formula The binomial distribution formula can calculate the probability of success for binomial distributions. Often you’ll be told to “plug in” the numbers to the  formul...

cammand for installing library in python

 Command for installing in jupyter notebook:                pip install library_name                ex. pip install nump installing from anaconda prompt:           1. pip install numpy           2.   conda install -c conda-forge matplotlib search for conda command for matplotlib and go to official website. Installing from anaconda navigator easy. Somtime give error then open as administrator