Skip to main content

Naive Bayes

Naive Bayes
Naïve Bayes is a probabilistic classifier that returns the probability of a test point belonging to a class
P(Ci|x) = P(x|Ci)P(Ci)/ P(x)
where Ci denotes the classes, and X denotes the features of the data point.

Ex:
C1, C2 or C = edible/poisonous
The feature ‘cap-shape’ is represented by X and X can take the values CONVEX, FLAT, BELL, etc
The probability of a CONVEX mushroom being edible, P(C = edible | X = CONVEX) is given by:
P( X = CONVEX | C = edible) . P(C = edible) / P(X = CONVEX)

Comprehension - Naive Bayes with Two Features
S.No Type of mushroom   Cap shape
1.       Poisonous         Convex
2.       Edible Convex
3.       Poisonous Convex
4.       Edible Convex
5.       Edible Convex
6.       Poisonous Convex
7.       Edible Bell
8.       Edible Bell
9.       Edible Convex
10.       Poisonous Convex
11.       Edible         Flat
12.       Edible         Bell

1. What are the chances of this happening, i.e. what is the value of P(X = CONVEX)? 8/12
2. What is the probability of the mushroom being CONVEX given it is edible?
P(X = CONVEX | C = edible) =  4/8
P(X = CONVEX | C = edible) means out of all the edible mushrooms, how many are CONVEX. Out of a total of 8 edible mushrooms, 4 are convex. Thus, it is 4/8.
3. What is the probability that the CONVEX mushroom is edible, P(C = edible | X = CONVEX)?
P( X = CONVEX | C = edible) . P(C = edible) / P(X = CONVEX) = (4/8).(8/12) / (8/12) = 4/8.
4. What is the probability of the CONVEX mushroom being poisonous, P(C = poisonous | X = CONVEX)? = 4/8
5. What are the chances of a random mushroom being poisonous, i.e. P(C = poisonous)? 4/12 = 1/3
6. What are the chances of a mushroom being CONVEX given it is poisonous, i.e. P(X = CONVEX | C = poisonous)? 4/4 = 1

Comprehension - Naive Bayes with Multiple Features

Table 2: Mushroom Dataset
No Type of Mushroom Cap.shape Cap.surface
1. Poisonous Convex Scaly
2. Edible Convex Scaly
3. Poisonous Convex Smooth
4. Edible Convex Smooth
5. Edible Convex Fibrous
6. Poisonous Convex Scaly
7. Edible Bell Scaly
8. Edible Bell Scaly
9. Edible Convex Scaly
10. Poisonous Convex Scaly
11. Edible Flat Scaly
12. Edible Bell Smooth

1. What is the numerator of P(C = edible | X = CONVEX, SCALY) = P(edible) x P(CONVEX | edible) x P(SCALY | edible)
2. What is the numerator of P(C = edible | X = CONVEX, SMOOTH) = P(edible) x P(CONVEX | edible) x P(SMOOTH| edible)
3. What is P(CONVEX | edible)? = 4/8
4. What is P(SMOOTH| edible)? = 2/8
5. What is P(CONVEX | poisonous)? 1
6. What is P(SMOOTH| poisonous ? 1/4

Prior, Posterior and Likelihood:
You have been using 3 terms: P(Class = edible / poisonous), P(X | Class) and P(Class | X). Bayesian classification is based on the principle that ‘you combine your prior knowledge or beliefs about a population with the case specific information to get the actual (posterior) probability’.

prior probability:
P(Class = edible) or P(Class = poisonous) is called the prior probability
likelihood:
P(X|Class)
posterior:
P(Class = edible | X)

From above table:
1. The values of P(X|Class). P(Class) where X = (CONVEX, SCALY) for both classes (edible and poisonous) are respectively:
Edible:    P(CONVEX | Edible). P(SCALY | EDIBLE). P(Edible) = (4/8)(5/8)(8/12) = 20.8% ,
Poisonous: P(CONVEX | poisonous). P(SCALY | poisonous). P(Poisonous) = (4/4)(3/4)(4/12) = 25%
2. For the (CONVEX, SCALY) mushroom:
The prior is in favor of edible; posterior in favor of poisonous
Prior is 8/12 and 4/12 for edible and poisonous respectively; posterior is 20.8% and 25%.


S.No Class Freq 1 Freq 2 Freq 3 Freq 4
1. Spam free buy limited hurry
2. Ham reply data report prestatn
3. Ham report prestatn    file end of day
4. Spam limited file buy click
5. Ham meeting timelines limited documents
6. Spam hurry data buy stock
7. Spam limited sex click viagra
8. Ham prestatnend of day data        report
9. Ham reply data prestatn    click
10. Spam free reply weekend click
11. Spam limited click free hurry
12. Ham meeting end of day weekend data
13. Spam hurry weekend stock offer
14. Ham report prestatn    file end of day
15. Ham free timelines reply offer

Spam Keywords: buy, free, hurry, weekend, stock, offer, viagra, sex, limited, click
Ham Keywords: reply, data, report, presentation, file, end of day, meeting, timelines, delay, documents

1. What is the prior probability of a mail being spam, P(class = spam)? = 7/15
2. What does Naive Bayes assume while classifying spam or ham mails? = That frequency of keywords like hurry, free, offer etc. are conditionally independent of each other
3. Consider an email with the vector of features X = (free, data, weekend, click). What is the likelihood, P(X | spam)
2/7 x  1/7 x 1/7 x 2/7 = 4/2401
4. Consider an email with the vector of features X = (free, data, weekend, click). What is the likelihood, P(X | ham)?
1/8 x 2/8 x 1/8 x 1/8 = 2/4096
5. The value of P(X|Class). P(Class) for class = spam for X = (free, data, weekend, click)?
P(X|Class) = P(X|spam) = 2/7 x  1/7 x 1/7 x 2/7 = 4/2401
P(class = spam) = 7/15
= 7/15 x 4/2401
6. What is the posterior for class = ham (i.e. without division by denominator) for the feature vector  X = (free, data, weekend, click)?
P(class = ham| X) = P(class = ham). P(X | class = ham) = (8/15)(2/4096).
7. Which class should be point X = (free, data, weekend, click) be classified into? - SPAM
The (numerators of) posteriors, P(Class | X) for spam and ham are respectively (7/15)(4/2401) and (8/15)(2/4096), spam's being higher.

Confusion Matrix:
                          Truth Table
Predicted Table           Spam        Ham
                  Spam    440         20
  Ham     40          500

1. What is the accuracy of the model? 440 + 550 / 440 + 500 + 40 + 20 = 940/1000
2. What is the sensitivity of the model? 440/440+40 =  440/480
3. What is the specificity of the model? 500/500 + 20 = 500/520
4. Given that you do not want to misclassify any genuine emails, which metric should be as high as possible?
Specificity: Fraction of correctly classified hams is measured by specificity (true negative rate).


What is the probability of word “Coffee” appearing in a document which has been classified as "Hot" if we are planning to do a Multinomial Naive Bayes Classification?

No. Document                              Class

0 Coffee Tea  Soup Coffee Coffee          Hot
1 Coffee is hot and so is Soup  and Tea   Hot
2 Espresso is a Hot Coffee  and not a Tea Hot
3 Coffee is neither Tea nor Soup          Hot
4 Sprite Pepsi  Cold Coffee and cold Tea  Cold

1.What is the probability of word “Coffee” appearing in a document which has been classified as "Hot" if we are planning to do a Multinomial Naive Bayes Classification? = 6/16
Word Coffee appears 6 times in all documents of class hot ( d0 : 3 , d1:1, d2:1 and d3:1 )
And there are 16 words altogether in the hot class of documents( d0 : 5 , d1: 4 , d2: 4 and d3: 3).
Hence the probability of word Coffee in class hot is 6/16.

2. What is Binarization of a feature vector?
A. Converting all non-zero word count of a feature vector to 1 and leaving zero counts as it is

3. What is the value of P(“ I love cold coffee”|Hot)? 1/24 * 7/24
Feedback :
P(“ I love cold coffee”|Hot)  = P(cold|hot)*P(coffee|hot) using Naive Bayes Theorem.
As P(cold|hot) =1/24 and P(coffee|hot)=7/24 the net product would be  1/24 * 7/24

4. A bag A contains 3 Red and 4 Green balls and another bag B contains 4 Red and 6 Green balls. One bag is selected at random and a ball is drawn from it.
If the ball drawn is found Green , find the probability that the bag chosen was A.
Then P(E1) = P(E2) = 1/2.
By hypothesis P(G/E1) = 4/7 and  P(G/E2) = 6/10
By Bayes theorem P(E1/G) = (P(G/E1)*P(E1))/P(G)
P=(4/7)x(1/2) / (1/2)x(4/7)+(1/2)x(6/10)= (4/14) / (4/14 + 6/20)=20/41


Comments

  1. This post is so helpfull and attractive.keep updating with more information...
    Data Science Positions
    Courses On Data Science

    ReplyDelete
  2. Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. Continue to publish more articles on

    AI Services

    Data Engineering Services

    ReplyDelete

Post a Comment

Popular posts from this blog

Advanced Regression.

Advanced Regression. - Generalized Linear Regression - Regularized Regression - Ridge and Lasso Regression Generalized Linear Regression process consists of the following two steps: 1. Conduct exploratory data analysis by examining scatter plots of explanatory and dependent variables. 2. Choose an appropriate set of functions which seem to fit the plot well and build models using them. Functions having no global maxima or minima are usually polynomial functions. Also, they typically have multiple roots and local maxima and minima. Ex: ax^4+bx^3+cx^2+dx+f Monotonically increasing function Ex: e^x 1. Can x1.x2.x3 be a feature if the raw attributes are x1, x2, x3 and x4? A. Yes, Derived features can be created using any combination of the raw attributes (linear or non-linear). In this case, the combination x1. x2. x3 is non-linear. 2. How many maximum features can be created if we have d raw attributes for n data points? Note that (n r)  here refers to the number of...

Model Selection

Model Selection: Finally, you learned 4 unique points about using a simpler model where ever possible: A simpler model is usually more generic than a complex model. This becomes important because generic models are bound to perform better on unseen datasets. A simpler model requires less training data points. This becomes extremely important because in many cases one has to work with limited data points. A simple model is more robust and does not change significantly if the training data points undergo small changes. A simple model may make more errors in the training phase but it is bound to outperform complex models when it sees new data. This happens because of overfitting. Complexity: The disadvantages the first person is likely to face because of a complex model are (mark all that apply): 1. He’ll need more training data to ‘learn’ 2. Despite the training, he may as well not learn and perform poorly in the real world Overfitting: The possibility of overfitting exist...