Mphasis | Bayesian Machine Learning

April 07, 2020

Bayesian Machine Learning - Part 1 Intro

Ashutosh Vyas

Introduction

As a data scientist, I am curious about knowing different analytical processes from a probabilistic point of view. There are two most popular ways of looking into any event, namely Bayesian and Frequentist. When Frequentist researchers look at any event from frequency of occurrence, Bayesian researchers focus more on probability of events happening.

I am starting this series of blog posts to illustrate the Bayesian methods of performing analytics.

Defining Baye's Rule

As we all know Baye's rule is one of the most popular probability equation, which is defined as:

P(a given b) = P(a intersection b) / P(b) ..... (1)

Here a and b are events that have taken place.

In the above equation I have bold-marked given and intersection as these words have the major significance in Baye's rule. Given illustrates that event b has already happened and now we need a to determine the probability of happening event a. Intersection illustrates the occurrence of event a and b simultaneously.

Another form in which this above equation can be written is as follows:

P(a given b) = P(b given a) * P(a) / P(b) .... (2)

(this equation can easily be derived from equation 1)

The above equation formulates the foundation of Bayesian inference.

Understanding the Baye's Rule from Analytics Perspective

In analytics we always try to identify the worldly behaviors from models. These models are mathematical equations with some parameters in them. These parameters are measured based upon the behavior of the events or the evidence we collect from the world. These evidences are popularly known as Data.

So the question now is - how Bayesian method helps in identifying these parameters? Let us first see how Baye's Rule can incorporate these models in them.

Let’s take theta and X as our events in the Baye's rule and re-write equation 2.
P(theta given X) = P(X given theta) * P(theta) / P(X) ..... (3)
Defining all the different components of the above equation -
• P(theta given X) : Posterior Distribution**
• P(X given theta) : Likelihood
• P(theta) : Prior distribution**
• P(X) : Evidence

** We can use the term distribution as all these terms are probabilities ranging from 0 to 1. theta in above case becomes the parameters of the model we need to compute. X is the data on which the model is trained.

Equation 3 can be re-written as:

posterior distribution = likelihood * prior distribution / evidence ..... (4)

Looking at all the above components individually, we have

Prior Distribution: We consider prior distribution of theta as the information available regarding theta before even starting the analytical model fitting process. This information is mostly based upon the experience. Usually Normal distribution is considered with mean = 0 and variance = 1 as the prior distribution of theta.

Posterior Distribution: Given the data, this is the solution distribution we get over our theta. That is, once we have trained our model on the given data, we finally end up tuning the parameters of the model. Posterior distribution is the distribution over measured theta. (this is again a big difference between frequentist and Bayesian way of inference).

Likelihood: This term is not a probabilistic distribution over theta, but is the probability of occurrence of the data given the theta. In other words, given some theta, how likely are we to get the given data, or how accurately our model with given theta as parameter can understand the given data.

Evidence: It is the probability of the occurrence of the data itself.

Let us now look at an example to see how Bayesian can help in determining the selection of a hypothesis, given the data.

let us suppose we have following data:

X = {2,4,,8,32,64}

And we propose following two hypothesis:

1) 2^n where n is ranging from 0 to 9

2) 2*n where n is ranging from 0 to 50

Applying Baye's rule -

Note : as we have no prior information, we will have equal probability for all hypothesis.

----- Hypothesis 2^n where n is ranging from 0 to 10------

This Hypothesis takes following values : 1,2,4,8,16,32,64,128,256,512

• prior 1: 1 / 2
• Likelihood 1 : (1/10)*(1/10)*(1/10)*(1/10)*(1/10)
• evidence : constant for all the hypothesis as the input data is fixed
• posterior 1 : (1/10)*(1/10)*(1/10)*(1/10)*(1/10) * (1/2) / evidence

----- Hypothesis 2*n where n is ranging from 0 to 50-----

This Hypothesis takes following values : 0,2,4,6,8,10,12,14,16...100.

• prior 2: 1 / 2
• Likelihood 2 : (1/50)*(1/50)*(1/50)*(1/50)*(1/50)
• evidence : constant for all the hypothesis as the input data is fixed
• posterior 2 : (1/50)*(1/50)*(1/50)*(1/50)*(1/50) * (1/2) / evidence

It can be easily seen from the above analysis that Posterior 1 >> Posterior 2, which means Hypothesis 1 defines the data in much better way then Hypothesis 2.

If we closely look into the evaluation of posterior for both the hypothesis, we will note the major difference creator was the likelihood, and maximizing this likelihood helps in parameter tuning. This method is popularly known as Maximum Likelihood Estimation.

We now know how to use Baye’s rule. The next blog post is about how to use it in estimating parameters for linear regression.

This point of view article originally published on datasciencecentral.com. Data Science Central is the industry's online resource for data practitioners. From Statistics to Analytics to Machine Learning to AI, Data Science Central provides a community experience that includes a rich editorial platform, social interaction, forum-based support, plus the latest information on technology, tools, trends, and careers.

Click here for the article