2.10. Estimators for Distribution Parameters

A lot of statistical and machine learning methods are based on learning the probability distribution from data. Assume that random variable \(X\) is distributed according to the probability mass function \(p_X(x;\theta)\) or probability density function \(f_X(x;\theta)\) where \(\theta\) are the parameters characterizing the distribution.

The question then is: can we give estimate of the parameters \(\theta\) given a sample of \(m\) values of this distribution ? In machine learning it would be called the learning set.

A very common way to come up with an estimator is the maximum likelihood estimator.

2.10.1. Maximum Likelihood Estimator

Consider a discrete distribution with parameters \(\theta\). The pmf is then written as \(\P(X=x) = p_X(x;\theta)\). We assume the values in the sample \(x\ls 1,\ldots,x\ls m\) are realization of \(m\) independent and identically distributed random variables \(X_1,\ldots,X_m\). Therefore:

\[\P(X_1=x\ls 1, X_2=x\ls 2, \ldots, X_m=x\ls m) = p_{X_1 X_2\cdots X_m}(x\ls1,\ldots,x\ls m \,;\, \theta)\]

where \(p_{X_1 X_2\cdots X_m}\) is the joint probability mass function.

Assuming the random variables are iid we have:

\[p_{X_1 X_2\cdots X_m}(x\ls1,\ldots,x\ls m \,;\, \theta) = \prod_{i=1}^m p_X(x\ls i\, ; \,\theta)\]

This probability can also be interpreted as a function of \(\theta\), it then gives the likelihood for the parameter \(\theta\) to explain the data:

\[\ell(\theta) = \prod_{i=1}^m p_X(x\ls i\, ; \,\theta)\]

The maximum likelihood estimator then is:

\[\hat \theta = \arg\max_{\theta} \ell(\theta)\]

In practice we often look at the log likelihood:

\[\begin{split}\hat\theta &= \arg\max_{\theta} \log\ell(\theta)\\ &= \arg\max_{\theta} \sum_{i=1}^m \log\left(p_X(x\ls i\, ; \,\theta)\right)\end{split}\]

In the following sections we will look at a few distributions and their MLE’s. Evidently the final expression for the estimator depends on the probability mass function \(p_X\) and its dependence on \(\theta\).

In case \(X\) is a continuous random variable the probability density function is used:

\[\hat\theta = \arg\max_{\theta} \sum_{i=1}^m \log\left(f_X(x\ls i\, ; \,\theta)\right)\]

Also note that the base of the logarithm is not important for our purpose.

2.10.2. Bernoulli Distribution

The one parameter that needs to be estimated for the Bernoulli distribution is the probability \(p = \P(X=1)\). Using the sample \(x\ls 1,\ldots,x\ls m\) an obvious way to calculate this is:

\[\hat p = \frac{\sum_{i=1}^m x\ls i}{m}\]

i.e. we just calculate the relative frequency of the value \(X=1\) in our sample.

This estimator is the maximum likelihood estimator (MLE) for \(p\) as we will show. The sample set of values are the outcomes of \(m\) iid random variables \(X_1,\ldots,X_m\). Assuming \(X_i\sim\Bernoulli(p)\) we have:

\[\P(X_1=x\ls 1,\ldots,X_m=x\ls m\,;\,p) = \prod_{i=1}^m \P(X_i=x\ls i\,;\,p) = \prod_{i=1}^m p^{x\ls i} (1-p)^{1-x\ls i}\]

This probability when viewed as a function of \(p\) for fixed sample set \(x\ls 1,\ldots,x\ls m\) is called the likelihood function:

\[\ell(p) = \prod_{i=1}^m p^{x\ls i} (1-p)^{1-x\ls i}\]

The maximum likelihood estimator for \(p\) now finds the value for \(p\) that maximizes \(\ell(p)\). Instead of maximizing \(\ell\) we maximize the logarithm of \(\ell\) leading to the maximum likelihood estimator:

\[\hat p = \arg\max_{p} \log(\ell(p))\]

where

\[\begin{split}\log( \ell(p)) &= \log\left(\prod_{i=1}^m p^{x\ls i} (1-p)^{1-x\ls i}\right)\\ &= \sum_{i=1}^m \left({x\ls i}\log(p) + (1-x\ls i)\log(1-p) \right)\\ &= \left(\sum_{i=1}^m x\ls i\right) \log(p) + \left(\sum_{i=1}^m (1-x\ls i)\right) \log(1-p)\\ &= m_1\log(p) + m_0\log(1-p)\end{split}\]

where \(m_1\) is the sum of the \(x\ls i\), i.e. the number of values 1 in the sample and \(m_0\) is the number of values 0 in the sample.

We can find the maximal value by calculating the derivative of \(\ell\) with respect to \(p\) and solving for \(\ell'(p)=0\). We have:

\[\frac{d}{dx}\ell(p) = \frac{m_1}{p} - \frac{m_0}{1-p}\]

setting this equal to 0 and solving for \(p\) leads to:

\[\hat p = \frac{m_1}{m_1+m_0} = \frac{m_1}{m}\]

With relatively small sample sizes you might run into the situation where \(m_1=0\) or \(m_0=0\). This will lead to either \(p=0\) or \(1-p=0\), a situation that in a lot of machine learning applications you would like to avoid. In such cases the add 1 Laplace smoothing is often used. Now we use

\[\hat p = \frac{m_1+1}{m+2}\]

Note that in case we have no data at all we just assume that \(\hat p=\tfrac{1}{2}\), i.e. a uniform distribution.

2.10.3. Categorical Distribution

For the categorical distribution there are \(K\) probabilities to be estimated from the sample \(x\ls 1,\ldots,x\ls m\). The possible values range from \(1\) to \(K\). Let’s denote the number of outcomes equal to \(i\) with \(m_i\), i.e.:

\[m_i = \sum_{j=1}^{m} \left[ x\ls j = i\right]\]

here \(\left[ B \right]\) is the Iverson bracket:

\[\begin{split}\left[ B \right] = \begin{cases} 1 &: B \text{ is True}\\ 0 &: \text{otherwise} \end{cases}\end{split}\]

The MLE for \(p_i\) is:

\[\hat p_i = \frac{m_i}{m}\]

The estimator using add 1 Laplace smoothing is given by:

\[\hat p_i = \frac{m_i+1}{m+K}\]

2.10.4. Normal Distribution

Let \(X\sim\Normal(\mu,\sigma^2)\) and let \(x\ls 1,\ldots,x\ls m\) be a sample from \(X\).

The MLE estimators for \(\mu\) and \(\sigma^2\) are:

\[\hat \mu = \frac{1}{m}\sum_{i=1}^{m} x\ls i\]

and

\[\hat \sigma^2 = \frac{1}{m}\sum_{i=1}^{m} (x\ls i-\hat\mu)^2\]