====================================== Estimators for Distribution Parameters ====================================== A lot of statistical and machine learning methods are based on learning the probability distribution from data. Assume that random variable $X$ is distributed according to the probability mass function $p_X(x;\theta)$ or probability density function $f_X(x;\theta)$ where $\theta$ are the parameters characterizing the distribution. The question then is: can we give estimate of the parameters $\theta$ given a sample of $m$ values of this distribution ? In machine learning it would be called the learning set. A very common way to come up with an estimator is the maximum likelihood estimator. Maximum Likelihood Estimator ============================ Consider a discrete distribution with parameters $\theta$. The pmf is then written as $\P(X=x) = p_X(x;\theta)$. We assume the values in the sample $x\ls 1,\ldots,x\ls m$ are realization of $m$ independent and identically distributed random variables $X_1,\ldots,X_m$. Therefore: .. math:: \P(X_1=x\ls 1, X_2=x\ls 2, \ldots, X_m=x\ls m) = p_{X_1 X_2\cdots X_m}(x\ls1,\ldots,x\ls m \,;\, \theta) where $p_{X_1 X_2\cdots X_m}$ is the joint probability mass function. Assuming the random variables are iid we have: .. math:: p_{X_1 X_2\cdots X_m}(x\ls1,\ldots,x\ls m \,;\, \theta) = \prod_{i=1}^m p_X(x\ls i\, ; \,\theta) This probability can also be interpreted as a function of $\theta$, it then gives the likelihood for the parameter $\theta$ to explain the data: .. math:: \ell(\theta) = \prod_{i=1}^m p_X(x\ls i\, ; \,\theta) The maximum likelihood estimator then is: .. math:: \hat \theta = \arg\max_{\theta} \ell(\theta) In practice we often look at the log likelihood: .. math:: \hat\theta &= \arg\max_{\theta} \log\ell(\theta)\\ &= \arg\max_{\theta} \sum_{i=1}^m \log\left(p_X(x\ls i\, ; \,\theta)\right) In the following sections we will look at a few distributions and their MLE's. Evidently the final expression for the estimator depends on the probability mass function $p_X$ and its dependence on $\theta$. In case $X$ is a continuous random variable the probability density function is used: .. math:: \hat\theta = \arg\max_{\theta} \sum_{i=1}^m \log\left(f_X(x\ls i\, ; \,\theta)\right) Also note that the base of the logarithm is not important for our purpose. Bernoulli Distribution ======================= The one parameter that needs to be estimated for the Bernoulli distribution is the probability $p = \P(X=1)$. Using the sample $x\ls 1,\ldots,x\ls m$ an obvious way to calculate this is: .. math:: \hat p = \frac{\sum_{i=1}^m x\ls i}{m} i.e. we just calculate the relative frequency of the value $X=1$ in our sample. This estimator is the **maximum likelihood estimator** (MLE) for $p$ as we will show. The sample set of values are the outcomes of $m$ iid random variables $X_1,\ldots,X_m$. Assuming $X_i\sim\Bernoulli(p)$ we have: .. math:: \P(X_1=x\ls 1,\ldots,X_m=x\ls m\,;\,p) = \prod_{i=1}^m \P(X_i=x\ls i\,;\,p) = \prod_{i=1}^m p^{x\ls i} (1-p)^{1-x\ls i} This probability when viewed as a function of $p$ for fixed sample set $x\ls 1,\ldots,x\ls m$ is called the likelihood function: .. math:: \ell(p) = \prod_{i=1}^m p^{x\ls i} (1-p)^{1-x\ls i} The maximum likelihood estimator for $p$ now finds the value for $p$ that maximizes $\ell(p)$. Instead of maximizing $\ell$ we maximize the logarithm of $\ell$ leading to the maximum likelihood estimator: .. math:: \hat p = \arg\max_{p} \log(\ell(p)) where .. math:: \log( \ell(p)) &= \log\left(\prod_{i=1}^m p^{x\ls i} (1-p)^{1-x\ls i}\right)\\ &= \sum_{i=1}^m \left({x\ls i}\log(p) + (1-x\ls i)\log(1-p) \right)\\ &= \left(\sum_{i=1}^m x\ls i\right) \log(p) + \left(\sum_{i=1}^m (1-x\ls i)\right) \log(1-p)\\ &= m_1\log(p) + m_0\log(1-p) where $m_1$ is the sum of the $x\ls i$, i.e. the number of values 1 in the sample and $m_0$ is the number of values 0 in the sample. We can find the maximal value by calculating the derivative of $\ell$ with respect to $p$ and solving for $\ell'(p)=0$. We have: .. math:: \frac{d}{dx}\ell(p) = \frac{m_1}{p} - \frac{m_0}{1-p} setting this equal to 0 and solving for $p$ leads to: .. math:: \hat p = \frac{m_1}{m_1+m_0} = \frac{m_1}{m} With relatively small sample sizes you might run into the situation where $m_1=0$ or $m_0=0$. This will lead to either $p=0$ or $1-p=0$, a situation that in a lot of machine learning applications you would like to avoid. In such cases the **add 1 Laplace smoothing** is often used. Now we use .. math:: \hat p = \frac{m_1+1}{m+2} Note that in case we have no data at all we just assume that $\hat p=\tfrac{1}{2}$, i.e. a uniform distribution. Categorical Distribution ======================== For the categorical distribution there are $K$ probabilities to be estimated from the sample $x\ls 1,\ldots,x\ls m$. The possible values range from $1$ to $K$. Let's denote the number of outcomes equal to $i$ with $m_i$, i.e.: .. math:: m_i = \sum_{j=1}^{m} \left[ x\ls j = i\right] here $\left[ B \right]$ is the Iverson bracket: .. math:: \left[ B \right] = \begin{cases} 1 &: B \text{ is True}\\ 0 &: \text{otherwise} \end{cases} The MLE for $p_i$ is: .. math:: \hat p_i = \frac{m_i}{m} The estimator using add 1 Laplace smoothing is given by: .. math:: \hat p_i = \frac{m_i+1}{m+K} Normal Distribution =================== Let $X\sim\Normal(\mu,\sigma^2)$ and let $x\ls 1,\ldots,x\ls m$ be a sample from $X$. The MLE estimators for $\mu$ and $\sigma^2$ are: .. math:: \hat \mu = \frac{1}{m}\sum_{i=1}^{m} x\ls i and .. math:: \hat \sigma^2 = \frac{1}{m}\sum_{i=1}^{m} (x\ls i-\hat\mu)^2