Mean and Median Values¶

Consider a set of scalar observation \(v_1,\ldots,v_n\). Assume we have good reasons to believe that all these values really should be the same (and equal to some unknown value \(t\)) but due to measurement errors, natural variation or some other unknown disturbances the values are not the same.

How then calculate (or better said estimate) the true value \(t\)?

The mean value¶

The intuitive way to estimate \(t\) is to calculate the mean of the set of observed values:

\[\bar{v} = \frac{1}{n} \sum_{i=1}^{n} v_i\]

This is not a bad an idea as well. In fact in case the deviations of the observed values from the true value \(v_i-t\) are statistically independent and all normally distributed with the same standard deviation the simple mean is the optimal statistical estimator.

The mean \(\bar v\) is also the value that minimizes the quadratic differences:

\[\bar v = \arg\min_u \sum_{i=1}^n (v_i - u)^2\]

The weighted mean value¶

In some situations the observations cannot be treated on an equal footing. E.g. you might suspect some of the observations to be more accurate then others. In such cases you might select a weighted mean to estimate the true value \(t\):

\[\bar{v} = \frac{\sum_{i=1}^{n} w_i v_i}{\sum_{i=1}^{n} w_i}\]

The larger the (positive) weight factor \(w_i\) the more influence the value \(v_i\) has on the calculated value. The division by \(n\) in the unweighted mean is replaced by a division by the sum of all weights (note that \(n\) equals the sum of all weights equal to 1).

Indeed the weighted mean is the value resulting from the following minimization:

\[\bar v = \arg\min_u \sum_{i=1}^n w_i (v_i - u)^2\]

The median value¶

Now consider the situation where for most of your observations you may safely assume that they are normal i.i.d. but alas a few of your observations are far far off from the true value. These outliers will greatly influence the estimation of the true value. Remember that the mean finds that value \(\bar v\) that minimizes the squared differences. So in case of a value far off from the mean of the rest of the observations it will shift the outcome towards the outlier to reduce the quadratic effect on the error.

Let us look instead at the following minimization problem:

\[m = \arg\min_u \sum_{i=1}^n | v_i - u |\]

So we look at the absolute difference instead of the squared difference. The shift towards the outlier should be less in this case.

It may come as an surprise (and really it should if you think of it) that the value resulting from the above minimization problem is the median value of the \(n\) observations.

The classical definition of the median value is of course to first sort the values \(v_i\) and then take the middle one (for \(n\) is odd, for \(n\) even we take the mean of the middle two observations).

The weighted median value¶

You may wonder of course how to calculate a weighted median in the situation that it is a priori known that some observations are more reliable then others.

What will be the outcome of the following minimization:

\[m = \arg\min_u \sum_{i=1}^n w_i | v_i - u |\]

For positive integer weights the weighted median allows for a simple interpretation. Instead of sorting the original set and taking the middle number, we first duplicate each value \(v_i\), \(w_i\) times and sort the resulting set. The set with duplicates is often denotes as \(w_1\diamond v_1, \ldots, w_n \diamond v_n\).

Sorting the set with all duplicates is not really necessary. A simple algorithm sorts the set \(x_1,\ldots,x_n\) and reorders the set of weights accordingly. Let \(x^\star_1,\ldots,x^\star_n\) be the sorted set of values and let \(w^\star_1,\ldots,w^\star_n\) be the shuffled set of weights. Let \(W\) be the sum of all weights. Then:

\[m = \min_k \sum_{i=1}^k w^\star_i \geq W/2\]