Averages, or means, are something that everyone understands and comprehends the implications. Or are they?
A fundamental idea in applied statistics is that there is an ideal system that we can only see through a glass darkly, by taking samples from it. Statisticians call it the population and engineers call it the process. This ideal process has a process mean, which we try to estimate by way of a sample mean. Within certain limitations, the more samples we take the better our understanding of the properties this infinite pool of numbers. One of the great advantages of modelling is that we can prescribe the properties of an artificial process and thereby examine the effectiveness of any tools we use in trying to examine other processes in the real world. Let us therefore create an imaginary perfectlybehaved process to illustrate the business of averaging.
An engineer is testing components for noise (Don’t panic! We are not going into the physics.) Because the electrons in the components are permanently being jiggled about by thermal energy, there is at any instant of time a very small nonzero voltage across each component. Provided the temperature remains constant, the statistical properties of this phenomenon remain constant in time and the voltages are normally distributed.
Our engineer tests ten identical components at ten equally spaced times, say one second apart, and records the instantaneous voltage in microvolts. We label the components x1, x2….x10 and the times t1, t2…t10.

t1 
t2 
t3 
t4 
t5 
t6 
t7 
t8 
t9 
t10 
Time 
x1 
0.30 
1.28 
0.24 
1.28 
1.20 
1.73 
2.18 
0.23 
1.10 
1.09 
0.05 
x2 
0.69 
1.69 
1.85 
0.98 
0.77 
2.12 
0.57 
0.40 
0.13 
0.37 
0.93 
x3 
0.33 
0.37 
1.34 
0.09 
0.19 
0.51 
1.97 
0.87 
2.38 
0.65 
0.44 
x4 
1.66 
1.61 
0.54 
0.90 
1.92 
0.08 
0.52 
0.68 
0.38 
0.76 
0.39 
x5 
1.44 
0.85 
1.52 
0.36 
0.03 
0.03 
0.32 
2.19 
1.74 
0.74 
0.48 
x6 
2.58 
1.45 
1.28 
0.65 
0.76 
0.47 
0.87 
0.60 
1.37 
1.12 
0.29 
x7 
0.69 
0.32 
0.94 
0.24 
0.13 
0.56 
0.14 
0.91 
1.88 
0.49 
0.21 
x8 
0.07 
0.83 
0.86 
0.64 
0.92 
1.11 
1.20 
1.56 
0.71 
0.64 
0.01 
x9 
2.21 
1.44 
1.30 
0.11 
0.00 
0.45 
0.03 
1.05 
1.77 
0.83 
0.35 
x10 
0.44 
0.62 
0.21 
1.03 
1.24 
0.31 
0.84 
0.82 
0.43 
0.45 
0.14 
Ensemble 
0.03 
0.11 
0.11 
0.17 
0.33 
0.13 
0.27 
0.07 
0.05 
0.17 
0.04 
There are two ways in which our engineer can create an average or sample mean. He can take the numbers for the ten components at any one time (i.e. a vertical column) add them and divide by ten. This is known as the Ensemble Average. Or he can take one component and average the voltages at the ten different times, which produces the Time Average. Conventionally the ensemble average is written E(x) and the time average is indicated by putting a bar over the x.
We note that both sets of averages appear less scattered than the original numbers, by about a third.
When the time average of a process is equal to the ensemble average, it is said to be ergodic. Our simple model process is ergodic by definition. It is obviously very powerful and convenient to be able to substitute one average for the other, but it can be a serious error to assume ergodicity without sound grounds for doing so.
A question often asked is whether one can improve accuracy by averaging. The answer is NO! If, for example, our engineer’s voltmeter had a bent pointer, giving a permanent offset, then all the averages would have the same offset. We can, however, increase the precision. If the process is well behaved and our measuring technique is consistent, then the purely statistical errors decrease as the square root of the number of samples used to create the average. Thus, under these restrictive conditions, if we have 100 samples we are entitled to express the average with one decimal place of extra precision. We have noted that the excursions of the averages in our table were smaller than the raw numbers by a factor of about three, or about root ten.
Knowing the average voltage is obviously of limited value. If you stick your fingers into the mains socket, it is of no help to you to know that the average voltage is zero. With alternating voltages we need a number that represents the average excursions away from zero.
One possibility is to average the absolute values of voltage (i.e. remove all the negative signs in our table). This is quite practicable, but it has a number of mathematical disadvantages associated with the fact that the absolute function has a discontinuity of slope at zero. The average of the squares has many mathematical advantages (going back to the theorem ascribed to Pythagoras) and in particular it is based on a smooth function. The only problem remaining is that this number now has the dimensions of voltage squared, which causes problems; for example when we want to compare it with the ordinary average, so we take the square root to restore the units. The resulting number is the Root Mean Square or RMS. When we quote the mains voltage as 110 or 230, the number is actually the RMS amplitude.
Note:
The field of hi fi is littered with foolish concepts that are lapped up by its
naïve followers. One of them is the expression watts RMS. Power in watts
is proportional to voltage squared, so it is already a mean square value. When
hi fi salesmen say watts RMS they actually mean watts of sine wave power.
In statistics the RMS value relative to the average is known as the standard deviation and is a measure of the size of the excursions away from the average. Here is our engineer’s table with RMS values instead of plain averages:

t1 
t2 
t3 
t4 
t5 
t6 
t7 
t8 
t9 
t10 
Time 
x1 
0.30 
1.28 
0.24 
1.28 
1.20 
1.73 
2.18 
0.23 
1.10 
1.09 
1.23 
x2 
0.69 
1.69 
1.85 
0.98 
0.77 
2.12 
0.57 
0.40 
0.13 
0.37 
0.69 
x3 
0.33 
0.37 
1.34 
0.09 
0.19 
0.51 
1.97 
0.87 
2.38 
0.65 
1.05 
x4 
1.66 
1.61 
0.54 
0.90 
1.92 
0.08 
0.52 
0.68 
0.38 
0.76 
1.01 
x5 
1.44 
0.85 
1.52 
0.36 
0.03 
0.03 
0.32 
2.19 
1.74 
0.74 
1.07 
x6 
2.58 
1.45 
1.28 
0.65 
0.76 
0.47 
0.87 
0.60 
1.37 
1.12 
1.23 
x7 
0.69 
0.32 
0.94 
0.24 
0.13 
0.56 
0.14 
0.91 
1.88 
0.49 
0.78 
x8 
0.07 
0.83 
0.86 
0.64 
0.92 
1.11 
1.20 
1.56 
0.71 
0.64 
0.93 
x9 
2.21 
1.44 
1.30 
0.11 
0.00 
0.45 
0.03 
1.05 
1.77 
0.83 
1.12 
x10 
0.44 
0.62 
0.21 
1.03 
1.24 
0.31 
0.84 
0.82 
0.43 
0.45 
0.70 
Ensemble 
1.33 
1.15 
1.13 
0.72 
0.87 
0.98 
1.08 
1.08 
1.39 
0.73 
1.05 
The values chosen for our model were actually a mean (average) of zero and a standard deviation (RMS) of unity. The averages of averages taken over the total 100 samples (shown in green), give a mean of –0.04 and a standard deviation of 1.05. The sample averages are never exactly equal to the population averages. In fact, the expected standard deviation of the sample average is equal to the standard deviation of the population mean divided by the square root of the number of samples, in this case 0.1, so the error is of a magnitude that would be anticipated.
This square root relationship means that there is a law of
diminishing returns in increasing the sample size to improve precision. For each
extra decimal place of precision we need to multiply the number of samples by 100.
It is a cardinal blunder to quote an average with a greater precision than can
be justified by this rule, but one that is committed with monotonous regularity,
especially when comparing one set of samples with another. When one sample mean
is subtracted from another, the standard deviations are combined (RMS) to
produce an increased scatter. In most applications both means are positive and
of similar magnitude, so the difference is much less than both, while its
scatter is greater, so the relative statistical errors are greatly multiplied.
Far reaching deductions are often then made on the basis of what are, in effect,
random errors.
Sometimes we want to know how two signals (sets of numbers as functions of time) are related in time. For this we use the crosscorrelation function. We do not need to go into this, and our simple model relates to only one variable, but it is useful to understand the autocorrelation function. This, in its basic form, is a function of two variables, which are instances of time. In order to determine the autocorrelation for, say, times t2 and t5, we go down the relevant columns and multiply the values together to create a new column of products. We then take the ensemble average of the products:
t2 
t5 
product 
1.28 
1.20 
1.53 
1.69 
0.77 
1.31 
0.37 
0.19 
0.07 
1.61 
1.92 
3.09 
0.85 
0.03 
0.03 
1.45 
0.76 
1.10 
0.32 
0.13 
0.04 
0.83 
0.92 
0.77 
1.44 
0.00 
0.00 
0.62 
1.24 
0.77 
R(t2,t5) 

0.21 
A process is said to be stationary when all its statistical properties are independent of time. This is of course difficult, indeed impossible, to determine with a finite number of samples. We therefore use a definition of stationarity in the wide sense and say that a process is stationary if the autocorrelation is a function only of the difference between the two times. In our model we have prescribed the process to be stationary and therefore in theory R(t3,t5) = R(t4,t6) etc. In practice, when we do the calculations in our samples, they are not equal and they never will be, but as we increase the number of samples they converge towards each other. If we are satisfied that the process is stationary, we can calculate the autocorrelation as a time average, and we can write it as a function of a single variable, the time interval, conventionally τ and the function is written as R(τ).
It is important to remember that there is a strong subjective element in stationarity. If we are examining ripples, the waves are a nonstationarity. If we are examining waves, the tides are a non stationarity. If we are examining the tides, the solar/lunar conjunctions are a nonstationarity. In the end nothing is stationary, as nothing lasts forever. We therefore need to satisfy ourselves that a process is effectively stationary over the time window that we have prescribed before we start taking liberties with the data.
In the above we have deliberately chosen a simple wellbehaved model. In the real world such behaviour is rare. In particular, data from human populations are neither ergodic nor stationary. Also, the measuring systems are not necessarily consistent, so even the assumption that we can apply the square root rule to increase precision might not be valid.
In the literature averages are bandied around, often quoted to absurd precisions, compared by subtraction or division and significant results are claimed that could well be ascribed to the calculation errors.
The greatest and most common problem is where the process is nonstationary and the ensemble average is varying with time. It is less problematic when we actually have access to an ensemble of systems to look at, but more often than not we only have one and the ensemble is just a theoretical ideal that we use to describe the statistical properties of the process. Then we have to resort to trying to calculate the local average in the time sequence. Immediately we are faced with the Uncertainty Principle, because in signal processing terms averaging is trying to extract the zerofrequency component, and the more accurately we know the average, the less we know where it is in time. Attempting to track the average is what data smoothing formulae do, and in selecting the smoothing parameter we have to trade accuracy of the average against spread in time. This is, perhaps, one of the most prominent causes of misleading results in the literature.