**Smoothing of data**

Data smoothing is one of those processes that is easy to
implement with a glib formula, but has much more profound implications than most
users realise. In the following we assume that we start of with a set of
numbers, *x _{k}*, that have resulted from sampling some process in
the real world, such as temperature, and that the interval between samples is

Technically, data smoothing is a form of low pass filtering, which means that it blocks out the high frequency components (short wiggles) in order to emphasis the low frequency ones (longer trends). There are two popular forms; (a) the running mean (or moving average) and (b) the exponentially weighted average. They are both implemented by means of efficient recursive formulae:

In each case, from an original sequence of numbers, *x _{k}*,
a new smoothed sequence,

**SPECIAL NOTE**: It is **NOT** necessary to
recalculate a complete average for each new point. It is surprising how often
this is done. Even the Mathcad ® statistical tutorial falls into this trap. In
the running mean, smoothing a sequence of length *L* then results in *(n-1)L*
unnecessary calculations, which can be a very large number with strong smoothing
of long sequences, resulting in long calculation times.

Each of these methods has one parameter that must be
chosen. The value of *n* determines how many numbers from the old sequence
are averaged to produce each point in the new sequence. The value of *b*
determines the effective time constant of the filter (actually –*T*/ln(*b*)
).

**Complications**

1. Transient response

It is one of the implications of the uncertainty principle
that, when we take a finite block of a process to represent the whole of it,
there are unavoidable errors. In this case they take the form of the transient
response. You can demonstrate the transient response by putting the step
function test sequence (*x _{k }* = 1,1,1,1,1,…..) into each formula. As this sequence is
already smooth, the ideal output should be the same as the input, but the running
mean ramps up to the value 1 over

Various methods are used to overcome this problem in the
running mean without discarding the first *n* output points. One is to
taper the average, so that the first output point is an average of one, the
second an average of two etc. up to the *n*^{th} point. This means
that the beginning of the output sequence is relatively unsmoothed, which can be
misleading. Another slightly better method is to precalculate the average of the
input sequence and pack *n* numbers equal to this value into the front of
the sequence. In either case, it is not desirable to make deductions from the
first *n* smoothed points.

2. Frequency response

The frequency response of the running mean formula is actually rather complicated, taking the form of what is known as a sinc function. This goes through a number of zeroes and a number of maxima as the frequency increases. Here is the actual frequency response for n=5 and n=8:

We can see that some interfering frequencies can be completely eliminated; yet a higher frequency is only reduced by a factor of 5. Thus we have to be very careful about identifying apparent periodicities in data smoothed by this method.

The exponentially weighted average drops smoothly to zero, so does not have these problems.

3 Phase response

The running mean is what is known technically as a linear phase filter, which means that, though all the frequency components are treated with different gains, they are all delayed by the same length of time. The exponentially weighted average does not have this property, so there is an extra form of distortion of the shape of the sequence.

**Discussion**

Data smoothing is a very useful technique for emphasising apparent slow trends in sequences of data. We have to be very careful, however, not to push it too far, especially in trying to identify periodicities in the data. We must also avoid giving too much credence to variations at the beginning (or the end!) of the smoothed sequence. Given these provisos, both the exponentially weighted average and the running mean are effective and can be implemented by means of efficient recursive formulae, though surprisingly often extremely inefficient non-recursive forms are applied. These simple examples are of real-time (or one=sided) filters, which only use present and past values, a necessary constraint in many important applications. There are many more elaborate methods, which require a much higher level of precaution.

The illustrations are condensed from *Laboratory
online computing*, a very old (1975) and forgotten book by the author.