How can I describe and present numerical data?
Before launching into statistical tests of hypotheses on your data, it is important (both for yourself and for your reader) that the data be understood and summarised. Categorical data (see FAQs 12 and 13) can be summarised in tables and charts of various kinds but for continuous variables (i.e. ordinary numbers), some ideas about the distributions of variables are needed. When we collect data, let us say the weights of men from a particular ethnic group, the distribution of the numbers representing the heights can be displayed as a histogram. To do this the count of men with heights in (usually) equal intervals is first tabulated and then displayed as bars whose heights are proportional to the counts, or frequency, of men in each interval. An example is shown in Figure 1.
The rectangular pillars show the count of men in the sample of 100 men that have heights between the limits shown on the horizontal scale of heights. The information to the bottom right of the figure gives the mean or average height for these men and a quantity called the standard deviation. The mean (average) height was 1875.5mm. The standard deviation is a measure of the spread of the heights in the sample; it is 119.4mm. We may note that the range, i.e. the difference between the maximum and minimum heights in the sample was actually 545mm (maximum was 2141mm and minimum was 1596mm) and the range is also a measure of spread. Technically the standard deviation is the square root of the mean square of the heights. So, to calculate it, square each of the heights, get the sum of the squares and divide by the number of men, then take the square of this. Actually, it turns out that in order to make the calculated standard deviation an unbiased estimate of the true standard deviation (of the population from which the sample of men were drawn), we divide by the number of men minus one. Unbiased estimates are discussed elsewhere.
The mean and standard deviation is all that is needed to define a normal (bell-shaped) curve that represents the population of men, and this curve is shown superimposed on the histogram in Figure 1.
Another way of displaying the data is shown in Figure 2. This is called a boxplot, for obvious reasons. Fifty percent of the sample have heights between the upper and lower edges of the box. Fifty percent have heights above (and below) the heavy black line inside the box. So 25% of the heights are between the top edge of the box and the upper end of the top-most vertical line (sometimes called a whisker). Similarly, 25% of the weights are between the lower edge of the box and the end of the lower whisker. [For some data there are rogue values that do not seem to be part of the main sample and there are rules that classify these as either outliers or extreme values. Outliers are cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box. Extreme values are cases with values more than 3 box lengths from the upper or lower edge of the box.]
The thick line inside the box is the median value; 50% of the cases have values above (and below) the median. The lower and upper edges of the box represent the lower and upper quartiles of the data. The difference between the upper and lower quartiles is the inter-quartile range. If the data is symmetrically distributed about the mean then the mean equals the median. This is true of the normal distribution.
There are a number of other descriptive statistics that are useful for summarizing data and some are shown in the following table, which was produced in SPSS from the data of heights of 100 men. There is insufficient space to fully describe those not already described in the foregoing, except to say that the standard error refers to the sampling uncertainty in the estimate of the mean of the population that is given by the mean of the sample. Also the skewness is a measure of whether the distribution has a longer tail in one direction than the other (also signified by the difference between the mean and median). And the kurtosis measures the "peakiness" of the distribution releative to the normal distribution.
| Statistic | Std. error | |||
| heights | Mean | 1875.55 | 11.94 | |
| 95% Confidence Interval for Mean | Lower Bound | 1851.85 | ||
| Upper Bound | 1899.24 | |||
| 5% Trimmed Mean | 1874.85 | |||
| Median | 1870.66 | |||
| Variance | 14260.92 | |||
| Std. Deviation | 119.419 | |||
| Minimum | 1596.47 | |||
| Maximum | 2141.32 | |||
| Range | 544.86 | |||
| Interquartile Range | 173.94 | |||
| Skewness | .091 | .241 | ||
| Kurtosis | -.476 | .478 | ||
There are also other ways of summarising simple data graphically, but the histogram and boxplot are the most useful (though this may be a matter of opinion). Both histogram and boxplot can be used to compare two samples, e.g. if data of the heights of a sample of women were available these could be shown side-by-side with the graphs of the men's data. Boxplots are particularly useful for this purpose.