| Neville Hunt's Homepage |
| Demonstrating sampling distributions in Excel |
| by
Neville Hunt and Houshang Mashhoudy, Coventry University Acknowledgement This article was published in The Spreadsheet User Volume 4, Number 1, May 1997.
Introduction In a recent strand of discussion on the edstat-l list it was suggested that one of the most difficult concepts to teach to an elementary statistics class is that of a sampling distribution. Many attempts have been made to use computer simulation to help explain this concept, employing resources as diverse as Minitab (Ryan et al, 1985) and graphic calculators (Graham, 1997). When adopting such an approach there is a danger that students will observe passively the results of the simulation without really understanding the underlying processes. This article will show how students with a basic knowledge of Excel can construct a spreadsheet to investigate the distributions of various sample statistics. Although "do-it-yourself" is the best way to develop students’ understanding, classes which lack Excel expertise can be provided with the spreadsheet ready-made by the teacher. Either way the workings of the spreadsheet should be plain to all. The basic structure can be transferred to other spreadsheet packages, although some of the refinements are only possible within Excel. We shall demonstrate sampling from four very different populations: Normal, uniform, exponential and binary. To provide some link between the four we have chosen the ubiquitous traffic survey as a common context.
Sampling from a Normal Population Suppose we are concerned about the speeds of cars on a local highway. It is reasonable to assume that the speeds follow a Normal distribution with mean 50 mph and standard deviation 16 mph. These can be simulated using the NORMINV function in Excel. We shall take 100 samples, in rows 16 to 115 of the worksheet.
Now, to simulate 100 such samples:
Figure 1: Initial spreadsheet The spreadsheet should look like Figure 1. Now we need to display and describe the sample means.
The completed spreadsheet is shown in Figure 2 where, to facilitate increasing the sample size, rows 19 to 113 have been hidden (using Format Rows Hide). To increase the sample size, simply select B15:B115 and use the drag-handle (or Copy and Paste) to copy the formulae and heading across to column C and beyond. To reduce the sample size, simply drag the selection back again (or Delete).
Figure 2: Completed spreadsheet Students should be encouraged to draw up a table showing the mean and s.d. (or standard error, as it is commonly referred to in this context) of the distribution of sample means (the sampling distribution) for different sample sizes. Some typical results are shown below. The results demonstrate the unbiasedness of the sample mean as an estimator of the population mean (50 here) and the inverse square root relationship between standard error and sample size (n). The Normality of the sampling distribution is unlikely to be questioned at this stage.
Sampling from a Uniform Population Now suppose that each motorist surveyed is asked to estimate what percentage of their journey distance has been completed. It is reasonable to assume that this is uniformly distributed between 0% and 100%. To simulate data from this population requires the simple formula =100*RAND( ).
Repeat the previous exercise, this time encouraging students to examine the shape of the sampling distribution carefully. The Central Limit Theorem states that if random samples of size n are taken from a population, the sample means will follow a Normal distribution, provided n is sufficiently large. Here, because of the symmetry of the population, the sampling distribution appears Normal even for very small sample sizes.
Sampling from a Skewed Population Suppose we now turn our attention to the time-gap between each car passing a certain point. These inter-arrival times could follow an Exponential distribution, which is skewed towards smaller values. We shall assume that one car passes every 30 seconds on average, so that the population mean and s.d. are both 30. These times can be simulated using a transformation of RAND( ).
Note that one or two means may exceed 100 and are therefore missing from the histogram. The previous exercise can now be repeated. Students will observe that as n increases the skewness of the sampling distribution disappears, in accordance with the Central Limit Theorem.
Sampling from a Binary Population Finally, consider the percentage of drivers who are female. Let 100 represent a female and 0 represent a male. Assuming an equal number of male and female drivers, we are thus sampling from a population consisting of an equal number of 0’s and 100’s, with mean 50 and s.d. 50.
The sample mean is now the percentage of females in the sample. Repeating the previous exercise we have found students to be genuinely astonished by the behaviour of this sampling distribution as n increases. Figure 3 shows the distribution when n=3. By n=16, normality has been restored.
Figure 3: Sampling from a binary population
Further Ideas If teachers decide to prepare the spreadsheet in advance for students to use, some enhancements are possible. For example, students might find it easier if the sample size could be changed by altering a value in a cell. Let the sample size be in B13 and name this cell n. In the first example, the formula in B16 should now be =IF(B$15<=n, NORMINV(RAND(),50,16),""). Select B15:B16 and copy across to Z15 then down to Z115. If the sample size is 3, for example, then columns E to Z remain blank. Pedants might wish to put a message in A14 to the effect =IF(n>25,"Maximum sample size 25","") in a bold red font! The parameters of each population have been carefully chosen to fit the range 0 to 100, so that we can keep the same fixed scale on the histogram. We believe it is advantageous to observe the "narrowing" of the distribution as n increases, although for larger sample sizes the shape of the histogram becomes rather ill-defined. On faster computers it may be feasible to increase both the number of samples simulated and the number of classes in the frequency distribution. Finally, by changing the formulae in A16:A115 the less well known sampling distributions of other statistics can be investigated. Martin et al (1996) have shown that the distributions of the sample median and sample mid-range give rise to some striking and thought-provoking results. Similarly Coles (1994) advocates investigating the sampling distributions of the minimum and maximum.
Conclusion Clearly a macro could be written to run these simulations at the click of a button, perhaps with a dialogue box offering different parent populations, parameters and sample sizes. However, there is a real danger of over-elaboration with students ever gasping at the undeniable cleverness but never grasping the underlying concept!
References Coles, S.G. (1994). Learning about extremes. Teaching Statistics,
16(1), 23-25. |
| All Contents © Neville Hunt 1999 |