Module 1.2: Describing Data with Pictures

Module 1.2 Notes
Describing Data with Pictures

Index to Module One Notes

1.2: Describing Data: Pictures

1.3: Describing Data: Number Summaries

This section of Module 1 will get us into Microsoft Excel, and take us to the point of creating a histogram summary of numerical data. Table 1.2.1 shows two and a half years of cycle time data from the manufacturing firm introduced in Module 1.1 Notes.

Table 1.2.1

	1997	1998	1999
January	19	20	21
February	21	22	21
March	20	19	29
April	16	20	25
May	18	23	21
June	23	19	29
July	22	27
August	24	22
September	17	18
October	26	20
November	20	19
December	21	18

Enter this data into a single column in a Microsoft Excel spreadsheet. For some extra practice with Excel features, you can place this column to the right of a column for Month/Yr, as shown in Worksheet 1.2.1. For example, I placed the months of the years in column A. Thus, column A, row 1 (cell A1), has the title and the specific months go from cell A2 to A31. In Microsoft Excel 98 and 2000, the default format for month/year is Mon-XX. So, when I typed January 1997 in Cell A2, and hit the enter key, Jan-97 appeared (you can change the default format of any cell by selecting Format on the Standard Toolbar, then Cells from the pulldown menu, then the Number tab, then whatever format is desired from the available list). Now, if you select cell A2, note a small square in the lower right of the highlighted cell border. Click and drag that square down to cell A31 and all of the months/years will enter successively - slick, huh?!

Cell B1 has the title for the variable "Time," and the data is entered from cell B2 to B31. There is no magic way of entering the data for the variable "Time" in column B, at least not until I get voice data entry capability. Of course, the typical FGCU MBA student has data entry employees who can do this for them at work - right?!!

An aside: if you have any questions about Excel as we go through the course, please e-mail/call me so we can get them resolved quickly. Another option for questions: I find the Excel Help feature getting better and better with each version, but it is getting so big that you seem to need basic search engine skills to locate your topic quickly.

I then copied the Time column into Column C and re-titled it "Sorted." One way to copy a range of data: select or highlight cells B1 to B31 by clicking on B1, then dragging to B31, select the Copy Icon on the Formatting Toolbar, select or highlight Cell C1, and click the Paste Icon on the Formatting Toolbar. Sort the data from lowest to highest by selecting or highlighting Cells C1 to C31, then click on the Sort Ascending Icon (A to Z) on the Formatting Toolbar. Note that his technique sorts the data in rows based on the content of one column. To sort the data in rows based on contents of two or more columns, select the rows and columns of your data range, click on Data on the Standard Toolbar, select Sort from the pulldown menu, and then follow the dialog screen entry instructions.

Did you replicate the columns in Worksheet 1.2.1? If not, you should e-mail or phone me for help. If so, save your work to a file before going on. Assignment 1 requires entry of your data, copying to a separate column, then sorting the duplicate column of data (Items 1 and 2 of Project Assignment 1).

Worksheet 1.2.1

Month/Yr	Time	Sorted
Jan-97	19	16
Feb-97	21	17
Mar-97	20	18
Apr-97	16	18
May-97	18	18
Jun-97	23	19
Jul-97	22	19
Aug-97	24	19
Sep-97	17	19
Oct-97	26	20
Nov-97	20	20
Dec-97	21	20
Jan-98	20	20
Feb-98	22	20
Mar-98	19	21
Apr-98	20	21
May-98	23	21
Jun-98	19	21
Jul-98	27	21
Aug-98	22	21
Sep-98	18	22
Oct-98	20	22
Nov-98	19	22
Dec-98	18	23
Jan-99	21	23
Feb-99	21	24
Mar-99	29	25
Apr-99	25	26
May-99	21	27
Jun-99	21	29

Now it's time to create a histogram to summarize the data in a picture. As we will see, the histogram tells us something about the center of the data, the spread of the data, and the shape of the data.

The fastest way to create a histogram is to use the Data Analysis Histogram Tool. To do this, you have to activate the Data Analysis Add-In feature of Excel. Select Tools on the Standard Menu Bar and check to see if Data Analysis is available on the pulldown menu. If not, select Add-Ins and the Add-Ins dialogue box will appear. Select the Analysis ToolPak check box. Click OK and exit Excel (CAUTION: Save your Excel File before exiting). When you open Excel, the Data Analysis Add-In feature should be available through the Tools pulldown menu on the Standard Toolbar.

Excel 2007 is slightly different. In Excel, go to top left Windows icon and click it once. At the bottom of the pop-down window click Excel Options. In the left column, select Add-Ins. On the right box, select Analysis Toolpak. Click on GO button at the bottom of window. When the Add-Inns windows popup, check the square for Analysis Toolpak and then OK. To confirm, go to the DATA tab in Excel. You should have a DataAnalysis menu window. For a demonstration on how to install the Add-in feature in Excel 2007, select this link.

If the Analysis ToolPak selection is not available in the Add-Ins, the Data Analysis component was probably not included when your copy of Excel was installed on your computer. Select Help on the Standard Toolbar, then select Microsoft Excel Help in the pulldown menu, select the Index tab, type keyword "Installation," select Search, and then select Install or Remove Individual Features, and follow the instructions. The Data Analysis component allows Excel to be somewhat competitive with statistical software packages. The component doesn't have the full functionality of software packages such as Minitab, SPSS or SAS, but it has the features commonly used in business applications... and these are on a spreadsheet used by most of the world.

After you have activated this important "Add In" that we will be using throughout this course, create a histogram. To do this, select Tools on the Standard Toolbar, then Data Analysis from the pulldown menu and respond to the dialog screen. I entered C1:C31 for the input range and checked Label in the Label Checkbox. This means that the first cell of my data range actually is a title rather than a data entry. If I would have entered C2:C31 for the input range, I would not have checked Label (if you are like me, you will make this mistake several times before you get the hang of Excel Data Analysis dialog screens). Skip bin range to let the software create a default set of frequency bars for the histogram. I like to put the histogram on the same page as my data so I select Output Range, then put the cursor in the rectangle next to Output Range, and type in a cell location for placement of my histogram. The cell location will identify the upper left corner of where the histogram will appear in the worksheet, so just make sure the worksheet is clear to the right and down from the selected cell location. I entered D1 as the worksheet was empty to the right and down from D1. Select Chart Output and OK and you should get something like Worksheet 1.2.2.

Worksheet 1.2.2

Bin	Frequency
16	1
18.6	4
21.2	15
23.8	5
26.4	3
More	2

The bins of the histogram represent a convenient way of summarizing the data by showing the frequency of observations at preselected intervals - the family name given to this type of chart then, is the frequency distribution. The histogram shows one observation in the Bin labeled 16. Next, there are 4 observations in the Bin with a value of 18.6 (observations with values of 17, 18, 18, and 18). This means that in the interval of numbers greater than 16 and less than 18.6 there are four numbers: 17, 18, 18, and 18. Likewise, there are 15 numbers in the interval of numbers greater than 18.6 and less than 21.2, and so forth. The bin labels are the upper limits of each class interval chosen by Excel.

How did Excel determine the interval of each bin should be 2.6? It's fairly simple, actually (you can skim this and the next paragraph if you want to go directly to interpretation of Center and Spread below). First, Excel computes the range of the data: highest number minus the lowest number = 29 - 16 = 13. Next, Excel divides the range by the number of intervals or bars desired to summarize and display the data. The default number of intervals is 5, so 13 divided by 5 gives a class interval range of 2.6. Now Excel divides the numbers 16 through 29 into five class intervals, each with a range of 2.6.

- The first interval includes the numbers < 16 (there is one 16)
- The second interval includes numbers > 16 to 18.6
- The third interval includes the numbers > 18.6 to 21.2
- The fourth interval includes the numbers > 21.2 to 23.8
- The fifth interval includes the numbers > 23.8 to 26.4
- The last interval is titled "More" by Excel: and includes all the numbers above the last upper limit (26.4 in this case) for placing in an extra "bucket" or bin.

Some notes on the location and shape of the chart: What if you do not like the location of the histogram? Select the histogram by pointing and clicking anywhere on the white area of the chart. You should see 8 small squares called "handles" around the border of the chart. Now, click again anywhere on the white area of the histogram and drag the chart wherever you wish. You can also change the shape of the histogram. Point and click on the middle handle on the left border and drag to the left to make the histogram longer to the left, or drag into the histogram to make it smaller - you can do the same to the right side. Point and click on the middle handle of the top border and drag up to make the histogram wider to the top - same for the bottom. If you point, click and drag a handle on the corner, the histogram gets larger if you drag out, or smaller if you drag in. If you point and click on the histogram, select File on the Standard Toolbar, and Print, you can print just the histogram. If you do not point and click on the histogram and select File and Print, you can print the histogram and whatever else is on the worksheet, such as the data. I like to select File and Print Preview to see where the page borders are to determine if I have to move the histogram so it doesn't get cut up when printing.

Creating a histogram is sometimes considered more "art" than "science." We are trying to develop a picture that best illustrates the center, spread and shape of the data distribution to communicate information about the underlying process. More intervals provide greater detail at the expense of losing the identification of the process center. Fewer intervals provide a more concise summary at the expense of losing the identification of the process shape. Excel does a very good job at creating the histogram as long as there is are at least 30 observations. You can change the number of intervals, the bin labels, and the frequency count by manipulating the Bin and Frequency cells of the frequency table that accompanies each histogram. A rule of thumb for the number of classes is between 5 and 15 classes.

In certain applications, you want to create a histogram with specific class limits for easier interpretation of the data (for example, you may not want to show decimal points, or you are interested in partitioning the data into specific income groups). First input class limits in a block of cells called Bin, for example 16 in cell B2, 19 in cell B3, etc (see figure 1.2.1 below). Second, in the Histogram pull down window, input this range into the ‘Bin Range’. Excel creates a histogram with your desired class limits as indicated in Worksheet 1.2.2.1

Figure 1.2.1

Worksheet 1.2.2.1

Bin	Frequency
16	1
19	8
22	14
25	4
28	2
More	1

The first interval includes the numbers < 16 (there is one 16)
- The second interval includes numbers > 16 to 19
- The third interval includes the numbers > 19 to 22
- The fourth interval includes the numbers > 22 to 25
- The fifth interval includes the numbers > 25 to 28
- The last interval is titled "More" includes all the numbers above the last upper limit (28 in this case).

The Center and Spread of a Set of Data

We can see that the center of the data appears at about 21, and the spread is from 16 to "More." You can change "More" to 29 where 29 is the highest number in the data set by typing 29 over the word "More" in the frequency table next to the histogram. This would then allow the reader to see that the spread of the data is from a low of 16 to a high of 29. Exact numerical measures of the center and spread of the data will be introduced in Module 1.3.

The Shape

One other thing we get from the histogram is the shape - the shape shown in Worksheet 1.2.2 is a symmetric bell-shaped distribution called the Normal Probability Distribution. The histograms of many distributions appear like this, especially if the continuous data in a sample under study is randomly selected, without bias to very high or very low numbers, and without multiple processes mixed together. I will illustrate the effects of very high or very low numbers next.

Suppose for a moment that the last observation (Jun 99) was 38 days rather than 29 days. Now look at the histogram in Worksheet 1.2.3.

Worksheet 1.2.3

Bin	Frequency
16	1
20.4	13
24.8	12
29.2	3
33.6	0
38	1

Note that the 38 lies outside and to the right of the range and center of the data - it is called an outlier. When outliers occur to the right of the range and center, we say the distribution is skewed right. Generally, outliers are events that are not created or expected from a process and should be removed and investigated. In statistical process control, an outlier is called a signal, and is an indication that a process is going out of control and must be shut down and examined. Once the outlier is removed, the histogram will return to its symmetric shape.

What happens if the June 1999 observation was an unexpected short cycle time, such as 7 days. Look at the resulting Histogram in Worksheet 1.2.4.

Worksheet 1.2.4

Bin	Frequency
7	1
11	0
15	0
19	9
23	16
27	4

This histogram shows an outlier or signal event to the left of the range of the data. When outliers occur to the left of the range and center, we say the distribution is skewed left.

Bi-Modal Shape

The next histogram, Worksheet 1.2.5, shows two distinct bell-shaped distributions, one with a center at about 10, the other between 21 and 22. The histogram would suggest that two processes are actually in the set of data - perhaps a set of data representing a process before improvements were made (center at 21.6 days), and one set of data representing a process after improvements were made (center at 10 days). This is a very common phenomenon in business. For example, in season variables may have much higher values than off season. As we will see in Modules 3 and 4, these type of Bi-Modal, or even multi-modal distributions should be separated or stratified before analysis of the data.

Worksheet 1.2.5

Bin	Frequency
3	1
6.7	10.0
10.4	17.0
14.1	2.0
17.9	2.0
21.6	18.0
25.3	7.0
29	3

There are some numerical summary measures that help us measure shape. These will be introduced in Module 1.3. The histogram is the third item required in your analysis of data for Project Assignment 1.

Pause and Reflect

Pictures, such as histograms, quickly communicate information about the center, spread and shape of distributions so that we begin to know about the underlying process that generates the data. This is true whether we are describing data from a population of interest, or data in a sample drawn from a population.

To complete this module notes, you can watch instructional videos on how to build Histograms with Excel through these 2 links:

1) http://ruby.fgcu.edu/courses/ekirche/jing/fa09/histogram.swf

2) http://ruby.fgcu.edu/courses/ekirche/jing/fa09/histogram2.swf

Sampling Concepts & Procedures

The backbone of statistical analysis entails understanding the extent to which the sample accurately represents the whole population. Many interesting and potentially problematic issues arise in taking and using samples, including bias, sample size too small, receiving dishonest response to questions that may be controversial or sensitive. And we are not just interested in the sample (or sample results); we really want information about the whole population from which the sample came from. Randomness is key component of obtaining a representative sample that reflects the characteristics in the population. In business, data usually arises from accounting transactions or management processes (i.e., inventory, sales, and payroll). Much of the data we analyze were recorded without explicit consideration, yet many decisions may depend on the data. Let's go over some important definitions before you collect the data for your assignments.

A subject or individual is a single member of a collection of items that we want to study. A variable is a characteristic of a subject or individual such as employee’s income or invoice amount. A data set consists of all the values of all the variables for all the individuals we have chosen to observe, that is, a collection of observations. Table 1.1 relates the type of dataset with the amount of variables you may want to study/present. In Asgn 1, for example, will conduct a univariate study; that is, we are interested in describing a process/construct through only one variable or characteristic process. As you may recall in my example for Module 1, I am interested in measuring customer service which of course is composed of many variables/characteristics (reliability, on time delivery, etc). However, I elected to present/describe it only in terms of cycle time, which is viewed by management as very important business characteristic to please our customers. Therefore, in Module 1, we'll deal with univariate study.

Table 1.1

Module 2 will study bivariate data through simple regression analysis. Modules 3, 4, and 5 will require the use of multivariate dataset. Figure 1.1.4 illustrate a multivariate data set.

Figure 1.1.4

Sample or Census?

A sample involves looking only at some items selected from the population. A census is an examination of all items in a defined population. The census can be contrasted with sampling in which information is obtained only from a subset of a population. Census data is commonly used for research, business marketing, and planning as well as a base for sampling surveys.

Situations Where a Sample May Be Preferred …

· Infinite Population: No census is possible if the population is infinite or of indefinite size (an assembly line can keep producing bolts, a doctor can keep seeing more patients).

· Destructive Testing: The act of sampling may destroy or devalue the item (measuring battery life, testing auto crashworthiness, or testing aircraft turbofan engine life).

· Timely Results: Sampling may yield more timely results than a census (checking wheat samples for moisture and protein content, checking peanut butter for aflatoxin contamination).

· Accuracy: Sample estimates can be more accurate than a census. Instead of spreading limited resources thinly to attempt a census, our budget of time and money might be better spent to hire experienced staff, improve training of field interviewers, and improve data safeguards.

· Cost: Even if it is feasible to take a census, the cost, either in time or money, may exceed our budget.

· Sensitive Information: Some kinds of information are better captured by a well-designed sample, rather than attempting a census. Confidentiality may also be improved in a carefully-done sample.

Situations Where a Census May Be Preferred …

· Small Population: If the population is small, there is little reason to sample, for the effort of data collection may be only a small part of the total cost.

· Large Sample Size: If the required sample size approaches the population size, we might as well go ahead and take a census.

· Database Exists: If the data are on disk we can examine 100% of the cases. But auditing or validating data against physical records may raise the cost.

· Legal Requirements: Banks must count all the cash in bank teller drawers at the end of each business day. The U.S. Congress forbade sampling in the 2000 decennial population census.

Pause and reflect: A parameter is any measurement that describes an entire population. Usually, the parameter value is unknown since we rarely can observe the entire population. Parameters are often (but not always) represented by Greek letters

Figure 1.1.5

Statistics are any measurement computed from a sample. Usually, the statistic is regarded as an estimate of a population parameter. Sample statistics are often (but not always) represented by Roman letters.

Let's review two situations in which samples provide estimates of population parameters.

1. A tire manufacturer developed a new tire designed to provide an increase in mileage over the firm's current line of tires. To estimate the mean number of miles provided by the new tire, the manufacturer selected a sample of 120 new tires for testing. The test provided a sample mean of 36,500 miles. Hence, an estimate of the mean tire mileage for the population of new tires was 36,500 miles

2. Members of a political party were considering supporting a particular candidate for election to the U.S. Senate, and party leaders wanted an estimate of the proportion of registered voters supporting the candidate. The time and cost associated with contacting every potential voter were prohibitive. Hence, a sample of 400 registered voters was selected and 160 of the 400 voters indicated a preference for the candidate. An estimate of the proportion of the population of registered voters supporting the candidate was 160/400 = 0.40.

These two examples illustrate why samples are used. It is important to realize that sample results provide only estimates of the value of the population characteristics. We do not expect the mean mileage for all tires in the population to be exactly 36,500 miles, nor do we expect exactly 0.40, or 40%, of the population of registered voters to support the candidate. However, proper sampling methods, the sample results will provide 'good' estimates of the population parameters. Let's briefly see some types of sampling procedures.

Sampling procedures:

· Simple Random Sample: Use random numbers to select items from a list (e.g., VISA cardholders).

· Systematic Sample: Select every kth item from a list or sequence (e.g., restaurant customers).

· Stratified Sample: Select randomly within defined strata (e.g., by age, occupation, gender).

· Cluster Sample: Like stratified sampling except strata are geographical areas (e.g., zip codes).

· Judgment Sample : Use expert knowledge to choose “typical” items (e.g., which employees to interview).

· Convenience Sample : Use a sample that happens to be available (e.g., ask co-worker opinions at lunch).

· Focus Groups : In-depth dialog with a representative panel of individuals (e.g. iPod users).

A very important sampling procedure is called Simple Random Sampling (SRS): Every item in the population of N items has the same chance of being chosen in the sample of n items. Thus, we rely on random numbers to select a name.

Example: There are 48 names in the list presented in figure 1.1.6 and we need to select one at random. Each student is associated with an ordinal number from 1 to 48. One simple way is to write each number in individual papers and place them in a box to be blindly selected by one person. Another way is to use random tables (as presented in many statistical texts) that can be used to select a random number. However, in this course we'll use Excel random number generators to accomplish this task. The Excel function

=RANDBETWEEN(bottom,top)

Returns a random integer number between the numbers you specify. A new random integer number is returned every time the worksheet is calculated (or F9 key is pressed).

Syntax:

=RANDBETWEEN(bottom,top)

Bottom: is the smallest integer RANDBETWEEN will return.

Top: is the largest integer RANDBETWEEN will return.

Example:

=RANDBETWEEN(1,100) returns a random number between 1 and 100 (varies)

In our example, by inserting the function =RANDBETWEEN(1,48) into a cell it returned the value 44. That means Stephanie (associated with number 44) was randomly selected from her class.

Figure 1.1.6

When the data are arranged in a rectangular array (or Table), we can randomly select an item by randomly selecting a row and a column.

For example, we want to randomly select a department store given in the table above with 3 columns and 4 rows. We use the function =RANDBETWEEN(1,3) function to randomly choose a column, the return in Excel was 3, and =RANDBEWTEEN(1,4) to choose a row, the return in Excel was 3. This way, each item has an equal chance of being selected. The final random selection is column 3 and row 3, pointing to Target department store.

Often, we have the need to randomize a List as indicated in Figure 1.1.7. The Excel function RAND() is very useful in many situations to generate random numbers. It returns an evenly distributed random real number greater than or equal to 0 and less than 1. A new random real number is returned every time the worksheet is calculated or F9 key is pressed.

Syntax:

=RAND( )

Remarks:

· To generate a random real number between a and b, use:

RAND()*(b-a)+a

· =RAND()*100 A random number greater than or equal to 0 but less than 100 (varies)

If you want to use RAND to generate a random number but don't want the numbers to change every time the cell is calculated, you can enter =RAND() in the formula bar, and then press F9 to change the formula to a random number.

In our example, use function =RAND() beside each row to create a column of random numbers between 0 and 1. Let's see an example.

Name		Major	Gender
Claudia		Accounting	F
Dan		Economics	M
Dave		Human Res	M
Kalisha		MIS	F
LaDonna		Finance	F
Marcia		Accounting	F
Matt		Undecided	M
Moira		Accounting	F
Rachel		Oper Mgt	F
Ryan		MIS	M
Tammy		Marketing	F
Victor		Marketing	M
Rand	Name	Major	Gender
0,382091	Claudia	Accounting	F
0,730061	Dan	Economics	M
0,143539	Dave	Human Res	M
0,906060	Kalisha	MIS	F
0,624378	LaDonna	Finance	F
0,229854	Marcia	Accounting	F
0,604377	Matt	Undecided	M
0,798923	Moira	Accounting	F
0,431740	Rachel	Oper Mgt	F
0,334449	Ryan	MIS	M
0,836594	Tammy	Marketing	F
0,402726	Victor	Marketing	M

Figure 1.1.7

Now you sort from in ascending order. The first n items are a random sample of the entire list (they are as likely as any others).

Rand	Name	Major	Gender
0,143539	Dave	Human Res	M
0,229854	Marcia	Accounting	F
0,334449	Ryan	MIS	M
0,382091	Claudia	Accounting	F
0,402726	Victor	Marketing	M
0,431740	Rachel	Oper Mgt	F
0,604377	Matt	Undecided	M
0,624378	LaDonna	Finance	F
0,730061	Dan	Economics	M
0,798923	Moira	Accounting	F
0,836594	Tammy	Marketing	F
0,906060	Kalisha	MIS	F

Systematic Sampling: Sample by choosing every k^th item from a list, starting from a randomly chosen entry on the list. Systematic sampling should yield acceptable results unless patterns in the population happen to recur at periodicity k. It can be used with unlistable or infinite populations. Systematic samples are well-suited to linearly organized physical populations.

A systematic sample of n items from a population of N items requires that periodicity k be approximately N/n. For example, out of 501 companies, we want to obtain a sample of 25. What should the periodicity k be? k = N/n , 501/25 » 20. So, we should choose every 20^th company from a random starting point.

For example, starting at item 2, we sample every k = 4 items to obtain a sample of n = 20 items from a list of N = 78 items.

Note that N/n = 78/20 » 4.

That is all for now folks. Try to apply some of these concepts when considering your data for the assignments in this course. Remember that a good sampling procedure helps eliminate bias and increase the chance of good estimates of population parameters. We'll see more about sampling when we get to module 5.

-x-x-x-x-x-x-

Reference:

Anderson, D., Sweeney, D., & Williams, T. (2006). Essentials of Modern Business Statistics with Microsoft Excel. Cincinnati, OH: 3^rd edition, South-Western. Chapter 2.

D. Groebner, P. Shannon, P. Fry & K. Smith. Business Statistics: A Decision Making Approach, Fifth Edition, Prentice Hall, Chapter 2

Ken Black. Business Statistics for Contemporary Decision Making. Fourth Edition, Wiley. Chapter 2

| Return to Module 1 Overview | Return to top of page |