54. AP Statistics
As you review the content in this book and work toward earning that 5 on your AP STATISTICS exam, here are five things that you MUST know:
1
Graders want to give you credit—help them! Make them understand what you are doing, why you are doing it, and how you are doing it. Don’t make the reader guess at what you are doing.
• Communication is just as important as statistical knowledge!
• Be sure you understand exactly what you are being asked to do or find or explain.
• Naked or bald answers will receive little or no credit! You must show where answers come from.
• On the other hand, don’t give more than one solution to the same problem—you will receive credit only for the weaker one.
2
Random sampling and random assignment are different ideas!
• Random sampling is use of chance in selecting a sample from a population.
– A simple random sample (SRS) is when every possible sample of a given size has the same chance of being selected.
– A stratified random sample is when the population is divided into homogeneous units called strata, and random samples are chosen from each strata.
– A cluster sample is when the population is divided into heterogeneous units called clusters, and a random sample of the clusters is chosen.
• Random assignment in experiments is when subjects are randomly assigned to treatments.
– This randomization evens out effects over which we have no control.
– Randomized block design refers to when the randomization occurs only within groups of similar experimental units called blocks.
3
Distributions describe variability! Understand the difference between:
• a population distribution (variability in an entire population),
• a sample distribution (variability in a particular sample), and
• a sampling distribution (variability between samples).
• The larger the sample size, the more the sample distribution looks like the population distribution.
• Central Limit Theorem: the larger the sample size, the more the sampling distribution (probability distribution of the sample means) looks like a normal distribution.
4
Check assumptions!
• Be sure the assumptions to be checked are stated correctly, but don’t just state them!
• Verifying assumptions and conditions means more than simply listing them with little check marks—you must show work or give some reason to confirm verification.
• If you refer to a graph, whether it is a histogram, boxplot, stemplot, scatterplot, residuals plot, normal probability plot, or some other kind of graph, you should roughly draw it. It is not enough to simply say, “I did a normal probability plot of the residuals on my calculator and it looked linear.”
5
Calculating the P-value is not the final step of a hypothesis test!
• There must be a decision to reject or fail to reject the null hypothesis.
• You must indicate how you interpret the P-value, that is, you need linkage. So, “Given that P = 0.007, I reject …” isn’t enough. You need something like, “Because P = 0.007 is less than 0.05, there is sufficient evidence to reject …”
• Finally, you need a conclusion in context of the problem.
The contents of this book cover the topics recommended by the AP Statistics Development Committee. A review of each of the 15 topics is followed by multiple-choice and free-response questions on that topic. Detailed explanations are provided for all answers. It should be noted that some of the topic questions are not typical AP exam questions but rather are intended to help review the topic. Finally, there is a diagnostic exam, and there are five full-length practice exams, totaling 276 questions, all with instructive, complete answers. An optional disk contains two new, full-length exams with 92 more questions.
Several points with regard to particular answers should be noted. First, step-by-step calculations using the given tables sometimes give minor differences from calculator answers due to round-off error. Second, calculator packages easily handle degrees of freedom that are not whole numbers, also resulting in minor answer differences. In the above cases, multiple-choice answers in this book have only one reasonable correct answer, and written explanations are necessary when answering free-response questions.
Students taking the AP Statistics Examination will be furnished with a list of formulas (from descriptive statistics, probability, and inferential statistics) and tables (including standard normal probabilities, t-distribution critical values, critical values, and random digits). While students will be expected to bring a graphing calculator with statistics capabilities to the examination, answers should not be in terms of calculator syntax. Furthermore, many students have commented that calculator usage was less than they had anticipated. However, even though the calculator is simply a tool, to be used sparingly, as needed, students should be proficient with this technology.
The examination will consist of two parts: a 90-minute section with 40 multiple-choice problems and a 90-minute free-response section with five open-ended questions and an investigative task to complete. In grading, the two sections of the exam will be given equal weight. Students have remarked that the first section involves “lots of reading,” while the second section involves “lots of writing.” The percentage of questions from each content area is approximately 25% data analysis, 15% experimental design, 25% probability, and 35% inference. Questions in both sections may involve reading generic computer output.
Note that in the multiple-choice section the questions are much more conceptual than computational, and thus use of the calculator is minimal. The score on the multiple-choice section is based on the number of correct answers, with no points deducted for incorrect answers. Blank answers are ignored.
In the free-response section, students must show all their work, and communication skills go hand in hand with statistical knowledge. Methods must be clearly indicated, as the problems will be graded on the correctness of the methods as well as on the accuracy of the results and explanation. That is, the free-response answers should address why a particular test was chosen, not just how the test is performed. Even if a statistical test is performed on a calculator such as the TI-84, formulas should still be stated. Choice of test, in inference, must include confirmation of underlying assumptions, and answers must be stated in context, not just as numbers.
Free-response questions are scored on a 0 to 4 scale with 1 point for a minimal response, 2 points for a developing response, 3 points for a substantial response, and 4 points for a complete response. Individual parts of these questions are scored as E for essentially correct, P for partially correct, and I for incorrect. Note that essentially correct does not mean perfect. Work is graded holistically, that is, a student’s complete response is considered as a whole whenever scores do not fall precisely on an integral value on the 0 to 4 scale.
Each open-ended question counts 15% of the total free-response score and the investigative task counts 25% of the free-response score. The first open-ended question is typically the most straightforward, and after doing this one to build confidence, students might consider looking at the investigative task since it counts more. Each completed AP examination paper will receive a grade based on a 5-point scale, with 5 the highest score and 1 the lowest score. Most colleges and universities accept a grade of 3 or better for credit or advanced placement or both.
While a review book such as this can be extremely useful in helping prepare students for the AP exam (practice problems, practice more problems, and practice even more problems are the three strongest pieces of advice), nothing can substitute for a good high school teacher and a good textbook. This author personally recommends the following texts from among the many excellent books on the market: Stats: Modeling the World by Bock, Velleman, and DeVeaux; The Practice of Statistics by Starnes, Yates, and Moore; Workshop Statistics: Discovery with Data by Rossman and Chance, Introduction to Statistics and Data Analysis by Peck, Olsen, and Devore; and Statistics: The Art and Science of Learning from Data by Agresti and Franklin.
Other wonderful sources of information are the College Board’s websites: www.collegeboard.org for students and parents, and www.apcentral.collegeboard.com for teachers.
A good piece of advice is for the student from day one to develop critical practices (like checking assumptions and conditions), to acquire strong technical skills, and to always write clear and thorough, yet to the point, interpretations in context. Final answers to most problems should be not numbers, but rather sentences explaining and analyzing numerical results. To help develop skills and insights to tackle AP free response questions (which often choose contexts students haven’t seen before), pick up newspapers and magazines and figure out how to apply what you are learning to better understand articles in print that reference numbers, graphs, and statistical studies.
The student who uses this Barron’s review book should study the text and illustrative examples carefully and try to complete the practice problems before referring to the solution keys. Simply reading the detailed explanations to the answers without first striving to work through the problems on one’s own is not the best approach. There is an old adage: Mathematics is not a spectator sport! Teachers clearly may use this book with a class in many profitable ways. Ideally, each individual topic review, together with practice problems, should be assigned after the topic has been covered in class. The full-length practice exams should be reserved for final review shortly before the AP examination.
For reference only.
BAR CHARTS
DOTPLOTS
HISTOGRAMS
STEMPLOTS
CENTER AND SPREAD
CLUSTERS AND GAPS
OUTLIERS
MODES
SHAPE
CUMULATIVE RELATIVE FREQUENCY PLOTS
SKEWNESS
There are a variety of ways to organize and arrange data. Much information can be put into tables, but these arrays of bare figures tend to be spiritless and sometimes even forbidding. Some form of graphical display is often best for seeing patterns and shapes and for presenting an immediate impression of everything about the data. Among the most common visual representations of data are dotplots, bar charts, histograms, and stemplots. It is important to remember that all graphical displays should be clearly labeled, leaving no doubt what the picture represents—AP Statistics scoring guides harshly penalize the lack of titles and labels!
TIP
The first thing to do with data is to draw a picture—always.
TIP
Just because a variable has numerical values doesn’t necessarily mean that it’s quantitative.
Bar charts are useful with regard to categorical (or qualitative) variables, that is, variables that note the category to which each individual belongs. This is in contrast to quantitative variables, which take on numerical values. Sizes can be measured as frequencies or percents.
EXAMPLE 1.1
In a survey taken during the first week of January 2015, 1100 parents wanted to keep the school year to the current 180 days, 300 wanted to shorten it to 160 days, 500 wanted to extend it to 200 days, and 100 expressed no opinion. (Or noting that there were 2000 parents surveyed, percentages can be calculated.)
TIP
Graphs must have appropriate labeling and scaling, or they will lose credit!
Dotplots can be used with categorical or quantitative variables.
EXAMPLE 1.2
When asked to choose their favorite dance music artist, 8 students chose Justin Timberlake, 5 picked Ray Dalton, 6 picked Nate Ruess, 3 picked Charli XCX, 5 picked Demi Lovato, and 3 picked Mikky Ekko. These data can be displayed in the following dotplot.
EXAMPLE 1.3
The dotplot below shows the lengths of stay (in days) for all patients admitted to a rural hospital during the first week in January 2015.
Histograms, useful for large data sets involving quantitative variables, show counts or percents falling either at certain values or between certain values. While the AP Statistics Exam does not stress construction of histograms, there are often questions on interpreting given histograms.
To construct a histogram using the TI-84, go to STAT → EDIT and put the data in a list, then turn a STAT PLOT on, choose the histogram icon under Type, specify the list where the data is, and use ZoomStat and/or adjust the WINDOW. Note that XSCL determines the width of the bin or class.
EXAMPLE 1.4
Suppose there are 2200 seniors in a city’s 6 high schools. Four hundred of the seniors are taking no AP classes, 500 are taking one, 900 are taking two, 300 are taking three, and 100 are taking four. These data can be displaced in the following histogram:
Sometimes, instead of labeling the vertical axis with frequencies, it is more convenient or more meaningful to use relative frequencies, that is, frequencies divided by the total number in the population.
Number of AP classes | Frequency | Relative frequency |
0 | 400 | 400/2200 = 0.18 |
1 | 500 | 500/2200 = 0.23 |
2 | 900 | 900/2200 = 0.41 |
3 | 300 | 300/2200 = 0.14 |
4 | 100 | 100/2200 = 0.05 |
Note that the shape of the histogram is the same whether the vertical axis is labeled with frequencies or with relative frequencies. Sometimes we show both frequencies and relative frequencies on the same graph.
EXAMPLE 1.5
Consider the following histogram of the numbers of pairs of shoes owned by 2000 women.
What can we learn from this histogram? For example, none of the women had fewer than 5 or more than 60 pairs of shoes. One hundred sixty of the women had 18 pairs of shoes. Twenty women had 5 pairs of shoes. Half the total area is less than or equal to 19, so half the women have 19 or fewer pairs of shoes. Fifteen percent of the area is more than 30, so 15 percent of the women have more than 30 pairs of shoes. Five percent of the area is more than 50, so 5 percent of the women have more than 50 pairs of shoes.
EXAMPLE 1.6
Consider the following histogram of exam scores, where the vertical axis has not been labeled.
What can we learn from this histogram?
Answer: It is impossible to determine the actual frequencies, that is, we have no idea if there were 25 students, 100 students, or any particular number of students who took the exam. However, we can determine the relative frequencies by noting the fraction of the total area that is over any interval.
We can divide the area into ten equal portions, and then note that 10% of the area is between 60 and 70, so 10% of the students scored between 60 and 70. Similarly, 40% scored between 70 and 80, 30% scored between 80 and 90, and 20% scored between 90 and 100.
Although it is usually not possible to divide histograms so nicely into ten equal areas, the principle of relative frequencies corresponding to relative areas still applies. Also note how this example shows the number of exam scores falling between certain values, whereas the previous two examples showed the number of AP classes taken and number of shoes owned for each value.
TIP
Relative frequencies are the usual choice when comparing distributions of different size populations.
Although a histogram may show how many scores fall into each grouping or interval, the exact values of individual scores are lost. An alternative pictorial display, called a stemplot (also called a stem-and-leaf display) retains this individual information and is useful for giving a quick overview of a distribution, displaying the relative density and shape of the data. A stemplot contains two columns separated by a vertical line. The left column contains the stems, and the right column contains the leaves.
EXAMPLE 1.7
Bisphenol A (BPA) is an industrial chemical that is found in many hard plastic bottles. Recent studies have shown a possible link between BPA exposure and childhood obesity. In one study of 27 elementary school children, urinary BPA levels in nanograms/milliliter (ng/mL) were as follows: {0.2, 0.4, 0.7, 0.7, 0.8, 0.8, 0.9, 1.0, 1.0, 1.3, 1.4, 1.4, 1.4, 1.7, 1.9, 2.1, 2.4, 2.5, 2.8, 2.8, 3.0, 3.3, 3.3, 3.8, 4.2, 4.5, 5.2}
TIP
All stemplots must have keys!
Note: Those with urine BPA level of 2 ng/mL or higher had more than twice the risk of being overweight.
EXAMPLE 1.8
How many nonstop pushups can a 15–18-year-old teenager do? In one study in a mixed gender high school gym class, the numbers of pushups were {2, 5, 7, 10, 12, 12, 14, 16, 16, 18, 19, 20, 21, 29, 32, 34, 35, 37, 37, 38, 39, 39, 42, 44, 50}
TIP
Center and spread should always be described together.
Looking at a graphical display, we see that two important aspects of the overall pattern are
1. the center, which separates the values (or area under the curve in the case of a histogram) roughly in half, and
2. the spread, that is, the scope of the values from smallest to largest.
In the histogram of Example 1.4, the center is 2 AP classes while the spread is from 0 to 4 AP classes.
In the histogram of Example 1.5 the center is about 19, and the spread is from 5 to 60; in the histogram of Example 1.6, the center is about 80, and the spread is from 60 to 100.
In the stemplot of Example 1.7, the center is 1.7 (middle of the 27 values), and the spread is from 0.2 to 5.2; in the stemplot of Example 1.8, the center is 21 (middle of the 25 values), and the spread is from 2 to 50.
Other important aspects of the overall pattern are
1. clusters, which show natural subgroups into which the values fall (for example, the salaries of teachers in Ithaca, NY, fall into three overlapping clusters, one for public school teachers, a higher one for Ithaca College professors, and an even higher one for Cornell University professors), and
2. gaps, which show holes where no values fall (for example, the Office of the Dean sends letters to students being put on the honor roll and to those being put on academic warning for low grades; thus the GPA distribution of students receiving letters from the Dean has a huge middle gap).
EXAMPLE 1.9
Hodgkin’s lymphoma is a cancer of the lymphatic system, the system that drains excess fluid from the blood and protects against infection. Consider the following histogram:
Simply saying that the average age at diagnosis for female cases is around 50 clearly misses something. The distribution of ages at diagnosis for female cases of Hodgkin’s lymphoma is bimodal with two distinct clusters, centered at 25 and 75.
TIP
Pay attention to outliers!
Extreme values, called outliers, are found in many distributions. Sometimes they are the result of errors in measurements and deserve scrutiny; however, outliers can also be the result of natural chance variation. Outliers may occur on one side or both sides of a distribution.
Some distributions have one or more major peaks, called modes. (The values with the peaks above them are the modes.) With exactly one or two such peaks, the distribution is said to be unimodal or bimodal, respectively. But every little bump in the data is not a mode! You should always look at the big picture and decide whether or not two (or more) phenomena are affecting the histogram.
TIP
Some distributions have many little ups (and downs), which should not be confused with modes.
EXAMPLE 1.10
The histogram below shows employee computer usage (number accessing the Internet) at given times at a company main office.
Note that this is a bimodal distribution. Computer usage at this company appears heaviest at midmorning and midafternoon, with a dip in usage during the noon lunch hour. There is an evening outlier possibly indicating employees returning after dinner (or perhaps custodial cleanup crews taking an Internet break!).
Note that, as illustrated above, it is usually instructive to look for reasons behind outliers and modes.
TIP
When describing a distribution, always comment on Shape, Outliers, Center, and Spread (SOCS). Or, alternatively, Center, Unusual values, Shape, and Spread (CUSS). And always describe in context.
Distributions come in an endless variety of shapes; however, certain common patterns are worth special mention:
1. A symmetric distribution is one in which the two halves are mirror images of each other. For example, the weights of all people in some organizations fall into symmetric distributions with two mirror-image bumps, one for men’s weights and one for women’s weights.
2. A distribution is skewed to the right if it spreads far and thinly toward the higher values. For example, ages of nonagenarians (people in their 90s) is a distribution with sharply decreasing numbers as one moves from 90-year-olds to 99-year-olds.
3. A distribution is skewed to the left if it spreads far and thinly toward the lower values. For example, scores on an easy exam show a distribution bunched at the higher end with few low values.
4. A bell-shaped distribution is symmetric with a center mound and two sloping tails. For example, the distribution of IQ scores across the general population is roughly symmetric with a center mound at 100 and two sloping tails.
5. A distribution is uniform if its histogram is a horizontal line. For example, tossing a fair die and noting how many spots (pips) appear on top yields a uniform distribution with 1 through 6 all equally likely.
Even when a basic shape is noted, it is important also to note if some of the data deviate from this shape.
TIP
In the real world, distributions are rarely perfectly symmetric or perfectly uniform, so we usually say “roughly” or “approximately” symmetric or uniform.
CUMULATIVE RELATIVE FREQUENCY PLOTS
Sometimes we sum frequencies and show the result visually in a cumulative relative frequency plot (also known as an ogive).
EXAMPLE 1.11
The following graph shows 2015 school enrollment in the United States by age.
What can we learn from this cumulative relative frequency plot? For example, going up to the graph from age 5, we see that 0.15 or 15% of school enrollment is below age 5. Going over to the graph from 0.5 on the vertical axis, we see that 50% of the school enrollment is below and 50% is above a middle age of 11. Going up from age 30, we see that 0.95 or 95% of the enrollment is below age 30, and thus 5% is above age 30. Going over from 0.25 and 0.75 on the vertical axis, we see that the middle 50% of school enrollment is between ages 6 and 7 at the lower end and age 16 at the upper end.
CUMULATIVE RELATIVE FREQUENCY AND SKEWNESS
A distribution skewed to the left has a cumulative frequency plot that rises slowly at first and then steeply later, while a distribution skewed to the right has a cumulative frequency plot that rises steeply at first and then slowly later.
EXAMPLE 1.12
Consider the essay grading policies of three teachers, Abrams, who gives very high scores, Brown, who gives equal numbers of low and high scores, and Connors, who gives very low scores. Histograms of the grades (with 1 the highest score and 4 the lowest score) are as follows:
SUMMARY
The three keys to describing a distribution are shape, center, and spread.
Also consider clusters, gaps, modes, and outliers.
Always provide context.
Look for reasons behind any unusual features.
A few common shapes arise from symmetric, skewed to the right, skewed to the left, bell-shaped, and uniform distributions.
For categorical (qualitative) data, dotplots and bar charts give useful displays.
For quantitative data, histograms, cumulative relative frequency plots (ogives), and stemplots give useful displays.
In a histogram, relative area corresponds to relative frequency.
Multiple-Choice Questions
Directions: The questions or incomplete statements that follow are each followed by five suggested answers or completions. Choose the response that best answers the question or completes the statement.
1. The stemplot below shows ages of CEOs of a select group of corporations.
Which of the following is not a correct statement about this distribution?
(A) The distribution is bell-shaped.
(B) The distribution is skewed left and right.
(C) The center is around 60.
(D) The spread is from 22 to 90.
(E) There are no outliers.
2. Which of the following is a true statement?
(A) Stemplots are useful both for quantitative and categorical data sets.
(B) Stemplots are equally useful for small and very large data sets.
(C) Stemplots can show symmetry, gaps, clusters, and outliers.
(D) Stemplots may or may not show individual values.
(E) Stems may be skipped if there is no data value for a particular stem.
3. Which of the following is an incorrect statement?
(A) In histograms, relative areas correspond to relative frequencies.
(B) In histograms, frequencies can be determined from relative heights.
(C) Symmetric histograms may have multiple peaks.
(D) Two students working with the same set of data may come up with histograms that look different.
(E) Displaying outliers may be more problematic when using histograms than when using stemplots.
4. Following is a histogram of test scores.
Which of the following is a true statement?
(A) The middle (median) score was 75.
(B) The mean score was 70.
(C) The mean score is probably less than the median score.
(D) If the passing score was 60, most students failed.
(E) More students scored between 50 and 60 than between 90 and 100.
Questions 5–9 refer to the following five cumulative relative frequency plots:
5. To which of the above cumulative relative frequency plots does the following histogram correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
6. To which of the above cumulative relative frequency plots does the following histogram correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
7. To which of the above cumulative relative frequency plots does the following histogram correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
8. To which of the above cumulative relative frequency plots does the following histogram correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
9. To which of the above cumulative relative frequency plots does the following histogram correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
Free-Response Questions
Directions: You must show all work and indicate the methods you use. You will be graded on the correctness of your methods and on the accuracy of your final answers.
THREE OPEN-ENDED QUESTIONS
1. The dotplot below shows the numbers of goals scored by the 20 teams playing in a city’s high school soccer games on a particular day.
(a) Describe the distribution.
(b) One superstar scored six goals, but his team still lost. What are all possible final scores for that game? Explain.
(c) Is it possible that all the teams scoring exactly two goals won their games? Explain.
2. The winning percentages for a major league baseball team over the past 22 years are shown in the following stemplot:
(a) Interpret the lowest value.
(b) Describe the distribution.
(c) Give a reason that one might argue that the team is more likely to lose a given game than win it.
(d) Give a reason that one might argue that the team is more likely to win a given game than lose it.
3. A college basketball team keeps records of career average points per game of players playing at least 75% of team games during their college careers. The cumulative relative frequency plot below summarizes statistics of players graduating over the past 10 years.
(a) Interpret the point (20, 0.4) in context.
(b) Interpret the intersection of the plot with the horizontal axis in context.
(c) Interpret the horizontal section of plot from 5 to 7 points per game in context.
(d) The players with the top 10% of the career average points per game achievements will be listed on a plaque. What is the cutoff score for being included on the plaque?
(e) What proportion of the players averaged between 10 and 20 points per game?
AN INVESTIGATIVE TASK
A company engineer creates a diagnostic measurement, , which should be at least 24.10 in a sample of size 12 if certain machinery is operating correctly. To explore this diagnostic measurement, the machine is perfectly calibrated. Then 100 random samples of size 12 of the product are taken from the assembly line. For each of these 100 samples, the diagnostic measurement W is calculated and shown plotted below.
Each day, one sample of size 12 is taken from the assembly line and the diagnostic measurement W is calculated. If W drops too low, a decision to recalibrate the machinery is made.
(a) From the dotplot above, estimate a measure of center and a measure of variability for the distribution.
(b) For the dotplot above, do there appear to be any outliers (no calculations required)? Justify your answer.
One day the random sample is {24.2, 24.84, 25.05, 23.43, 23.9, 25.01, 23.01, 24.5, 24.23, 23.76, 24.69, 23.21}.
(c) Based on the dotplot above, does the engineer have sufficient evidence to conclude that recalibration is necessary? Justify your answer.
MULTIPLE-CHOICE
1. (B) There is no such thing as being skewed both left and right.
2. (C) Stemplots are not used for categorical data sets, are too unwieldy to be used for very large data sets, and show every individual value. Stems should never be skipped over—gaps are important to see.
3. (B) Histograms give information about relative frequencies (relative areas correspond to relative frequencies) and may or may not have an axis with actual frequencies. Symmetric histograms can have any number of peaks. Choice of width and number of classes changes the appearance of a histogram. Stemplots clearly show outliers; however, in histograms outliers may be hidden in large class widths.
4. (E) The median score splits the area in half, and so the median is not 75. The median appears to be about 70 (with equal area on each side), and since the data are skewed right, the mean will be larger than the median, so the mean is greater than 70. The area between 50 and 60 is greater than the area between 90 and 100 but is less than the area between 60 and 100.
5. (B) A histogram with little area under the curve early and much greater area later results in a cumulative relative frequency plot which rises slowly at first and then at a much faster rate later.
6. (C) A histogram with large area under the curve early and much less area later results in a cumulative relative frequency plot which rises quickly at first and then at a much slower rate later.
7. (E) A histogram with little area under the curve in the middle and much greater area on both ends results in a cumulative relative frequency plot which rises quickly at first, then almost levels off, and finally rises quickly at the end.
8. (D) A histogram with little area under the curve on the ends and much greater area in the middle results in a cumulative relative frequency plot which rises slowly at first, then quickly in the middle, and finally slowly again at the end.
9. (A) Uniform distributions result in cumulative relative frequency plots which rise at constant rates, thus linear.
FREE-RESPONSE
1. (a) A complete answer considers shape, center, and spread.
Shape: unimodal, skewed right, outlier at 10
Center: around 2 or 3
Spread: from 0 to 10
(b) If the player scored six goals, his/her team must have scored either 7 or 10, but they lost, so they scored 7, and the only possible final score is that they lost by a score of 10 to 7.
(c) No, there were six teams that scored exactly two goals, but there were only five teams that scored less than two goals, so not all the two-goal teams could have won.
2. (a) The lowest winning percentage over the past 22 years is 46.0%.
(b) A complete answer considers shape, center, and spread.
Shape: two clusters, each somewhat bell-shaped
Center: around 50%
Spread: from 46.0 to 55.6%
(c) The team had more losing seasons (13) than winning seasons (9).
(d) The cluster of winning percentages is further above 50% than the cluster of losing percentages is below 50%.
3. (a) 40% of the players averaged fewer than 20 points per game.
(b) All the players averaged at least 3 points per game.
(c) No players averaged between 5 and 7 points per game because the cumulative relative frequency was 10% for both 5 and 7 points.
(d) Go over to the plot from 0.9 on the vertical axis, and then down to the horizontal axis to result in 28 points per game.
(e) Reading up to the plot and then over from 10 and from 20 shows that 0.25 of the players averaged under 10 points per game and 0.4 of the players averaged under 20 points per game. Thus, 0.4 – 0.25 = 0.15 gives the proportion of players who averaged between 10 and 20 points per game.
AN INVESTIGATIVE TASK
(a) The center is roughly between 24.10 and 24.11, and the data are spread from 24.01 to 24.20.
(b) There seems to be two “low” data points, 24.01 and 24.02, and one “high” data point, 24.20. These three data points are distinctly separated from the other points.
(c) For this day’s sample, A value of 24.03 or less occurred only twice in the 100 samples. Thus, if the machinery was operating properly, a W measurement of 24.03 would be very unusual. The conclusion should be to recalibrate the machine.
MEASURING THE CENTER
MEASURING SPREAD
MEASURING POSITION
EMPIRICAL RULE
HISTOGRAMS
BOXPLOTS
CHANGING UNITS
Given a raw set of data, often we can detect no overall pattern. Perhaps some values occur more frequently, a few extreme values may stand out, and the range of values is usually apparent. The presentation of data, including summarizations and descriptions, and involving such concepts as representative or average values, measures of dispersion, positions of various values, and the shape of a distribution, falls under the broad topic of descriptive statistics. This aspect of statistics is in contrast to statistical analysis, the process of drawing inferences from limited data, a subject discussed in later topics.
MEASURING THE CENTER: MEDIAN AND MEAN
The word average is used in phrases common to everyday conversation. People speak of bowling and batting averages or the average life expectancy of a battery or a human being. Actually the word average is derived from the French avarie, which refers to the money that shippers contributed to help compensate for losses suffered by other shippers whose cargo did not arrive safely (i.e., the losses were shared, with everyone contributing an average amount). In common usage average has come to mean a representative score or a typical value or the center of a distribution. Mathematically, there are a variety of ways to define the average of a set of data. In practice, we use whichever method is most appropriate for the particular case under consideration. However, beware of a headline with the word average; the writer has probably chosen the method that emphasizes the point he or she wishes to make.
In the following paragraphs we consider the two primary ways of denoting an average:
1. The median, which is the middle number of a set of numbers arranged in numerical order.
2. The mean, which is found by summing items in a set and dividing by the number of items.
EXAMPLE 2.1
Consider the following set of home run distances (in feet) to center field in 13 ballparks: {387, 400, 400, 410, 410, 410, 414, 415, 420, 420, 421, 457, 461}. What is the average?
Answer: The median is 414 (there are six values below 414 and six values above), while the mean is
REMEMBER
Don’t forget to put the data in order before finding the median.
Median
The word median is derived from the Latin medius which means “middle.” The values under consideration are arranged in ascending or descending order. If there is an odd number of values, the median is the middle one. If there is an even number, the median is found by adding the two middle values and dividing by 2. Thus the median of a set has the same number of elements above it as below it.
The median is not affected by exactly how large the larger values are or by exactly how small the smaller values are. Thus it is a particularly useful measurement when the extreme values, called outliers, are in some way suspicious or when we want do diminish their effect. For example, if ten mice try to solve a maze, and nine succeed in less than 15 minutes while one is still trying after 24 hours, the most representative value is the median (not the mean, which is over 2 hours). Similarly, if the salaries of four executives are each between \$240,000 and \$245,000 while a fifth is paid less than \$20,000, again the most representative value is the median (the mean is under \$200,000). It is often said that the median is “resistant” to extreme values.
In certain situations the median offers the most economical and quickest way to calculate an average. For example, suppose 10,000 lightbulbs of a particular brand are installed in a factory. An average life expectancy for the bulbs can most easily be found by noting how much time passes before exactly one-half of them have to be replaced. The median is also useful in certain kinds of medical research. For example, to compare the relative strengths of different poisons, a scientist notes what dosage of each poison will result in the death of exactly one-half the test animals. If one of the animals proves especially susceptible to a particular poison, the median lethal dose is not affected.
Mean
While the median is often useful in descriptive statistics, the mean, or more accurately, the arithmetic mean, is most important for statistical inference and analysis. Also, for the layperson, the average is usually understood to be the mean.
The mean of a whole population (the complete set of items of interest) is often denoted by the Greek letter µ (mu), while the mean of a sample (a part of a population) is often denoted by $\overline{x}$. For example, the mean value of the set of all houses in the United States might be µ = \$56,400, while the mean value of 100 randomly chosen houses might be $\overline{x}$ = \$52,100 or perhaps $\overline{x}$ = \$63,800 or even $\overline{x}$ = \$124,000.
In statistics we learn how to estimate a population mean from a sample mean. Throughout this book, the word sample often implies a simple random sample (SRS), that is, a sample selected in such a way that every possible sample of the desired size has an equal chance of being included. (It is also true that each element of the population will have an equal chance of being included.) In the real world, this process of random selection is often very difficult to achieve, and so we proceed, with caution, as long as we have good reason to believe that our sample is representative of the population.
Mathematically, the mean $ \frac{\sum x}{n}$ where $\sum x$ represents the sum of all the elements of the set under consideration and n is the actual number of elements. $\sum$ is the uppercase Greek letter sigma.
EXAMPLE 2.2
Suppose that the numbers of unnecessary procedures recommended by five doctors in a 1-month period are given by the set {2, 2, 8, 20, 33}. Note that the median is 8 and the mean is $\frac{2+2+8+20+33}{5}=13$ If it is discovered that the fifth doctor also recommended an additional 25 unnecessary procedures, how will the median and mean be affected?
Answer: The set is now {2, 2, 8, 20, 58}. The median is still 8; however, the mean changes to $\frac{2+2+8+20+58}{5}=18$
The above example illustrates how the mean, unlike the median, is sensitive to a change in any value.
EXAMPLE 2.3
Suppose the salaries of six employees are \$3000, \$7000, \$15,000, \$22,000, \$23,000, and \$38,000, respectively.
a. What is the mean salary?
Answer:
$\frac{3000+7000+15000+22000+23000+38000}{6}=18000$
b. What will the new mean salary be if everyone receives a $3000 increase?
Answer:
$\frac{6000+10000+18000+25000+26000+41000}{6}=21000$
Note that \$18,000 + \$3000 = \$21,000.
c. What if everyone receives a 10% raise?
Answer:
$\frac{3300+7700+16500+24200+25300+41800}{6}=19800$
Note that 110% of \$18,000 is \$19,800.
The above example illustrates how adding the same constant to each value increases the mean (and median) by a like amount. Similarly, multiplying each value by the same constant multiplies the mean (and median) by a like amount.
TIP
Understanding variation is the key to understanding statistics.
MEASURING SPREAD: RANGE, INTERQUARTILE RANGE, VARIANCE, AND STANDARD DEVIATION
In describing a set of numbers, not only is it useful to designate an average value but it is also important to be able to indicate the variability or the dispersion of the measurements. An explosion engineer in mining operations aims for small variability—it would not be good for his 30-minute fuses actually to have a range of 10–50 minutes before detonation. On the other hand, a teacher interested in distinguishing better students from poorer students aims to design exams with large variability in results—it would not be helpful if all her students scored exactly the same. The players on two basketball teams may have the same average height, but this observation doesn’t tell the whole story. If the dispersions are quite different, one team may have a 7-foot player, whereas the other has no one over 6 feet tall. Two Mediterranean holiday cruises may advertise the same average age for their passengers. One, however, may have only passengers between 20 and 25 years old, while the other has only middle-aged parents in their forties together with their children under age 10.
There are four primary ways of describing variability or dispersion:
1. The range, which is the difference between the largest and smallest values
2. The interquartile range, IQR, which is the difference between the largest and smallest values after removing the lower and upper quarters (i.e., IQR is the range of the middle 50%); that is, IQR = Q3 – Q1 = 75th percentile minus 25th percentile
3. The variance, which is determined by averaging the squared differences of all the values from the mean
4. The standard deviation, which is the square root of the variance
EXAMPLE 2.4
The monthly rainfall in Monrovia, Liberia, where May through October is the rainy season and November through April the dry season, is as follows:
Month: | Jan | Feb | Mar | Apr | May | June | July | Aug | Sept | Oct | Nov | Dec | |
Rain (in.): | 1 | 2 | 4 | 6 | 18 | 37 | 31 | 16 | 28 | 24 | 9 | 4 |
The mean is
$\frac{1+2+4+6++18+37+31++16+28+24+9+4}{12}=15$ inches
What are the measures of variability?
Answer: Range: The maximum is 37 inches (June), and the minimum is 1 inch (January). Thus the range is 37 – 1 = 36 inches of rain.
Interquartile range: Removing the lower and upper quarters leaves 4, 6, 9, 16, 18, and 24. Thus the interquartile range is 24 – 4 = 20. [The interquartile range is sometimes calculated as follows: The median of the lower half is $Q_1=\frac{4+4}{2}=4$ the median of the upper half is $Q_3=\frac{24+28}{2}=26$ and the interquartile range is Q3 – Q1 = 22. When there is a large number of values in the set, the two methods give the same answer.]
Variance:
$\frac{14^2+13^2+11^2+9^2+3^2+22^2+16^2+1^2+13^2+9^2+6^2+11^2}{12}=143.7$
Standard deviation: $\sqrt{143.7}=12.0$ inches
Range
The simplest, most easily calculated measure of variability is the range. The difference between the largest and smallest values can be noted quickly, and the range gives some impression of the dispersion. However, it is entirely dependent on the two extreme values and is insensitive to the ones in the middle.
One use of the range is to evaluate samples with very few items. For example, some quality control techniques involve taking periodic small samples and basing further action on the range found in several such samples.
Interquartile Range
Finding the interquartile range is one method of removing the influence of extreme values on the range. It is calculated by arranging the data in numerical order, removing the upper and lower quarters of the values, and noting the range of the remaining values. That is, it is the range of the middle 50% of the values.
The numerical rule for designating outliers is to calculate 1.5 times the interquartile range (IQR) and then call a value an outlier if it is more than 1.5 × IQR below the first quartile or 1.5 × IQR above the third quartile.
EXAMPLE 2.5
Suppose that the starting salaries (in $1000) for college graduates who took AP Statistics in high school and at least one additional statistics class in college have the following characteristics: the smallest value is 18.8, 10% of the values are below 25.6, 25% are below 41.1, the median is 59.3, 60% are below 84.3, 75% are below 101.9, 90% are below 118.0, and the top value is 201.7.
a. What is the range?
Answer: The range is 201.7 – 18.8 = 182.9 (thousand dollars) = $182,900.
b. What is the interquartile range?
Answer: The interquartile range, that is, the range of the middle 50% of the values, is Q3 – Q1 = 101.9 – 41.1 = 60.8 (thousand dollars) = $60,800.
c. When the numerical rule is used for outliers, should either the smallest or largest value be called an outlier?
Answer: 1.5 × IQR = 1.5 × 60.8 = 91.2. If a value is more than 91.2 below the first quartile, 41.1, or more than 91.2 above the third quartile, 101.9, then it will be called an outlier. Since the largest value, 201.7, is greater than 101.9 + 91.2 = 193.1, it is considered an outlier by the given numerical rule.
Variance
Dispersion is often the result of various chance happenings. For example, consider the motion of microscopic particles suspended in a liquid. The unpredictable motion of any particle is the result of many small movements in various directions caused by random bumps from other particles. If we average the total displacements of all the particles from their starting points, the result will not increase in direct proportion to time. If, however, we average the squares of the total displacements of all the particles, this result will increase in direct proportion to time.
The same holds true for the movement of paramecia. Their seemingly random motions as seen under a microscope can be described by the observation that the average of the squares of the displacements from their starting points is directly proportional to time. Also, consider ping-pong balls dropped straight down from a high tower and subjected to chance buffeting in the air. We can measure the deviations from a center spot on the ground to the spots where the balls actually strike. As the height of the tower is increased, the average of the squared deviations increases proportionately.
In a wide variety of cases we are in effect trying to measure dispersion from the mean due to a multitude of chance effects. The proper tool in these cases is the average of the squared deviations from the mean; it is called the variance and is denoted by $\sigma^2$ ($\sigma$ is the lowercase Greek letter sigma):
$\sigma^2=\frac{\sum(x-\mu)^2}{n}$
For circumstances specified later, the variance of a sample, denoted by s2, is calculated as
$s^2=\frac{\sum(x-\overline{x})^2}{n-1}$
TIP
Most calculators give the standard deviation, and this must be squared to find the variance.
EXAMPLE 2.6
The Points Per Game (PPG) during the 2012–2013 season of the New York Knicks players were {14.2, 28.7, 10.4, 1.8, 6.6, 13.9, 6.0, 18.1, 6.8, 7.0, 8.7, 3.5, 7.2}. What was the variance?
Answer: The variance can be quickly found on any calculator with a simple statistical package, or it can be found as follows:
Standard Deviation
Suppose we wish to pick a representative value for the variability of a certain population. The preceding discussions indicate that a natural choice is the value whose square is the average of the squared deviations from the mean. Thus we are led to consider the square root of the variance. This value is called the standard deviation, is denoted by $\sigma$, and is calculated on your calculator or as follows:
Similarly, the standard deviation of a sample is denoted by s and is calculated on your calculator or as follows:
While variance is measured in square units, standard deviation is measured in the same units as are the data.
For the various x-values, the deviations $x-\overline{x}$ are called residuals, and s is a “typical value” for the residuals. While s is not the average of the residuals (the average of the residuals is always 0), s does give a measure of the spread of the x-values around the sample mean.
EXAMPLE 2.7
The number of calories in 12-ounce servings of five popular beers are {95, 152, 188, 205, 131}. Using the TI-84, 1-Var Stats gives
Since these data represent a sample of beers, the standard deviation is 7.1501.
MEASURING POSITION: SIMPLE RANKING, PERCENTILE RANKING, AND Z-SCORE
We have seen several ways of choosing a value to represent the center of a distribution. We also need to be able to talk about the position of any other values. In some situations, such as wine tasting, simple rankings are of interest. Other cases, for example, evaluating college applications, may involve positioning according to percentile rankings. There are also situations in which position can be specified by making use of measurements of both central tendency and variability.
There are three important, recognized procedures for designating position:
1. Simple ranking, which involves arranging the elements in some order and noting where in that order a particular value falls
2. Percentile ranking, which indicates what percentage of all values fall below the value under consideration
3. The z-score, which states very specifically by how many standard deviations a particular value varies from the mean.
EXAMPLE 2.8
It is recommended that the “good cholesterol,” high-density lipoprotein (HDL), be present in the blood at levels of at least 40 mg/dl. Suppose a 50-member high school football team are all tested with resulting HDL levels of {53, 26, 45, 33, 64, 29, 73, 29, 21, 58, 70, 41, 48, 55, 55, 39, 57, 48, 9, 59, 56, 39, 68, 50, 65, 30, 38, 54, 49, 35, 56, 70, 43, 86, 52, 40, 28, 40, 67, 50, 47, 54, 59, 29, 29, 42, 45, 37, 51, 40}. What is the position of the HDL score of 41?
Answer: Since there are 31 higher HDL levels on the list, the 41 has a simple ranking of 32 out of 50. Eighteen HDL levels are lower, so the percentile ranking is 18/50 = 36%. The above list has a mean of 47.22 with a standard deviation of 15.05, so the HDL score of 41 has a z-score of (41 – 47.22)/15.05 = –0.413.
Simple Ranking
Simple ranking is easily calculated and easily understood. We know what it means for someone to graduate second in a class of 435, or for a player from a team of size 30 to have the seventh-best batting average. Simple ranking is useful even when no numerical values are associated with the elements. For example, detergents can be ranked according to relative cleansing ability without any numerical measurements of strength.
Percentile Ranking
Percentile ranking, another readily understood measurement of position, is helpful in comparing positions with different bases. We can more easily compare a rank of 176 out of 704 with a rank of 187 out of 935 by noting that the first has a rank of 75%, and the second, a rank of 80%. Percentile rank is also useful when the exact population size is not known or is irrelevant. For example, it is more meaningful to say that Jennifer scored in the 90th percentile on a national exam rather than trying to determine her exact ranking among some large number of test takers.
The quartiles, Q1 and Q3, lie one-quarter and three-quarters of the way up a list, respectively. Their percentile ranks are 25% and 75%, respectively. The interquartile range defined earlier can also be defined to be Q3 – Q1. The deciles lie one-tenth and nine-tenths of the way up a list, respectively, and have percentile ranks of 10% and 90%.
z-Score
The z-score is a measure of position that takes into account both the center and the dispersion of the distribution. More specifically, the z-score of a value tells how many standard deviations the value is from the mean. Mathematically, x – µ gives the raw distance from µ to x; dividing by $\sigma$ converts this to number of standard deviations. Thus $z=\frac{x-\mu}{\sigma}$, where x is the raw score, µ is the mean, and $\sigma$ is the standard deviation. If the score x is greater than the mean µ, then z is positive; if x is less than µ, then z is negative.
Given a z-score, we can reverse the procedure and find the corresponding raw score. Solving for x gives $x=\mu+z\sigma$.
EXAMPLE 2.9
Suppose the average (mean) price of gasoline in a large city is \$3.80 per gallon with a standard deviation of \$0.05. Then \$3.90 has a z-score of $\frac{3.65-3.80}{0.05}=+2,$ while \$3.65 has a z-score of $\frac{3.65-3.80}{0.05}=-3$. Alternatively, a z-score of +2.2 corresponds to a raw score of 3.80 + 2.2(0.05) = 3.80 + 0.11 = 3.91, while a z-score of –1.6 corresponds to 3.80 – 1.6(0.05) = 3.72.
It is often useful to portray integer z-scores and the corresponding raw scores as follows:
EXAMPLE 2.10
Suppose the attendance at a movie theater averages 780 with a standard deviation of 40. Adding multiples of 40 to and subtracting multiples of 40 from the mean 780 gives
A theater attendance of 835 is converted to a z-score as follows: $\frac{835-780}{40}=1.375$
A z-score of –2.15 is converted to a theater attendance as follows: 780 – 2.15(40) = 694.
The empirical rule (also called the 68-95-99.7 rule) applies specifically to symmetric bell-shaped data (not to skewed data!). In this case, about 68% of the values lie within 1 standard deviation of the mean, about 95% of the values lie within 2 standard deviations of the mean, and more than 99% of the values lie within 3 standard deviations of the mean.
In the following figure the horizontal axis shows z-scores:
EXAMPLE 2.11
Suppose that taxicabs in New York City are driven an average of 75,000 miles per year with a standard deviation of 12,000 miles. What information does the empirical rule give us?
Answer: Assuming that the distribution is bell-shaped, we can conclude that approximately 68% of the taxis are driven between 63,000 and 87,000 miles per year, approximately 95% are driven between 51,000 and 99,000 miles, and virtually all are driven between 39,000 and 111,000 miles.
The empirical rule also gives a useful quick estimate of the standard deviation in terms of the range. We can see in the figure above that 95% of the data fall within a span of 4 standard deviations (from –2 to +2 on the z-score line) and 99.7% of the data fall within 6 standard deviations (from –3 to +3 on the z-score line). It is therefore reasonable to conclude that for these data the standard deviation is roughly between one-fourth and one-sixth of the range. Since we can find the range of a set almost immediately, the empirical rule technique for estimating the standard deviation is often helpful in pointing out gross arithmetic errors.
EXAMPLE 2.12
If the range of a bell-shaped data set is 60, what is an estimate for the standard deviation?
Answer: By the empirical rule, the standard deviation is expected to be between $\frac{1}{6}.60=10$ and $\frac{1}{4}.60=15$. If the standard deviation is calculated to be 0.32 or 87, there is probably an arithmetic error; a calculation of 12, however, is reasonable.
However, it must be stressed that the above use of the range is not intended to provide an accurate value for the standard deviation. It is simply a tool for pointing out unreasonable answers rather than, for example, blindly accepting computer outputs.
HISTOGRAMS AND MEASURES OF CENTRAL TENDENCY
Suppose we have a detailed histogram such as
Our measures of central tendency fit naturally into such a diagram.
The median divides a distribution in half, so it is represented by a line that divides the area of the histogram in half.
The mean is affected by the spacing of all the values. Therefore, if the histogram is considered to be a solid region, the mean corresponds to a line passing through the center of gravity, or balance point.
The above distribution, spread thinly far to the low side, is said to be skewed to the left. Note that in this case the mean is usually less than the median. Similarly, a distribution spread far to the high side is skewed to the right, and its mean is usually greater than its median.
EXAMPLE 2.13
Suppose that the faculty salaries at a college have a median of \$82,500 and a mean of \$88,700. What does this indicate about the shape of the distribution of the salaries?
Answer: The median is less than the mean, and so the salaries are probably skewed to the right. There are a few highly paid professors, with the bulk of the faculty at the lower end of the pay scale.
It should be noted that the above principle is a useful, but not hard-and-fast, rule.
EXAMPLE 2.14
The set given by the dotplot below is skewed to the right; however, its median (3) is greater than its mean (2.97).
HISTOGRAMS, Z-SCORES, AND PERCENTILE RANKINGS
We have seen that relative frequencies are represented by relative areas, and so labeling the vertical axis is not crucial. If we know the standard deviation, the horizontal axis can be labeled in terms of z-scores. In fact, if we are given the percentile rankings of various z-scores, we can construct a histogram.
EXAMPLE 2.15
Suppose we are asked to construct a histogram from these data:
z-score: | –2 | –1 | 0 | 1 | 2 |
|
Percentile ranking: | 0 | 20 | 60 | 70 | 100 |
We note that the entire area is less than z-score +2 and greater than z-score –2. Also, 20% of the area is between z-scores –2 and –1, 40% is between –1 and 0, 10% is between 0 and 1, and 30% is between 1 and 2. Thus the histogram is as follows:
Now suppose we are given four in-between z-scores as well:
z-score | Ranking |
2.0 | 100 |
1.5 | 80 |
1.0 | 70 |
0.5 | 65 |
0.0 | 60 |
–0.5 | 30 |
–1.0 | 20 |
–1.5 | 5 |
–2.0 | 0 |
With 1000 z-scores perhaps the histogram would look like
The height at any point is meaningless; what is important is relative areas. For example, in the final diagram above, what percentage of the area is between z-scores of +1 and +2?
Answer: Still 30%.
What percent is to the left of 0?
Answer: Still 60%.
A boxplot (also called a box and whisker display) is a visual representation of dispersion that shows the smallest value, the largest value, the middle (median), the middle of the bottom half of the set (Q1), and the middle of the top half of the set (Q3).
TIP
The IQR is the length of the box, not the box itself. So the median is in the box, or is between Q1 and Q3, but is not in the IQR.
EXAMPLE 2.16
After an AP Statistics teacher hears that every one of her 27 students received a 3 or higher on the AP exam, she treats the class to a game of bowling. The individual student scores are 210, 130, 150, 140, 150, 210, 150, 125, 85, 200, 70, 150, 75, 90, 150, 115, 120, 125, 160, 140, 100, 95, 100, 215, 130, 160, and 200. Their students note that the greatest score is 215, the smallest is 70, the middle is 140, the middle of the top half is 160, and the middle of the bottom half is 100. A boxplot of these five numbers is
TIP
Be careful about describing the shape of a distribution when all that one has is a boxplot. For example, “approximately normal” is never a possible conclusion.
Note that the display consists of two “boxes” together with two “whiskers”—hence the alternative name. The boxes show the spread of the two middle quarters; the whiskers show the spread of the two outer quarters. This relatively simple display conveys information not immediately available from histograms or stem and leaf displays.
Putting the above data into a list, for example, L1, on the TI-84, not only gives the five-number summary
1-Var Stats
minX=70
Q1=100
Med=140
Q3=160
MaxX=215
TIP
Note that a boxplot gives one measure of center (the median) and two measures of variability (the range and the IQR).
but also gives the boxplot itself using STAT PLOT, choosing the boxplot from among the six type choices, and then using ZoomStat or in WINDOW letting Xmin=0 and Xmax=225.
On the TI-Nspire the data can be put in a list (here called index), and then a simultaneous multiple view is possible.
When a distribution is strongly skewed, or when it has pronounced outliers, drawing a boxplot with its five-number summary including median, quartiles, and extremes, gives a more useful description than calculating a mean and a standard deviation.
Usually, values more than 1.5 × IQR (1.5 times the interquartile range) outside the two boxes are plotted separately as outliers. (The TI-84 has a modified boxplot option. Note the two options in the second row of Type in StatPlot.)
EXAMPLE 2.17
Inputting the lengths of words in a selection of Shakespeare’s plays results in a calculator output of
Outliers consist of any word lengths less than Q1 – 1.5(IQR) = 3 – 1.5(5 – 3) = 0 or greater than Q3 + 1.5(IQR) = 5 + 1.5(5 – 3) = 8. A boxplot indicating outliers, together with a histogram (on the TI-84 up to three different graphs can be shown simultaneously) is
Note: Some computer output shows two levels of outliers—mild (between 1.5 IQR and 3 IQR) and extreme (more than 3 IQR). In this example, the word length of 12 would be considered an extreme outlier since it is greater than 5 + 3(5 – 3) = 11.
It should be noted that two sets can have the same five-number summary and thus the same boxplots but have dramatically different distributions.
EXAMPLE 2.18
Let A = {0, 5, 10, 15, 25, 30, 35, 40, 45, 50, 71, 72, 73, 74, 75, 76, 77, 78, 100} and B = {0, 22, 23, 24, 25, 26, 27, 28, 29, 50, 55, 60, 65, 70, 75, 85, 90, 95, 100}. Simple inspection indicates very different distributions, however the TI-84 gives identical boxplots with Min = 0, Q1 = 25, Med = 50, Q3 = 75, and Max = 100 for each.
Changing units, for example, from dollars to rubles or from miles to kilometers, is common in a world that seems to become smaller all the time. It is instructive to note how measures of center and spread are affected by such changes.
Adding the same constant to every value increases the mean and median by that same constant; however, the distances between the increased values stay the same, and so the range and standard deviation are unchanged.
EXAMPLE 2.19
A set of experimental measurements of the freezing point of an unknown liquid yield a mean of 25.32 degrees Celsius with a standard deviation of 1.47 degrees Celsius. If all the measurements are converted to the Kelvin scale, what are the new mean and standard deviation?
Answer: Kelvins are equivalent to degrees Celsius plus 273.16. The new mean is thus 25.32 + 273.16 = 298.48 kelvins. However, the standard deviation remains numerically the same, 1.47 kelvins. Graphically, you should picture the whole distribution moving over by the constant 273.16; the mean moves, but the standard deviation (which measures spread) doesn’t change.
Multiplying every value by the same constant multiplies the mean, median, range, and standard deviation all by that constant.
EXAMPLE 2.20
Measurements of the sizes of farms in an upstate New York county yield a mean of 59.2 hectares with a standard deviation of 11.2 hectares. If all the measurements are converted from hectares (metric system) to acres (one acre was originally the area a yoke of oxen could plow in one day), what are the new mean and standard deviation?
Answer: One hectare is equivalent to 2.471 acres. The new mean is thus 2.471 × 59.2 = 146.3 acres with a standard deviation of 2.471 × 11.2 = 27.7 acres. Graphically, multiplying each value by the constant 2.471 both moves and spreads out the distribution.
SUMMARY
The two principle measurements of the center of a distribution are the mean and the median.
The principle measurements of the spread of a distribution are the range (maximum value minus minimum value), the interquartile range (IQR = Q3 – Q1), the variance, and the standard deviation.
Adding the same constant to every value in a set adds the same constant to the mean and median but leaves all the above measures of spread unchanged.
Multiplying every value in a set by the same constant multiplies the mean, median, range, IQR, and standard deviation by that constant.
The mean, range, variance, and standard deviation are sensitive to extreme values, while the median and interquartile range are not.
The principle measurements of position are simple ranking, percentile ranking, and the z-score (which measures the number of standard deviations from the mean).
The empirical rule (the 68-95-99.7 rule) applies specifically to symmetric bell-shaped data.
In skewed left data, the mean is usually less than the median, while in skewed right data, the mean is usually greater than the median.
Boxplots visually show the five-number summary: the minimum value, the first quartile (Q1), the median, the third quartile (Q3), and the maximum value; and usually indicate outliers as distinct points.
Note that two sets can have the same five-number summary and thus the same boxplots but have dramatically different distributions.
Multiple-Choice Questions
Directions: The questions or incomplete statements that follow are each followed by five suggested answers or completions. Choose the response that best answers the question or completes the statement.
1. The graph below shows household income in Laguna Woods, California.
What can be said about the ratio $\frac{\text{Mean family income}}{\text{Median family income}}$?
(A) Approximately zero
(B) Less than one, but definitely above zero
(C) Approximately one
(D) Greater than one
(E) Cannot be answered without knowing the standard deviation
2. Which of the following are true statements?
I. The range of the sample data set is never greater than the range of the population.
II. The interquartile range is half the distance between the first quartile and the third quartile.
III. While the range is affected by outliers, the interquartile range is not.
(A) I only
(B) II only
(C) III only
(D) I and II
(E) I and III
3. Dieticians are concerned about sugar consumption in teenagers’ diets (a 12-ounce can of soft drink typically has 10 teaspoons of sugar). In a random sample of 55 students, the number of teaspoons of sugar consumed for each student on a randomly selected day is tabulated. Summary statistics are noted below:
Min = 10 Max = 60 First quartile = 25 Third quartile = 38
Median = 31 Mean = 31.4 n = 55 s = 11.6
Which of the following is a true statement?
(A) None of the values are outliers.
(B) The value 10 is an outlier, and there can be no others.
(C) The value 60 is an outlier, and there can be no others.
(D) Both 10 and 60 are outliers, and there may be others.
(E) The value 60 is an outlier, and there may be others at the high end of the data set.
Problems 4–6 refer to the following five boxplots.
4. To which of the above boxplots does the following histogram correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
5. To which of the above boxplots does the following histogram correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
6. To which of the above boxplots does the following histogram correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
Problems 7–9 refer to the following five histograms:
7. To which of the above histograms does the following boxplot correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
8. To which of the above histograms does the following boxplot correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
9. To which of the above histograms does the following boxplot correspond?
(A) A
(B) B
(C) C
(D) D
(E) E
10. Below is a boxplot of CO2 levels (in grams per kilometer) for a sampling of 2008 vehicles.
Suppose follow-up testing determines that the low outlier should be 10 grams per kilometer less and the two high outliers should each be 5 grams per kilometer greater. What effect, if any, will these changes have on the mean and median CO2 levels?
(A) Both the mean and median will be unchanged.
(B) The median will be unchanged, but the mean will increase.
(C) The median will be unchanged, but the mean will decrease.
(D) The mean will be unchanged, but the median will increase.
(E) Both the mean and median will change.
11. Below is a boxplot of yearly tuition and fees of all four year colleges and universities in a Western state. The low outlier is from a private university that gives full scholarships to all accepted students, while the high outlier is from a private college catering to the very rich.
Removing both outliers will effect what changes, if any, on the mean and median costs for this state’s four year institutions of higher learning?
(A) Both the mean and the median will be unchanged.
(B) The median will be unchanged, but the mean will increase.
(C) The median will be unchanged, but the mean will decrease.
(D) The mean will be unchanged, but the median will increase.
(E) Both the mean and median will change.
12. Suppose the average score on a national test is 500 with a standard deviation of 100. If each score is increased by 25, what are the new mean and standard deviation?
(A) 500, 100
(B) 500, 125
(C) 525, 100
(D) 525, 105
(E) 525, 125
13. Suppose the average score on a national test is 500 with a standard deviation of 100. If each score is increased by 25%, what are the new mean and standard deviation?
(A) 500, 100
(B) 525, 100
(C) 625, 100
(D) 625, 105
(E) 625, 125
14. If quartiles Q1 = 20 and Q3 = 30, which of the following must be true?
I. The median is 25.
II. The mean is between 20 and 30.
III. The standard deviation is at most 10.
(A) I only
(B) II only
(C) III only
(D) All are true.
(E) None are true.
15. A 1995 poll by the Program for International Policy asked respondents what percentage of the U.S. budget they thought went to foreign aid. The mean response was 18%, and the median was 15%. (The actual amount is less than 1%.) What do these responses indicate about the likely shape of the distribution of all the responses?
(A) The distribution is skewed to the left.
(B) The distribution is skewed to the right.
(C) The distribution is symmetric around 16.5%.
(D) The distribution is bell-shaped with a standard deviation of 3%.
(E) The distribution is uniform between 15% and 18%.
16. Assuming that batting averages have a bell-shaped distribution, arrange in ascending order:
I. An average with a z-score of –1.
II. An average with a percentile rank of 20%.
III. An average at the first quartile, Q1.
(A) I, II, III
(B) III, I, II
(C) II, I, III
(D) II, III, I
(E) III, II, I
17. Which of the following are true statements?
I. If the sample has variance zero, the variance of the population is also zero.
II. If the population has variance zero, the variance of the sample is also zero.
III. If the sample has variance zero, the sample mean and the sample median are equal.
(A) I and II
(B) I and III
(C) II and III
(D) I, II, and III
(E) None of the above gives the complete set of true responses.
18. When there are multiple gaps and clusters, which of the following is the best choice to give an overall picture of a distribution?
(A) Mean and standard deviation
(B) Median and interquartile range
(C) Boxplot with its five-number summary
(D) Stemplot or histogram
(E) None of the above are really helpful in showing gaps and clusters.
19. Suppose the starting salaries of a graduating class are as follows:
Number of Students | Starting Salary ($) |
10 | 15,000 |
17 | 20,000 |
25 | 25,000 |
38 | 30,000 |
27 | 35,000 |
21 | 40,000 |
12 | 45,000 |
What is the mean starting salary?
(A) $30,000
(B) $30,533
(C) $32,500
(D) $32,533
(E) $35,000
20. When a set of data has suspect outliers, which of the following are preferred measures of central tendency and of variability?
(A) mean and standard deviation
(B) mean and variance
(C) mean and range
(D) median and range
(E) median and interquartile range
21. If the standard deviation of a set of observations is 0, you can conclude
(A) that there is no relationship between the observations.
(B) that the average value is 0.
(C) that all observations are the same value.
(D) that a mistake in arithmetic has been made.
(E) none of the above.
22. A teacher is teaching two AP Statistics classes. On the final exam, the 20 students in the first class averaged 92 while the 25 students in the second class averaged only 83. If the teacher combines the classes, what will the average final exam score be?
(A) 87
(B) 87.5
(C) 88
(D) None of the above
(E) More information is needed to make this calculation.
23. Suppose 10% of a data set lie between 40 and 60. If 5 is first added to each value in the set and then each result is doubled, which of the following is true?
(A) 10% of the resulting data will lie between 85 and 125.
(B) 10% of the resulting data will lie between 90 and 130.
(C) 15% of the resulting data will lie between 80 and 120.
(D) 20% of the resulting data will lie between 45 and 65.
(E) 30% of the resulting data will lie between 85 and 125.
24. The 70 highest dams in the world have an average height of 206 meters with a standard deviation of 35 meters. The Hoover and Grand Coulee dams have heights of 221 and 168 meters, respectively. The Russian dams, the Nurek and Charvak, have heights with z-scores of +2.69 and –1.13, respectively. List the dams in order of ascending size.
(A) Charvak, Grand Coulee, Hoover, Nurek
(B) Charvak, Grand Coulee, Nurek, Hoover
(C) Grand Coulee, Charvak, Hoover, Nurek
(D) Grand Coulee, Charvak, Nurek, Hoover
(E) Grand Coulee, Hoover, Charvak, Nurek
25. The first 115 Kentucky Derby winners by color of horse were as follows: roan, 1; gray, 4; chestnut, 36; bay, 53; dark bay, 17; and black, 4. (You should “bet on the bay!”) Which of the following visual displays is most appropriate?
(A) Bar chart
(B) Histogram
(C) Stemplot
(D) Boxplot
(E) Time plot
For Questions 26 and 27 consider the following: The graph below shows cumulative proportions plotted against grade point averages for a large public high school.
26. What is the median grade point average?
(A) 0.8
(B) 2.0
(C) 2.4
(D) 2.5
(E) 2.6
27. What is the interquartile range?
(A) 1.0
(B) 1.8
(C) 2.4
(D) 2.8
(E) 4.0
28. The following dotplot shows the speeds (in mph) of 100 fastballs thrown by a major league pitcher.
Which of the following is the best estimate of the standard deviation of these speeds?
(A) 0.5 mph
(B) 1.1 mph
(C) 1.6 mph
(D) 2.2 mph
(E) 6.0 mph
FREE-REPONSE QUESTIONS
Directions: You must show all work and indicate the methods you use. You will be graded on the correctness of your methods and on the accuracy of your final answers.
FOUR OPEN-ENDED QUESTIONS
1. Victims spend from 5 to 5840 hours repairing the damage caused by identity theft with a mean of 330 hours and a standard deviation of 245 hours.
(a) What would be the mean, range, standard deviation, and variance for hours spent repairing the damage caused by identity theft if each of the victims spent an additional 10 hours?
(b) What would be the mean, range, standard deviation, and variance for hours spent repairing the damage caused by identity theft if each of the victims’ hours spent increased by 10%?
2. In a study of all school districts in a state, the median 4-year graduation rate was 78.0% with Q1 = 60.4% and Q3 = 82.6%. The only rates below Q1 or above Q3 were 26.4%, 32.2%, 49.0%, 57.9%, 88.3%, and 98.1%.
(a) Draw a boxplot.
(b) Describe the distribution.
(c) Is the mean 4-year graduation rate probably close to, below, or above 78.0%? Explain.
(d) Would a stemplot give more, less, or basically the same information?
3. The Children’s Health Insurance Program (CHIP) provides health benefits to children from families whose incomes exceed the eligibility for Medicaid. Each state sets its own eligibility criteria. The following boxplot shows recent yearly expenditures on this program by state.
(a) What are the median and interquartile range of the distribution of yearly state expenditures in the CHIP program?
(b) Suppose the federal government takes over three million dollars of administrative costs from the state CHIP expenditures. What are the median and interquartile range of the new reduced expenditure distribution?
(c) Suppose instead the federal government picks up the tab for half of all state CHIP expenditures. What are the median and interquartile range of this new reduced expenditure distribution?
(d) Based on the above boxplot, which of the following is the most reasonable value for the mean state expenditure (in millions of dollars): 78, 135, 325, 630, or 750? Explain.
4. Suppose a distribution has mean 300 and standard deviation 25. If the z-score of Q1 is –0.7 and the z-score of Q3 is 0.7, what values would be considered to be outliers?
5. (a) Draw the boxplot for the data set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Very different sets can have the same boxplot.
(b) Find the data set with sample size n = 11, with each value an integer between 0 and 10, and with the same boxplot as in (a), which has the smallest possible mean.
(c) Find the data set with sample size n = 11, with each value an integer between 0 and 10, and with the same boxplot as in (a), which has the largest possible standard deviation.
AN INVESTIGATIVE TASK
A measure of variability is the median absolute deviation (MAD) defined as the median deviation from the median, that is, as the median of the absolute values of the deviations from the median. For example, the median of {1, 3, 7, 10, 11, 12} is 8.5, the absolute deviations from the median are {7.5, 5.5, 1.5, 1.5, 2.5, 3.5}, and the median of these deviations, MAD, is (2.5 + 3.5)/2 = 3.
The 12 students in an AP Statistics class all score above 33 (the cutoff score that year for achieving a 3 or above): {35, 38, 38, 42, 44, 48, 50, 52, 56, 60, 62, 71}.
(a) Calculate the median of these data.
(b) Calculate the MAD for these data. Show your work.
(c) Show that half of these data values are closer than one MAD to the median and half are further than one MAD from the median.
(d) How would the calculation of MAD have changed if the top score was 76 rather than 71? Justify your answer.
(e) What does the answer to (d) say about one difference between the two measures of variability: MAD versus standard deviation?
MULTIPLE-CHOICE
1. (D) The distribution is clearly skewed right, so the mean is greater than the median, and the ratio is greater than one.
2. (E) All elements of the sample are taken from the population, and so the smallest value in the sample cannot be less than the smallest value in the population; similarly, the largest value in the sample cannot be greater than the largest value in the population. The interquartile range is the full distance between the first quartile and the third quartile. Outliers are extreme values, and while they may affect the range, they do not affect the interquartile range when the lower and upper quarters have been removed before calculation.
3. (E) Outliers are any values below Q1 – 1.5(IQR) = 5.5 or above Q3 + 1.5(IQR) = 57.5.
4. (A) The value 50 seems to split the area under the histogram in two, so the median is about 50. Furthermore, the histogram is skewed to the left with a tail from 0 to 30.
5. (B) Looking at areas under the curve, Q1 appears to be around 20, the median is around 30, and Q3 is about 40.
6. (C) Looking at areas under the curve, Q1 appears to be around 10, the median is around 30, and Q3 is about 50.
7. (C) The boxplot indicates that 25% of the data lie in each of the intervals 10–20, 20–35, 35–40, and 40–50. Counting boxes, only histogram C has this distribution.
8. (D) The boxplot indicates that 25% of the data lie in each of the intervals 10–15, 15–25, 25–35, and 35–50. Counting boxes, only histogram D has this distribution.
9. (E) The boxplot indicates that 25% of the data lie in each of the intervals 10–20, 20–30, 30–40, and 40–50. Counting boxes, only histogram E has this distribution.
10. (A) Subtracting 10 from one value and adding 5 to two values leaves the sum of the values unchanged, so the mean will be unchanged. Exactly what values the outliers take will not change what value is in the middle, so the median will be unchanged.
11. (C) The high outlier is further from the mean than is the low outlier, so removing both will decrease the mean. However, removing the lowest and highest values will not change what value is in the middle, so the median will be unchanged.
12. (C) Adding the same constant to every value increases the mean by that same constant; however, the distances between the increased values and the increased mean stay the same, and so the standard deviation is unchanged. Graphically, you should picture the whole distribution as moving over by a constant; the mean moves, but the standard deviation (which measures spread) doesn’t change.
13. (E) Multiplying every value by the same constant multiplies both the mean and the standard deviation by that constant. Graphically, increasing each value by 25% (multiplying by 1.25) both moves and spreads out the distribution.
14. (E) The median is somewhere between 20 and 30, but not necessarily at 25. Even a single very large score can result in a mean over 30 and a standard deviation over 10.
15. (B) The median is less than the mean, and so the responses are probably skewed to the right; there are a few high guesses, with most of the responses on the lower end of the scale.
16. (A) Given that the empirical rule applies, a z-score of –1 has a percentile rank of about 16%. The first quartile Q1 has a percentile rank of 25%.
17. (C) If the variance of a set is zero, all the values in the set are equal. If all the values of the population are equal, the same holds true for any subset; however, if all the values of a subset are the same, this may not be true of the whole population. If all the values in a set are equal, the mean and the median both equal this common value and so equal each other.
18. (D) Stemplots and histograms can show gaps and clusters that are hidden when one simply looks at calculations such as mean, median, standard deviation, quartiles, and extremes.
19. (B) There are a total of 10 + 17 + 25 + 38 + 27 + 21 + 12 = 150 students. Their total salary is 10(15,000) + 17(20,000) + 25(25,000) + 38(30,000) + 27(35,000) + 21(40,000) + 12(45,000)
= $4,580,000. The mean is
20. (E) The mean, standard deviation, variance, and range are all affected by outliers; the median and interquartile range are not.
21. (C) Because of the squaring operation in the definition, the standard deviation (and also the variance) can be zero only if all the values in the set are equal.
22. (A) The sum of the scores in one class is 20 × 92 = 1840, while the sum in the other is 25 × 83 = 2075. The total sum is 1840 + 2075 = 3915. There are 20 + 25 = 45 students, and
so the average score is
23. (B) Increasing every value by 5 gives 10% between 45 and 65, and then doubling gives 10% between 90 and 130.
24. (A) 206 + 2.69(35) = 300; 206 – 1.13(35) = 166.
25. (A) Bar charts are used for categorical variables.
26. (C) The median corresponds to the 0.5 cumulative proportion.
27. (A) The 0.25 and 0.75 cumulative proportions correspond to Q1 = 1.8 and Q3 = 2.8, respectively, and so the interquartile range is 2.8 – 1.8 = 1.0.
28. (B) With bell-shaped data the empirical rule applies, giving that the spread from 92 to 98 is roughly 6 standard deviations, and so one SD is about 1.
FREE-RESPONSE
1. (a) Adding 10 to each value increases the mean by 10, but leaves measures of variability unchanged, so the new mean is 340 hours while the range stays at 5835 hours, the standard deviation remains at 245 hours, and the variance remains at 2452 = 60,025 hr2.
(b) Increasing each value by 10% (multiplying by 1.10) will increase the mean to 1.1(330) = 363 hours, the range to 1.1(5835) = 6418.5 hours, the standard deviation to 1.1(245) = 269.5 hours, and the variance to (269.5)2 = 72,630.25 hr2. (Note that the variance increases by a multiple of (1.1)2 not by a multiple of 1.1.)
2. (a) Check for outliers: IQR = 82.6 – 60.4 = 22.2. Q1 – 1.5(IQR) = 27.1 while Q3 + 1.5(IQR) = 115.9, so the only outlier is 26.4.
(b) A complete answer considers shape, center, and spread.
Shape: appears skewed left with an outlier at 26.4
Center: median is 78.0
Spread: from 26.4 to 98.1
(c) When the distribution is skewed left, the mean is usually less than the median.
(d) A stemplot would show more information because it shows all the original data, not just the few values given above; a stemplot can show clusters and gaps which are hidden by a boxplot.
3. (a) The median is 77.5 (millions of dollars), and the IQR = Q3 – Q1 = 159.2 – 33.4 = 125.8 (millions of dollars).
(b) Reducing every value by 3 will reduce the median by 3 but leave measures of variability unchanged, so the new mean is 77.5 – 3 = 74.5 (millions of dollars), and the IQR will still be 125.8 (millions of dollars).
(c) Reducing every value by 50% reduces the median to (0.5)(77.5) = 38.75 (millions of dollars) and reduces the IQR to (0.5)(125.8) = 62.9 (millions of dollars).
(d) The boxplot indicates that the distribution is skewed right, so the mean will be greater than the median. It is unlikely that the two outliers will pull the mean out as far as 325, so the most reasonable value for the mean is 135 (millions of dollars).
4. Z-scores give the number of standard deviations from the mean, so
Q1 = 300 – 0.7(25) = 282.5 and Q3 = 300 + 0.7(25) = 317.5.
The interquartile range is IQR = 317.5 – 282.5 = 35, and 1.5(IQR) = 1.5(35) = 52.5.
The standard definition of outliers encompasses all values less than Q1 – 52.5 = 230 and all values greater than Q3 + 52.5 = 370.
5. (a)
(b) Note that must keep Min = 0, Q1 = 2, Med = 5, Q3 = 8, and Max = 10, with the same totals of in-between values, so move in-between values to the left, and the answer is
{0, 0, 2, 2, 2, 5, 5, 5, 8, 8, 10}.
(c) Note that must keep Min = 0, Q1 = 2, Med = 5, Q3 = 8, and Max = 10, with the same totals of in-between values, so move in-between values outward, and the answer is
{0, 0, 2, 2, 2, 5, 8, 8, 8, 10, 10}.
INVESTIGATIVE TASK
(a) The median of the data values is (48 + 50)/2 = 49.
(b) The absolute deviations from the median are {14, 11, 11, 7, 5, 1, 1, 3, 7, 11, 13, 22}.
In ascending order, these deviations are {1, 1, 3, 5, 7, 7, 11, 11, 11, 13, 14, 22}.
The median of these deviations is MAD = (7 + 11)/2 = 9.
(c) One MAD less than the median is 49 – 9 = 40, and one MAD greater than the median is 49 + 9 = 58. Half of the values (6 values) are between 40 and 58: {42, 44, 48, 50, 52, 56} are all between 40 and 58, whereas half of the values (6 values) are either less than 40 or greater than 58: {35, 38, 38, 60, 62, 71}.
(d) If the top score was 76 rather than 71, the median of the data values would still be 49. The greatest deviation would be 27 rather than 22, but the median deviation would still be 9.
(e) The presence of outliers does not change the value of the MAD (MAD is resistant to outliers). In contrast, the standard deviation (SD) is very sensitive to the presence of outliers (the squares of every deviation from the mean enters the SD calculation).
DOTPLOTS
DOUBLE BAR CHARTS
BACK-TO-BACK STEMPLOTS
PARALLEL BOXPLOTS
CUMULATIVE FREQUENCY PLOTS
Many real-life applications of statistics involve comparisons of two populations. Such comparisons can involve modifications of graphical displays such as dotplots, bar charts, stemplots, boxplots, and cumulative frequency plots to portray both sets simultaneously.
TIP
When asked for a comparison, don’t forget to address shape, outliers (unusual values), center and spread (SOCS or CUSS), and to refer to context. You must use comparative words, that is, you must state which center and which spread is larger (or if they are approximately the same). Simply making two separate lists is not enough and will be penalized.
EXAMPLE 3.1
The caloric intakes of 25 people on each of two weight loss programs are recorded as follows:
Program A: 1000, 1000, 1100, 1100, 1100, 1200, 1200, 1200, 1200, 1300, 1300, 1300, 1300, 1300, 1400, 1400, 1400, 1400, 1400, 1500, 1500, 1600, 1600, 1700, 1900
Program B: 1000, 1100, 1100, 1200, 1200, 1200, 1300, 1300, 1300, 1400, 1400, 1400, 1400, 1500, 1500, 1500, 1500, 1500, 1600, 1600, 1600, 1700, 1700, 1800, 1800
These data can be compared with dotplots, one above the other, using the same horizontal scale.
Program A appears to be associated with a lower average caloric intake than Program B. Comparing shape, center, and spread, we have:
TIP
Don’t forget to label and provide a scale for all graphs!
Shape: We see that both sets of data are roughly bell-shaped (the empirical rule applies), and Program A has an outlier at 1900 calories (while Program B has no outliers).
Center: Visually, or by counting dots, the centers of the two distributions are 1300 and 1400 calories, for Programs A and B respectively. (A calculator gives means of xA = 1336 and xB = 1424.) By any method, the center for Program B is higher.
Spread: The spreads are approximately the same, 1000 to 1900 calories for Program A and 1000 to 1800 calories for Program B. (A calculator gives standard deviations of sA = 218 and sB = 218.)
EXAMPLE 3.2
A study tabulated the percentages of young adults who recognized various photographs as follows: Joe Stalin (10%), Joe Camel (95%), Senator Simpson of Wyoming (5%), Bart Simpson (80%), Al Gore (30%), Al Bundy (60%), Mickey Mantle (20%), Mickey Mouse (100%), Charlie Chaplin (25%), Charlie the Tuna (90%). These data can be illustrated with a bar chart appropriately displayed in pairs of bars.
The pairs of bar graphs visually indicate the reason for concern felt by some educators.
EXAMPLE 3.3
In a 40-year study, survival years were measured for cancer patients undergoing one of two different chemotherapy treatments. The data for 25 patients on the first drug and 30 on the second were as follows:
Drug A: 5, 10, 17, 39, 29, 25, 20, 4, 8, 31, 21, 3, 12, 11, 19, 10, 4, 22, 17, 18, 13, 28, 11, 14, 21
Drug B: 19, 12, 20, 28, 22, 35, 1, 21, 21, 26, 18, 28, 29, 20, 15, 32, 31, 24, 22, 26, 18, 20, 22, 35, 30, 18, 25, 24, 19, 21
In drawing a back-to-back stemplot of the above data, we place a vertical line on each side of the column of stems and then arrange one set of leaves extending out to the right while the other extends out to the left.
Note that even though drug A showed the longest-surviving patient (39 years) and drug B showed the shortest-surviving patient (1 year), the back-to-back stemplot indicates that the bulk of patients on drug B survived longer than the bulk of patients on drug A.
Comparing shape, center, and spread, we have:
Shape: Both distributions are roughly bell-shaped (the empirical rule applies). The drug A distribution appears to have a high outlier at 39, while the drug B distribution appears to have a low outlier at 1.
Center: Visually, or by counting values, the centers of the two distributions are 17 and 22, respectively. (A calculator gives means of xA = 16.48 and xB = 22.73.) By either method, the drug B distribution has a greater center.
Spread: The spreads are 3 to 39 survival years for drug A and 1 to 35 survival years for drug B. (A calculator gives standard deviations of sA = 9.25 and sB = 6.97.) By either method, the drug A distribution has a greater spread than the drug B distribution.
EXAMPLE 3.4
Mail-order labs and 1-hour minilabs were compared with regard to price for developing and printing one 24-exposure roll of 35-millimeter color-print film. Prices included shipping and handling charges where applicable. Following is a computer output describing the results:
For mail-order labs:
Mean = 5.37 Standard deviation = 1.92 Min = 3.51
Max = 8.00 N = 18 Median = 4.77
Quartiles = 3.92, 6.45
For 1-hour minilabs:
Mean = 10.11 Standard deviation = 1.32 Min = 8.58
Max = 11.95 N = 15 Median = 10.08
Quartiles = 8.97, 11.51
In drawing parallel boxplots (also called side-by-side boxplots) of the above data, we place both on the same diagram:
Boxplots show the minimum, maximum, median, and quartile values. The distribution of mail-order lab prices is lower and more spread out than that of prices of 1-hour minilabs. Both are slightly skewed toward the upper end (the skewness can also be noted from the computer output showing the mean to be greater than the median in both cases.)
Parallel boxplots are useful in presenting a picture of the comparison of several distributions.
EXAMPLE 3.5
Following are parallel boxplots showing the daily price fluctuations of a certain common stock over the course of 5 years. What trends do the boxplots show?
The parallel boxplots show that from year to year the median daily stock price has steadily risen 20 points from about \$58 to about \$78, the third quartile value has been roughly stable at about \$84, the yearly low has never decreased from that of the previous year, and the interquartile range has never increased from one year to the next.
EXAMPLE 3.6
The graph below compares cumulative frequency plotted against age for the U.S. population in 1860 and in 1980.
How do the medians and interquartile ranges compare?
Answer: Looking across from .5 on the vertical axis, we see that in 1860 half the population was under the age of 20, while in 1980 all the way up to age 32 must be included to encompass half the population. Looking across from .25 and .75 on the vertical axis, we see that for 1860, Q1 = 9 and Q3 = 35 and so the interquartile range is 35 – 9 = 26 years, while for 1980, Q1 = 16 and Q3 = 50 and so the interquartile range is 50 – 16 = 34 years. Thus, both the median and the interquartile range were greater in 1980 than in 1860.
SUMMARY
To visually compare two or more distributions use:
Dotplots, either one above the other or side-by-side
Double bar charts
Histograms, either one above the other or side-by-side
Back-to-back stemplots
Parallel boxplots
Cumulative frequency plots on the same grid
For all the above, make note of any similarities and differences in shape, center, and spread.
Multiple-Choice Questions
Directions: The questions or incomplete statements that follow are each followed by five suggested answers or completions. Choose the response that best answers the question or completes the statement.
1. The dotplots below show the yearly wages of all male and female executives at a large firm.
Which of the following conclusions cannot be drawn from the plots?
(A) A greater proportion of male employees than female employees are executives at this firm.
(B) No executive receives a salary less than $25,000.
(C) The median salary paid to male executives is less than the median salary paid to female executives.
(D) The range of salaries paid to male executives is less than the range of salaries paid to female executives.
(E) More male than female executives have salaries over $70,000.
2. Which of the following statements about the two histograms above is true?
(A) The empirical rule applies only to set A.
(B) The mean of set A looks to be greater than the mean of set B.
(C) The mean of set B looks to be greater than the mean of set A.
(D) Both sets have roughly the same variance.
(E) The standard deviation of set B is greater than 5.
3. Consider the following back-to-back stemplots comparing car battery lives (in months) of samples of two popular brands.
Which of the following are true statements?
I. The sample sizes are the same.
II. The ranges are the same.
III. The variances are the same.
IV. The means are the same.
V. The medians are the same.
(A) I and II
(B) I and IV
(C) II and V
(D) III and V
(E) I, II, and III
4. The following boxplots were constructed from SAT math scores of boys and girls at a high school:
Which of the following is a possible boxplot for the combined scores of all the students?
Questions 5–7 refer to the following population pyramids (source: U.S. Census Bureau).
5. What is the approximate median age of the Liberian population?
(A) 0–4
(B) 15–19
(C) 30–34
(D) 40–44
(E) There is insufficient information to approximate the median.
6. Which country has more children younger than 10 years of age?
(A) Liberia
(B) Canada
(C) You can’t tell without calculating means.
(D) You can’t tell without calculating medians.
(E) You can’t tell without calculating some measure of variability.
7. Which of the following statements are plausible, given the graphs?
I. Canadian women tend to live longer than men.
II. The recent civil war in Liberia, with the extensive use of child soldiers, has had an impact on the population age distribution.
III. Canadian demographics show a decreasing birth rate.
(A) I only
(B) I and II only
(C) I and III only
(D) II and III only
(E) All three are plausible.
8. Looking at very large sets of communications metadata, mainly phone call and e-mail logs, a government agency tracks connections starting with intelligence targets overseas and extending to “contact chains” of different lengths. During January 2014 and January 2015, the lengths of contact chains analyzed are shown in the following histograms.
In which month did the chain length distribution have the greater standard deviation?
(A) January 2014 because of its bell-shaped distribution
(B) January 2014 because the 2015 distribution is roughly uniform
(C) January 2015 because of similar means, but a later date
(D) January 2015 because more data are further from the mean
(E) This cannot be answered without more information.
Free-Response Questions
Directions: You must show all work and indicate the methods you use. You will be graded on the correctness of your methods and on the accuracy of your final answers.
FOUR OPEN-ENDED QUESTIONS
1. Are women better than men at multitasking? Suppose in one study of multitasking a random sample of 200 female and 200 male high school students were assigned several tasks at the same time, such as solving simple mathematics problems, reading maps, and answering simple questions while talking on a telephone. Total times taken to complete all the tasks are given in the histograms below.
Write a few sentences comparing the distributions of times to complete all tasks by females and by males.
2. In independent random samples of 20 men and 20 women, the number of minutes spent on grooming on a given day were:
Men: 27, 32, 82, 36, 43, 75, 45, 16, 23, 48, 51, 57, 60, 64, 39, 40, 69, 72, 54, 57
Women: 49, 50, 35, 69, 75, 35, 49, 54, 98, 58, 22, 34, 60, 38, 47, 65, 79, 38, 42, 87
Using back-to-back stemplots. compare the two distributions.
3. To analyze the social media behavior differences between boys and girls, Mrs. V’s FDA high school AP Statistics class was asked to count the number of text messages that they sent over a long three-day weekend. The following table summarizes the data:
| Values under Q1 | Q1 | Median | Q3 | Values over Q3 |
Females | 15, 43, 100 | 130 | 175 | 358 | 450, 573, 1098 |
Males | 3, 59 | 72 | 183 | 273 | 293, 337 |
(a) Construct parallel boxplots of this set of data.
(b) Do the data indicate that females or males had the greater mean number of texts? Explain.
4. Cumulative frequency graphs of the ages of people on three different Caribbean cruises (A, B, and C) are given below:
Write a few sentences comparing the distributions of ages of people on the three cruises.
AN INVESTIGATIVE TASK
The NFL quarterback rating formula provides a means of comparing passing performances. The graphs below show the quarterback ratings for two players during one 16-game season.
(a) Construct dotplots showing the frequencies of each rating for each player.
(b) Compare the distributions of quarterback ratings for the two players.
(c) What information is more apparent from the dotplots than from the above 16-game graphs?
(d) What information is more apparent from the above 16-game graphs than from the dotplots?
A central moving average is calculated using data equally spaced either side of the point in the series where the mean is calculated. The first few lines of the 3-game central moving averages for the quarterback ratings of the two players are as follows:
(e) Fill in the two blank spaces corresponding to the fourth game above. Show your calculations.
MULTIPLE-CHOICE
1. (A) The numbers of male and female employees are not given so proportions who are executives cannot be determined.
2. (E) The empirical rule applies to bell-shaped data like those found in set B, not in set A. Both sets are roughly symmetric around 150 and so both should have means about 150. Set A is much more spread out than set B, and so set A has the greater variance. For bell-shaped data, about 95% of the values fall within two standard deviations of the mean and 99.7% fall within three. However, in the histogram for set B, one sees that 95% of the data are not between 140 and 160, and 99.7% are not between 135 and 165. Thus, the standard deviation for set B must be greater than 5.
3. (A) Both sets have 20 elements. The ranges, 76 − 37 = 39 and 86 − 47 = 39, are equal. Brand A clearly has the larger mean and median, and with its skewness it also has the larger variance.
4. (E) The minimum of the combined set of scores must be the min of the boys since it is lower; the maximum of the combined set of scores must be the max of the girls since it is higher; the first quartile must be the same as the identical first quartiles of the two original distributions. There are no outliers (scores more than 1.5(IQR) from the first and third quartiles).
5. (B) Roughly 50% of total bar length is above and below the 15–19 interval.
6. (B) There are about 1.2 million younger than the age of 10 in Liberia (boys and girls) and roughly 3.5 million in Canada.
7. (E) In the Canadian graph, all higher age groups show greater numbers of women than men. In the Liberian graph, the smaller 15–19 age group shows a definite break with the overall pattern (a great number of child soldiers died in the fighting). In the Canadian graph, the narrowing base indicates a decreasing birth rate.
8. (D) The standard deviation is defined in terms of squared deviations from the mean. In the 2014 distribution, more data are concentrated closer to the mean, whereas in the 2015 distribution, more data are further from the mean.
FREE-RESPONSE
1. A complete answer compares shape, center, and spread and mentions context in at least one of the responses.
Shape: The distribution of times to complete all tasks by females is skewed right (toward the higher values), whereas the distributions of times to complete all tasks by males is roughly bell-shaped.
Center: The center of the distribution of female times (at around $2\frac{1}{4}$ minutes) is less than the center of the distribution of male times (at around $3\frac{3}{4}$ minutes).
Spread: The spreads of the two distributions are roughly the same; the range of the female times ($5-1\frac{1}{2}=3\frac{1}{2}$ minutes) equals the range of the male times ($5\frac{1}{2}-2=3\frac{1}{2}$ minutes).
2.
A complete answer compares shape, center, and spread and mentions context in at least one of the responses.
Shape: The men’s distribution of hours grooming is roughly symmetric, whereas the women’s distribution of hours grooming is skewed right (toward higher values).
Center: The center of the men’s distribution is about the same as the center of the women’s distribution, both about 50 min.
Spread: The spread of the men’s distribution (with a range of 82 – 16 = 66 min) is less than the spread of the women’s distribution (with a range of 98 – 22 = 76 min).
3. (a) For females, Q1 – 1.5(IQR) = 130 – 1.5(358 – 130) = –212 and Q3 + 1.5(IQR) = 700, so 1098 is an outlier. For males, Q1 – 1.5(IQR) = 72 – 1.5(273 – 72) = –229.5 and Q3 + 1.5(IQR) = 574.5, so there are no outliers.
(b) The medians are roughly equal. The male distribution appears roughly symmetric, so the mean is close to the median; however, the female distribution shows extreme right skewness, so the mean is much greater than the median. Thus, the females had a greater mean number of text messages than did the males.
4. A complete answer compares shape, center, and spread and mentions context in at least one of the responses.
Shape: Cruise A, for which the cumulative frequency plot rises steeply at first, has more younger passengers, and thus a distribution skewed to the right (towards the higher ages). Cruise C, for which the cumulative frequency plot rises slowly at first and then steeply towards the end, has more older passengers, and thus a distribution skewed to the left (towards the younger ages). Cruise B, for which the cumulative frequency plot rises slowly at each end and steeply in the middle, has a more bell-shaped distribution.
Center: Considering the center to be a value separating the area under the histogram roughly in half, the centers will correspond to a cumulative frequency of 0.5. Reading across from 0.5 to the intersection of each graph, and then down to the x-axis, shows centers of approximately 18, 40, and 61 years, respectively. Thus, the center of distribution A is the least, and the center of distribution C is the greatest.
Spread: The spreads of the age distributions of all three cruises are the same: from 10 to 70 years.
AN INVESTIGATIVE TASK
(a)
(b) Shape: The Player A distribution is roughly uniform, whereas the Player B distribution is skewed right. (Also, the Player A distribution has no outliers, whereas the Player B distribution looks to have an outlier at 80.)
Center: The center of the Player A distribution (at about 110) is greater than the center of the Player B distribution (at about 35).
Spread: The variability in the Player B distribution is greater than the variability of the Player A distribution (for example, the range in the A distribution is 130 – 90 = 40, whereas the range in the B distribution is 80 – 25 = 55.
(c) The shapes (uniform for A and skewed right for B) are more apparent in the dotplots.
(d) With the dotplots, it’s impossible to see the game to game variability. Also the dotplot for the B distribution doesn’t show the end of year upswing in ratings.
(e) For Player and for Player
SCATTERPLOTS
CORRELATION AND LINEARITY
LEAST SQUARES REGRESSION LINE
RESIDUAL PLOTS
OUTLIERS AND INFLUENTIAL POINTS
TRANSFORMATIONS TO ACHIEVE LINEARITY
Our studies so far have been concerned with measurements of a single variable. However, many important applications of statistics involve examining whether two or more variables are related to one another. For example, is there a relationship between the smoking histories of pregnant women and the birth weights of their children? Between SAT scores and success in college? Between amount of fertilizer used and amount of crop harvested?
Two questions immediately arise. First, how can the strength of an apparent relationship be measured? Second, how can an observed relationship be put into functional terms? For example, a real estate broker might not only wish to determine whether a relationship exists between the prime rate and the number of new homes sold in a month but might also find useful an expression with which to predict the number of home sales given a particular value of the prime rate.
A graphical display, called a scatterplot, gives an immediate visual impression of a possible relationship between two variables, while a numerical measurement, called a correlation coefficient, is often used as a quantitative value of the strength of a linear relationship. In either case, evidence of a relationship is not evidence of causation.
Suppose a relationship is perceived between two quantitative variables X and Y, and we graph the pairs (x, y). We are interested in the strength of this relationship, the scatterplot arising from the relationship, and any deviation from the basic pattern of this relationship. In this topic we examine whether the relationship can be reasonably explained in terms of a linear function, that is, one whose graph is a straight line.
TIP
Recognize explanatory (x) and response (y) variables in context.
For example, we might be looking at a plot such as
We need to know what the term best-fitting straight line means and how we can find this line. Furthermore, we want to be able to gauge whether the relationship between the variables is strong enough so that finding and making use of this straight line is meaningful.
Patterns in Scatterplots
When larger values of one variable are associated with larger values of a second variable, the variables are called positively associated. When larger values of one are associated with smaller values of the other, the variables are called negatively associated.
EXAMPLE 4.1
The strength of the association is gauged by how close the plotted points are to a straight line.
EXAMPLE 4.2
Sometimes different dots in a scatterplot are labeled with different symbols or different colors to show a categorical variable. The resulting labeled scatterplot might distinguish between men and women, between stocks and bonds, and so on.
EXAMPLE 4.3
The above diagram is a labeled scatterplot distinguishing men with plus signs and women with square dots to show a categorical variable.
When analyzing the overall pattern in a scatterplot, it is also important to note clusters and outliers.
EXAMPLE 4.4
An experiment was conducted to note the effect of temperature and light on the potency of a particular antibiotic. One set of vials of the antibiotic was stored under different temperatures, but under the same lighting, while a second set of vials was stored under different lightings, but under the same temperature.
In the first scatterplot note the linear pattern with one outlier far outside this pattern. A possible explanation is that the antibiotic is more potent at lower temperatures, but only down to a certain temperature at which it drastically loses potency.
TIP
Note when the data falls in distinct groups.
In the second histogram note the two clusters. It appears that below a certain light intensity the potency is one value, while above that intensity it is another value. In each cluster there seems to be no association between intensity and potency.
IMPORTANT
Correlation does not imply causation!
Although a scatter diagram usually gives an intuitive visual indication when a linear relationship is strong, in most cases it is quite difficult to visually judge the specific strength of a relationship. For this reason there is a mathematical measure called correlation (or the correlation coefficient). Important as correlation is, we always need to keep in mind that significant correlation does not necessarily indicate causation and that correlation measures the strength only of a linear relationship.
Correlation, designated by r, has the formula
in terms of the means and standard deviations of the two sets. We note that the formula is actually the sum of the products of the corresponding z-scores divided by 1 less than the sample size. However, you should be able to quickly calculate correlation using the statistical package on your calculator. (Examining the formula helps you understand where correlation is coming from, but you will NOT have to use the formula to calculate r.)
Note from the formula that correlation does not distinguish between which variable is called x and which is called y. The formula is also based on standardized scores (z-scores), and so changing units does not change the correlation. Finally, since means and standard deviations can be strongly influenced by outliers, correlation is also strongly affected by extreme values.
TIP
With standardized data (z-scores), r is the slope of the regression line.
The value of r always falls between –1 and +1, with –1 indicating perfect negative correlation and +1 indicating perfect positive correlation. It should be stressed that a correlation at or near zero doesn’t mean there isn’t a relationship between the variables; there may still be a strong nonlinear relationship.
EXAMPLE 4.5
NOTE
We can also say that r2 gives the percentage of variation in y that can be explained by the regression line, with x as the explanatory variable.
It can be shown that r2, called the coefficient of determination, is the ratio of the variance of the predicted values $ \widehat{y}$ to the variance of the observed values y. That is, there is a partition of the y-variance, and r2 is the proportion of this variance that is predictable from a knowledge of x. Alternatively, we can say that r2 gives the percentage of variation in y that is explained by the variation in x. In either case, always interpret r2 in context of the problem. Remember when calculating r from r2 that r may be positive or negative.
While the correlation r is given as a decimal between –1.0 and 1.0, the coefficient of determination r2 is usually given as a percentage. An r2 of 100% is a perfect fit, with all the variation in y explained by variation in x. How large a value of r2 is desirable depends on the application under consideration. While scientific experiments often aim for an r2 in the 90% or above range, observational studies with r2 of 10% to 20% might be considered informative. Note that while a correlation of .6 is twice a correlation of .3, the corresponding r2 of 36% is four times the corresponding r2 of 9%.
What is the best-fitting straight line that can be drawn through a set of points?
TIP
If a scatterplot indicates a non-linear relationship, don’t try to force a straight-line fit.
On the basis of our experience with measuring variances, by best-fitting straight line we mean the straight line that minimizes the sum of the squares of the vertical differences between the observed values and the values predicted by the line.
TIP
A hat over a variable means it is a predicted version of the variable.
That is, in the above figure, we wish to minimize
It is reasonable, intuitive, and correct that the best-fitting line will pass through ($\overline{x},\overline{y}$) where $\overline{x}$ and $\overline{y}$ are the means of the variables X and Y. Then, from the basic expression for a line with a given slope through a given point, we have
The slope b1 can be determined from the formula
$b_1=r.\frac{s_y}{s_x}$
where r is the correlation and sx and sy are the standard deviations of the two sets. That is, each standard deviation change in x results in a change of r standard deviations in y. If you graph z-scores for the y-variable against z-scores for the x-variable, the slope of the regression line is precisely r, and, in fact, the linear equation becomes zy = rzx.
This best-fitting straight line, that is, the line that minimizes the sum of the squares of the differences between the observed values and the values predicted by the line, is called the least squares regression line or simply the regression line. It can be calculated directly by entering the two data sets and using the statistics package on your calculator.
TIP
Just because we can calculate a regression line doesn’t mean it is useful.
EXAMPLE 4.6
An insurance company conducts a survey of 15 of its life insurance agents. The average number of minutes spent with each potential customer and the number of policies sold in a week are noted for each agent. Letting X and Y represent the average number of minutes and the number of sales, respectively, we have
X: 25 23 30 25 20 33 18 21 22 30 26 26 27 29 20
Y: 10 11 14 12 8 18 9 10 10 15 11 15 12 14 11
Find the equation of the best-fitting straight line for the data.
Answer: Plotting the 15 points (25, 10), (23, 11), …, (20, 11) gives an intuitive visual impression of the relationship:
TIP
Be sure to label axes and show number scales whenever possible.
This scatterplot indicates the existence of a relationship that appears to be linear; that is, the points lie roughly on a straight line. Furthermore, the linear relationship is positive; that is, as one variable increases, so does the other (the straight line slopes upward).
Using a calculator, we find the correlation to be r = .8836, the coefficient of determination to be r2 = .78 (indicating that 78% of the variation in the number of policies is explained by the variation in the number of minutes spent), and the regression line to be
$\widehat{y}=\overline{y}+b_1(x-\overline{x})$
= 12 + 0.5492(x – 25)
= –1.73 + 0.5492x
We also write: $\widehat{\text{Policies}}=-1.73+0.5492$ Minutes
Adding this to our scatterplot yields
Thus, for example, we might predict that agents who average 24 minutes per customer will average 0.5492(24) – 1.73 = 11.45 sales per week. We also note that each additional minute spent seems to produce an average 0.5492 extra sale.
EXAMPLE 4.7
Following are advertising expenditures and total sales for six detergent products:
Advertising (\$1000) (x): 2.3 5.7 4.8 7.3 5.9 6.2
Total sales (\$1000) (y): 77 105 96 118 102 95
Predict the total sales if \$5000 is spent on advertising and interpret the slope of the regression line. What if \$100,000 is spent on advertising?
Answer: With your calculator, the equation of the regression line is found to be
$\widehat{y}$ = 98.833 + 7.293(x – 5.367) = 59.691 + 7.293x
TIP
Use meaningful variable names.
(A calculator like the TI-84, with less round-off error, directly gives $\widehat{y}$ = 59.683 + 7.295x.)
It is also worthwhile to replace the x and y with more appropriately named variables, resulting, for example, in
The regression line predicts that if \$5000 is spent on advertising, the resulting total sales will be 7.293(5) + 59.691 = 96.156 thousands of dollars (\$96,156).
The slope of the regression line indicates that every extra \$1000 spent on advertising will result in an average of \$7293 in added sales.
If \$100,000 is spent on advertising, we calculate 7.293(100) + 59.691 = 788.991 thousands of dollars (≈\$789,000). How much confidence should we have in this answer? Not much! We are trying to use the regression line to predict a value far outside the range of the data values. This procedure is called extrapolation and must be used with great care.
TIP
Be careful about extrapolation beyond the observed x-values.
It should be noted that when we use the regression line to predict a y-value for a given x-value, we are actually predicting the mean y-value for that given x-value. For any given x-value, there are many possible y-values, and we are predicting their mean. So if \$5000 is spent many times on advertising, various resulting total sales figures may result, but their predicted average is \$96,156.
The TI-Nspire gives
EXAMPLE 4.8
A random sample of 30 U.S. farm regions surveyed during the summer of 2003 produced the following statistics:
Average temperature (°F) during growing season: $\overline{x}$ = 81, sx = 3
Average corn yield per acre (bu.): $\overline{y}$ = 131, sy = 5
Correlation r = .32
Based on this study, what is the mean predicted corn yield for a region where the average growing season temperature is 76.5°F?
Answer: 76.5 is $\frac{76.5-81}{3}=-1.5$ standard deviations below the average temperature reported in the study, so with a correlation of r = .32, the predicted corn yield is .32(–1.5) = –0.48 standard deviations below the average corn yield, or 131 – 0.48(5) = 128.6 bushels per acre.
Alternatively, we could have found the linear regression equation relating these variables: slope=$r\frac{s_y}{s_x}=.32\frac{5}{3}\approx 0.533$ intercept ≈ 131 – 81(0.533) ≈ 87.8, and thus $\widehat{\text{Yield}} $ = 0.533 Temp + 87.8. Then 0.533(76.5) + 87.8 ≈ 128.6.
CAREFUL!
Order is important; residual equals observed – predicted.
The difference between an observed and predicted value is called the residual. When the regression line is graphed on the scatterplot, the residual of a point is the vertical distance the point is from the regression line.
NOTE
When the data point is above the regression line, the residual is positive; a data point below the line gives a negative residual.
The regression line is the line that minimizes the sum of the squares of the residuals.
EXAMPLE 4.9
We calculate the predicted values from the regression line in Example 4.7 and subtract from the observed values to obtain the residuals:
Note that the sum of the residuals is
0.5 + 3.7 + 1.3 + 5.1 – 0.7 – 9.9 = 0.0
The TI-Nspire easily shows the “squares of the residuals,” the sum of which is minimized by the regression line.
The above equation is true in general; that is, the sum and thus the mean of the residuals is always zero.
The notation for residuals is $\widehat{e_i}=y_i-\widehat{y_i}$ and so $\sum_{i=1}^{n}\widehat{e_i}=0$. The standard deviation of the residuals is calculated as follows:
se gives a measure of how the points are spread around the regression line.
Plotting the residuals gives further information. In particular, a residual plot with a definite pattern is an indication that a nonlinear model will show a better fit to the data than the straight regression line. In addition to whether or not the residuals are randomly distributed, one should look at the balance between positive and negative residuals and also the size of the residuals in comparison to the associated y-values.
The residuals can be plotted against either the x-values or the $\widehat{y}$-values (since $\widehat{y}$ is a linear transformation of x, the plots are identical except for scale and a left-right reversal when the slope is negative).
It is also important to understand that a linear model may be appropriate, but weak, with a low correlation. And, alternatively, a linear model may not be the best model (as evidenced by the residual plot), but it still might be a very good model with high r2.
EXAMPLE 4.10
Suppose the drying time of a paint product varies depending on the amount of a certain additive it contains.
Additive (oz), x: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Drying time (hr), y: | 4 | 2.1 | 1.5 | 1 | 1.2 | 1.7 | 2.5 | 3.6 | 4.9 | 6.1 |
Using a calculator, we find the regression line $\widehat{y}$ = 0.327x + 1.062, and we find the residuals:
x: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
y – $\widehat{y}$: | 2.61 | 0.38 | –0.54 | –1.37 | –1.5 | –1.32 | –0.85 | –0.08 | 0.89 | 1.77 |
The resulting residual plot shows a strong pattern:
This pattern indicates that a nonlinear model will be a better fit than a straight-line model. A scatterplot of the original data shows clearly what is happening:
The ability to interpret computer output is important not only to do well on the AP Statistics exam, but also to understand statistical reports in the business and scientific world.
EXAMPLE 4.11
Miles per gallon versus speed for a new model automobile is fitted with a least squares regression line. The graph of the residuals and some computer output for the regression are as follows:
Regression Analysis: MPG Versus Speed
The regression equation is
MPG = 38.9 – 0.218 Speed
Predictor | Coef | SE Coef | T | P |
|
Constant | 38.929 | 5.651 | 6.89 | 0.000 |
|
Speed | –0.2179 | 0.1119 | –1.95 | 0.099 |
|
Analysis of Variance
Source | DF | SS | MS | F | P |
Regression | 1 | 199.34 | 199.34 | 3.79 | 0.099 |
Residual Error | 6 | 315.54 | 52.59 |
|
|
Total | 7 | 514.88 |
|
|
|
a. Interpret the slope of the regression line in context.
Answer: The slope of the regression line is –0.2179, indicating that, on average, the MPG drops by 0.2179 for every increase of one mile per hour in speed.
b. What is the mean predicted MPG at a speed of 30 mph?
Answer: At 30 mph, the mean predicted MPG is –0.2179(30) + 38.929, or about 32.4 MPG.
c. What was the actual MPG at a speed of 30 mph?
Answer: The residual for 30 mph is about +3.5, and since residual = actual – predicted, we estimate the actual MPG to be 32.4 + 3.5, or about 36 MPG.
d. Is a line the most appropriate model? Explain.
Answer: The fact that the residuals show such a strong curved pattern indicates that a nonlinear model would be more appropriate.
e. What does “S = 7.252” refer to?
Answer: The standard deviation of the residuals, se = 7.252, is a “typical value” of the residuals and gives a measure of how the points are spread around the regression line.
EXAMPLE 4.12
The number of youngsters playing Little League baseball in Ithaca, New York, during the years 1995–2003 is fitted with a least squares regression line. The graph of the residuals and some computer output for their regression are as follows:
Regression Analysis: Number of Players Versus Years Since 1995
Predictor | Coef | SE Coef | T | P |
Constant | 123.800 | 1.798 | 68.84 | 0.000 |
Years | 12.6333 | 0.3778 | 33.44 | 0.000 |
S = 2.926, R–Sq = 99.4%, R–Sq(adj) = 99.3%
Analysis of Variance
Source | DF | SS | MS | F | P | |
Regression | 1 | 9576.1 | 9576.1 | 1118.45 | 0.000 | |
Residual Error | 7 | 59.9 | 8.6 |
|
|
|
Total | 8 | 9636.0 |
|
|
|
|
TIP
Simply using a calculator to find a regression line is not enough; you must understand it (for example, be able to interpret the slope and intercepts in context).
a. Does it appear that a line is an appropriate model for the data? Explain.
Answer: Yes. R–Sq = 99.4% is large, and the residual plot shows no pattern. Thus a linear model is appropriate.
b. What is the equation of the regression line (in context)?
Answer: Predicted # of players = 123.8 + 12.6 (years since 1995)
c. Interpret the slope of the regression line in the context of the problem.
Answer: The slope of the regression line is 12.6, indicating that, on average, the predicted number of children playing Little League baseball in Ithaca increased by 12 or 13 players per year during the 1995–2003 time period.
d. Interpret the y-intercept of the regression line in the context of the problem.
Answer: The y-intercept, 123.8, refers to the year 1995. Thus the number of players in Little League in Ithaca in 1995 was predicted to be around 124.
e. What is the predicted number of players in 1997?
Answer: For 1997, x = 2, so the predicted number of players is 12.6(2) + 123.8 = 149.
f. What was the actual number of players in 1997?
Answer: The residual for 1997 (x = 2) from the residual plot is +5, so actual – predicted = 5, and thus the actual number of players in 1997 must have been 5 + 149 = 154.
g. What years, if any, did the number of players decrease from the previous year? Explain.
Answer: The number would decrease if one residual were more than 12.6 greater than the next residual. This never happens, so the number of players never decreased.
OUTLIERS AND INFLUENTIAL POINTS
In a scatterplot, regression outliers are indicated by points falling far away from the overall pattern. That is, a point is an outlier if its residual is an outlier in the set of residuals.
EXAMPLE 4.13
A scatterplot of grade point average (GPA) versus weekly television time for a group of high school seniors is as follows:
By direct observation of the scatterplot, we note that there are two outliers: one person who watches 5 hours of television weekly yet has only a 1.5 GPA, and another person who watches 25 hours weekly yet has a 3.0 GPA. Note also that while the value of 30 weekly hours of television may be considered an outlier for the television hours variable and the 0.5 GPA may be considered an outlier for the GPA variable, the point (30, 0.5) is not an outlier in the regression context because it does not fall off the straight-line pattern.
Scores whose removal would sharply change the regression line are called influential scores. Sometimes this description is restricted to points with extreme x-values. An influential score may have a small residual but still have a greater effect on the regression line than scores with possibly larger residuals but average x-values.
EXAMPLE 4.14
Consider the following scatterplot of six points and the regression line:
The heavy line in the scatterplot on the left below shows what happens when point A is removed, and the heavy line in the scatterplot on the right below shows what happens when point B is removed.
Note that the regression line is greatly affected by the removal of point A but not by the removal of point B. Thus point A is an influential score, while point B is not. This is true in spite of the fact that point A is closer to the original regression line than point B.
TRANSFORMATIONS TO ACHIEVE LINEARITY
Often a straight-line pattern is not the best model for depicting a relationship between two variables. A clear indication of this problem is when the scatterplot shows a distinctive curved pattern. Another indication is when the residuals show a distinctive pattern rather than a random scattering. In such a case, the nonlinear model can sometimes be revealed by transforming one or both of the variables and then noting a linear relationship. Useful transformations often result from using the log or ln buttons on your calculator to create new variables.
EXAMPLE 4.15
Consider the following years and corresponding populations:
Year, x: | 1950 | 1960 | 1970 | 1980 | 1990 |
Population (1000s), y: | 50 | 67 | 91 | 122 | 165 |
The scatterplot and residual plot indicate that a nonlinear relationship would be an even stronger model.
Letting log y be a new variable, we obtain
x: | 1950 | 1960 | 1970 | 1980 | 1990 |
log y: | 1.70 | 1.83 | 1.96 | 2.09 | 2.22 |
The scatterplot and residual plot now indicate a stronger linear relationship.
A regression analysis yields $\widehat{\log y}=0.013x-23.65.$ In context we have: $\widehat{\text{log(Pop)}}=$ –23.65 + 0.013(Year). So, for example, the population predicted for the year 2000 would be calculated $\widehat{\text{log(Pop)}}=$ 0.013(2000) – 23.65 = 2.35, and so Pop = 102.35 ≈ 224 thousand, or 224,000. The “linear” equation $\widehat{\text{log y}}=$ 0.013x – 23.65 can be re-expressed as $\widehat{y}$ = 100.013x–23.65 [= 10–23.65 × (100.013)x = 2.2387E–24 × 1.0304x].
There are many useful transformations. For example:
Log y as a linear function of x, log y = ax + b, re-expresses as an exponential:
y = 10ax+b or y = b110ax where b1 = 10b
Log y as a linear function of log x, log y = a log x + b, re-expresses as a power:
y = 10a log x+b or y = b1xa where b1 = 10b
$\sqrt{y}$ as a linear function of x, $\sqrt{y}$ = ax + b, re-expresses as a quadratic:
y = (ax + b)2
$\frac{1}{y}$ as a linear function of x, $\frac{1}{y}$ = ax + b, re-expresses as a reciprocal:
$y=\frac{1}{ax+b}$
y as a linear function of log x, y = a log x + b, is a logarithmic function.
Note: Although you need to be able to recognize the need for a transformation, justify its appropriateness (residuals plot), use whichever is appropriate from above to create a linear model, and use the model to make predictions, you do not have to be able to re-express the linear equation in the manner shown above.
EXAMPLE 4.16
What are possible models for the following data?
Answer: A linear fit to x and y gives $\widehat{y}$ = 65x – 61 with r = 0.99.
A linear fit to x and log y gives log y = 0.279x + 1.139 with r = 0.98. This results in an exponential relationship:
$\widehat{y}$ = 100.279x+1.139 = 13.77(100.279x) = 13.77(1.901x)
A linear fit to log x and log y gives log y = 1.639 log x + 1.295, also with r = 0.99. This results in a power relationship:
$\widehat{y}$ = 101.639 log x+1.295 = 19.72(x1.639)
All three models give high correlation and are reasonable fits. Further analysis can be done by examining the residual plots:
The first two residual plots have distinct curved patterns. The third residual plot illustrates both a more random pattern and smaller residuals. (One can also create residual plots using the derived curved models, but we choose to restrict our attention to residual plots of the linear models.) Among the above three models, the power model $\widehat{y}$ = 19.72(x1.639), appears to be best.
SUMMARY
A scatterplot gives an immediate indication of the shape (linear or not), strength, and direction (positive or negative) of a possible relationship between two variables.
If the relationship appears roughly linear, then the correlation coefficient, r, is a useful measurement.
The value of r is always between –1 and +1, with positive values indicating positive association and negative values indicating negative association; and values close to –1 or +1 indicating a stronger linear association than values close to 0, which indicate a weaker linear association.
Evidence of an association is not evidence of a cause-and-effect relationship!
Correlation is not affected by which variable is called x and which y or by changing units.
Correlation can be strongly affected by extreme values.
The differences between the observed and predicted values are called residuals.
The best-fitting straight line, called the regression line, minimizes the sum of the squares of the residuals.
For the linear regression model, the mean of the residuals is always 0.
A definite pattern in the residual plot indicates that a nonlinear model may fit the data better than the straight regression line.
The coefficient of determination, r2, gives the percentage of variation in y that is accounted for by the variation in x.
Influential scores are scores whose removal would sharply change the regression line.
Nonlinear models can sometimes be studied by transforming one or both variables and then noting a linear relationship.
It is very important to be able to interpret generic computer output.
Multiple-Choice Questions
Directions: The questions or incomplete statements that follow are each followed by five suggested answers or completions. Choose the response that best answers the question or completes the statement.
1. A study collects data on average combined SAT scores (math, critical reading, and writing) and percentage of students who took the exam at 100 randomly selected high schools. Following is part of the computer printout for regression:
Which of the following is a correct conclusion?
(A) SAT in the variable column indicates that SAT is the dependent (response) variable.
(B) The correlation is ±0.875, but the sign cannot be determined.
(C) The y-intercept indicates the mean combined SAT score if percent of students taking the exam has no effect on combined SAT scores.
(D) The R2 value indicates that the residual plot does not show a strong pattern.
(E) Schools with lower percentages of students taking the exam tend to have higher average combined SAT scores.
2. A simple random sample of 35 world-ranked chess players provides the following statistics:
Number of hours of study per day: $\overline{x} $ = 6.2, sx = 1.3
Yearly winnings: $\overline{y} $ = $208,000, sy = $42,000
Correlation r = 0.15
Based on this data, what is the resulting linear regression equation?
(A) $\widehat{\text{Winnings}}
$ = 178,000 + 4850 Hours
(B) $\widehat{\text{Winnings}}
$ = 169,000 + 6300 Hours
(C) $\widehat{\text{Winnings}}
$ = 14,550 + 31,200 Hours
(D) $\widehat{\text{Winnings}}
$ = 7750 + 32,300 Hours
(E) $\widehat{\text{Winnings}}
$ = –52,400 + 42,000 Hours
3. A rural college is considering constructing a windmill to generate electricity but is concerned over noise levels. A study is performed measuring noise levels (in decibels) at various distances (in feet) from the campus library, and a least squares regression line is calculated with a correlation of 0.74. Which of the following is a proper and most informative conclusion for an observation with a negative residual?
(A) The measured noise level is 0.74 times the predicted noise level.
(B) The predicted noise level is 0.74 times the measured noise level.
(C) The measured noise level is greater than the predicted noise level.
(D) The predicted noise level is greater than the measured noise level.
(E) The slope of the regression line at that point must also be negative.
4. Consider the following three scatterplots:
Which has the greatest correlation coefficient?
(A) I
(B) II
(C) III
(D) They all have the same correlation coefficient.
(E) This question cannot be answered without additional information.
5. Suppose the correlation is negative. Given two points from the scatterplot, which of the following is possible?
I. The first point has a larger x-value and a smaller y-value than the second point.
II. The first point has a larger x-value and a larger y-value than the second point.
III. The first point has a smaller x-value and a larger y-value than the second point.
(A) I only
(B) II only
(C) III only
(D) I and III
(E) I, II, and III
6. Consider the following residual plot:
Which of the following scatterplots could have resulted in the above residual plot? (The y-axis scales are not the same in the scatterplots as in the residual plot.)
(A)
(B)
(C)
(D)
(E) None of these could result in the given residual plot.
7. Suppose the regression line for a set of data, $\widehat{y} $ = 3x + b, passes through the point (2, 5). If $\overline{x}$ and $\overline{y}$ are the sample means of the x– and y-values, respectively, then $\overline{y}$ =
(A) $\overline{x}$.
(B) $\overline{x}$ – 2.
(C) $\overline{x}$ + 5.
(D) 3$\overline{x}$.
(E) 3$\overline{x}$ – 1.
8. Suppose a study finds that the correlation coefficient relating family income to SAT scores is r = +1. Which of the following are proper conclusions?
I. Poverty causes low SAT scores.
II. Wealth causes high SAT scores.
III. There is a very strong association between family income and SAT scores.
(A) I only
(B) II only
(C) III only
(D) I and II
(E) I, II, and III
9. A study of department chairperson ratings and student ratings of the performance of high school statistics teachers reports a correlation of r = 1.15 between the two ratings. From this information we can conclude that
(A) chairpersons and students tend to agree on who is a good teacher.
(B) chairpersons and students tend to disagree on who is a good teacher.
(C) there is little relationship between chairperson and student ratings of teachers.
(D) there is strong association between chairperson and student ratings of teachers, but it would be incorrect to infer causation.
(E) a mistake in arithmetic has been made.
10. Which of the following statements about correlation r is true?
(A) A correlation of 0.2 means that 20% of the points are highly correlated.
(B) Perfect correlation, that is, when the points lie exactly on a straight line, results in r = 0.
(C) Correlation is not affected by which variable is called x and which is called y.
(D) Correlation is not affected by extreme values.
(E) A correlation of 0.75 indicates a relationship that is 3 times as linear as one for which the correlation is only 0.25.
Questions 11–13 refer to the following:
The relationship between winning game proportions when facing the sun and when the sun is on one’s back is analyzed for a random sample of 10 professional players. The computer printout for regression is below:
11. What is the equation of the regression line, where face and back are the winning game proportions when facing the sun and with back to the sun, respectively?
12. What is the correlation?
(A) –0.984
(B) –0.986
(C) 0.984
(D) 0.986
(E) 0.993
13. For one player, the winning game proportions were 0.55 and 0.59 for facing and back, respectively. What was the associated residual?
(A) –0.028
(B) 0.028
(C) –0.0488
(D) 0.0488
(E) 0.3608
14. Which of the following statements about residuals are true?
I. The mean of the residuals is always zero.
II. The regression line for a residual plot is a horizontal line.
III. A definite pattern in the residual plot is an indication that a nonlinear model will show a better fit to the data than the straight regression line.
(A) I and II
(B) I and III
(C) II and III
(D) I, II, and III
(E) None of the above gives the complete set of true responses.
15. Data are obtained for a group of college freshmen examining their SAT scores (math plus writing plus critical reading) from their senior year of high school and their GPAs during their first year of college. The resulting regression equation is
$\widehat{\text{GPA}}=$ 0.55 + 0.00161 (SAT total) with r = 0.632
What percentage of the variation in GPAs can be explained by looking at SAT scores?
(A) 0.161%
(B) 16.1%
(C) 39.9%
(D) 63.2%
(E) This value cannot be computed from the information given.
16. In a study of whether the structure of the adult human brain changes when a new skill is learned, the gray matter volume of four individuals was measured before and after learning a new cognitive skill. The resulting scatterplot was:
The correlation above is 0. Three researchers each run the experiment on a new subject and each obtain an additional data point:
Match the above scatterplots with their new correlations.
(A) | I: –0.33 | II: 0 | III: 0.33 |
(B) | I: 0 | II: 0.33 | III: 0.64 |
(C) | I: 0 | II: 0.33 | III: 1.0 |
(D) | I: –0.33 | II: 0 | III: 1.0 |
(E) | I: 0 | II: 0.50 | III: 1.0 |
17. In a study of winning percentage in home games versus average home attendance for professional baseball teams, the resulting regression line is:
What is the residual if a team has a winning percentage of 55% with an average attendance of 34,000?
(A) –11.0
(B) –0.8
(C) 0.8
(D) 11.0
(E) 23.0
18. Consider the following scatterplot of midterm and final exam scores for a class of 15 students.
Which of the following is incorrect?
(A) The same number of students scored 100 on the midterm exam as scored 100 on the final exam.
(B) Students who scored higher on the midterm exam tended to score higher on the final exam.
(C) The scatterplot shows a moderate negative correlation between midterm and final exam scores.
(D) The coefficient of determination here is positive.
(E) No one scored 90 or above on both exams.
19. If every woman married a man who was exactly 2 inches taller than she, what would the correlation between the heights of married men and women be?
(A) Somewhat negative
(B) 0
(C) Somewhat positive
(D) Nearly 1
(E) 1
20. Suppose the correlation between two variables is r = 0.23. What will the new correlation be if 0.14 is added to all values of the x-variable, every value of the y-variable is doubled, and the two variables are interchanged?
(A) 0.23
(B) 0.37
(C) 0.74
(D) –0.23
(E) –0.74
21. Suppose the correlation between two variables is –0.57. If each of the y-scores is multiplied by –1, which of the following is true about the new scatterplot?
(A) It slopes up to the right, and the correlation is –0.57.
(B) It slopes up to the right, and the correlation is +0.57.
(C) It slopes down to the right, and the correlation is –0.57.
(D) It slopes down to the right, and the correlation is +0.57.
(E) None of the above is true.
22. Consider the set of points {(2, 5), (3, 7), (4, 9), (5, 12), (10, n)}. What should n be so that the correlation between the x– and y-values is 1?
(A) 21
(B) 24
(C) 25
(D) A value different from any of the above.
(E) No value for n can make r = 1.
23. Consider the following three scatterplots:
Which of the following is a true statement about the correlations for the three scatterplots?
(A) None are 0.
(B) One is 0, one is negative, and one is positive.
(C) One is 0, and both of the others are positive.
(D) Two are 0, and the other is 1.
(E) Two are 0, and the other is close to 1.
24. Consider the three points (2, 11), (3, 17), and (4, 29). Given any straight line, we can calculate the sum of the squares of the three vertical distances from these points to the line. What is the smallest possible value this sum can be?
(A) 6
(B) 9
(C) 29
(D) 57
(E) None of these values
25. Suppose that the scatterplot of log X and log Y shows a strong positive correlation close to 1. Which of the following is true?
(A) The variables X and Y also have a correlation close to 1.
(B) A scatterplot of the variables X and Y shows a strong nonlinear pattern.
(C) The residual plot of the variables X and Y shows a random pattern.
(D) A scatterplot of X and log Y shows a strong linear pattern.
(E) A cause-and-effect relationship can be concluded between log X and log Y.
26. Consider n pairs of numbers. Suppose $\overline{x}$ = 2, sx = 3, $\overline{y}$ = 4, and sy = 5. Of the following, which could be the least squares line?
(A) $\widehat{y} $ = –2 + x
(B) $\widehat{y} $ = 2x
(C) $\widehat{y} $ = –2 + 3x
(D) $\widehat{y}=\frac{5}{3}-x$
(E) $\widehat{y} $ = 6 – x
Free-Response Questions
Directions: You must show all work and indicate the methods you use. You will be graded on the correctness of your methods and on the accuracy of your final answers.
TEN OPEN-ENDED QUESTIONS
1. Average home attendance and number of home wins for the 2009–2010 NBA Pacific Division teams were as follows:
| Lakers | Suns | Clippers | Warriors | Kings |
Average attendance | 18,997 | 17,648 | 16,343 | 18,027 | 13,254 |
Home wins | 34 | 32 | 21 | 18 | 18 |
(a) Does a winning team bring out the fans? Can average attendance be predicted from number of wins? Find the equation of the best-fitting straight line.
(b) Interpret the slope.
(c) Predict the average attendance for a team with 25 home wins.
(d) What number of home wins will predict an average of 17,000 fans?
(e) What is the residual for the Lakers average attendance?
2. The shoe sizes and the number of ties owned by ten corporate vice presidents are as follows:
Shoe size, x: | 8 | 9.5 | 9 | 11 | 9 | 9.5 | 8.5 | 9 | 9 | 9.5 |
Number of ties, y: | 10 | 10 | 8 | 15 | 12 | 13 | 16 | 7 | 12 | 4 |
(a) Draw a scatterplot for these data.
(b) Find the correlation r.
(c) Can we find the best-fitting straight-line approximation to the above data? Does it make sense to use this equation to predict the number of ties owned by a corporate executive who wears size 10 shoes? Explain.
3. Following is a scatterplot of the average life expectancies and per capita incomes (in thousands of dollars) for people in a sample of 50 countries.
(a) Estimate the mean for the set of 50 life expectancies and for the set of 50 per capita incomes.
(b) Estimate the standard deviation for the set of life expectancies and for the set of per capita incomes. Explain your reasoning.
(c) Does the scatterplot show a correlation between per capita income and life expectancy? Is it positive or negative? Is it weak or strong?
4. (a) Find the correlation r for each of the three sets:
A = {(5, 5), (5, 10), (10, 5), (10, 10)}
B = {(50, 50), (50, 55), (55, 50), (55, 55)}
C = {(90, 90), (90, 95), (95, 90), (95, 95)}
(b) Find the correlation for the set consisting of the 12 scores from A, B, and C.
(c) Comment on the above results.
5. An outlier can have a striking effect on the correlation r. For example, comment on the following three scatterplots:
6. Fuel economy y (in miles per gallon) is tabulated for various speeds x (in miles per hour) for a certain car model. A linear regression model gives Predicted fuel economy = 34.8 – 0.16 (Speed) with the following residual plot:
A quadratic regression model gives $\widehat{y}$ = –0.0032x2 + 0.26x + 23.8 with the following residual plot:
(a) What does each model predict for fuel economy at 50 miles per hour?
(b) Which model is a better fit? Explain.
7. The following scatterplot shows the grades for research papers for a sociology professor’s class plotted against the lengths of the papers (in pages).
Mary turned in her paper late and was told by the professor that her grade would have been higher if she had turned it in on time. A computer printout fitting a straight line to the data (not including Mary’s score) by the method of least squares gives
Grade = 46.51 + 1.106 Length
R–sq = 74.6%
(a) Find the correlation coefficient for the relationship between grade and length of paper based on these data (excluding Mary’s paper).
(b) What is the slope of the regression line and what does it signify?
(c) How will the correlation coefficient change if Mary’s paper is included? Explain your answer.
(d) How will the slope of the regression line change if Mary’s paper is included? Explain your answer.
(e) What grade did Mary receive? Predict what she would have received if her paper had been on time.
8. Data show a trend in winning long jump distances for an international competition over the years 1972–92. With jumps recorded in inches and dates in years since 1900, a least squares regression line is fit to the data. The computer output and a graph of the residuals are as follows:
(a) Does a line appear to be an appropriate model? Explain.
(b) What is the slope of the least squares line? Give an interpretation of the slope.
(c) What is the correlation?
(d) What is the predicted winning distance for the 1980 competition?
(e) What was the actual winning distance in 1980?
9. A scatterplot of the number of accidents per day on a particular interstate highway during a 30-day month is as follows:
(a) Draw a histogram of the frequencies of the number of accidents.
(b) Draw a boxplot of the number of accidents.
(c) Name a feature apparent in the scatterplot but not in the histogram or boxplot.
(d) Name a feature clearly shown by the histogram and boxplot but not as obvious in the scatterplot.
10. The following scatterplot shows the pulse rate drop (in beats per minute) plotted against the amount of medication (in grams) of an experimental drug being field-tested in several hospitals.
A computer printout showing the results of fitting a straight line to the data by the method of least squares gives
PulseRateDrop = –1.68 + 8.5 Grams
R–sq = 81.9%
(a) Find the correlation coefficient for the relationship between pulse rate drop and grams of medication.
(b) What is the slope of the regression line and what does it signify?
(c) Predict the pulse rate drop for a patient given 2.25 grams of medication.
(d) A patient given 5 grams of medication has his pulse rate drop to zero. Does this invalidate the regression equation? Explain.
(e) How will the size of the correlation coefficient change if the 3-gram result is removed from the data set? Explain.
(f) How will the size of the slope of the least squares regression line change if the 3-gram result is removed from the data set? Explain.
MULTIPLE-CHOICE
1. (E) The variable column indicates the independent (explanatory) variable. The sign of the correlation is the same as the sign of the slope (negative here). In this example, the y-intercept is meaningless (predicted SAT result if no students take the exam). There can be a strong linear relation, with high R2 value, but still a distinct pattern in the residual plot indicating that a non-linear fit may be even stronger. The negative value of the slope (–2.84276) gives that the predicted combined SAT score of a school is 2.84 points lower for each one unit higher in the percentage of students taking the exam, on average.
2. (A)
3. (D) Residual = Measured – Predicted, so if the residual is negative, the predicted must be greater than the measured (observed).
4. (D) The correlation coefficient is not changed by adding the same number to each value of one of the variables or by multiplying each value of one of the variables by the same positive number.
5. (E) A negative correlation shows a tendency for higher values of one variable to be associated with lower values of the other; however, given any two points, anything is possible.
6. (A) This is the only scatterplot in which the residuals go from positive to negative and back to positive.
7. (E) Since (2, 5) is on the line y = 3x + b, we have 5 = 6 + b and b = –1. Thus the regression line is y = 3x – 1. The point (x, y) is always on the regression line, and so we have y = 3x – 1.
8. (C) The correlation r measures association, not causation.
9. (E) The correlation r cannot take a value greater than 1.
10. (C) If the points lie on a straight line, r = ±1. Correlation has the formula so x and y are interchangeable, and r does not depend on which variable is called x or y. However, since means and standard deviations can be strongly influenced by outliers, r too can be strongly affected by extreme values. While r = 0.75 indicates a better fit with a linear model than r = 0.25 does, we cannot say that the linearity is threefold.
11. (B) The “Predictor” column indicates the independent variable with its coefficient to the right.
12. (E)
13. (B) $ \widehat{back}$ = 0.056 + 0.920 (0.55) = 0.562 and so the residual = 0.59 – 0.562 = 0.028
14. (D) The sum and thus the mean of the residuals are always zero. In a good straight-line fit, the residuals show a random pattern.
15. (C) The coefficient of determination r2 gives the proportion of the y-variance that is predictable from a knowledge of x. In this case r2 = (0.632)2 = 0.399 or 39.9%.
16. (B) The point I doesn’t contribute to a line with negative or positive slope. In none of the scatterplots do the points fall on a straight line, so none of them have correlation 1.0.
17. (C) Predicted winning percentage = 44 + 0.0003(34,000) = 54.2, and
Residual = Observed – Predicted = 55 – 54.2 = 0.8.
18. (B) On each exam, two students had scores of 100. There is a general negative slope to the data showing a moderate negative correlation. The coefficient of determination, r2, is always $\geq$ 0. While several students scored 90 or above on one or the other exam, no student did so on both exams.
19. (E) On the scatterplot all the points lie perfectly on a line sloping up to the right, and so r = 1.
20. (A) The correlation is not changed by adding the same number to every value of one of the variables, by multiplying every value of one of the variables by the same positive number, or by interchanging the x– and y-variables.
21. (B) The slope and the correlation coefficient have the same sign. Multiplying every y-value by –1 changes this sign.
22. (E) A scatterplot readily shows that while the first three points lie on a straight line, the fourth point does not lie on this line. Thus no matter what the fifth point is, all the points cannot lie on a straight line, and so r cannot be 1.
23. (E) All three scatterplots show very strong nonlinear patterns; however, the correlation r measures the strength of only a linear association. Thus r = 0 in the first two scatterplots and is close to 1 in the third.
24. (A) Using your calculator, find the regression line to be $ \widehat{y}$ = 9x – 8. The regression line, also called the least squares regression line, minimizes the sum of the squares of the vertical distances between the points and the line. In this case (2, 10), (3, 19), and (4, 28) are on the line, and so the minimum sum is (10 – 11)2 + (19 – 17)2 + (28 – 29)2 = 6.
25. (B) When transforming the variables leads to a linear relationship, the original variables have a nonlinear relationship, their correlation (which measures linearity) is not close to 1, and the residuals do not show a random pattern. While r close to 1 indicates strong association, it does not indicate cause and effect.
26. (E) The least squares line passes through $(\overline{x},\overline{y})=(2,4)$, and the slope b satisfies
FREE-RESPONSE
1. (a) A calculator gives $\widehat{Attendance}=$ 12,416 + 180.4 (Wins).
(b) Each additional home win raises the average attendance by about 180 people, on average.
(c) 12,416 + 180.4(25) = 16,926
(d) 17,000 = 12,416 + 180.4(Wins) gives Wins = 25.4 so 26 wins needed to average at least 17,000 average attendance.
(e) With 34 wins, the predicted average attendance is 12,416 + 180.4(34) = 18,550 so the residual is 18,997 – 18,550 = 447.
2. (a)
(b) A calculator gives r = 0.1568.
(c) The correlation r is low for this number of data scores, and the scatterplot shows no linear pattern whatsoever. Although theoretically we could use our techniques to find the best-fitting straight-line approximation, the result would be meaningless and should not be used for predictions.
3. (a) By visual inspection $\overline{x}$ ≈ 68 and $\overline{y}$ ≈ 21.
(b) The range of the life expectancies is 80 – 54 = 26, and so the standard deviation is roughly 26/4=6.5. Similarly the standard deviation of the per capita incomes is roughly $\frac{30-10}{4}=5$.
(c) While the points generally fall from the lower left to the upper right, they are still widely scattered. Thus the scatterplot shows a weak positive correlation between per capita income and life expectancy.
4. (a) The correlation for each of the three sets is 0.
(b) The correlation for the set consisting of all 12 scores is 0.9948.
(c) The data from each set taken separately show no linear pattern. However, together they show a strong linear fit. Note the positions of the data from the separate sets in the complete scatterplot.
5. In the first scatterplot, the points fall exactly on a downward sloping straight line, so r = –1. In the second scatterplot, the isolated point is an influential point, and r is close to +1. In the third scatterplot, the isolated point is also influential, and r is close to 0.
6. (a) $\widehat{y}$ = –0.16(50) + 34.8 = 26.8 miles per gallon, and $\widehat{y}$ = –0.0032(50)2 + 0.258(50) + 23.8 = 28.7 miles per gallon.
(b) Model 2 is the better fit. First, the residuals are much smaller for model 2, indicating that this model gives values much closer to the observed values. Second, a curved residual pattern like that in model 1 indicates that a nonlinear model would be better. A more uniform residual scatter as in model 2 indicates a better fit.
7. (a) The correlation coefficient is It is positive because the slope of the regression line is positive.
(b) The slope is 1.106, signifying that each additional page raises a grade by 1.106.
(c) Including Mary’s paper will lower the correlation coefficient because her result seems far off the regression line through the other points.
(d) Including Mary’s paper will swing the regression line down and lower the value of the slope.
(e) From the graph, Mary received an 82. From the regression line, Mary would have received $\widehat{y}$ = 46.51 + 1.106(45) = 96.3 if she had turned in her paper on time.
8. (a) Yes. The residual graph is not curved, does not show fanning, and appears to be random or scattered.
(b) The slope is 0.95893, indicating that the winning jump improves 0.95893 inches per year on average or about 3.8 inches every four years on average.
(c) With r2 = 0.921, the correlation r is 0.96.
(d) 0.95893(80) + 256.576 ≈ 333.3 inches
(e) The residual for 1980 is +2, and so the actual winning distance must have been 333.3 + 2 = 335.3 inches.
9. (a)
(b)
(c) There is a roughly linear trend with daily accidents increasing during the month.
(d) The daily number of accidents is strongly skewed to the right.
10. (a) The correlation coefficient It is positive because the slope of the regression line is positive.
(b) The slope is 8.5, signifying that each gram of medication lowers the pulse rate by 8.5 beats per minute.
(c) $\widehat{y}$ = –1.68 + 8.5(2.25) = 17.4 beats per minute.
(d) There is always danger in using a regression line to extrapolate beyond the values of x contained in the data. In this case, the 5 grams was an overdose, the patient died, and the regression line cannot be used for such values beyond the data set.
(e) Removing the 3-gram result from the data set will increase the correlation coefficient because the 3-gram result appears to be far off a regression line through the remaining points.
(f) Removing the 3-gram result from the data set will swing the regression line upward so that the slope will increase.
MARGINAL FREQUENCIES AND DISTRIBUTIONS
CONDITIONAL FREQUENCIES AND DISTRIBUTIONS
While many variables such as age, income, and years of education are quantitative or numerical in nature, others such as gender, race, brand preference, mode of transportation, and type of occupation are qualitative or categorical. Quantitative variables, too, are sometimes grouped into categorical classes.
MARGINAL FREQUENCIES AND DISTRIBUTIONS
Qualitative data often encompass two categorical variables that may or may not have a dependent relationship. These data can be displayed in a two-way contingency table.
EXAMPLE 5.1
The Cuteness Factor: A Japanese study had volunteers look at pictures of cute baby animals, adult animals, or tasty-looking foods, after which they tested their focus in solving puzzles.
Pictures viewed is the row variable, whereas level of focus is the column variable. One method of analyzing these data involves calculating the totals for each row and each column.
These totals are placed in the right and bottom margins of the table and thus are called marginal frequencies (or marginal totals). These marginal frequencies can then be put in the form of proportions or percentages. The marginal distribution of the level of focus is
This distribution can be displayed in a bar graph as follows;
Similarly, we can determine the marginal distribution for the pictures viewed:
The representative bar graph is
CONDITIONAL FREQUENCIES AND DISTRIBUTIONS
The marginal distributions described and calculated above do not describe or measure the relationship between the two categorical variables. For this we must consider the information in the body of the table, not just the sums in the margins.
EXAMPLE 5.2
We are interested in predicting the level of focus from the pictures viewed, and so we look at conditional frequencies for each row separately. For example, in Example 5.1 what proportion or percentage of the participants who viewed baby animals then had each of the levels of focus?
This conditional distribution can be displayed either with groupings of bars or by a segmented bar chart where each segment has a length corresponding to its relative frequency:
Similarly, the conditional distribution for the participants who viewed adult animals are
For the participants who viewed the tasty foods, we have
Both of the following bar charts give good visual pictures:
EXAMPLE 5.3
A study was made to compare year in high school with preference for vanilla or chocolate ice cream with the following results:
What are the conditional relative frequencies for each class?
In such a case, where all the conditional relative frequency distributions are identical, we say that the two variables show perfect independence. (However, it should be noted that even if the two variables are completely independent, the chance is very slim that a resulting contingency table will show perfect independence.)
EXAMPLE 5.4
Suppose you need heart surgery and are trying to decide between two surgeons, Dr. Fixit and Dr. Patch. You find out that each operated 250 times last year with the following results:
Whom should you go to? Among Dr. Fixit’s 250 patients 190 survived, for a survival rate of 190/250=0.76 or 76%, while among Dr. Patch’s 250 patients 200 survived, for a survival rate of 200/250=0.80 or 80%. Your choice seems clear.
However, everything may not be so clear-cut. Suppose that on further investigation you determine that the surgeons operated on patients who were in either good or poor condition with the following results:
Note that adding corresponding boxes from these two tables gives the original table above.
How do the surgeons compare when operating on patients in good health? Dr. Fixit’s 68 patients in good condition have a survival rate of 60/68=0.882 or 88.2%, while Dr. Patch’s 137 patients in good condition have a survival rate of 120/137=0.876 or 87.6%. Similarly, we note that Dr. Fixit’s 182 patients in poor condition have a survival rate of 130/182=0.714 or 71.4%, while Dr. Patch’s 113 patients in poor condition have a survival rate of 80/113=0.708 or 70.8%.
Thus Dr. Fixit does better with patients in good condition (88.2% versus Dr. Patch’s 87.6%) and also does better with patients in poor condition (71.4% versus Dr. P’s 70.8%). However, Dr. Fixit has a lower overall patient survival rate (76% versus Dr. Patch’s 80%)! How can this be?
This problem is an example of Simpson’s paradox, where a comparison can be reversed when more than one group is combined to form a single group. The effect of another variable, sometimes called a lurking variable, is masked when the groups are combined. In this particular example, closer scrutiny reveals that Dr. Fixit operates on many more patients in poor condition than Dr. Patch, and these patients in poor condition are precisely the ones with lower survival rates. Thus even though Dr. Fixit does better with all patients, his overall rating is lower. Our original table hid the effect of the lurking variable related to the condition of the patients.
SUMMARY
Two-way contingency tables are useful in showing relationships between two categorical variables.
The row and column totals lead to calculations of the marginal distributions.
Focusing on single rows or columns leads to calculations of conditional distributions.
Segmented bar charts are a useful visual tool to show conditional distributions.
Simpson’s paradox occurs when the results from a combined grouping seem to contradict the results from the individual groups.
Multiple-Choice Questions
Directions: The questions or incomplete statements that follow are each followed by five suggested answers or completions. Choose the response that best answers the question or completes the statement.
Questions 1–5 are based on the following: To study the relationship between party affiliation and support for a balanced budget amendment, 500 registered voters were surveyed with the following results:
1. What percentage of those surveyed were Democrats?
(A) 10%
(B) 20%
(C) 30%
(D) 40%
(E) 50%
2. What percentage of those surveyed were for the amendment and were Republicans?
(A) 25%
(B) 38%
(C) 40%
(D) 62.5%
(E) 65.8%
3. What percentage of Independents had no opinion?
(A) 5%
(B) 10%
(C) 20%
(D) 25%
(E) 50%
4. What percentage of those against the amendment were Democrats?
(A) 30%
(B) 42%
(C) 50%
(D) 60%
(E) 71.4%
5. Voters of which affiliation were most likely to have no opinion about the amendment?
(A) Democrat
(B) Republican
(C) Independent
(D) Republican and Independent, equally
(E) Democrat, Republican, and Independent, equally
Questions 6–10 are based on the following: A study of music preferences in three geographic locations resulted in the following segmented bar chart:
6. What percentage of those surveyed from the Northeast prefer country music?
(A) 20%
(B) 30%
(C) 40%
(D) 50%
(E) 70%
7. Which of the following is greatest?
(A) The percentage of those from the Northeast who prefer classical.
(B) The percentage of those from the West who prefer country.
(C) The percentage of those from the South who prefer pop or rock.
(D) The above are all equal.
(E) It is impossible to determine the answer without knowing the actual numbers of people involved.
8. Which of the following is greatest?
(A) The number of people in the Northeast who prefer pop or rock.
(B) The number of people in the West who prefer classical.
(C) The number of people in the South who prefer country.
(D) The above are all equal.
(E) It is impossible to determine the answer without knowing the actual numbers of people involved.
9. All three bars have a height of 100%.
(A) This is a coincidence.
(B) This happened because each bar shows a complete distribution.
(C) This happened because there are three bars each divided into three segments.
(D) This happened because of the nature of musical patterns.
(E) None of the above is true.
10. Based on the given segmented bar chart, does there seem to be a relationship between geographic location and music preference?
(A) Yes, because the corresponding segments of the three bars have different lengths.
(B) Yes, because the heights of the three bars are identical.
(C) Yes, because there are three segments and three bars.
(D) No, because the heights of the three bars are identical.
(E) No, because summing the corresponding segments for classical, summing the corresponding segments for country, and summing the corresponding segments for pop or rock all give approximately the same total.
11. In the following table, what value for n results in a table showing perfect independence?
(A) 10
(B) 40
(C) 60
(D) 75
(E) 100
12. A company employs both men and women in its secretarial and executive positions. In reports filed with the government, the company shows that the percentage of female employees who receive raises is higher than the percentage of male employees who receive raises. A government investigator claims that the percentage of male secretaries who receive raises is higher than the percentage of female secretaries who receive raises, and that the percentage of male executives who receive raises is higher than the percentage of female executives who receive raises. Is this possible?
(A) No, either the company report is wrong or the investigator’s claim is wrong.
(B) No, if the company report is correct, then either a greater percentage of female secretaries than of male secretaries receive raises or a greater percentage of female executives than of male executives receive raises.
(C) No, if the investigator is correct, then by summation of the corresponding numbers, the total percentage of male employees who receive raises would have to be greater than the total percentage of female employees who receive raises.
(D) All of the above are true.
(E) It is possible for both the company report to be true and the investigator’s claim to be correct.
Free-Response Questions
Directions: You must show all work and indicate the methods you use. You will be graded on the correctness of your methods and on the accuracy of your final answers.
TWO OPEN-ENDED QUESTIONS
1. The following table gives the numbers (in thousands) of officers and enlisted personnel by military branch in the U.S. armed forces.
Army | Navy | Marine Corps | Air Force | |
Officers | 88 | 52 | 20 | 65 |
Enlisted | 452 | 276 | 178 | 258 |
(a) Calculate the percentage
i. of military men and women who are enlisted.
ii. of military men and women who are not Marine Corps officers.
iii. of officers who are in the Navy.
(b) Construct a graphical display showing the association between career path (officer vs. enlisted) and military branch.
(c) Summarize what the graphical display illustrates about the association between career path (officer vs. enlisted) and military branch.
2. The graduate school at the University of California at Berkeley reported that in 1973 they accepted 44% of 8442 male applicants and 35% of 4321 female applicants. Concerned that one of their programs was guilty of gender bias, the graduate school analyzed admissions to the six largest graduate programs and obtained the following results:
Program | Men Accepted | Men Rejected | Women Accepted | Women Rejected |
A | 511 | 314 | 89 | 19 |
B | 352 | 208 | 17 | 8 |
C | 120 | 205 | 202 | 391 |
D | 137 | 270 | 132 | 243 |
E | 53 | 138 | 95 | 298 |
F | 22 | 351 | 24 | 317 |
(a) Find the percentage of men and the percentage of women accepted by each program. Comment on any pattern or bias you see.
(b) Find the percentage of men and the percentage of women accepted overall by these six programs. Does this appear to contradict the results from part (a)?
(c) If you worked in the Graduate Admissions Office, what would you say to an inquiring reporter who is investigating gender bias in graduate admissions?
MULTIPLE-CHOICE
1. (E) Of the 500 people surveyed, 50 + 150 + 50 = 250 were Democrats, and 250/500=0.5 or 50%.
2. (A) Of the 500 people surveyed, 125 were both for the amendment and Republicans, and 125/500=0.25 ỏ 25%.
3. (E) There were 15 + 10 + 25 = 50 Independents; 25 of them had no opinion, and 25/50= 0.5 or 50%.
4. (E) There were 150 + 50 + 10 = 210 people against the amendment; 150 of them were Democrats, and 150/210=0.714 ỏ 71.4%.
5. (C) The percentages of Democrats, Republicans, and Independents with no opinion are 20%, 12.5%, and 50%, respectively.
6. (A) In the bar corresponding to the Northeast, the segment corresponding to country music stretches from the 50% level to the 70% level, indicating a length of 20%.
7. (B) Based on lengths of indicated segments, the percentage from the West who prefer country is the greatest.
8. (E) The given bar chart shows percentages, not actual numbers.
9. (B) In a complete distribution, the probabilities sum to 1, and the relative frequencies total 100%.
10. (A) The different lengths of corresponding segments show that in different geographic regions different percentages of people prefer each of the music categories.
11. (D) Relative frequencies must be equal. Either looking at rows gives $\frac{20}{70}=\frac{30}{30+n}$ or looking at columns gives $\frac{20}{50}=\frac{50}{50+n}$. We could also set up a proportion $\frac{n}{30}=\frac{50}{20}$ or $\frac{n}{50}=\frac{30}{20}$ Solving any of these equations gives n = 75.
12. (E) It is possible for both to be correct, for example, if there were 11 secretaries (10 women, 3 of whom receive raises, and 1 man who receives a raise) and 11 executives (10 men, 1 of whom receives a raise, and 1 woman who does not receive a raise). Then 100% of the male secretaries receive raises while only 30% of the female secretaries do; and 10% of the male executives receive raises while 0% of the female executives do. However, overall 3 out of 11 women receive raises, while only 2 out of 11 men receive raises. This is an example of Simpson’s paradox.
FREE-RESPONSE
1. (a)
(b) Calculate row or column totals, and then show either a side-by-side bar graph or a segmented bar graph, showing percentages, and conditioned on either career path (officer vs. enlisted) or military branch:
(c) The Army and the Navy have about the same percentage of officers (16%), while the Air Force has a higher percentage of officers (20%), and the Marine Corps has a lower percentage of officers (10%).
OR
Among the officers and the enlisted career paths there are about the same percentage Army (39%), and about the same percentage Navy (23%), while the officers have a lower percentage Marine Corps than the enlisted (9% vs. 16%) and the officers have a higher percentage Air Force than the enlisted (29% vs. 22%).
2. (a)
Program | Percentage of Men Accepted (%) | Percentage of Women Accepted (%) |
A | 62 | 82 |
B | 63 | 68 |
C | 37 | 34 |
D | 33 | 35 |
E | 28 | 24 |
F | 6 | 7 |
There doesn’t appear to be any real pattern; however, women seem to be favored in four of the programs, while men seem to be slightly favored in the other two programs.
(b) Overall, 1195 out of 2681 male applicants were accepted, for a 45% acceptance rate, while 559 out of 1835 female applicants were accepted, for a 30% acceptance rate. This appears to contradict the results from part a.
(c) You should tell the reporter that while it is true that the overall acceptance rate for women is 30% compared to the 44% acceptance rate for men, program by program women have either higher acceptance rates or only slightly lower acceptance rates than men. The reason behind this apparent paradox is that most men applied to programs A and B, which are easy to get into and have high acceptance rates. However, most women applied to programs C, D, E, and F, which are much harder to get into and have low acceptance rates.
CENSUS
SAMPLE SURVEY
EXPERIMENT
OBSERVATIONAL STUDY
In the real world, time and cost considerations usually make it impossible to analyze an entire population. Does the government question you and your parents before announcing the monthly unemployment rates? Does a television producer check every household’s viewing preferences before deciding whether a pilot program will be continued? In studying statistics we learn how to estimate population characteristics by considering a sample. For example, later in this book we will see how to estimate population means and proportions by looking at sample means and proportions.
To derive conclusions about the larger population, we need to be confident that the sample we have chosen represents that population fairly. Analyzing the data with computers is often easier than gathering the data, but the frequently quoted “Garbage in, garbage out” applies here. Nothing can help if the data are badly collected. Unfortunately, many of the statistics with which we are bombarded by newspapers, radio, and television are based on poorly designed data collection procedures.
A census is a complete enumeration of an entire population. In common use, it is often thought of as an official attempt to contact every member of the population, usually with details regarding age, marital status, race, gender, occupation, income, years of school completed, and so on. Every 10 years the U.S. Bureau of the Census divides the nation into nine regions and attempts to gather information about everyone in the country. A massive amount of data is obtained, but even with the resources of the U.S. government, the census is not complete. For example, many homeless people are always missed, or counted at two temporary residences, and there are always households that do not respond even after repeated requests for information. It is estimated that the 2010 census missed about 2.1% of Black Americans and 1.5% of Hispanics, together accounting for some 1.5 million people.
In most studies, both in the private and public sectors, a complete census is unreasonable because of time and cost involved. Furthermore, attempts to gather complete data have been known to lead to carelessness. Finally, and most important, a well-designed, well-conducted sample survey is far superior to a poorly designed study involving a complete census. For example, a poorly worded question might give meaningless data even if everyone in the population answers.
The census tries to count everyone; it is not a sample. A sample survey aims to obtain information about a whole population by studying a part of it, that is, a sample. The goal is to gather information without disturbing or changing the population. Numerous procedures are used to collect data through sampling, and much of the statistical information distributed to us comes from sample surveys. Often, controlled experiments are later undertaken to demonstrate relationships suggested by sample surveys.
However, the one thing that most quickly invalidates a sample and makes useful information impossible to obtain is bias. A sample is biased if in some critical way it does not represent the population. The main technique to avoid bias is to incorporate randomness into the selection process. Randomization protects us from effects and influences, both known and unknown. Finally, the larger the sample, the better the results, but what is critical is the sample size, not the percentage or fraction of the population. That is, a random sample of size 500 from a population of size 100,000 is just as representative as a random sample of size 500 from a population of size 1,000,000.
In a controlled study, called an experiment, the researcher should randomly divide subjects into appropriate groups. Some action is taken on one or more of the groups, and the response is observed. For example, patients may be randomly given unmarked capsules of either aspirin or acetaminophen and the effects of the medication measured. Experiments often have a treatment group and a control group; in the ideal situation, neither the subjects nor the researcher knows which group is which. The Salk vaccine experiment of the 1950s, in which half the children received the vaccine and half were given a placebo, with not even their doctors knowing who received what, is a classic example of this double-blind approach. Controlled experiments can indicate cause-and-effect relationships.
The critical principles behind good experimental design include control (outside of who receives what treatments, conditions should be as similar as possible for all involved groups), blocking (the subjects can be divided into representative groups to bring certain differences directly into the picture), randomization (unknown and uncontrollable differences are handled by randomizing who receives what treatments), replication (treatments need to be repeated on a sufficient number of subjects), and generalizability (ability to repeat an experiment in a variety of settings).
Sample surveys are one example of what are called observational studies. In observational studies there is no choice in regard to who goes into the treatment and control groups. For example, a researcher cannot ethically tell 100 people to smoke three packs of cigarettes a day and 100 others to smoke only one pack per day; he can only observe people who habitually smoke these amounts. In observational studies the researcher strives to determine which variables affect the noted response. While results may suggest relationships, it is difficult to conclude cause and effect.
Observational studies are primary, vital sources of data; however, they are a poor method of measuring the effect of change. To evaluate responses to change, one must impose change, that is, perform an experiment. Furthermore, observational studies on the impact of some variable on another variable often fail because explanatory variables are confounded with other variables.
SUMMARY
A complete census is usually unreasonable because of time and cost constraints.
Estimate population characteristics (called parameters) by considering statistics from a sample.
Analysis of badly gathered sample data is usually a meaningless exercise.
A sample is biased if in some critical way it does not represent the population.
The main technique to avoid bias is to incorporate randomness into the selection process.
Experiments involve applying a treatment to one or more groups and observing the responses.
Observational studies involve observing responses to choices people make.
Multiple-Choice Questions
Directions: The questions or incomplete statements that follow are each followed by five suggested answers or completions. Choose the response that best answers the question or completes the statement.
1. When travelers change airlines during connecting flights, each airline receives a portion of the fare. Several years ago, the major airlines used a sample trial period to determine what percentage of certain fares each should collect. Using these statistical results to determine fare splits, the airlines now claim huge savings over previous clerical costs. Which of the following is true?
(A) The airlines ran an experiment using a trial period for the control group.
(B) The airlines ran an experiment using fare splits as treatments.
(C) The airlines ran an observational study using the calculations from a trial period as a sample.
(D) The airlines ran an observational study, but fare splits were a confounding variable.
(E) The airlines tried to gather a census but ended up with a sample.
2. Which of the following is not true?
(A) In an experiment some treatment is intentionally forced on one group to note the response.
(B) In an observational study information is gathered on an already existing situation.
(C) Sample surveys are observational studies, not experiments.
(D) While observational studies may suggest relationships, it is usually not possible to conclude cause and effect because of the lack of control over possible confounding variables.
(E) A complete census is the only way to establish a cause-and-effect relationship absolutely.
3. In one study on the effect of niacin on cholesterol level, 100 subjects who acknowledged being long-time niacin takers had their cholesterol levels compared with those of 100 people who had never taken niacin. In a second study, 50 subjects were randomly chosen to receive niacin and 50 were chosen to receive a placebo.
(A) The first study was a controlled experiment, while the second was an observational study.
(B) The first study was an observational study, while the second was a controlled experiment.
(C) Both studies were controlled experiments.
(D) Both studies were observational studies.
(E) Each study was part controlled experiment and part observational study.
4. In one study subjects were randomly given either 500 or 1000 milligrams of vitamin C daily, and the number of colds they came down with during a winter season was noted. In a second study people responded to a questionnaire asking about the average number of hours they sleep per night and the number of colds they came down with during a winter season.
(A) The first study was an experiment without a control group, while the second was an observational study.
(B) The first study was an observational study, while the second was a controlled experiment.
(C) Both studies were controlled experiments.
(D) Both studies were observational studies.
(E) None of the above is a correct statement.
5. In a 1992 London study, 12 out of 20 migraine sufferers were given chocolate whose flavor was masked by peppermint, while the remaining eight sufferers received a similar-looking, similar-tasting tablet that had no chocolate. Within 1 day, five of those receiving chocolate complained of migraines, while no complaints were made by any of those who did not receive chocolate. Which of the following is a true statement?
(A) This study was an observational study of 20 migraine sufferers in which it was noted how many came down with migraines after eating chocolate.
(B) This study was a sample survey in which 12 out of 20 migraine sufferers were picked to receive peppermint-flavored chocolate.
(C) A census of 20 migraine sufferers was taken, noting how many were given chocolate and how many developed migraines.
(D) A study was performed using chocolate as a placebo to study one cause of migraines.
(E) An experiment was performed comparing a treatment group that was given chocolate to a control group that was not.
6. Suppose you wish to compare the average class size of mathematics classes to the average class size of English classes in your high school. Which is the most appropriate technique for gathering the needed data?
(A) Census
(B) Sample survey
(C) Experiment
(D) Observational study
(E) None of these methods is appropriate.
7. Two studies are run to compare the experiences of families living in high-rise public housing to those of families living in townhouse subsidized rentals. The first study interviews 25 families who have been in each government program for at least 1 year, while the second randomly assigns 25 families to each program and interviews them after 1 year. Which of the following is a true statement?
(A) Both studies are observational studies because of the time period involved.
(B) Both studies are observational studies because there are no control groups.
(C) The first study is an observational study, while the second is an experiment.
(D) The first study is an experiment, while the second is an observational study.
(E) Both studies are experiments.
8. Two studies are run to determine the effect of low levels of wine consumption on cholesterol level. The first study measures the cholesterol levels of 100 volunteers who have not consumed alcohol in the past year and compares these values with their cholesterol levels after 1 year, during which time each volunteer drinks one glass of wine daily. The second study measures the cholesterol levels of 100 volunteers who have not consumed alcohol in the past year, randomly picks half the group to drink one glass of wine daily for a year while the others drink no alcohol for the year, and finally measures their levels again. Which of the following is a true statement?
(A) The first study is an observational study, while the second is an experiment.
(B) The first study is an experiment, while the second is an observational study.
(C) Both studies are observational studies, but only one uses both randomization and a control group.
(D) The first study is a census of 100 volunteers, while the second study is an experiment.
(E) Both studies are experiments.
MULTIPLE-CHOICE
1. (C) This study is not an experiment in which responses are being compared. It is an observational study in which the airlines use split fare calculations from a trial period as a sample to indicate the pattern of all split fare transactions. A census listing all possible connecting flights was not attempted.
2. (E) The first two sentences can be considered part of the definitions of experiment and observational study. A sample survey does not impose any treatment; it simply counts a certain outcome, and so it is an observational study, not an experiment. A complete census can provide much information about a population, but it doesn’t necessarily establish a cause-and-effect relationship among seemingly related population parameters.
3. (B) The first study was observational because the subjects were not chosen for treatment.
4. (A) The first study was an experiment with two treatment groups and no control group. The second study was observational; the researcher did not randomly divide the subjects into groups and have each group sleep a designated number of hours per night.
5. (E) This study was an experiment in which the researchers divided the subjects into treatment and control groups. A census would involve a study of all migraine sufferers, not a sample of 20. The response of the treatment group receiving chocolate was compared to the response of the control group receiving a placebo. The peppermint tablet with no chocolate was the placebo.
6. (A) The main office at your school should be able to give you the class sizes of every math and English class. If need be, you can check with every math and English teacher.
7. (C) In the first study the families were already in the housing units, while in the second study one of two treatments was applied to each family.
8. (E) Both studies apply treatments and measure responses, and so both are experiments.
SIMPLE RANDOM SAMPLING
CHARACTERISTICS OF A WELL-DESIGNED, WELL-CONDUCTED SURVEY
SAMPLING ERROR
SOURCES OF BIAS
OTHER SAMPLING METHODS
Most data collection involves observational studies, not controlled experiments. Furthermore, while most data collection has some purpose, many studies come to mind after the data have been assembled and examined. For data collection to be useful, the resulting sample must be representative of the population under consideration.
How can a good, that is, a representative, sample be chosen? One technique would be to write the name of each member of the population on a card, mix the cards thoroughly in a large box, and pull out a specified number of cards. This method would give everyone in the population an equal chance of being selected as part of the sample. Unfortunately, this method is usually too time-consuming and too costly, and bias might still creep in if the mixing is not thorough. A simple random sample, that is, one in which every possible sample of the desired size has an equal chance of being selected, can more easily be obtained by assigning a number to everyone in the population and using a random number table or having a computer generate random numbers to indicate choices.
EXAMPLE 7.1
Suppose 80 students are taking an AP Statistics course and the teacher wants to randomly pick out a sample of 10 students to try out a practice exam. She first assigns the students numbers 01, 02, 03, …, 80. Reading off two digits at a time from a random number table, she ignores any over 80 and ignores repeats, stopping when she has a set of ten. If the table began 75425 56573 90420 48642 27537 61036 15074 84675, she would choose the students numbered 75, 42, 55, 65, 73, 04, 27, 53, 76, and 10. Note that 90 and 86 are ignored because they are over 80, and the second and third occurrences of 42 are ignored because they are repeats.
CHARACTERISTICS OF A WELL-DESIGNED, WELL-CONDUCTED SURVEY
A well-designed survey always incorporates chance, such as using random numbers from a table or a computer. However, the use of probability techniques is not enough to ensure a representative sample. Often we don’t have a complete listing of the population, and so we have to be careful about exactly how we are applying “chance.” Even when subjects are picked by chance, they may choose not to respond to the survey or they may not be available to respond, thus calling into question how representative the final sample really is. The wording of the questions must be neutral—subjects give different answers depending on the phrasing.
EXAMPLE 7.2
Suppose we are interested in determining the percentage of adults in a small town who eat a nutritious breakfast. How about randomly selecting 100 numbers out of the telephone book, calling each one, and asking whether the respondent is intelligent enough to eat a nutritious breakfast every morning?
Answer: Random selection is good, but a number of questions should be addressed. For example, are there many people in the town without telephones or with unlisted numbers? How will the time of day the calls are made affect whether the selected people are reachable? If people are unreachable, will replacements be randomly chosen in the same way or will this lead to a certain class of people being underrepresented? Finally, even if these issues are satisfactorily addressed, the wording of the question is clearly not neutral—unless the phrase intelligent enough is dropped, answers will be almost meaningless.
SAMPLING ERROR: THE VARIATION INHERENT IN A SURVEY
No matter how well-designed and well-conducted a survey is, it still gives a sample statistic as an estimate for a population parameter. Different samples give different sample statistics, all of which are estimates for the same population parameter, and so error, called sampling error, is naturally present. This error can be described using probability; that is, we can say how likely we are to have a certain size error. Generally, the chance of this error occurring is smaller when the sample size is larger. However, the way the data are obtained is crucial—a large sample size cannot make up for a poor survey design or faulty collection techniques. Only a complete census can prevent sampling error; that is, whenever a sample is taken, sampling error will be present.
EXAMPLE 7.3
Each of four major news organizations surveys likely voters and separately reports that the percentage favoring the incumbent candidate is 53.4%, 54.1%, 52.0%, and 54.2%, respectively. What is the correct percentage? Did three or more of the news organizations make a mistake?
Answer: There is no way of knowing the correct population percentage from the information given. The four surveys led to four statistics, each an estimate of the population parameter. No one made a mistake unless there was a bad survey, for example, one without the use of chance, or not representative of the population, or with poor wording of the question. Sampling differences are natural.
TIP
Sampling error is to be expected, while bias is to be avoided.
Poorly designed sampling techniques result in bias, that is, in a tendency to favor the selection of certain members of a population. If a study is biased, size doesn’t help—a large sample size will simply result in a large worthless study. Think about bias before running a study, because once all the data comes in, there is no way to recover if the sample was biased. Sometimes pilot testing with a small sample will show bias that can be corrected before a larger sample is obtained. Although each of the following sources of bias is defined separately, there is overlap, and many if not most examples of bias involve more than one of the following.
HOUSEHOLD BIAS: When a sample includes only one member of any given household, members of large households are underrepresented. To respond to this, pollsters sometimes give greater weight to members of larger households.
NONRESPONSE BIAS: A good example is that of most mailed questionnaires, as they tend to have very low response percentages, and it is often unclear which part of the population is responding. Sometimes people chosen for a survey simply refuse to respond or are unreachable or too difficult to contact. Answering machines and caller ID prevent easy contacts. To maximize response rates, one can use multiple follow-up contacts and cash or other incentives. Also, short, easily understood surveys generally have higher response rates.
QUOTA SAMPLING BIAS: This results when interviewers are given free choice in picking people, for example, to obtain a particular percentage men, a particular percentage Catholic, or a particular percentage African-American. This flawed technique resulted in misleading polls leading to the Chicago Tribune making an early incorrect call of Thomas E. Dewey as the winner over Harry S. Truman in the 1948 presidential election.
RESPONSE BIAS: The very question itself can lead to misleading results. People often don’t want to be perceived as having unpopular or unsavory views and so may respond untruthfully when face to face with an interviewer or when filling out a questionnaire that is not anonymous. Patients may lie about following doctors’ orders, dieters may be dishonest about how strictly they’ve followed a weight loss program, students may shade the truth about how many hours they’ve studied for exams, and viewers may not want to admit they they watch certain television programs.
SELECTION BIAS: An often-cited example is the Literary Digest opinion poll that predicted a landslide victory for Alfred Landon over Franklin D. Roosevelt in the 1936 presidential election. The Digest surveyed people with cars and telephones, but in 1936 only the wealthy minority, who mainly voted Republican, had cars and telephones. In spite of obtaining more than two million responses, the Digest picked a landslide for the wrong man!
TIP
Think about potential bias before collecting data.
SIZE BIAS: Throwing darts at a map to decide in which states to sample would bias in favor of geographically large states. Interviewing people checking out of the hospital would bias in favor of patients with short stays, since due to costs, more people today have shorter stays. Having each student pick one coin out of a bag of 1000 coins to help estimate the total monetary value of the coins in the bag would bias in favor of large coins, for example, quarters over dimes.
UNDERCOVERAGE BIAS: This happens when there is inadequate representation. For example, telephone surveys simply ignore all those possible subjects who don’t have telephones. In the 2008 presidential election surveys, phone surveys went only to land line phones, leaving out many young adults who have only cell phones. Another example is convenience samples, like interviews at shopping malls, which are based on choosing individuals who are easy to reach. These interviews tend to produce data highly unrepresentative of the entire population. Door-to-door household surveys typically miss college students and prison inmates, as well as the homeless.
VOLUNTARY RESPONSE BIAS: Samples based on individuals who offer to participate typically give too much emphasis to people with strong opinions. For example, radio call-in programs about controversial topics such as gun control, abortion, and school segregation do not produce meaningful data on what proportion of the population favor or oppose related issues. Online surveys posted to websites are a modern source of voluntary response bias.
WORDING BIAS: Nonneutral or poorly worded questions may lead to answers that are very unrepresentative of the population. To avoid such bias, do not use leading questions, and write questions that are clear and relatively short. Also be careful of sequences of questions that lead respondents toward certain answers.
Note: Again, it should be understood that there is considerable overlap among the above classifications. For example, a nonneutral question may be said to have both response bias and wording bias. Selection bias and undercoverage bias often go hand in hand. Voluntary response bias and nonresponse bias are clearly related.
Time- and cost-saving modifications are often used to implement sampling procedures other than simple random samples.
Systematic sampling involves listing the population in some order (for example, alphabetically), choosing a random point to start, and then picking every tenth (or hundredth, or thousandth, or kth) person from the list. This gives a reasonable sample as long as the original order of the list is not in any way related to the variables under consideration.
TIP
Know the difference between strata and clusters.
In stratified sampling the population is divided into homogeneous groups called strata, and random samples of persons from all strata are chosen. For example, we can stratify by age or gender or income level or race and pick a sample of people from each stratum. Note that all individuals in a given stratum have a characteristic in common. We could further do proportional sampling, where the sizes of the random samples from each stratum depend on the proportion of the total population represented by the stratum.
In cluster sampling the population is divided into heterogeneous groups called clusters, and we then take a random sample of clusters from among all the clusters. For example, to survey high school seniors we could randomly pick several senior class homerooms in which to conduct our study. Note that each cluster should resemble the entire population.
Multistage sampling refers to a procedure involving two or more steps, each of which could involve any of the various sampling techniques. The Gallup organization often follows a procedure in which nationwide locations are randomly selected, then neighborhoods are randomly selected in each of these locations, and finally households are randomly selected in each of these neighborhoods.
EXAMPLE 7.4
Suppose a sample of 100 high school students from a school of size 5000 is to be chosen to determine their views on the death penalty. One method would be to have each student write his or her name on a slip of paper, put the papers in a box, and have the principal reach in and pull out 100 of the papers. However, questions could arise regarding how well the papers are mixed up in the box. For example, how might the outcome be affected if all students in one homeroom toss in their names at the same time so that their papers are clumped together? Another method would be to assign each student a number from 1 to 5000 and then use a random number table, picking out four digits at a time and tossing out repeats and numbers over 5000 (simple random sampling). What are alternative procedures?
Answer: From a list of the students, the surveyor could simply note every fiftieth name (systematic sampling). Since students in each class have certain characteristics in common, the surveyor could use a random selection method to pick 25 students from each of the separate lists of freshmen, sophomores, juniors, and seniors (stratified sampling). The researcher could separate the homerooms by classes; then randomly pick five freshmen homerooms, five sophomore homerooms, five junior homerooms, and five senior homerooms (cluster sampling); and then randomly pick five students from each of the homerooms (multistage sampling). The surveyor could separately pick random samples of males and females (stratified sampling), the size of each of the two samples chosen according to the proportion of male and female students attending the school (proportional sampling).
It should be noted that none of the alternative procedures in the above example result in a simple random sample because every possible sample of size 100 does not have an equal chance of being selected.
SUMMARY
A simple random sample (SRS) is one in which every possible sample of the desired size has an equal chance of being selected.
Sampling error is not an error, but rather refers to the natural variability between samples.
Bias is the tendency to favor the selection of certain members of a population.
Nonresponse bias occurs when a large fraction of those sampled do not respond (most mailed questionnaires are good examples).
Response bias happens when the question itself leads to misleading results (for example, people don’t want to be perceived as having unpopular, unsavory, or illegal views).
Undercoverage bias occurs when part of the population is ignored (for example, telephone surveys miss all those without phones).
Voluntary response bias occurs when individuals choose whether to respond (for example, radio call-in surveys).
Systematic sampling involves listing the population, choosing a random point to start, and then picking every nth person for some n.
Stratified sampling involves dividing the population into homogeneous groups called strata and then picking random samples from each of the strata.
Cluster sampling involves dividing the population into heterogeneous groups called clusters, and then picking everyone in a random sample of the clusters.
Multistage sampling refers to procedures involving two or more steps, each of which could involve any of the sampling techniques.
Multiple-Choice Questions
Directions: The questions or incomplete statements that follow are each followed by five suggested answers or completions. Choose the response that best answers the question or completes the statement.
1. Ann Landers, who wrote a daily advice column appearing in newspapers across the country, once asked her readers, “If you had it to do over again, would you have children?” Of the more than 10,000 readers who responded, 70% said no. What does this show?
(A) The survey is meaningless because of voluntary response bias.
(B) No meaningful conclusion is possible without knowing something more about the characteristics of her readers.
(C) The survey would have been more meaningful if she had picked a random sample of the 10,000 readers who responded.
(D) The survey would have been more meaningful if she had used a control group.
(E) This was a legitimate sample, randomly drawn from her readers and of sufficient size to allow the conclusion that most of her readers who are parents would have second thoughts about having children.
2. Which of the following is a true statement?
(A) If bias is present in a sampling procedure, it can be overcome by dramatically increasing the sample size.
(B) There is no such thing as a “bad sample.”
(C) Sampling techniques that use probability techniques effectively eliminate bias.
(D) Convenience samples often lead to undercoverage bias.
(E) Voluntary response samples often underrepresent people with strong opinions.
3. Two possible wordings for a questionnaire on gun control are as follows:
I. The United States has the highest rate of murder by handguns among all countries. Most of these murders are known to be crimes of passion or crimes provoked by anger between acquaintances. Are you in favor of a 7-day cooling-off period between the filing of an application to purchase a handgun and the resulting sale?
II. The United States has one of the highest violent crime rates among all countries. Many people want to keep handguns in their homes for self-protection. Fortunately, U.S. citizens are guaranteed the right to bear arms by the Constitution. Are you in favor of a 7-day waiting period between the filing of an application to purchase a needed handgun and the resulting sale?
One of these questions showed that 25% of the population favored a 7-day waiting period between application for purchase of a handgun and the resulting sale, while the other question showed that 70% of the population favored the waiting period. Which produced which result and why?
(A) The first question probably showed 70% and the second question 25% because of the lack of randomization in the choice of pro-gun and anti-gun subjects as evidenced by the wording of the questions.
(B) The first question probably showed 25% and the second question 70% because of a placebo effect due to the wording of the questions.
(C) The first question probably showed 70% and the second question 25% because of the lack of a control group.
(D) The first question probably showed 25% and the second question 70% because of response bias due to the wording of the questions.
(E) The first question probably showed 70% and the second question 25% because of response bias due to the wording of the questions.
4. Each of the 29 NBA teams has 12 players. A sample of 58 players is to be chosen as follows. Each team will be asked to place 12 cards with its players’ names into a hat and randomly draw out two names. The two names from each team will be combined to make up the sample. Will this method result in a simple random sample of the 348 basketball players?
(A) Yes, because each player has the same chance of being selected.
(B) Yes, because each team is equally represented.
(C) Yes, because this is an example of stratified sampling, which is a special case of simple random sampling.
(D) No, because the teams are not chosen randomly.
(E) No, because not each group of 58 players has the same chance of being selected.
5. To survey the opinions of bleacher fans at Wrigley Field, a surveyor plans to select every one-hundredth fan entering the bleachers one afternoon. Will this result in a simple random sample of Cub fans who sit in the bleachers?
(A) Yes, because each bleacher fan has the same chance of being selected.
(B) Yes, but only if there is a single entrance to the bleachers.
(C) Yes, because the 99 out of 100 bleacher fans who are not selected will form a control group.
(D) Yes, because this is an example of systematic sampling, which is a special case of simple random sampling.
(E) No, because not every sample of the intended size has an equal chance of being selected.
6. Which of the following is a true statement about sampling error?
(A) Sampling error can be eliminated only if a survey is both extremely well designed and extremely well conducted.
(B) Sampling error concerns natural variation between samples, is always present, and can be described using probability.
(C) Sampling error is generally larger when the sample size is larger.
(D) Sampling error implies an error, possibly very small, but still an error, on the part of the surveyor.
(E) Sampling error is higher when bias is present.
7. What fault do all these sampling designs have in common?
I. The Wall Street Journal plans to make a prediction for a presidential election based on a survey of its readers.
II. A radio talk show asks people to phone in their views on whether the United States should pay off its huge debt to the United Nations.
III. A police detective, interested in determining the extent of drug use by teenagers, randomly picks a sample of high school students and interviews each one about any illegal drug use by the student during the past year.
(A) All the designs make improper use of stratification.
(B) All the designs have errors that can lead to strong bias.
(C) All the designs confuse association with cause and effect.
(D) None of the designs satisfactorily controls for sampling error.
(E) None of the designs makes use of chance in selecting a sample.
8. A state auditor is given an assignment to choose and audit 26 companies. She lists all companies whose name begins with A, assigns each a number, and uses a random number table to pick one of these numbers and thus one company. She proceeds to use the same procedure for each letter of the alphabet and then combines the 26 results into a group for auditing. Which of the following is a true statement?
(A) Each company has an equal probability of being audited.
(B) Each set of 26 companies has an equal chance of being selected.
(C) Her procedure results in a simple random sample.
(D) Her procedure doesn’t truly make use of chance.
(E) She could have used a calculator random number generator in place of using a random number table to achieve similar results.
9. A researcher planning a survey of heads of households in a particular state has census lists for each of the 23 counties in that state. The procedure will be to obtain a random sample of heads of households from each of the counties rather than grouping all the census lists together and obtaining a sample from the entire group. Which of the following is an incorrect statement about the resulting stratified sample?
(A) It is not a simple random sample.
(B) It is easier and less costly to obtain than a simple random sample.
(C) It gives comparative information that a simple random sample wouldn’t give.
(D) A cluster sample would have been more appropriate.
(E) Differences in county sizes could be taken into account by making the size of the random sample from each county depend on the proportion of the total population represented by the county.
10. To find out the average occupancy size of student-rented apartments, a researcher picks a simple random sample of 100 such apartments. Even after one follow-up visit, the interviewer is unable to make contact with anyone in 27 of these apartments. Concerned about nonresponse bias, the researcher chooses another simple random sample and instructs the interviewer to continue this procedure until contact is made with someone in a total of 100 apartments. The average occupancy size in the final 100-apartment sample is 2.78. Is this estimate probably too low or too high?
(A) Too low, because of undercoverage bias.
(B) Too low, because convenience samples overestimate average results.
(C) Too high, because of undercoverage bias.
(D) Too high, because convenience samples overestimate average results.
(E) Too high, because voluntary response samples overestimate average results.
11. To conduct a survey of long-distance calling patterns, a researcher opens a telephone book to a random page, closes his eyes, puts his finger down on the page, and then reads off the next 50 names. Which of the following is incorrect?
(A) The survey design incorporates chance.
(B) Assuming the page and starting point on the page are randomly selected, each person in the phone book has an equal chance of being selected.
(C) The procedure could easily result in selection bias.
(D) The procedure does not result in a simple random sample.
(E) This is the typical methodology of a systematic sample.
12. Consider the following three events:
I. Although 18% of the student body are minorities, in a random sample of 20 students, 5 are minorities.
II. In a survey about sexual habits, an embarrassed student deliberately gives the wrong answers.
III. A surveyor mistakenly records answers to one question in the wrong space.
Which of the following correctly characterizes the above?
(A) I, sampling error; II, response bias; III, human mistake
(B) I, sampling error; II, nonresponse bias; III, hidden error
(C) I, hidden bias; II, voluntary sample bias; III, sampling error
(D) I, undercoverage error; II, voluntary error; III, unintentional error
(E) I, small sample error; II, deliberate error; III, mistaken error
13. A researcher plans a study to examine the depth of belief in God among the adult population. He obtains a simple random sample of 100 adults as they leave church one Sunday morning. All but one of them agree to participate in the survey. Which of the following is a true statement?
(A) Proper use of chance as evidenced by the simple random sample makes this a well-designed survey.
(B) The high response rate makes this a well-designed survey.
(C) Selection bias makes this a poorly designed survey.
(D) The validity of this survey depends on whether or not the adults attending this church are representative of all churches.
(E) The validity of this survey depends upon whether or not similar numbers of those surveyed are male and female.
Free-Response Questions
Directions: You must show all work and indicate the methods you use. You will be graded on the correctness of your methods and on the accuracy of your final answers.
SEVEN OPEN-ENDED QUESTIONS
1. Cell phones emit a form of electromagnetic radiation, and there is a concern on how this affects the human body. A World Health Organization (WHO) study of 12,000 people found no connection between moderate cell phone use and brain cancer, although the report does mention a higher incidence of brain cancer for heavy users (defined as those who used their phone for at least half an hour a day). A study of 420,095 persons in Denmark found no correlation between length of cell phone subscriptions (in years) and brain tumor incidence.
(a) Were these studies observational studies or experiments or one of each? Explain.
(b) Does the WHO study definition of “heavy users” seem reasonable? Explain.
(c) Neither study tries to distinguish between voice or text messaging use of the cell phone. Should this have any affect on conclusions? Explain.
(d) What is a weakness in the Denmark study that the WHO study does take into account?
2. A questionnaire is being designed to determine whether most people are or are not in favor of legislation protecting the habitat of the spotted owl. Give two examples of poorly worded questions, one biased toward each response.
3. To obtain a sample of 25 students from among the 500 students present in school one day, a surveyor decides to pick every twentieth student waiting in line to attend a required assembly in the gym.
(a) Explain why this procedure will not result in a simple random sample of the students present that day.
(b) Describe a procedure that will result in a simple random sample of the students present that day.
4. A hot topic in government these days is welfare reform. Suppose a congresswoman wishes to survey her constituents concerning their opinions on whether the federal government should turn welfare over to the states. Discuss possible sources of bias with regard to the following four options: (1) conducting a survey via random telephone dialing into her district, (2) sending out a mailing using a registered voter list, (3) having a pollster interview everyone who walks past her downtown office, and (4) broadcasting a radio appeal urging interested citizens in her district to call in their opinions to her office.
5. You and nine friends go to a restaurant and check your coats. You all forget to pick up the ticket stubs, and so when you are ready to leave, Hilda, the hatcheck girl, randomly gives each of you one of the ten coats. You are surprised that one person actually receives the correct coat. You would like to explore this further and decide to use a random number table to simulate the situation. Describe how the random number table can be used to simulate one trial of the coat episode. Explain what each of the digits 0 through 9 will represent.
6.
You are supposed to interview the residents of two of the above five houses.
(a) How would you choose which houses to interview?
(b) You plan to visit the homes at 9 a.m. If someone isn’t home, explain the reasons for and against substituting another house.
(c) Are there any differences you might expect to find among the residents based on the above sketch?
7. A cable company plans to survey potential customers in a small city currently served by satellite dishes. Two sampling methods are being considered. Method A is to randomly select a sample of 25 city blocks and survey every family living on those blocks. Method B is to randomly select a sample of families from each of the five natural neighborhoods making up the city.
(a) What is the statistical name for the sampling technique used in Method A, and what is a possible reason for using it rather than an SRS?
(b) What is the statistical name for the sampling technique used in Method B, and what is a possible reason for using it rather than an SRS?
AN INVESTIGATIVE TASK
A basic problem in ecology is to estimate the number of animals in a wildlife population. Suppose a wildlife management team captures and tags 27 deer in a state forest. The deer are released and given time to mingle with the deer population. One month later a second capture is made of 20 deer, among which 3 are noted to be tagged from the original capture.
(a) Assuming that the number of tagged individuals within the second sample is proportional to the number of tagged individuals in the whole population, estimate the total number of deer in the forest.
(b) A formula for variance of this estimate is Var(N)=$\frac{(a+1)(c+1)(a-b)(c-b)}{(b+1)^2(b+2)}$ where N = estimate of total population size, a = number of animals originally captured and tagged, b = number of tagged animals who are recaptured, and c = number of animals in second sample. What is the standard deviation for the population size estimate above?
(c) Given the answers to (a) and (b), would it be unexpected for the true deer population to be 225? Explain and show your work.
(d) What would a more accurate population estimate be, if it took into account that 6 of the tagged deer were shot by hunters before they could rejoin with the herd?
MULTIPLE-CHOICE
1. (A) This survey provides a good example of voluntary response bias, which often overrepresents negative opinions. The people who chose to respond were most likely parents who were very unhappy, and so there is very little chance that the 10,000 respondents were representative of the population. Knowing more about her readers, or taking a sample of the sample would not have helped.
2. (D) If there is bias, taking a larger sample just magnifies the bias on a larger scale. If there is enough bias, the sample can be worthless. Even when the subjects are chosen randomly, there can be bias due, for example, to non-response or to the wording of the questions. Convenience samples, like shopping mall surveys, are based on choosing individuals who are easy to reach, and they typically miss a large segment of the population. Voluntary response samples, like radio call-in surveys, are based on individuals who offer to participate, and they typically overrepresent persons with strong opinions.
3. (E) The wording of the questions can lead to response bias. The neutral way of asking this question would simply have been: Are you in favor of a 7-day waiting period between the filing of an application to purchase a handgun and the resulting sale?
4. (E) In a simple random sample, every possible group of the given size has to be equally likely to be selected, and this is not true here. For example, with this procedure it will be impossible for all the Bulls to be together in the final sample. This procedure is an example of stratified sampling, but stratified sampling does not result in simple random samples.
5. (E) In a simple random sample, every possible group of the given size has to be equally likely to be selected, and this is not true here. For example, with this procedure it will be impossible for all the early arrivals to be together in the final sample. This procedure is an example of systematic sampling, but systematic sampling does not result in simple random samples.
6. (B) Different samples give different sample statistics, all of which are estimates of a population parameter. Sampling error relates to natural variation between samples, can never be eliminated, can be described using probability, and is generally smaller if the sample size is larger.
7. (B) The Wall Street Journal survey has strong selection bias; that is, people who read the Journal are not very representative of the general population. The talk show survey results in a voluntary response sample, which typically gives too much emphasis to persons with strong opinions. The police detective’s survey has strong response bias in that students may not give truthful responses to a police detective about their illegal drug use.
8. (E) While the auditor does use chance, each company will have the same chance of being audited only if the same number of companies have names starting with each letter of the alphabet. This will not result in a simple random sample because each possible set of 26 companies does not have the same chance of being picked as the sample. For example, a group of companies whose names all start with A will not be chosen. Calculator random number generators and random number tables have similar uses and results.
9. (D) This is not a simple random sample because all possible sets of the required size do not have the same chance of being picked. For example, a set of households all from just half the counties has no chance of being picked to be the sample. Stratified samples are often easier and less costly to obtain and also make comparative data available. In this case responses can be compared among various counties. There is no reason to assume that each county has heads of households with the same characteristics and opinions as the state as a whole, so cluster sampling is not appropriate. When conducting stratified sampling, proportional sampling is used when one wants to take into account the different sizes of the strata.
10. (C) It is most likely that the apartments at which the interviewer had difficulty finding someone home were apartments with fewer students living in them. Replacing these with other randomly picked apartments most likely replaces smaller-occupancy apartments with larger-occupancy ones.
11. (E) While the procedure does use some element of chance, all possible groups of size 50 do not have the same chance of being picked, and so the result is not a simple random sample. There is a very real chance of selection bias. For example, a number of relatives with the same name and similar long-distance calling patterns might be selected. The typical methodology of a systematic sample involves picking every nth member from the list, where n is roughly the population size divided by the desired sample size.
12. (A) The natural variation in samples is called sampling error. Embarrassing questions and resulting untruthful answers are an example of response bias. Inaccuracies and mistakes due to human error are one of the real concerns of researchers.
13. (C) Surveying people coming out of any church results in a very unrepresentative sample of the adult population, especially given the question under consideration. Using chance and obtaining a high response rate will not change the selection bias and make this into a well-designed survey.
Free-Response
1. (a) Both studies were observational because no treatments were applied.
(b) Typical cell phone use today, especially among younger people, is well over half an hour, so half an hour does not seem to be a reasonable split between moderate and heavy use.
(c) This absolutely affects conclusions in that both studies look for relationships with brain cancer. While voice conversation involves holding the phone against one’s head, text messaging does not.
(d) The Denmark study looks at how many years individuals used their cell phones, but not at the extent of daily use, while the WHO study does consider daily usage.
2. There are many possible examples, such as Are you in favor of protecting the habitat of the spotted owl, which is almost extinct and desperately in need of help from an environmentally conscious government? and Are you in favor of protecting the habitat of the spotted owl no matter how much unemployment and resulting poverty this causes among hard-working loggers?
3. (a) To be a simple random sample, every possible group of size 25 has to be equally likely to be selected, and this is not true here. For example, if there are 40 students who always rush to be first in line, this procedure will allow for only 2 of them to be in the sample. Or if each homeroom of size 20 arrives as a unit, this procedure will allow for only 1 person from each homeroom to be in the sample.
(b) A simple random sample of the students can be obtained by numbering them from 001 to 500 and then picking three digits at a time from a random number table, ignoring numbers over 500 and ignoring repeats, until a group of 25 numbers is obtained. The students corresponding to these 25 numbers will be a simple random sample.
4. The direct telephone and mailing options will both suffer from undercoverage bias. For example, especially affected by the legislation under discussion are the homeless, and they do not have telephones or mailing addresses. The pollster interviews will result in a convenience sample, which can be highly unrepresentative of the population. In this case, there might be a real question concerning which members of her constituency spend any time in the downtown area where her office is located. The radio appeal will lead to a voluntary response sample, which typically gives too much emphasis to persons with strong opinions.
5. In numbering the people 0 through 9, each digit stands for whose coat someone receives. Pick the digits, omitting repeats, until a group of ten different digits is obtained. Check for a match (1 appearing in the first position corresponding to person 1, or 2 appearing in the next position corresponding to person 2, and so on, up to 0 appearing in the last position corresponding to person 10).
6. (a) To obtain an SRS, you might use a random number table and note the first two different numbers between 1 and 5 that appear. Or you could use a calculator to generate numbers between 1 and 5, again noting the first two different numbers that result.
(b) Time and cost considerations would be the benefit of substitution. However, substitution rather than returning to the same home later could lead to selection bias because certain types of people are not and will not be home at 9 a.m. With substitution the sample would no longer be a simple random sample.
(c) Corner lot homes like homes 1 and 5 might have different residents (perhaps with higher income levels) than other homes.
7. (a) Method A is an example of cluster sampling, where the population is divided into heterogeneous groups called clusters and individuals from a random sample of the clusters are surveyed. It is often more practical to simply survey individuals from a random sample of clusters (in this case, a random sample of city blocks) than to try to randomly sample a whole population (in this case the entire city population).
(b) Method B is an example of stratified sampling, where the population is divided into homogeneous groups called strata and random individuals from each stratum are chosen. Stratified samples can often give useful information about each stratum (in this case, about each of the five neighborhoods) in addition to information about the whole population (the city population).
AN INVESTIGATIVE TASK
(a) 27/N = 3/20 gives N = 180.
(b)
(c) No, this would not have been unexpected because 45, the absolute difference between 180 and 225, is less than the standard deviation of 54.76.
(d) 21/N = 3/20gives N=140.
EXPERIMENTS VERSUS OBSERVATIONAL STUDIES
CONFOUNDING, CONTROL GROUPS, PLACEBO EFFECTS, BLINDING
TREATMENTS, EXPERIMENTAL UNITS, RANDOMIZATION
REPLICATION, BLOCKING, GENERALIZABILITY OF RESULTS
There are several primary principles dealing with the proper planning and conducting of experiments. First, possible confounding variables must be controlled for, usually through the use of comparison. Second, chance should be used in assigning which subjects are to be placed in which groups for which treatment. Third, natural variation in outcomes can be lessened by using more subjects.
EXPERIMENTS VERSUS OBSERVATIONAL STUDIES VERSUS SURVEYS
In an experiment we impose some change or treatment and measure the result or response. In an observational study we simply observe and measure something that has taken place or is taking place, while trying not to cause any changes by our presence. A sample survey is an observational study in which we draw conclusions about an entire population by considering an appropriately chosen sample to look at. An experiment often suggests a causal relationship, while an observational study may show only the existence of associations.
EXAMPLE 8.1
A study is to be designed to determine whether daily calcium supplements benefit women by increasing bone mass. How can an observational study be performed? An experiment? Which is more appropriate here?
Answer: An observational study might interview and run tests on women seen purchasing calcium supplements in a pharmacy. Or perhaps all patients hospitalized during a particular time period could be interviewed with regard to taking calcium and then their bone mass measured. The bone mass measurements of those taking calcium supplements could then be compared to that of those not taking supplements.
An experiment could be performed by selecting some number of subjects, using chance to pick half to receive calcium supplements while the other half receives similar-looking placebos, and noting the difference in bone mass before and after treatment for each group.
The experimental approach is more appropriate here. With the observational study there could be many explanations for any bone mass difference noted between patients who take calcium and those who don’t. For example, women who have voluntarily been taking calcium supplements might be precisely those who take better care of themselves in general and thus have higher bone mass for other reasons. The experiment tries to control for lurking variables by randomly giving half the subjects calcium.
EXAMPLE 8.2
A study is to be designed to examine the life expectancies of tall people versus those of short people. Which is more appropriate, an observational study or an experiment?
Answer: An observational study, examining medical records of heights and ages at time of death, seems straightforward. An experiment where subjects are randomly chosen to be made short or tall, followed by recording age at death, would be groundbreaking (and, of course, nonsensical).
EXAMPLE 8.3
A study is to be designed to examine the GPAs of students who take marijuana regularly and those who don’t. Which is more appropriate, an observational study or an experiment?
Answer: As much as some researchers might want to randomly require half the subjects to take an illegal drug, this would be unethical. The proper procedure here is an observational study, having students anonymously fill out questionnaires asking about marijuana usage and GPA.
Experiments involve explanatory variables, called factors, that are believed to have an effect on response variables. A group is treated with some level of the explanatory variable, and the outcome on the response variable is measured.
EXAMPLE 8.4
To test the value of help sessions outside the classroom, students could be divided into three groups, with one group receiving 4 hours of help sessions per week outside the classroom, a second group receiving 2 hours of help sessions outside the classroom, and a third group receiving no help outside the classroom. What are the explanatory and response variables and what are the levels?
Answer: The explanatory variable, help sessions outside the classroom, is being given at three levels: 4 hours weekly, 2 hours weekly, and 0 hours weekly. The response variable is not specified but might be a final exam score or performance on a particular test.
The different factor-level combinations are called treatments. In Example 8.4, there are three treatments (corresponding to the three levels of the one factor). Suppose the students were further randomly divided into a morning class and an afternoon class. There would then be two factors, one with three levels and one with two levels, and a total of six treatments (AM class with 4 hours help, AM class with 2 hours help, AM class with 0 hours help, PM class with 4 hours help, PM class with 2 hours help, and PM class with 0 hours help).
CONFOUNDING, CONTROL GROUPS, PLACEBO EFFECTS, AND BLINDING
When there is uncertainty with regard to which variable is causing an effect, we say the variables are confounded. For example, suppose two fertilizers require different amounts of watering. In an experiment it might be difficult to determine if the difference in fertilizers or the difference in watering is the real cause of observed differences in plant growth. Sometimes we can control for confounding. For example, we can have many test plots using one or the other of the fertilizers, with equal numbers of sunny and shady plots for each fertilizer, so that fertilizer and sun are not confounded.
Sometimes a variable drives two other variables, creating the mistaken impression that the two other variables are related by cause and effect. For example, elementary school students with larger shoe sizes appear to have higher reading levels. However, there is a variable, age, which drives both the other variables. That is, older students tend to wear larger shoes than younger students, and older students also tend to have higher reading levels. Wearing larger shoes will not improve reading skills! There is a common response; that is, changes in both shoe size and reading level are caused by changes in age.
In an experiment there is a group that receives the treatment, and there is a control group that doesn’t. The experiment compares the responses in the treatment group to the responses in the control group. Randomly putting subjects into treatment and control groups can help reduce the problems posed by confounding and lurking variables. Thus these problems are easier to control for when doing experiments than when doing observational studies.
It is a fact that many people respond to any kind of perceived treatment. This is called the placebo effect. For example, when given a sugar pill after surgery but told that it is a strong pain reliever, many patients feel immediate relief from their pain. In many studies, subjects appear to consciously or subconsciously want to help the researcher prove a point. Thus when responses are noticed in any experiment, there is concern whether real physical responses are being caused by the psychological placebo effect. Blinding occurs when the subjects or the response evaluators don’t know which subjects are receiving different treatments such as placebos.
TIP
Blinding and placebos in experiments are important but are not always feasible. You can still have “experiments” without these.
EXAMPLE 8.5
A study is intended to test the effects of vitamin E and beta carotene on heart attack rates. How should it be set up?
Answer: Using randomization, the subjects should be split into four groups: those who will be given just vitamin E, just beta carotene, both vitamin E and beta carotene, and neither vitamin E nor beta carotene. For example, as each subject joins the test, the next digit in a random number table can be read off, ignoring 0 and 5–9, and with 1, 2, 3, and 4 designating which group the subject is placed in. Or if the total number of subjects is known and available, for example, 800, then each can be assigned a number and three digits at a time be read off the random number table. With repeats and numbers over 800 thrown away, the first 200 numbers picked represent one group, the next 200 another, and so on. More meaningful results will be obtained if the study is double-blind, that is, if not only are the subjects unaware of what kind of tablets they are taking but so are the doctors evaluating whether or not they have heart problems. Many diagnoses are not clear-cut, and doctors can be influenced if they know exactly which potential preventive their patients are taking.
TREATMENTS, EXPERIMENTAL UNITS, AND RANDOMIZATION
An experiment is performed on objects called experimental units, and if the units are people, they are called subjects. The experimental units or subjects are often divided into two groups. One group receives a treatment and is called the treatment group. A comparison is made between the response noted in the treatment group and the response noted in the control group, the group that receives no treatment.
To help minimize the effect of lurking variables, and of confounding, it is important to use randomization, that is, to use chance in deciding which subjects go into which group. It is not sufficient to try to systematically match characteristics between the two groups. It seems reasonable, for example, to hand-sort subjects so that both the treatment group and the control group have the same number of women, the same number of Catholics, the same number of Hispanics, the same number of short people, and so on, but this method does not work well. There are always other variables that one might not think of considering until after the results of the experiment start coming in. The best method to use is randomization employing a computer, a hat with names in it, or a random number table.
Note that randomization usually refers to how given subjects are assigned to treatments, not to how a group of subjects are chosen from an entire population. The object of an experiment is to see if different treatments lead to different responses, and so we randomly assign subjects to treatments to balance unknown sources of variability. Random assignment to treatments is critical, especially if the subjects are not randomly selected, as is the case in medical/drug experiments. Generalizing the findings of the study is a separate question, one that depends on how the initial group of subjects was assembled.
COMPLETELY RANDOMIZED DESIGN FOR TWO TREATMENTS
Comparing two treatments using randomization is often the design of choice. To help minimize hidden bias, it is best if subjects do not know which treatment they are receiving. This is called single-blinding. Another precaution is the use of double-blinding, in which neither the subjects nor those evaluating their responses know who is receiving which treatment.
EXAMPLE 8.6
There is a pressure point on the wrist that some doctors believe can be used to help control the nausea experienced following certain medical procedures. The idea is to place a band containing a small marble firmly on a patient’s wrist so that the marble is located directly over the pressure point. Describe how an experiment might be run on 50 postoperative patients.
Answer: Assign each patient a number from 01 to 50. From a random number table read off two digits at a time, throwing away repeats, 00, and numbers over 50, until 25 numbers have been selected. Put wristbands with marbles over the pressure point on the patients with these assigned numbers. Put wristbands with marbles on the remaining patients also, but not over the pressure point. Have a researcher check by telephone with all 50 patients at designated time intervals to determine the degree of nausea being experienced. Neither the patients nor the researcher on the telephone should know which patients have the marbles over the correct pressure point.
EXAMPLE 8.7
A chemical fertilizer company wishes to test whether using their product results in superior vegetables. After dividing a large field into small plots, how might the experiment proceed?
Answer: If the company has one recommended fertilizer application level, half the plots can be randomly selected (assigning the plots numbers and using a random number table) to receive the prescribed dosage of fertilizer. This random selection of plots is to ensure that neither fertilized plants nor unfertilized plants are inadvertently given land with better rainfall, sunshine, soil type, and so on. To avoid possible bias on the part of employees who will weed and water the plants, they should not know which plots have received the fertilizer. It might be necessary to have containers, one for each plot, of a similar-looking, similar-smelling substance, half of which contain the fertilizer while the rest contain a chemically inactive material. Finally, if the vegetables are to be judged by quantity and size, the measurements will be less subject to bias. However, if they are to be judged qualitatively, for example, by taste, the judges should not know which vegetables were treated with the fertilizer and which were not.
If the researchers also wish to consider level, that is, the amount of fertilizer, randomization should be used for more groupings. For example, if there are 60 plots on which to test four levels of fertilizer, the first 12 different two-digit numbers in the range 01–60 appearing on a random number table might receive one level, the next 12 new two-digit numbers a second level, and so on, with the last 12 plots receiving the “placebo” treatment.
RANDOMIZED PAIRED COMPARISON DESIGN
Two treatments can be compared based on the responses of paired subjects, one of whom receives one treatment while the other receives the second treatment. Often the paired subjects are really single subjects who are given both treatments, one at a time.
EXAMPLE 8.8
The famous Pepsi-Coke tests had subjects compare the taste of samples of each drink. How could such a paired comparison test be set up?
Answer: It is crucial that such a test be blind, that is, that the subjects not know which cup contains which drink. Furthermore, to help avoid hidden bias, which drink the subjects taste first should be decided by chance. For example, as each subject arrives, the researcher could read off the next digit from a random number table, with the subject receiving Pepsi or Coke first depending on whether the digit is odd or even.
Note: Even though the subjects are being given a drink, and there is some randomization going on, some statisticians consider this to be a sample survey aimed at estimating a population proportion rather than a true experiment.
EXAMPLE 8.9
Does seeing pictures of accidents caused by drunk drivers influence one’s opinion on penalties for drunk drivers? How could a comparison test be designed?
Answer: The subjects could be asked questions about drunk driving penalties before and then again after seeing the pictures, and any change in answers noted. This would be a poor design because there is no control group, there is no use of randomization, and subjects might well change their answers because they realize that that is what is expected of them after seeing the pictures.
A better design is to use randomization to split the subjects into two groups, half of whom simply answer the questions while the other half first see the pictures and then answer the questions.
Another possibility is to use a group of twins as subjects. One of each set of twins is randomly picked (e.g., based on choosing an odd or even digit from a random number table) to answer the questions without seeing the pictures, while the other first sees the pictures and then answers the questions. The answers could be compared from each set of twins. This is a paired comparison test that might help minimize lurking variables due to family environment, heredity, and so on.
REPLICATION, BLOCKING, AND GENERALIZABILITY OF RESULTS
When differences are observed in a comparison test, the researcher must decide whether these differences are statistically significant or whether they can be explained by natural variation. One important consideration is the size of the sample—the larger the sample, the more significant the observation. This is the principle of replication; that is, the treatment should be repeated on a sufficient number of subjects so that real response differences are more apparent.
Just as stratification in sampling design first divides the population into representative groups called strata, blocking in experiment design first divides the subjects into representative groups called blocks. One can think of blocking as running a separate experiment on each block. This technique helps control certain lurking variables by bringing them directly into the picture and helps make conclusions more specific. The paired comparison design is a special case of blocking in which each pair (or each subject if the subjects serve as their own controls) can be considered a block.
TIP
Use proper terminology! The language of experiments is different from the language of observational studies—you shouldn’t mix up blocking and stratification.
EXAMPLE 8.10
There is a rising trend for star college athletes to turn professional without finishing their degrees. A study is performed to assess whether reading an article about professional salaries has an impact on such decisions. Randomization can be used to split the subjects into two groups, and those in one group given the article before answering questions. How can a block design be incorporated into the design of this experiment?
Answer: The subjects can be split into two blocks, underclass and upperclass, before using randomization to assign some to read the article before questioning. With this design, the impact of the salary article on freshmen and sophomores can be distinguished from the impact on juniors and seniors.
Similarly, blocking can be used to separately analyze men and women, those with high GPAs and those with low GPAs, those in different sports, those with different majors, and so on.
A major goal of experiments is to be able to generalize the results to broader populations. Often an experiment must be repeated in a variety of settings. For example, it is hard to generalize from the effect a television commercial has on students at a private midwestern high school to the effect the same commercial has on retired senior citizens in Florida. Generally, comparison and randomization are important, blinding is sometimes critical, and taking care to avoid hidden bias as much as possible is always indicative of a well-designed experiment. However, knowledge of the subject so that realistic situations can be created in testing should also be emphasized. Testing and experimenting on people does not put them in natural states, and this situation can lead to artificial responses.
SUMMARY
Experiments involve applying a treatment to one or more groups and observing the responses.
Experiments often have a treatment group and a control group.
Blocking is the process of dividing the subjects into representative groups to bring certain differences into the picture (for example, blocking by gender, age, or race).
Random assignment of subjects to treatment groups is extremely important in handling unknown and uncontrollable differences.
Random assignment refers to what is done with subjects after they’ve been picked for a study, whereas random sampling refers to how subjects are selected for a study.
Variables are said to be confounded when there is uncertainty as to which variable is causing an effect.
The placebo effect refers to the fact that many people respond to any kind of perceived treatment.
Blinding refers to subjects not knowing which treatment they are receiving.
Double-blinding refers to subjects and those evaluating their responses not knowing who received which treatments.
Completely randomized designs refer to experiments in which everyone has an equal chance of receiving any treatment.
Randomized block designs refer to experiments in which the randomization occurs only within blocks.
Randomized paired comparison designs refer to experiments in which subjects are paired and randomization is used to decide who in each pair receives what treatment.
Multiple-Choice Questions
Directions: The questions or incomplete statements that follow are each followed by five suggested answers or completions. Choose the response that best answers the question or completes the statement.
1. A study is made to determine whether taking AP Statistics in high school helps students achieve higher GPAs when they go to college. In comparing records of 200 college students, half of whom took AP Statistics in high school, it is noted that the average college GPA is higher for those 100 students who took AP Statistics than for those who did not. Based on this study, guidance counselors begin recommending AP Statistics for college bound students. Which of the following is incorrect?
(A) While this study indicates a relation, it does not prove causation.
(B) There could well be a confounding variable responsible for the seeming relationship.
(C) Self-selection here makes drawing the counselors’ conclusion difficult.
(D) A more meaningful study would be to compare an SRS from each of the two groups of 100 students.
(E) This is an observational study, not an experiment.
2. In a 1927–32 Western Electric Company study on the effect of lighting on worker productivity, productivity increased with each increase in lighting but then also increased with every decrease in lighting. If it is assumed that the workers knew a study was in progress, this is an example of
(A) the effect of a treatment unit.
(B) the placebo effect.
(C) the control group effect.
(D) sampling error.
(E) voluntary response bias.
3. When the estrogen-blocking drug tamoxifen was first introduced to treat breast cancer, there was concern that it would cause osteoporosis as a side effect. To test this concern, cancer subjects were randomly selected and given tamoxifen, and their bone density was measured before and after treatment. Which of the following is a true statement?
(A) This study was an observational study.
(B) This study was a sample survey of randomly selected cancer patients.
(C) This study was an experiment in which the subjects were used as their own controls.
(D) With the given procedure, there cannot be a placebo effect.
(E) Causation cannot be concluded without knowing the survival rates.
4. In designing an experiment, blocking is used
(A) to reduce bias.
(B) to reduce variation.
(C) as a substitute for a control group.
(D) as a first step in randomization.
(E) to control the level of the experiment.
5. Which of the following is incorrect?
(A) Blocking is to experiment design as stratification is to sampling design.
(B) By controlling certain variables, blocking can make conclusions more specific.
(C) The paired comparison design is a special case of blocking.
(D) Blocking results in increased accuracy because the blocks have smaller size than the original group.
(E) In a randomized block design, the randomization occurs within the blocks.
6. Consider the following studies being run by three different nursing home establishments.
I. One nursing home has pets brought in for an hour every day to see if patient morale is improved.
II. One nursing home allows hourly visits every day by kindergarten children to see if patient morale is improved.
III. One nursing home administers antidepressants to all patients to see if patient morale is improved.
Which of the following is true?
(A) None of these studies uses randomization.
(B) None of these studies uses control groups.
(C) None of these studies uses blinding.
(D) Important information can be obtained from all these studies, but none will be able to establish causal relationships.
(E) All of the above
7. A consumer product agency tests miles per gallon for a sample of automobiles using each of four different octanes of gasoline. Which of the following is true?
(A) There are four explanatory variables and one response variable.
(B) There is one explanatory variable with four levels of response.
(C) Miles per gallon is the only explanatory variable, but there are four response variables corresponding to the different octanes.
(D) There are four levels of a single explanatory variable.
(E) Each explanatory level has an associated level of response.
8. Is hot oatmeal with fruit or a Western omelet with home fries a more satisfying breakfast? Fifty volunteers are randomly split into two groups. One group is fed oatmeal with fruit, while the other is fed Western omelets with home fries. Each volunteer then rates his/her breakfast on a one to ten scale for satisfaction. If the Western omelet with home fries receives a substantially higher average score, what is a reasonable conclusion?
(A) In general, people find Western omelets with home fries more satisfying for breakfast than hot oatmeal with fruit.
(B) There is no reasonable conclusion because the subjects were volunteering rather than being randomly selected from the general population.
(C) There is no reasonable conclusion because of the small size of the sample.
(D) There is no reasonable conclusion because blinding was not used.
(E) There is no reasonable conclusion because there are too many possible confounding variables such as age, race, and ethnic background of the individual volunteers and season when the study was performed.
9. Which of the following is a true statement?
(A) In well-designed observational studies, responses are systematically influenced during the collection of data.
(B) In well-designed experiments, the treatments result in responses that are as similar as possible.
(C) A well-designed experiment always has a single treatment but may test that treatment at different levels.
(D) Causation and association are unrelated concepts.
(E) In well-designed, well-conducted experiments, strong association implies cause and effect.
10. Which of the following is not important in the design of experiments?
(A) Control of confounding variables
(B) Randomization in assigning subjects to different treatments
(C) Replication of the experiment using sufficient numbers of subjects
(D) Care in observing without imposing change
(E) Isolating variability due to differences between blocks
11. Which of the following is a true statement about the design of matched-pair experiments?
(A) Each subject might receive both treatments.
(B) Each pair of subjects receives the identical treatment, and differences in their responses are noted.
(C) Blocking is one form of matched-pair design.
(D) Stratification into two equal sized strata is an example of matched pairs.
(E) Randomization is unnecessary in true matched pair designs.
12. Do teenagers prefer sports drinks colored blue or green? Two different colorings, which have no effect on taste, are used on the identical drink to result in a blue and a green beverage; volunteer teenagers are randomly assigned to drink one or the other colored beverage; and the volunteers then rate the beverage on a one to ten scale. Because of concern that sports interest may affect the outcome, the volunteers are first blocked by whether or not they play on a high school team. Is blinding possible in this experiment?
(A) No, because the volunteers know whether they are drinking a blue or green drink.
(B) No, because the volunteers know whether or not they play on a high school team.
(C) Yes, by having the experimenter in a separate room randomly pick one of two containers and remotely have a drink poured from that container.
(D) Yes, by having the statistician analyzing the results not know which volunteer sampled which drink.
(E) Yes, by having the volunteers drink out of solid colored thermoses, so that they don’t know the color of the drink they are tasting.
13. Some researchers believe that too much iron in the blood can raise the level of cholesterol. The iron level in the blood can be lowered by making periodic blood donations. A study is performed by randomly selecting half of a group of volunteers to give periodic blood donations while the rest do not. Is this an experiment or an observational study?
(A) An experiment with a single factor
(B) An experiment with control group and blinding
(C) An experiment with blocking
(D) An observational study with comparison and randomization
(E) An observational study with little if any bias
Free-Response Questions
Directions: You must show all work and indicate the methods you use. You will be graded on the correctness of your methods and on the accuracy of your final answers.
ELEVEN OPEN-ENDED QUESTIONS
1. The belief that sugar causes hyperactivity is the most popular example of how people believe that food influences behavior.
(a) Many parents, witnessing the aftermath of cake and ice cream at birthday parties, attest to the relationship between sugar and hyperactivity. Are these observational studies or experiments? Explain.
(b) Name a confounding variable to the above and explain how it is confounded with sugar.
(c) Design a study to allow a parent to determine whether sugar causes hyperactivity in his/her child and explain why double blinding is so important here.
2. Suppose a new drug is developed that appears in laboratory settings to completely prevent people who test positive for human immunodeficiency virus (HIV) from ever developing full-blown acquired immunodeficiency syndrome (AIDS). Putting all ethical considerations aside, design an experiment to test the drug. What ethical considerations might arise during the testing that would force an early end to the experiment?
3. A new weight-loss supplement is to be tested at three different levels (once, twice, and three times a day). Design an experiment, including a control group and including blocking for gender, for 80 overweight volunteers, half of whom are men. Explain carefully how you will use randomization.
4. Two studies are run to measure the health benefits of long-time use of daily high doses of vitamin C. Researchers in the first study send a questionnaire to all 50,000 subscribers to a health magazine, asking whether they have taken large doses of vitamin C for at least a 2-year period and what they perceive to be the health benefits, if any. The response rate is 80%. The 10,000 people who did not respond to the first mailing receive follow-up telephone calls, and eventually responses are registered from 98% of the magazine subscribers. Researchers in a second study take a group of 200 volunteers and randomly select 100 to receive high doses of vitamin C while the others receive a similar-looking, similar-tasting placebo. The volunteers are not told whether they are receiving the vitamin, but their doctors know and are asked to note health changes during a 2-year period. Comment on the designs of the two studies, remarking on their good points and on possible sources of error.
5. Explain how you would design an experiment to evaluate whether subliminal advertising (flashing “BUY POPCORN” on the screen for a fraction of a second) results in more popcorn being sold in a movie theater. Show how you will incorporate comparison, randomization, and blinding.
6. Throughout history millions of people have used garlic to obtain a variety of perceived health benefits. A vitamin production company decides to run a scientific test to assess the value of garlic in promoting a general sense of well-being. They randomly pick 250 of their employees, and once a day for 2 months the employees fill out questionnaires about their sense of well-being that day. For the next 2 months the employees take garlic capsules daily and again fill out the same questionnaires. Finally, for 2 concluding months the employees stop taking the pills and continue to fill out the daily questionnaires. Comment on the design of this experiment.
7. A new pain control procedure has been developed in which the patient uses a small battery pack to vary the intensity and duration of electric signals to electrodes surgically embedded in the afflicted area. Putting all ethical considerations aside, design an experiment to test the procedure. What ethical considerations might arise during the testing that would force an early end to the experiment?
8. A new vegetable fertilizer is to be tested at two different levels (regular concentration and double concentration). Design an experiment, including a control, for 30 test plots, half of which are in shade. Explain carefully how you will use randomization.
9. Two studies are run to measure the extent to which taking zinc lozenges helps to shorten the duration of the common cold. Researchers in the first study send questionnaires to all 5000 employees of a major teaching hospital asking whether they have taken zinc lozenges to fight the common cold and what they perceive to be the benefits, if any. The response rate is 90%. The 500 people who did not respond to the first mailing receive follow-up telephone calls, and eventually responses are obtained from over 99% of the hospital employees. Researchers in the second study take a group of 100 volunteers and randomly select 50 to receive zinc lozenges while the others receive a similar-looking, similar-tasting placebo. The volunteers are not told whether they are taking the zinc lozenges, but their doctors know and are asked to accurately measure the duration of common cold symptoms experienced by the volunteers. Comment on the designs of the two studies, remarking on their good points and on possible sources of error.
10. Explain how you would design an experiment to evaluate whether praying for a hospitalized heart attack patient leads to a speedier recovery. Show how you would incorporate comparison, randomization, and blinding.
11. The computer science department plans to offer three introductory-level CS courses: one using Pascal, one using C++, and one using Java.
(a) The department chairperson plans to give all students the same general programming exam at the end of the year and to compare the relative effectiveness of using each of the programming languages by comparing the mean grades of the students from each course. What is wrong, if anything, with the chairperson’s plan?
(b) The chairperson also wishes to determine whether math majors or science majors do better in the courses. Suppose he calculates that the average grade of science majors was higher than the average grade of math majors in each of the courses. Does it follow that the average grade of all the science majors taking the three courses must be higher than the average grade of all the math majors? Explain.
(c) Suppose 300 students wish to take introductory programming. How would you randomly assign 100 students to each of the three courses?
(d) How would you randomly assign students to the three courses if you wanted the assignment to be independent from student to student with each student in turn having a one-third probability of taking each of the three classes.
(e) Name a lurking variable that all the above methods miss.
AN INVESTIGATIVE TASK
A high school offers two precalculus courses, one that uses a traditional lecture and drill method, and a second that divides students into small groups to work on open-ended problems. To compare the effectiveness of the two methods, the administration proposes to compare average SAT math scores for the students in the two courses.
(a) What is wrong with the administration’s proposal?
(b) Suppose a group of 50 students are willing to take either course. Explain how you would use a random number table to set up an experiment comparing the effectiveness of the two courses.
(c) Apply your setup procedure to the given random number table:
84177 06757 17613 15582 51506 81435 41050 92031 06449
05059 59884 31180 53115 84469 94868 57967 05811 84514
75011 13006 63395 55041 15866 06589 13119 71020 85940
91932 06488 74987 54355 52704 90359 02649 47496 71567
94268 08844 26294 64759 08989 57024 97284 00637 89283
03514 59195 07635 03309 72605 29357 23737 67881 03668
33876 35841 52869 23114 15864 38942
(d) Discuss any variables that your setup doesn’t consider.
MULTIPLE-CHOICE
1. (D) It may well be that very bright students are the same ones who both take AP Statistics and have high college GPAs. If students could be randomly assigned to take or not take AP Statistics, the results would be more meaningful. Of course, ethical considerations might make it impossible to isolate the confounding variable in this way. Only using a sample from the observations gives less information.
2. (B) The desire of the workers for the study to be successful led to a placebo effect.
3. (C) In experiments on people, the subjects can be used as their own controls, with responses noted before and after the treatment. However, with such designs there is always the danger of a placebo effect. Thus the design of choice would involve a separate control group to be used for comparison.
4. (B) Blocking divides the subjects into groups, such as men and women, or political affiliations, and thus reduces variation.
5. (D) Blocking in experiment design first divides the subjects into representative groups called blocks, just as stratification in sampling design first divides the population into representative groups called strata. This procedure can control certain variables by bringing them directly into the picture, and thus conclusions are more specific. The paired comparison design is a special case of blocking in which each pair can be considered a block. Unnecessary blocking detracts from accuracy because of smaller sample sizes.
6. (E) None of the studies has any controls, such as randomization, control groups, or blinding, and so while they may give valuable information, they cannot establish cause and effect.
7. (D) Octane is the only explanatory variable, and it is being tested at four levels. Miles per gallon is the single response variable.
8. (A) There is nothing wrong with using volunteers—what is important is to randomly assign the volunteers into the two treatment groups. There is no way to use blinding in this study—the subjects will clearly know which breakfast they are eating. The main idea behind randomly assigning subjects to the different treatments is to control for various possible confounding variables—it is reasonable to assume that people of various ages, races, ethnic backgrounds, etc., are assigned to receive each of the treatments.
9. (E) In good observational studies, the responses are not influenced during the collecting of data. In good experiments, treatments are compared as to differences in responses. In an experiment, there can be many treatments, each at a different level. Well-designed experiments can show cause and effect.
10. (D) Control, randomization, and replication are all important aspects of well-designed experiments. Care in observing without imposing change refers to observational studies, not experiments.
11. (A) Each subject might receive both treatments, as, for example, in the Pepsi-Coke taste comparison study. The point is to give each subject in a matched pair a different treatment and note any difference in responses. Matched-pair experiments are a particular example of blocking, not vice versa. Stratification refers to a sampling method, not to experimental design. Randomization is used to decide which of a pair gets which treatment or which treatment is given first if one subject is to receive both.
12. (A) Blinding does have to do with whether or not the subjects know which treatment (color in this experiment) they are receiving. However, drinking out of solid colored thermoses makes no sense since the beverages are identical except for color and the point of the experiment is the teenager’s reaction to color. Blinding has nothing to do with blocking (team participation in this experiment).
13. (A) This study is an experiment because a treatment (periodic removal of a pint of blood) is imposed. There is no blinding because the subjects clearly know whether or not they are giving blood. There is no blocking because the subjects are not divided into blocks before random assignment to treatments. For example, blocking would have been used if the subjects had been separated by gender or age before random assignment to give or not give blood donations. There is a single factor—giving or not giving blood.
FREE-RESPONSE
1. (a) These are observational studies as there is no randomization of treatments to subjects.
(b) The excitement of a birthday party is a confounding variable. Without conducting a proper experiment, there is no way of telling whether observed hyperactivity is caused by sugar or by the excitement of a party or by some other variable.
(c) The parent should randomly give the child sugar or sugar-free sweets at parties and observe the child’s behavior. It is important that the parent not know which the child is receiving (double blinding), because the parent might perceive a difference in behavior which is not really there if he/she knows whether or not the child is being given a sugary food.
2. Ask doctors, hospitals, or blood testing laboratories to make known that you are looking for HIV-positive volunteers. As the volunteers arrive, use a random number table to give each one the drug or a placebo (e.g., if the next digit in the table is odd, the volunteer gets the drug, while if the next digit is even, the volunteer gets a placebo). Use double-blinding; that is, both the volunteers and their doctors should not know if they are receiving the drug or the placebo. Ethical considerations will arise, for example, if the drug is very successful. If volunteers on the placebo are steadily developing full-blown AIDS while no one on the drug is, then ethically the test should be stopped and everyone put on the drug. Or if most of the volunteers on the drug are dying from an unexpected fatal side effect, the test should be stopped and everyone taken off the drug.
TIP
Simply saying to “randomly assign” subjects to treatment groups is usually an incomplete response. You need to explain how to make the assignments—for example, by using a random number table or through generating random numbers on a calculator.
3. To achieve blocking by gender, first separate the men and women. Label the 40 men 01 through 40. Use a random number table to pick two digits at a time, ignoring 00 and numbers greater than 40, and ignoring repeats, until a group of ten such numbers is obtained. These men will receive the supplement at the once-a-day level. Follow along in the table, continuing to ignore repeats, until another group of ten is selected. These men will receive the supplement at the twice-a-day level. Again ignore repeats until a third group of ten is selected to receive the supplement at the three-times-a-day level, while the remaining men will be a control group and not receive the supplement. Now repeat the entire procedure, starting by labeling the women 01 through 40. A decision should be made whether or not to use a placebo and have all participants take “something” three times a day. Weigh all 80 overweight volunteers before and after a predetermined length of time. Calculate the change in weight for each individual. Calculate the average change in weight among the ten people in each of the eight groups. Compare the four averages from each block (men and women) to determine the effect, if any, of different levels of the supplement for men and for women.
4. The first study, an observational study, does not suffer from nonresponse bias, as do most mailed questionnaires, because it involved follow-up telephone calls and achieved a high response rate. However, this study suffers terribly from selection bias because people who subscribe to a health magazine are not representative of the general population. One would expect most of them to strongly believe that vitamins improve their health. The second study, a controlled experiment, used comparison between a treatment group and a control group, used randomization in selecting who went into each group, and used blinding to control for a placebo effect on the part of the volunteers. However, it did not use double-blinding; that is, the doctors knew whether their patients were receiving the vitamin, and this could have introduced hidden bias when they made judgments regarding their patients’ health.
5. Every day for some specified period of time, look at the next digit on a random number table. If it is odd, flash the subliminal message all day on the screen, while if it is even, don’t flash the message that day (randomization). Don’t let the customers know what is happening (blinding) and don’t let the clerks selling the popcorn know what is happening (double-blinding). Compare the quantity of popcorn bought by the treatment group, that is, by the people who receive the subliminal message, to the quantity bought by the control group, the people who don’t receive the message (comparison).
6. Any conclusions would probably be meaningless. There is a substantial danger of the placebo effect here; that is, real physical responses could be caused by the psychological effect of knowing the intent of the research. The experiment would be considerably strengthened by using a control group taking a look-alike capsule. Any conclusions are further suspect because of the choice of subjects. Rather than making a random selection from the intended population, the company is using a sample from its own employees, a sample almost guaranteed to have concerns, interests, and backgrounds that will confound the responses or limit their generalizability.
7. Ask doctors and hospitals to make known that you are looking for volunteers from among intractable pain sufferers. As the volunteers arrive, use a random number table to decide which will have the electrodes properly embedded in their pain centers and which will have the electrodes harmlessly embedded in wrong positions. For example, if the next digit in the table is odd, the volunteer receives the proper embedding, while if the next digit is even, the volunteer does not. Use double-blinding; that is, both the volunteers and their doctors should not know if the volunteers are receiving the proper embedding. Ethical considerations will arise, for example, if the procedure is very successful. If volunteers with the wrong embeddings are in constant pain, while everyone with proper embedding is pain-free, then ethically the test should be stopped and everyone given the proper embedding. Or if most of the volunteers with proper embedding develop an unexpected side effect of the pain spreading to several nearby sites, then the test should be stopped and the procedure discontinued for everyone.
8. To achieve blocking by sunlight, first separate the sunlit and shaded plots. Label the 15 sunlit plots 01 through 15. Using a random number table, pick two digits at a time, ignoring 00 and numbers above 15 and ignoring repeats, until a group of five such numbers is obtained. These sunlit plots will receive the fertilizer at regular concentration. Continue in the table, ignoring repeats, until another group of five is selected. These sunlit plots will receive the fertilizer at double concentration, while the remaining sunlit plots will be a control group receiving no fertilizer. Now repeat the procedure, this time labeling the shaded plots 01 through 15. Assuming size is the pertinent outcome, weigh all vegetables at the end of the season, compare the average weights among the three sunlit groups, and compare the average weights among the three shaded groups to determine the effect of the fertilizer, if any, at different levels on sunlit plots and separately on shaded plots.
9. The first study, an observational study, does not suffer from nonresponse bias, as do most studies involving mailed questionnaires, because the researchers made follow-up telephone calls and achieved a very high response rate. However, the first study suffers terribly from selection bias. People who work at a teaching hospital are not representative of the general population. One would expect many of them to have heard about how zinc coats the throat to hinder the propagation of viruses. The second study, a controlled experiment, used comparison between a treatment group and a control group, used randomization in selecting who went into each group, and used blinding to control for a placebo effect on the part of the volunteers. However, they did not use double-blinding; that is, the doctors knew whether their patients were receiving the zinc lozenges, and this could introduce hidden bias as the doctors make judgments about their patients’ health.
10. For each new heart attack patient entering the hospital, look at the next digit from a random number table. If it is odd, give the name to a group of people who will pray for the patient throughout his or her hospitalization, while if it is even, don’t ask the group to pray (randomization). Don’t let the patients know what is happening (blinding) and don’t let the doctors know what is happening (double-blinding). Compare the lengths of hospitalization of patients who receive prayers with those of control group patients who don’t receive prayers (comparison).
11. (a) Allowing the students to self-select which class to take leads to confounding that could be significant. For example, perhaps the brighter students all want to learn a certain one of the three languages.
(b) It is possible for the average score of all science majors to be lower than the average for all math majors even though the science majors averaged higher in each class. For example, suppose that the students taking Java scored much higher than the students in the other two classes. Furthermore, only one science major took Java, and she scored tops in the class. Then the overall average of the math majors could well be higher. This is an example of Simpson’s paradox, in which a comparison can be reversed when more than one group is combined to form a single group.
(c) Number the students 001 through 300. Read off three digits at a time from a random number table, noting all triplets between 001 and 300 and ignoring repeats, until 100 such numbers have been selected. Keep reading off three digits, ignoring repeats, until 100 new numbers between 001 and 300 are selected. These get C++, while the remaining 100 get Java. Even quicker would be to use a calculator to generate random digits between 001 and 300.
(d) Go through the list of students, flipping a die for each. If a 1 or a 2 shows, the student takes Pascal, if a 3 or a 4 shows, C++, and if a 5 or a 6 shows, Java.
(e) Another possible variable are the teachers. For example, perhaps the better teachers teach Java.
AN INVESTIGATIVE TASK
(a) There is no reason to believe that there was anything random about which students took which course. Perhaps all the weaker students self-selected or were advised to choose the traditional course.
(b) The students could be labeled 01 through 50. Pairs of digits could then be read off a random number table, ignoring numbers over 50 and ignoring duplicates, until a set of 25 numbers is obtained. The students corresponding to these numbers could be enrolled in the traditional course, and the remaining students in the other.
(c) Applying the above procedure results in {17, 31, 14, 35, 41, 05, 09, 20, 06, 44, 50, 43, 11, 18, 45, 01, 13, 33, 04, 19, 02, 08, 40, 49, 03}. Enroll the students with these numbers in the traditional course.
(d) Which teachers teach which courses is not considered. Perhaps the more interesting, exciting teachers teach the new version. Even though a control group is selected, there is no blinding, and so students in the new version might work harder because they realize they are part of an experiment.
LAW OF LARGE NUMBERS
BASIC PROBABILITY RULES
MULTISTAGE PROBABILITY CALCULATIONS
BINOMIAL DISTRIBUTION
GEOMETRIC PROBABILITIES
SIMULATION
DISCRETE RANDOM VARIABLES, MEANS (EXPECTED VALUES), AND STANDARD DEVIATIONS
In the world around us, unlikely events sometimes take place. At other times, events that seem inevitable do not occur. Chance is everywhere! The cards you are dealt in a poker game, the particular genes you inherit from your parents, and the coin toss at the beginning of a tennis match to determine who serves first are examples of chance behavior that mathematics can help us understand. Even though we may not be able to foretell a specific result, we can sometimes assign what is called a probability to indicate the likelihood that a particular event will occur.
Probabilities are always between 0 and 1, with a probability close to 0 meaning that an event is unlikely to occur, and a probability close to 1 meaning that the event is likely to occur. The sum of the probabilities of all the separate outcomes of an experiment is always 1.
TIP
Calculators will express very small probabilities in scientific notation such as 3.4211073E–6. Know what this means, and remember that probabilities are never greater than 1.
The relative frequency of an event is the proportion of times the event happened, that is, the number of times the event happened divided by the total number of trials. Relative frequencies may change every time an experiment is performed. The law of large numbers states that when an experiment is performed a large number of times, the relative frequency of an event tends to become closer to the probability of the event, that is, probability is long-term relative frequency.
The law of large numbers says nothing at all about short-run behavior. There is no such thing as a law of small numbers or a law of averages. Gamblers might say that “red” is due on a roulette table, a basketball player is due to make a shot, a player has a hot hand at the craps table, or a certain number is due to come up in a lottery, but if events are independent, the probability of the outcome of the next trial has nothing to do with what happened in previous trials. Even though casinos and life insurance companies might lose money in the short run, they make long-term profits because of their understanding of the law of large numbers.
EXAMPLE 9.1
There are two games involving flipping a fair coin. In the first game, you win a prize if you can throw between 45% and 55% heads. In the second game, you win if you can throw more than 60% heads. For each game would you rather flip 20 times or 200 times?
Answer: The probability of throwing heads is 0.5. By the law of large numbers, the more times you throw the coin, the more the relative frequency tends to become closer to this probability. With fewer tosses, there is greater chance of wide swings in the relative frequency. Thus, in the first game you would rather have 200 flips, whereas in the second game you would rather have only 20 flips.
EXAMPLE 9.2
A standard literacy test consists of 100 multiple-choice questions, each with five possible answers. There is no penalty for guessing. A score of 60 is considered passing and a score of 80 is considered superior. When an answer is completely unknown, test takers employ one of three strategies: guess, choose answer (c), or choose the longest answer. The table below summarizes the results of 1000 test takers.
Note that in analyzing tables such as the one above, it is usually helpful to sum rows and columns.
(a) What is the probability that someone in this group uses the “guess” strategy?
Answer: P(guess)=300/1000=0.3
(b) What is the probability that someone in this group scores 60–79?
Answer: P(score 60-79)=530/1000=0.53
(c) What is the probability that someone in this group does not score 60–79?
Answer: P(does not score 60–79) = 1 − P(score 60–79) = 1 − 0.53 = 0.47
[The probability that an event will not occur, that is, the probability of its complement, is equal to 1 minus the probability that the event will occur.]
(d) What is the probability that someone in this group chooses strategy “answer (c)” and scores 80–100?
Answer:
TIP
The word “or” here means one event or the other event or both events.
(e) What is the probability that someone in this group chooses strategy “longest answer” or scores 0–59?
Answer:
Note that
[For any pair of events A and B, $P(A\cup B)=P(A)+P(B)-P(A\cap B)$.]
(f) What is the probability that someone in this group chooses strategy “guess” given that his/her score was 0–59?
Answer:
Note that we narrowed our attention to the 120 test takers who scored 0–59 to calculate this conditional probability.
(g) What is the probability that someone in this group scored 80–100 given that he/she chose strategy “longest answer”?
Answer:
Note that we narrowed our attention to the 380 test takers who chose “longest answer” to calculate this conditional probability.
[A formula for conditional probability is given by $P(A|B)=\frac{P(A\cap B)}{P(B)}$]
TIP
If A and B are independent we also have P(A|not B) = P(A).
(h) Are the strategy “guess” and scoring 0–59 independent events? That is, is whether a test taker used the strategy “guess” unaffected by whether he/she scored 0–59?
Answer: We must check if P (guess|score 0–59) = P(guess). From (f) and (a), we see that these probabilities are not equal (0.333 ≠ 0.3), so the strategy “guess” and scoring 0–59 are not independent events.
[Events A and B are independent if P(A|B) = P(A), or equivalently, if P(B|A) = P(B). It is also true that A and B are independent if and only if P(A $\cap$ B) = P(A) P(B), that is, if and only if the probability of both events happening is the product of their probabilities. Yet another insight into independence is that A and B are independent if and only if P(A|B) = P(A|BC).]
TIP
Don’t multiply probabilities unless the events are independent.
(i) Are the strategy “longest answer” and scoring 80–100 mutually exclusive events? That is, are these two events disjoint and cannot simultaneously occur?
Answer: longest answer $\cap$ score 80–100 ≠ Ø and P(longest answer $\cap$ score 80–100) ≠ 0, so the strategy “longest answer” and scoring 80–100 are not mutually exclusive events.
TIP
Don’t confuse independence with mutally exclusive.
Note that if two events are mutually exclusive, then the probability that at least one event will occur is equal to the sum of the respective probabilities of the two events:
If P(A $\cap$ B) = 0, then P(A $\cup$ B) = P(A) + P(B)
TIP
Don’t add probabilities unless the events are mutually exclusive.
[Whether events are mutually exclusive or are independent are two very different properties! One refers to events being disjoint; the other refers to an event having no effect on whether or not the other occurs. Note that mutually exclusive (disjoint) events are not independent (except in the special case that one of the events has probability 0). That is, mutually exclusive gives that P(A $\cap$ B) = 0, whereas independence gives that P(A $\cap$ B) = P(A) P(B) (and the only way these are ever simultaneously true is in the very special case when P(A) = 0 or P(B) = 0).]
MULTISTAGE PROBABILITY CALCULATIONS
EXAMPLE 9.3
On a university campus, 60%, 30%, and 10% of the computers use Windows, Apple, and Linux operating systems, respectively. A new virus affects 3% of the Windows, 2% of the Apple, and 1% of the Linux operating systems. What is the probability a computer on this campus has the virus?
Answer: In such problems it is helpful to start with a tree diagram.
TIP
Tree diagrams can be very useful in working with conditional probabilities.
We then have
P(Windows $\cap$ virus) = (0.6)(0.03) = 0.018
P(Apple $\cap$ virus) = (0.3)(0.02) = 0.006
P(Linux $\cap$ virus) = (0.1)(0.01) = 0.001
At this stage a Venn diagram is helpful in finishing the problem:
P(virus) = P(Windows $\cap$ virus) + P(Apple $\cap$ virus) + P(Linux $\cap$ virus)
= 0.018 + 0.006 + 0.001 = 0.025
TIP
Naked or bald answers will receive little or no credit. You must show where answers come from.
We can take the above analysis one stage further and answer such questions as: If a randomly chosen computer on this campus has the virus, what is the probability it is a Windows machine? An Apple machine? A Linux machine?
A probability distribution is a listing or formula giving the probability of each outcome.
In many applications, such as coin tossing, there are only two possible outcomes. For applications in which a two-outcome situation is repeated a certain number of times, and the probability of each of the two outcomes remains the same for each repetition, the resulting calculations involve what are known as binomial probabilities.
EXAMPLE 9.4
Suppose the probability that a lightbulb is defective is 0.1 (so probability of being good is 0.9).
(a) What is the probability that four lightbulbs are all defective?
Answer: Because of independence (i.e., whether one lightbulb is defective is not influenced by whether any other lightbulb is defective), we can multiply individual probabilities of being defective to find the probability that all the bulbs are defective:
(0.1)(0.1)(0.1)(0.1) = (0.1)4 = 0.0001.
(b) What is the probability that exactly two out of three lightbulbs are defective?
Answer: The probability that the first two bulbs are defective and the third is good is (0.1)(0.1)(0.9) = 0.009. The probability the first bulb is good and the other two are defective is (0.9)(0.1)(0.1) = 0.009. Finally, the probability that the second bulb is good and the other two are defective is (0.1)(0.9)(0.1) = 0.009. Summing, we find that the probability that exactly two out of three bulbs are defective is 0.009 + 0.009 + 0.009 = 0.027.
(c) What is the probability that exactly three out of eight lightbulbs are defective?
Answer: The probability of any particular arrangement of three defective and five good bulbs is (0.1)3(0.9)5 = 0.00059049. How many such arrangements are there?
The answer is given by combinations: $\bigl(\begin{smallmatrix}8 \\3\end{smallmatrix}\bigr)=\frac{8!}{3!5!}$. Thus, the probability that
exactly three out of eight lightbulbs are defective is 56 × 0.00059049 = 0.03306744. [On the TI-84, binompdf(8,.1,3) = 0.03306744.]
NOTE
On the exam, you should write binompdf (n = 8, p = 0.1, x = 3) = 0.033.
More generally, if an experiment has two possible outcomes, called success and failure, with the probability of success equal to p, and the probability of failure equal to q (of course, p + q = 1), and if the outcome at any particular time has no influence over the outcome at any other time, then if the experiment is repeated n times, the probability of exactly k successes (and thus n – k failures) is
EXAMPLE 9.5
Super Mario cards were in one-third of cereal boxes advertising enclosed Magic Motion cards. If six boxes of cereal are purchased, what is the probability of exactly two Super Mario cards?
Answer: If the probability of a Mario card is 1/3, then the probability of no Mario card is 1 – 1/3 = 2/3. If two out six boxes have a Mario card, then 6 – 2 = 4 do not. Thus, the desired probability is [On the TI-84, binompdf(6,1/3,2)≈ 0.329.]
TIP
On the exam it is sufficient to write: Binomial, n = 6, p = 1/3, P(X = 2) = 0.329.
In some situations it is easier to calculate the probability of the complementary event and subtract this value from 1.
EXAMPLE 9.6
Joe DiMaggio had a career batting average of 0.325. What was the probability that he would get at least one hit in five official times at bat?
Answer: We could sum the probabilities of exactly one hit, two hits, three hits, four hits, and five hits. However, the complement of “at least one hit” is “zero hits.” The probability of no hit is
and thus the probability of at least one hit in five times at bat is 1 – 0.140 = 0.860.
[Or binomcdf(5, .675, 4) ≈ 0.860.]
Many, perhaps most, applications of probability involve such phrases as at least, at most, less than, and more than. In these cases, solutions involve summing two or more cases. For such calculations, the TI-84 binomcdf is very useful. binomcdf(n,p,x) gives the probability of x or fewer successes in a binomial distribution with number of trials n and probability of success p.
EXAMPLE 9.7
A manufacturer has the following quality control check at the end of a production line: If at least eight of ten randomly picked articles meet all specifications, the whole shipment is approved. If, in reality, 85% of a particular shipment meet all specifications, what is the probability that the shipment will make it through the control check?
Answer: The probability of an item meeting specifications is 0.85, and so the probability of it not meeting specifications must be 0.15. We want to determine the probability that at least eight out of ten articles will meet specifications, that is, the probability that exactly eight or exactly nine or exactly ten articles will meet specifications. We sum the three binomial probabilities:
[On the TI-84 one can calculate 1 – binomcdf(10, .85, 7) ≈ 0.820 or binomcdf(10, .15, 2) ≈ 0.820.]
EXAMPLE 9.8
For the problem in Example 9.7, what is the probability that a shipment in which only 70% of the articles meet specifications will make it through the control check?
Answer:
[Or binomcdf(10, .3, 2) ≈ 0.383.]
EXAMPLE 9.9
A grocery store manager notes that 35% of customers who buy a particular product make use of a store coupon to receive a discount. If seven people purchase the product, what is the probability that fewer than four will use a coupon?
Answer: In this situation, “fewer than four” means zero, one, two, or three.
[Or binomcdf(7, .35, 3) ≈ 0.800.]
Sometimes we are asked to calculate the probability of each of the possible outcomes (the results should sum to 1).
EXAMPLE 9.10
If the probability that a male birth will occur is 0.51, what is the probability that a five-child family will have all boys? Exactly four boys? Exactly three boys? Exactly two boys? Exactly one boy? All girls?
Answer:
[Or binompdf(5, .49) = {0.0345 0.1657 0.3185 0.3060 0.1470 0.0283}.]
A list such as the one in Example 9.10 shows the entire probability distribution, which in this case refers to a listing of all outcomes and their probabilities.
Suppose an experiment has two possible outcomes, called success and failure, with the probability of success equal to p and the probability of failure equal to q = 1 – p, and the trials are independent. Then the probability that the first success is on trial number X = k is
qk–1p
EXAMPLE 9.11
Suppose only 12% of men in ancient Greece were honest. What is the probability that the first honest man Diogenes encounters will be the third man he meets?
Answer : (0.88)2(0.12) = 0.092928 [or geometpdf (.12, 3) = 0.092928]
What is the probability that the first honest man he encounters will be no later than the fourth man he meets?
Answer : (0.12) + (0.88)(0.12) + (0.88)2(0.12) + (0.88)3(0.12) = 0.40030464
[or geometcdf (.12, 4) = 0.40030464]
Instead of algebraic calculations, sometimes we can use simulation to answer probability questions.
EXAMPLE 9.12
If left alone, 70% of birthmarks gradually fade away. If ten children, 5 of each gender, are born with birthmarks, what is the probability that the same number of boys and girls will lose their birthmarks? Answer the question using simulation.
Answer: Let the digits 1–7 represent having a birthmark that fades away, and 8, 9, and 0 represent having a birthmark that doesn’t fade away. To simulate the 10 children, select 10 digits from the random number table, with the first 5 representing boys and the next 5 representing girls. Note the number of digits 1–7 in each group and see if there is a match. Underlining the digits 1–7 gives
TIP
Probabilities calculated through simulations should always be referred to as estimates or approximations.
For example, in the first set of 10 digits, there are 3 boys and 4 girls whose birthmarks fade, so no match. In the second set of 10 digits, there are 5 boys and 5 girls whose birthmarks fade, so a match. We have 3-4, 5-5, 5-3, 5-5, 4-3; 3-4, 4-5, 5-4, 2-3, 3-3; 4-3, 3-2, 4-4, 4-5, 4-3; 5-2, 3-3, 4-5, 4-4, 2-3; 4-5, 1-4. 4-3, 3-2, 3-4; 4-2, 3-5, 4-3, 4-2, 4-4; 4-4, 4-3, 5-4, 4-5, 5-5; 5-3, 4-3, 3-2, 2-3, 3-4; 5-4, 1-3, 5-5, 4-4, 3-2; 4-3, 2-4, 3-3, 4-4, 3-4
We count 13 matches out of a possible 50, and so estimate the probability that the same number of boys and girls will lose their birthmarks to be 13/50 = 0.26.
TIP
Be able to describe simulations so that others can repeat your procedure.
EXAMPLE 9.13
Babe Ruth had a career batting average of 0.342, quite impressive for a home run hitter! Use simulation to estimate the probability that his first hit in a game is on the first at-bat? On the second at-bat? Not until the third at-bat? Fourth at-bat?
Answer: Using the random number table from the previous example, read off 3 digits at a time, with 001–342 representing a hit and anything else, not a hit, and starting all over again every time there is a hit. So, for example, reading off the first line gives
869-619-414-146 is a first hit on the fourth at-bat,
633-325 is a first hit on the second at-bat,
526-445-234 is a first hit on the third at-bat,
882-462-662-337 is a first hit on the fourth at-bat,
169 is a first hit on the first at-bat, and
214 is also a first hit on the first at-bat.
Continuing in this fashion and tabulating the results, we have
Number of the at-bat | Frequency | Estimated probability |
1 | 22 | 22/64 = 0.344 |
2 | 14 | 14/64 = 0.219 |
3 | 13 | 13/64 = 0.203 |
4 | 6 | 6/64 = 0.094 |
Over 4 | 9 | 9/64 = 0.141 |
Total | 64 |
|
The actual probabilities are 0.342, (0.658)(0.342) = 0.225, (0.658)2(0.342) = 0.148, (0.658)3(0.342) = 0.097, and 1 – (0.342 + 0.225 + 0.148 + 0.094) = 0.188. [On the TI-84, one can calculate these geometric probabilities: geometpdf(0.342,1), …, geometpdf(0.342,4), and 1 – geometcdf(0.342,4).]
In performing a simulation, you must:
1. Set up a correspondence between outcomes and random numbers.
2. Give a procedure for choosing the random numbers (for example, pick three digits at a time from a designated row in a random number table).
3. Give a stopping rule.
4. Note what is to be counted (what is the purpose of the simulation), and give the count if requested.
DISCRETE RANDOM VARIABLES, MEANS (EXPECTED VALUES), AND STANDARD DEVIATIONS
Often each outcome of an experiment has not only an associated probability but also an associated real number. For example, the probability may be 0.5 that a student is taking 0 AP classes, 0.3 that he/she is taking 1 AP class, and 0.2 that he/she is taking 2 AP classes. If X represents the different numbers associated with the potential outcomes of some chance situation, we call X a random variable.
While the mean of the set {2, 7, 12, 15, 24} is (2 + 7 + 12 + 15)/4 = 9 or 2(¼) + 7(¼) + 12(¼) + 15(¼) = 9, the expected value or mean of a random variable takes into account that the various outcomes may not be equally likely. The expected value or mean of a random variable X is given by µX = E(X)= xP(x) where P(x) is the probability of outcome x. [We also write µX xipi.]
EXAMPLE 9.14
In a lottery, 10,000 tickets are sold at $1 each with a prize of $7500 for one winner. What is the average result for each bettor?
Answer: The actual winning payoff is $7499 because the winner paid $1 for a ticket, so we have
Thus the average result for each person betting on the lottery is a $0.25 loss. Alternatively, we can say that the expected payoff to the lottery system is $0.25 for each ticket sold.
Suppose we have a binomial random variable, that is, a random variable whose values are the numbers of “successes” in some binomial probability distribution.
EXAMPLE 9.15
Of the automobiles produced at a particular plant, 40% had a certain defect. Suppose a company purchases five of these cars. What is the expected value for the number of cars with defects?
Answer: We might guess that the average or mean or expected value is 40% of 5 = 0.4 × 5 = 2, but let’s calculate from the definition. L