Summary Measure Parameter Statistic Proportion ̂ Mean Standard Deviation Correlation Coefficient Slope ! ! Y-intercept ” ” formula Notation
Formulas
− # #
# = # = $ %&’ =
$#
−
= 3 − 1
1.5 : 3 + (1.5 × ) 1 − (1.5 × )
- In 2005, Alcohol and Drug Services of Gallatin County randomly sampled 789 DUI offenders. The average age of those sampled was 29.3 years. Of those DUI offenders, 22.94% were current MSU students. The average age of the MSU DUI offenders was 21.7 years with 46.41% of MSU DUI offenders being under the age of 21. The average age of a non-MSU DUI offender was 31.7 years with 13.16% of non-MSU DUI offenders being under the age of 21. Researchers would like to use these data to determine if whether a DUI offender is an MSU student can predict whether the DUI offender is under the age of 21 among all DUI offenders in Gallatin County.
- What are the observational units?
- List the variables in this data set and identify whether each variable is categorical or quantitative.
- Of the list of variables, which would be an explanatory variable, and which would be a response variable?
- Is the value 29.3 a statistic or a parameter? What is its appropriate notation?
- Is the value 0.1316 a statistic or a parameter? What is its appropriate notation?
- What is the target population?
- Is this an observational study or a randomized experiment? Explain.
- Which types of sampling bias may be present in this study? Explain.
- Draw a segmented bar plot of whether the DUI offender is under the age of 21 segmented by whether the DUI offender is an MSU student.
- Based on your segmented bar plot, is there an association between whether the DUI offender was an MSU student and whether the DUI offender is under the age of 21? Explain.
More to read: Statistics Methods and Inference
Define events: A = MSU student
B = Under the age of 21
- Identify what each numerical value given in the problem represents in probability notation.
0.2294 =
0.4641 =
0.1316 =
- Fill in the following hypothetical two-way table to represent the situation.
MSU Student | Non-MSU Student | Total | |
Underage | |||
Not Underage | |||
Total | 100,000 |
10
Use your table from part l. to answer questions m. through t.
- What is the probability that a randomly selected DUI offender is not an MSU student? What is the notation used for this probability?
- What is the probability that a randomly selected DUI offender is underage? What is the notation used for this probability?
- What is the probability that a randomly selected DUI offender is not underage? What is the notation used for this probability?
- What is the probability that a randomly selected underage DUI offender is an MSU student? What is the notation used for this probability?
- What is the probability that a randomly selected DUI offender who is not underage is an MSU student? What is the notation used for this probability?
- If randomly selected DUI offender is not underage, what is the probability they are an MSU student? What is the notation used for this probability?
- What is the probability that a randomly selected DUI offender is not and MSU student and is not underage? What is the notation used for this probability?
- What is the probability that a randomly selected DUI offender is an MSU student and is not underage? What is the notation used for this probability?
- Is there an association between the mother’s age and the amount of weight she gains during pregnancy? The following variables were collected on a random sample of 100 births for babies in North Carolina where the mother was not a smoker and another 50 where the mother was a smoker.
Also read: Data Visualization and Descriptive Statistics
Variable | Description |
f_age | Father’s age (years) |
m_age | Mother’s age (years) |
weeks | Weeks at which the mother gave birth |
premature | Indicates whether the baby was premature or not |
visits | Number of hospital visits |
gained | Weight gained by mother (lbs) |
weight | Birth weight of the baby (lbs) |
sex_baby | Gender of the baby |
smoke | Whether or not the mother was a smoker |
- Identify the observational units
- Which of the variables measured are categorical?
- Which of the variables measured are quantitative?
- Name one type of graph you could create to examine the association between the variables “premature” and “weight”.
- Name one type of graph you could create to examine the distribution of the variable “f_age”.
- For the research question presented in this problem, which variable is the explanatory variable and which is the response variable?
- What is the target population?
- Is this an observational study or a randomized experiment? Explain.
- If we find an association between a mother’s age and the weight gained during her pregnancy, can we conclude that changes in age cause changes in weight gain? Explain.
- To which population can we generalize the results from this study? Explain.
- Which types of sampling bias may be present in this study? Explain.
- Identify a potential confounding variable and explain why that variable meets the definition of a confounding variable.
A scatterplot of the amount of weight a woman gained during pregnancy and the woman’s age at the time of her child’s birth is below. The regression line has been added.
- Write a paragraph describing the four features of the scatterplot between a mother’s age and the amount of weight she gained during pregnancy.
- Based on the scatterplot, does there appear to be an association between a mother’s age and the amount of weight she gained during pregnancy? Explain.
Regression output from R is below.
Estimate Std. Error t value Pr(>|t|) (Intercept) 35.568 5.689 6.252 0.000
m_age -0.117 0.209 -0.562 0.575
R-squared: 0.002161
- Is the value -0.117 in the regression output a statistic or a parameter? What is its proper notation?
- Is the correlation between a mother’s age and the amount of weight she gained during pregnancy among all North Carolina births a statistic or a parameter? What is its proper notation?
- Use the linear model output above to write the least squares line in proper statistical notation.
- Interpret the value of slope in context of the problem. What is its proper notation?
- Interpret the value of R2 in context of the problem.
- What is the value of the sample correlation between a mother’s age and the amount of weight she gained during pregnancy? What is its proper notation?
- Predict a value of the mother’s weight gain during pregnancy for mothers of age: 20, 25, and 40.
- The (age, weight gain) values for three mothers are: (27, 5), (31, 42), and (33, 25). Calculate the residuals for these three observations.
Consider the scatterplot below.
- Identify the variables shown in the plot and their types.
- Does adding the variable “smoking status” affect the relationship between the mother’s age and the amount of weight gained? Explain.
- Does money make you happy? A social research group interested in the relationship between income and happiness surveyed 500 US adults with incomes from $15K to $75K and asked them to rank their happiness on a scale from 1 (lowest) to 10 (highest). A scatterplot of these data is shown below.
- What are the observational units?
- What is the sample size? What is its proper notation?
- What is the target population?
- What is the explanatory variable? Is it categorical or quantitative?
- What is the response variable? Is it categorical or quantitative?
- Based on the scatterplot, does there appear to be an association between happiness and income? Explain.
- Write a paragraph describing the four features of the scatterplot between happiness score and income.
- What type of study design was used?
- What is the scope of inference for this study?
Regression output from R is below.
Estimate | Std. | Error | t value | Pr(>|t|) | |
(Intercept) | 0.204 | 0.089 | 2.299 | 0.022 | |
income | 0.714 | 0.019 | 38.505 | 0.000 |
- Use the linear model output above to write the least squares line in proper statistical notation.
- Is the value 0.204 a statistic or a parameter? What is its proper notation?
- Interpret the value of slope in context of the problem.
- Predict the happiness score for a person with an income of $67,000.
- Find the residual for a person with a happiness score of 5.38 and an income of
$67,000. Did the regression line overestimate or underestimate the happiness score for this person?
- )&’ ! “# # $ % & ‘ ‘The sample variance of happiness is ( = 2.053. The variance of the residuals is ( = 0.757. What is the value of the coefficient of determination? Interpret this value in context of the problem.
- Data were collected by the Planet Money podcast to test a theory about crowd-sourcing. Planet Money had a post on their website with pictures of Penelope, the cow, and asked people to guess how much she weighed (in pounds). Over 17,000 people gave responses. Penelope’s actual weight was 1,355 pounds.
The following summary statistics and plots were created with the data in R.
> favstats(weight)
min | Q1 | median | Q3 | max | mean | sd | n |
1 | 907.5 | 1245 | 1542 | 14555 | 1287.083 | 622.2028 | 17184 |
- What is the sample size?
- What are the observational units?
- Using the plots above, write a paragraph describing the distribution of guessed weight, including all four features that we look for in a histogram.
- Which is the better measure of center for this data? Explain your answer.
- Calculate the IQR. Interpret this value in context of the problem.
- Identify the standard deviation and interpret this value. What is its proper notation?
- Identify one of the quartile values and interpret that value in context of the problem.
- To which population can you generalize these results?
- Which types of sampling bias may be present in this study? Explain.
- Cinnamon has been used for over 5,000 years in holistic medicine to treat a variety of digestive issues, including nausea, gas, diarrhea, and bad breath, all of which have also been speculated to be related to poor glucose management by the body. Cinnamon has also been linked to a decrease in blood glucose levels, which could prove helpful for those suffering from Type II diabetes, where blood glucose levels are too high.
Researchers interested in exploring whether cinnamon can reduce blood glucose levels better than diet changes alone randomly assigned 18 volunteers to take either a 1000mg dose of cassia cinnamon or a placebo for 9 weeks, starting after a three week control period. The participants were Caucasian, between the ages of sixty and seventy, with untreated Type II diabetes. All participants were instructed to follow a diabetic diet and to maintain their pre-study, normal activity levels. After a total of 12 weeks had passed, change blood sugar levels (after 12 weeks – baseline measurements) were recorded. Note: this means negative values indicate decreases in blood sugar levels.
More resources: What is Statistics?
Summary statistics for the two samples are shown in the R output below.
> favstats(Difference~Treatment, data=cinnamon)
Treatment | min | Q1 | median | Q3 | max | mean | sd | n |
1 Cinnamon | -37.1 | -28.4 | -25.3 | -21.3 | -14.6 | -24.5333 | 6.7011 | 9 |
2 Placebo | 1.2 | 2.5 | 7.2 | 7.9 | 13.3 | 6.1778 | 3.8124 | 9 |
- What are the observational units?
- What variable(s) are recorded? What type is each (categorical or quantitative)?
- Identify the explanatory variable.
- Identify the response variable.
- What is the target population?
- What is the sample?
- What is the study design? Explain your reasoning.
- What is the scope of inference for this study?
- Explain the purpose of random assignment in this study.
- To which population can we generalize the results?
- Is the true difference in mean change in blood sugar levels for the two treatments a parameter or a statistic? What is its notation?
- Is the sample standard deviation in change in blood sugar levels for the Cinnamon treatment group a parameter or a statistic? What is its notation?
- Write a sentence interpreting the standard deviation of change in blood sugar levels for the Cinnamon group.
- Write a sentence interpreting the third quartile of change in blood sugar levels for the Cinnamon group.
- Write a sentence interpreting the median of change in blood sugar levels for the Placebo group.
- Write a sentence interpreting the first quartile of change in blood sugar levels for the Placebo group.
- Which group has the larger inter-quartile range (IQR) – Cinnamon or Placebo?
- What type of plot could be used to examine the relationship between change in blood glucose levels and treatment group?
- The changes in blood sugar levels for the nine individuals in the Placebo group are (in increasing order): 1.2, 2.0, 2.5, 5.8, 7.2, 7.7, 7.9, 8.0, 13.3. Use this information and the summary statistics provided above to draw a boxplot of changes in blood sugar levels for the Placebo group. Include a scale and label the five number summary on the boxplot.
A histogram of change in blood sugar levels for the Cinnamon group is shown below.
- Using the histogram above, write a paragraph describing the distribution of change in blood sugar levels for the Cinnamon group, including all four features that we look for in a histogram.
- Do storks bring babies? During the ten or twelve years following World War II, the populations of most western European cities steadily grew as a result of migrations from surrounding rural areas. There was also that spurt of fecundity known as the post-war baby boom. Data from the city of Copenhagen for each of the ten years following World War II show a correlation coefficient of +0.85 between (i) the annual number of storks nesting in the city, and (ii) the annual number of human babies born in the city. As population increased, there were more people to have babies, and therefore more babies were born. Also as population increased, there was more building construction to accommodate it, which in turn provided more nesting places for storks; hence increasing numbers of storks.
- Identify the observational units.
- What is the sample size? What is its proper notation?
- What variables(s) were recorded? What type is each (categorical or quantitative)?
- Which variable is the explanatory variable? Which is the response variable?
- Which type of plot is most appropriate to display the relationship between annual number of storks in the city and annual number of human babies born in the city?
- Are these data from an observational study or an experiment? How do you know?
- What is the scope of inference for this study?
- Based on this information, identify a possible confounding variable in the study. Explain why it meets the definition of a confounding variable.
- Is there an association between the annual number of storks nesting in the city, and the annual number of human babies born in the city? If so, is it positive or negative? Explain how you know.
- What is the value of R2? Interpret this value in context of the problem.