Statistics for data Analysis and Regression

Problem 1:

Upload the dataset sleep.csv.

The data come from the American Time Use Survey. Each row represents a U.S. resident who was randomly sampled. People selected for the sample were called on a randomly chosen day and participated in an extensive interview designed to determine what they spent their time on in the previous 24 hours. Your data is a random sample of n=999 from this original data set (which was in the tens of thousands) and has only a special selection of variables.

Our interest in this problem is to understand what factors influence the amount of sleep we get each night, at least in terms of the few variables provided here that might address that issue

1) Perform a two-sample t-test to test whether or not the mean number of minutes of sleep differs on holidays compared to other days:

a) what is the value of the t-statistic?
b) what is the p-value?
c) State your conclusion in the context of the data. Use a significance level of 5%.

2) Fit a model that uses only the day of the week (day) to predict the number of minutes of sleep.

a) What is the mean number of minutes of sleep on Friday?
b) Interpret the estimate for daySunday.

3) Fit a full model using only age, eating, socializing, and homework to predict sleep.

Do not include categorical variables.

Comment on the model validity and support your comments with appropriate diagnostic plots. (Note we are not asking you to fix any problems you find; just note whether or not a problem exists.)

4) Do older people tend to sleep less, all other things being equal? Assume the model from question 3 is valid. State the value of the coefficient for age and provide the p-value and use these to answer the question. Use a 5% significance level.

5) Consider a transformation of the response variable and none of the predictor variables. Again, use only the age, eating, socializing, and homework variables as predictors.

a) What transformation does the inverse response plot suggest? Provide the plot and explain your answer.

b) What transformation does the Box-Cox transform suggest? Again, provide evidence based on your output.

6) Transform the sleep variable using the “rounded” power suggested by the box-cox method and fit the model using this transformed variable. Again, use only the age, eating, socializing, and homework variables as predictors.

In your opinion, did this improve the model? Explain in 1-2 sentences.

7) Is collinearity an issue with these predictors (untransformed)? Explain why or why not.

8) For the sake of simplicity, once again fit a model with untransformed response

variable. Use as predictors age, socializing, homework, day and holiday. (No interactions.) Perform best subsets regression. We will assume that all models are valid.

a) How many variables should be in the “best” model? Explain how you reached this decision.

b) State the BIC for your best model.
c) Which variables are in the best model?

9) Use your final model to predict the number of minutes of sleep on Sunday that is not a holiday, for John, who is average on all continuous variables in your model.

Your prediction should take the form of an interval at a 95% significance level.

Problem 2:

The following questions refer to data collected to examine the factors that drive auto insurance rates. (from National Highway Traffic Safety Administration). The data are at the state level: each observation is a state. The output you’ll need is provided

in a separate file analysis.pdf.

The variables are:

state: states in the U.S. plus Washington, D.C. num_drivers: number of drivers involved in fatal collisions per billion miles

perc_speeding: percentage of drivers involved in fatal collisions who were

speeding

perc_alcohol: percentage of drivers involved in fatal collisions who were alcohol-

impaired

perc_not_distracted: percentages of drivers involved in fatal collisions who were

not distracted.

perc_no_previous: percentage of drivers involved in fatal collisions who had not

been in previous accidents

insurance_premiums: car insurance costs per driver in dollars

losses: amount of money insurances companies paid for collissions per insured

driver.

Which of the following is the best interpretation of the slope for perc_speeding

(using the untransformed model)?

a) Controlling for the number of drivers involved in fatal collisions per billion miles,

states with an additional percentage point of drivers in fatal collisions who were

speeding pay, on average, an additional premium of about $0.81.

b) Controlling for all other variables in the model, states with an additional

percentage point of drivers in fatal collisions who were speeding pay, on average, an

additional premium of $0.81.

c) Controlling for all other variables in the model, states with an additional

percentage point of drivers in fatal collisions are charged an additional premium of

$0.81.

d) Controlling for the number of drivers involved in fatal collisions per billion miles,

states with an additional percentage point of drivers in fatal collisions who were

speeding pay an additional premium of $0.81.

Answer:

Suppose we examine a 95% confidence interval for the slope for perc_alcohol.

Which of the following is true?

a) The interval will contain 0.
b) The interval will not contain 0.
c) The interval will not contain 0 for 95% of all samples drawn from this population.
d) The interval will contain 0 for 95% of all samples drawn from this population.

Answer:

Consider the F-test given in the summary output for the model. (Shown below,

but you can also find it in the output section.) Write the null and alternative

hypotheses that this F-statistic is testing:

F-statistic: 5.658 on 6 and 44 DF, p-value: 0.0002014

(a) Null Hypothesis:

(b) Alternative Hypothesis:

c) What do we conclude from this F-statistic based on the p-value? (Assume the

model is valid.)

The mean of losses is 134.49. The 95% confidence interval for the mean insurance

premium in states with losses of $134.49 is (847, 926). If we calculate a 95%

confidence interval for any other amount of losses, the width of the new confidence

interval will be

a) wider
b) smaller
c) the same
d) it depends.

Answer:

Here is out put from the box-cox power transform:

What does this output suggest? Choose the best answer:

a) It is best to not do a transformation.
b) None of the predictors are associated with the response variable.
c) It is best to do a log transform.
d) It is best to do some sort of a transformation

Answer:

The output below shows two different ANOVA tables for the same linear model.

Explain why the output is the same in both tables for the num_drivers variable but

different for the losses variable.

Explain:

Last Updated on March 10, 2020