Exploratory data analysis (EDA)

Stat 311 Homework 2

This assignment has some problems related to Lesson2 and emphasizes exploratory data analysis (EDA)— visualization and numeric summaries for qualitative and quantitative data. We recommend that you create a new folder for this assignment.

Download the data files and HW2Template.Rmd to this folder before you begin. Check out the two .Rmd files that appear on the Lesson 2 Presentations page—they contain code examples for several types of summaries that were presented in the lectures.

To complete the coding parts of the assignment, copy/paste/edit code from either CategoricalData.Rmd or QuantitativeData.Rmd as needed. Upload your final pdf file to Gradescope [do not forget to identify the page numbers for each part of each problem according to the outline].

Problems 1 – 3 do not require any code. Simply type your answers into the .Rmd file. Problems 4 and 5 require the use of R code.

To reinforce the concepts in the Lesson 2 lectures and for extra practice with R commands, I recommend that you try some of the OpenIntro tutorials that I linked on the Readings page for Lesson 2.

Problems

For each part, compare the distributions, A and B, based on means/SDs and medians/IQRs. Do not show any calculations—you do not need to calculate anything to answer this question. Simply by looking at the numbers, state how the means/SDs or medians/IQRs compare. Make sure to explain your reasoning.
a) Compare the means/SDs for A: −40, 0, 0, 0, 15, 25, 30, 30 and B: −20, 0, 0, 0, 15, 25, 30, 30 b) Compare the means/SDs for A: 0, 50, 300, 550, 600 and B: 100, 200, 300, 400, 500 c) Compare the medians/IQRs for A: 0, 100, 500, 600, 1000 and B: 0, 10, 50, 60, 100 d) Compare the medians/IQRs for A: 6, 7, 8, 9, 10 and B: 1, 2, 3, 4, 5
Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average. The manager of this plant is under pressure from a local union to increase the amount of paid time off.

However, he does not want to give more days off to the workers because that would be costly. Instead, he decides he should fire 10 employees in such a way as to raise the average number of days off that are reported by his employees.

To achieve this goal, should he fire employees who have the greatest number of days off, least number of days off, or those who have about the average number of days off?

In a class of 20 students, 19 of them took an exam in class and 1 student took a make-up exam the following day. The professor graded the first batch of 19 exams and found a mean score of 71 points with a standard deviation of 7.5 points. The student who took the make-up the following day scored 82 points on the exam.

a) Does the new student’s score increase or decrease the mean score? Give a qualitative argument without calculations.

b) Formally compute the new mean. Show your work. [For students wanting to play with fancier equations in R, use $$\frac {numerator}{denominator}$$ to define a fraction. To get a bar over a letter, use $$\bar x$$. To get a subscript, such as s[x], use $$s_{x}$$. These various commands can be combined such as $$\bar x = \frac {1+2+3}{3} = 2$$. $$ puts equations on their own line, whereas use $ for inline equations.

c) Does the new student’s score increase or decrease the standard deviation of the scores? Explain.

Stat 311 Homework 2 4. This problem uses the same data regarding environmental policy versus economic policy that were

presented in Lesson 2, Lecture 1, except the data are categorized by education level or party identity.

a) Read in GallupByEd.csv and GallupByPI.csv, creating two separate objects that store the data.

Convert variables in each object to factors as needed. Reorder/rename factors. [The code for this is provided in the template, so nothing to do but look at the code to understand how factors were reordered/relabeled].

b) How many observations are there in each data set?

c) Produce two two-way contingency tables, one for each data set, with education or party ID in rows and the response in the columns. You can just leave the tables as they display in the R output.

d) What is the joint percentage of people who favor an environmental first policy and are college graduates?

e) What are the marginal percentages for education? [Hint: you will be reporting three percentages].

Explain to a layperson what is meant by marginal percentages.

f) Produce two more tables, one for each data set, that show row conditional percentages instead of counts. What are the conditional percentages for Response for those participants who identify as republicans? What is meant, in layman terms, when we refer to conditional percentages?

g) For each data set create one bar graph (your choice of version) to explore the association between either education and Response, or PartyID and Response. Make sure the axes are appropriately labeled.

h) For each data set, does the row variable (education or party ID) appear to be associated with the response, or does the row variable and Response appear to be independent. Explain. [Note: this is a qualitative answer based only on data visualization]

Complete the following parts using a data set about popular diets (PopularDiets.csv). The data dictionary for the data set is found in the file DietDataDescription.pdf. The journal article that explains the study with results is in the file JournalArticleForDietStudy_joc40214. You will need to browse the journal article to answer parts a) and c).

a) Was this study an observational study or an experiment? Briefly explain.

b) What participants were sampled for this study?

c) What do you believe to be the population of interest? Do you think the results can be generalized to the population of interest or some other population? Explain.

d) Read in the data. Set variables to factors as needed. [Code for this is provided in the template so there is nothing you need to do.]

e) How many observations are in the data set? How many of the subjects completed the study?

f) Explore the weight loss variable. Present summary statistics and two different graphs (your choice) that give you a complete picture of the sample distribution for this variable. Synthesize your overall findings the describe the sample distribution of weight loss by combining the information you get from the summary statistics and graphs. What numeric statistics do you think are best to use to summarize the location and variability of weight loss? Explain.

g) Explore how weight loss varies by type of diet. [Hint: use comparative box plots or faceted histograms by diet type]. Make a qualitative assessment, based on the graph, regarding differences in the distribution of weight loss by diet type.

Last Updated on June 28, 2022