Write a report using the data set about. These are the instructions given.
Report 1: instructions and grading criteria
Overview. The first report is due at the end of the day on Wednesday, October 18. A draft of your report is due on October 11. I’ll give you feedback, with the goal of improving your final report (and its grade). I expect the draft to be mostly complete. Of course, you’ll edit it further in response to my comments and perhaps add new things to it, but the more complete it is, the more I can help you.
Your goal in this report is to describe your dataset in detail, with a special focus on a handful of variables and their relationships. The grading criteria below will explain what your report should contain.
Labs on September 27, October 4, and October 11 will be respectively on loading and manipulating dataframes, on plotting, and on statistical modeling. These will help give you the skills you need to write a good report, but there will probably be something you’d like to do in R that you haven’t been taught to do. Don’t be afraid! One resource is the Book of R. Another is the internet: it’s generally very easy to search online for instructions on how to accomplish something in R. And finally, you have me. Don’t be afraid to email me if you can’t figure out how to do something, or if some problem is tripping you up. I’m very happy to help. The sooner you get started, the better.
Nothing you’re asked to do here is so hard, but it will be daunting if you leave it to the end.
One quirk of these reports is that the first will be based on half of your data, while the second will be based on the other half. Thus, you’ll need to randomly throw out half your data. Here’s how to do this in R:
set.seed(Last4DigitsFromYourStudentIDNumber)
data <- subset(data, 1:nrow(data) %in% sample.int(nrow(data), nrow(data)/2))
The second line replaces data with a random sample of half the observations in data. The first line makes it so that the same random sample is chosen each time you run your code.
Criteria for F. Any of these things will earn you a grade of F:
• not handing your report in by the due date;
• omitting a major task completely;
• a report that does not knit in RStudio due to problems that are difficult to fix;
• a report riddled with major errors.
Criteria for D. A report earns a grade of D if it is not F-worthy but has other serious problems,
including:
• the report does not knit, but the errors are fairly easy to fix;
• the level of English writing is so bad as to interfere with the comprehension of the report;
• all major tasks are carried out, but with major errors in many or minor errors in all in a way that suggests lack of comprehension.
Criteria for C. If the report attempts all of the tasks listed in the criteria for B but does so with significant failings, it earns a C.
Criteria for B. The report is written in grammatical, comprehensible English. It’s written in full sentences with the first word of each capitalized, as you’d do in your non-scientific writing. It’s submitted via Blackboard as an RMarkdown (.Rmd) file, together with the data files needed to knit 1the report into a finished text. It knits on the lab computers without errors. It includes the code given at the top of this document to subset your data, cutting your dataset in half.
The report describes general information about the dataset: Who collected the data? When?
Where? Why? If you know what sorts of questions the dataset was originally meant to answer, your report should say what they are. If you don’t, your report should speculate about them.
The report describes the layout of the dataset, answering the following questions:
• How many observations are in your dataset, after subsetting your data?
• How many variables are there? If your dataset has a large quantity of variables (approximately fifteen or more), you can focus only on the variables that you find most important
and will be analyzing further, ignoring the rest. Otherwise, for each variable, provide the following information:
{ What information does the variable contain?
{ Is it numerical? Discrete or continuous? Is it categorical? Nominal or ordinal?
{ Does the variable have missing values? How are they encoded? How many of the observations have missing values, either as a percentage or as a count?
{ If the variable is categorical, give an overview of its possible values. How many are there? If there are only a few, list them and explain their meanings if they’re not obvious.
{ If the variable is numerial, give an appropriate measure of the center and spread of its distribution.
After this, your report picks a smaller number of variables for more careful study. There should be no more than five of these; fewer is recommended. For each of these variables, your report should include a more detailed description of its distribution, including appropriate plots. If appropriate, discuss whether the variable is normally distributed, and explain your reasoning. (See Section 3.1{3.2 of the OpenIntro textbook and the discussion of the qqnorm command on p. 354 of the Book of R|we’ll be discussing these in class and in lab the week of October 11th.)
For each pair of variables selected, your report discusses their relationship. If appropriate, it includes plots, correlations, or two-way tables.
Criteria for A. The report analyzes each pair of selected variables in more detail. For pairs of numerical variables, discuss if it is appropriate to fit a linear model. If it is appropriate, do so; discuss proportions of explained variances (i.e., the R2 variable|see Section 7.2.6 of OpenIntro Statistics) and evaluate the quality of the model. Provide plots to justify your statements. For pairs
consisting of one categorical and one numerical variable, discuss the distribution of the numerical variable within each level of the categorical variable, if it makes sense to do so.
The report compares and contrasts possible presentations of the data: What options did you consider for plotting your data and measuring its center and spread? Why did you make the choices you did?
Zoom out and critique the dataset itself. Do you believe the data is accurate? Given your understanding of it was gathered, does the dataset help answer the questions it was meant to answer? Putting feasibility aside, what other information do you wish was contained in the dataset, and what questions would you hope to answer with it?