Project 1

Due, in stat405 mailbox, Tuesday 25 Sep


The purpose of the project is to give you an opportunity to work on a larger analysis than what we’ve tackled in the homeworks, and for you to practice working in a group (each group will have three or four people). The main advantage to working in a group is that you can bounce ideas off one another, and hopefully uncover more interesting features of the data. The main problem, as most of you will discover, is co-ordinating time to meet together.

I expect the project to be approximately three times as much work as a normal homework. I think this is fair: you have three times as long, and you can share the work. Don’t mistake the effort you put in with the length of the final project - here are a couple of quotes to get you thinking:

“I have made this letter longer than usual, because I lack the time to make it short.” Blaise Pascal

“If I am to speak ten minutes, I need a week for preparation; if fifteen minutes, three days; if half an hour, two days; if an hour, I am ready now.” Woodrow Wilson

Sample projects

We used different data sets in the past, but these samples should give you a good feel for what is expected: a, b, c, d, e, f.


In this project, you will perform explore a new, larger dataset: mpg2.csv.bz2.

The National Highway Traffic Safety Administration (NHSTA) has been setting fuel economy standards for cars and truck sold in the U.S.A. since the passage of the Corporate Average Fuel Economy Act (CAFE) in 1975. The Environmental Protection Agency (EPA) has been given the responsibility of determining the best practices and methods for testing and reporting the average fuel economy for cars and trucks. Beginning in 1978 the EPA began documenting the calculated fuel economy estimates, calculated in controlled settings, and has reported these estimates accompanied by selected vehicle characteristics.


  • year: year of testing, 1984–2012

  • make: manufacturer

  • model: model name

  • vclass: vehicle class

  • displ: engine size, in liters

  • tran: human readable description of transmission

  • trans_dscr: more info about the transmission. This is not very well documented, but some of the abbreviations are:

    • EMS: Engine management system
    • SIL: Shift indicator light on instrument panel
    • CLKUP: Computer-controlled continuously variable lockup
    • VLKUP: Continuously variable, user-selectable lockup
    • nLKUP: User-selectable lockup with n (2 through 9) lockup ranges
    • CMODE: Computer controlled multimode transmission
    • VMODE: User-selectable continuously variable transmission
    • nMODE: Multimode, user-selectable transmission. n = number of gear ranges (2 through 9)
    • DC/FW or FW: Declutching and freewheeling
  • cyl: number of cylinders

  • drive: type of drive train

  • fueltype: type of fuel the car uses

  • eng_dscr: a description of the engine. I couldn’t find a comprehensive list of the acronyms, so you’ll need to do some detective work

  • tcharger: does the car have a turbocharger?

  • scharger: does the car have a supercharger?

  • guzzler:

    • G = the model is a gas guzzler
    • T = the model is equipped with turbocharger
    • S = the model is equipped with supercharger
  • cty: city miles per gallon

  • hwy: highway miles per gallon

  • cmb: combined cty + hwy miles per gallon

  • pv: passenger volume

  • lv: luggage volume

This data was kindly provided by the EPA

The data available is rather large, so you will need to read about what is available, discuss questions you aim to answer, and identify the data necessary to answer them (try to not use more than four or five variables; also think of ways to reduce the number of rows you’re dealing with). Here are some questions to think about:

  • Are there other sources of data that might be useful?
  • What do you want to learn from this data?
  • What data do you need to answer those questions?
  • What data is available?
  • What is your strategy for selecting data? Could you focus on a particular subset?
  • How will you structure the data?

The main theme of the paper MUST not focus on changes in fuel economy over time

Deadlines & deliverables

  • 12 Sep-15 Sep. Meet with Hadley/Barrett/Yeshaya to go over initial questions and draft investigations.You should bring your main research questions and the first plots you have created to answer them.

  • Tuesday 25 Sep. Hand in 10-15 page report organised as described below, plus an appendix containing your R code. Please also email me a copy of the report as a PDF.

  • Tuesday 25 Sep. Peer rating of team members.

  • Thursday 4 Oct. Results of team discussion (homework)

Grading rubric

Project rubric

Overall grade breakdown:

  • Introduction: 10
  • Questions and findings: 60
  • Conclusion: 10
  • Presentation: 15
  • Code: 25


The purpose of the introduction is to introduce the data set, provide some context, and guide me as to what to expect from the rest of the report. You may find it easiest to write the introduction last, after you write the rest of the report. It should be about a page in length.

On the first page (or cover page), please include the full names of all team members.

Questions and findings

You should have approximately four or five main questions and associated findings, each which may be broken down further in more specific minor questions. Some of these questions will occur to you immediately upon looking at the data, and some will require considerable considerable exploration before they occur to you. To get to the four questions that you report on, I’d expect you to have had 20 or more questions. A lot of the time you will run into a dead end, or the answer to your question will turn out to be uninteresting or obvious. It is always disappointing not to report on something that you spend time working on, but it does make for a better report. You might want to briefly mention some of the dead ends you went down to demonstrate that you’ve done more than just the obvious.

Like your homeworks, I will assess the questions and findings based on the three criteria of curiosity, scepticism and organisation.

In all real data sets you will need to spend a lot of time cleaning up the data - fixing incorrect values, dealing with missing values etc. Don’t forget to give a brief description of what you did - that could count as one of your 4-5 questions/findings.


The conclusion should summarise your findings. Rather than just repeating what you’ve already said, try and weave your findings together into a consistent story. You should also reflect a little on other questions that the exploration raised, and what you would do next. Do you need to collect more data? Or collect data in a different way?


I’ll also mark the general presentation of the project. This is divided into three parts: text, tables and graphics. Graphs should follow the guidelines we have discussed in class - we haven’t discussed tables in class, but here are some good guidelines from North Carolina State. Also make sure to look at the xtable.

You are encouraged to explore more of the capabilities of latex to produce a unique and attractive document. The documentation for the memoir package is comprehensive and provides many ideas. It’s also huge, so don’t try and read the whole thing, just pick out bits that look interesting.


Last, but not least, your report should include an appendix which allows the reader to reproduce your findings. For this project, this would be an appendix containing the R code used to produce your graphics and perform an analysis. This appendix will be graded according to the code rubric.