stat405

Syllabus

This course will teach you to be a data analyst. You will learn how to take a large dataset break up into manageable pieces and use a range of qualitative and quantitative tools to summarise it and learn what it has to tell. You will learn the importance of scepticism and curiosity, and how to communicate your findings. Each section of the course is motivated by a particular dataset, and you will gain experience working with a wide variety of data sources varying in size and quality.

We will focus our efforts on the statistical programming environment R. You will learn how to program in R, learning both the basic syntax and a range of vocabulary to help solve common problems. The only way to become a competent programmer is practice, and many of the weekly homeworks will require substantial programming. That said, the focus of the course is on data analysis, and your technical abilities are only useful insomuch as they allow you to explore the world and to craft useful, persuasive arguments.

There are few requirements for this course. You will need some basic statistical knowledge, particularly of the linear model and its extensions, but otherwise the course is largely self-contained. Being a skilled touch typist is a big plus!

Overview

Goals

  • Become a capable data analyst.
  • Learn how to program (in R).

Structure

Each topic in the course is motivated by a data problem. Some of the data sets we will use are:

  • Fuel economy of US cars
  • Characteristics and prices of 50,000 diamonds
  • Mortality in Mexico
  • Slot machine pay offs
  • US baby names from 1880 to 2008

We’ll also use some other interesting datasets for two projects, and you’ll have the opportunity to clean and compile your own data for the final project.

Topic outline

  • Week 1: Introduction to R and visualisation
  • Week 2: Introduction to data manipulation. Project 1 introduced.
  • Week 3: More data manip. Intro to statistical reports.
  • Week 4: Loading and saving data. Control flow.
  • Week 5: Functions. Project 1 due.
  • Week 6: Group-wise data manipulation. (Hadley away)
  • Week 7: Text manipulation with regular expressions. Project 2 announced.
  • Week 8: (CC: only one class). Visualising spatial data.
  • Week 9: Perception and polishing.
  • Week 10: Data structures. Project 2 due. (Hadley away).
  • Week 11: Tidy data.
  • Week 12: Dates and times. Visualising time & space. Project 3 announced.
  • Week 13: Modelling.
  • Week 14: (TG: only one class) Poster presentation skills.
  • Week 15: Final topics. Final presentation. Project 3 due.

Assessment

Grading breakdown

  • Weekly homework: 40%. One lowest grades dropped.
  • Team projects: 60%: 15% + 20% + 25%

Please hand in a physical version of your homework and projects to the stat405 mailbox - we will write comments on it and give it back to you. An electronic version will be accepted under only exception circumstances.

All grades will be posted electronically on owl-space. It’s your responsibility to double-check that I have correctly entered your grade from your assignment. Please let me know if I’ve made a mistake.

Late policy

0% penalty if in the stat405 mailbox by Friday 9am, 20% by 9am Monday morning and 100% after that.

Grading scale

The grading will be a little different to what you are used to in statistics. Most assignments will be graded according to a rubric. Each component (typically skepticism, curiosity and organisation) is graded between 1 (F) and 5 (A+). To get a 5 you will typically need to go above and beyond what I have covered in class and show me something new.

This grading scale is applied uniformly over the entire semester so in the first few homeworks you may only receive grades of 5 or 6 out of 15.

A rough conversion between rubric grades and letter grades is:

  • 4.5–5.0 = A+
  • 3.5–4.5 = A
  • 2.5–3.5 = B
  • < 2.5 = F

These are minimum guaranteed grades - i.e. we may be more generous depending on the grading distribution this semester. Plusses and minus will be awarded at our discretion.

Weekly homework

To do well in this course you will need to spend 4-5 hours a week (outside of class!), and the weekly homeworks are designed to encourage you do that. For each homework you will need to revise the week’s work, as well synthesise some new information, from the help pages or the web.

Team projects

Each member of the team is responsible for every part of the project. I know team projects can be frustrating, but I hope to teach some skill that should make it less painful. More details will be provided when we start the first project, but expect to produce a 15-page report detailing the analysis of a large data set.

Each project will receive a single grade, but individual grades will be weighted by effort as judged by the entire team.

Teams will be assigned by Garrett and myself, and after the first project teams will be rebuilt unless each team unanimously decides to stay together. Additionally, teams can chose to fire team members who are not performing well (after meeting with me as a team), and individuals can choose to quit if they feel they are doing all the work.

Final project

For your final project, you’ll be expect to find your own dataset. As well as writing a report, you’ll present at a formal poster session.

Model answers

Homeworks and projects are open ended, and there are no right answers. To give you a feel for what some good answers are, I’ll publish (anonymously) a few of the best answers. If you don’t want yours to be published, please let me know.

Collaboration and citation

For homeworks (and obviously team projects) I encourage you to work together. Please discuss the data, code and problems with one another, but do your own exploration and write up. We expect everyone to hand in substantially different homeworks, and we will enforce this under the honour code.

Please use any resources available to you. Many homeworks will explicitly encourage you to use resources on the internet, but showing extra initiative will always be appreciated. You will find R programming tough at first, so feel free to email me questions or discuss your problems with other classmates.

Note that it is not acceptable to copy verbatim from outside sources, and in most assignments even quotes will not be appropriate. Use the ideas, not the particular details. Always give credit where credit is due, so all use of outside sources should be cited: for projects you will be expected to have a formal bibliography; for homeworks, a casual citation is fine; and for code, reference the source in a comment.

Disability statement

If you have a documented disability that will impact your work in this class, please contact me to discuss your needs. You’ll also need to register with the Disability Support Services Office in the Allen Center.