stat405

Plyr drills

Diamonds data set

  1. Make a new data set that has the average depth and price of the diamonds in the data set.

    Use summarise()

    new <- summarise(diamonds, avg_depth = mean(depth), avg_price = mean(price))
  2. Add a new column to the data set that records each diamond's price per carat.

    Use transform()

    new <- transform(diamonds, carat_price = price / carat)
  3. Make a data set that only includes diamonds with an Ideal cut.

    Use subset()

    new <- subset(diamonds, cut == "Ideal")
  4. Create a new data set that groups diamonds by their cut and displays the average price of each group.

    Use ddply()

    new <- ddply(diamonds, "cut", summarise, avg_price = mean(price))
  5. Create a new data set that groups diamonds by color and displays the average depth and average table for each group.

    new <- ddply(diamonds, "color", summarise, avg_depth = mean(depth), avg_table = mean(table))
  6. Add two columns to the diamonds data set. The first column should display the average depth of diamonds in the diamond's color group. The second column should display the average table of diamonds in the diamonds color group.

    new <- ddply(diamonds, "color", transform, avg_depth = mean(depth), avg_table = mean(table))
  7. Make a data set that contains all of the unique combinations of cut, color, and clarity, as well the average price of diamonds in each group.

    new <- ddply(diamonds, c("cut", "color", "clarity"), summarize, avg_price = mean(price))
  8. Add a column to the diamonds data set that displays the average price for all diamonds with a diamond's cut, color, and clarity.

    new <- ddply(diamonds, c("cut", "color", "clarity"), transform, avg_price = mean(price))
  9. Do diamonds with the best cut fetch the best price for a given amount of carats?

    Determine the average price per carat for diamonds in each cut group.

    new <- ddply(diamonds, "cut", summarise, carat_price = mean(price / carat))
    
    # no
  10. Which color diamonds seem to be largest on average (in terms of carats)?

    new <- ddply(diamonds, "color", summarise, avg_carat = mean(carat))
    
    # diamonds with color J (the worst color)
  11. What color of diamonds occurs the most frequently among diamonds with ideal cuts?

    ddply(subset(diamonds, cut = "Ideal"), "color", summarise, freq = length(color))
    
    # or
    
    table(subset(diamonds, cut = "Ideal")$color)
    
    # color G
  12. Which clarity of diamonds has the largest average table per carats?

    new <- ddply(diamonds, "clarity", summarise, rel_table = mean(table / carat))
    
    # clarity = VVS1
  13. Which diamond has the largest price per carat compared other diamonds with its cut, color, and clarity?

    new <- ddply(diamonds, c("cut", "color", "clarity"), transform, diff = price - mean(price))
    
    which(new$diff == max(new$diff))
    
    # Diamond 47423 (the diamond in row 47423)
  14. What is the average price per carat of diamonds that cost more than $10000?

    not_cheap <- subset(diamonds, price > 10000)
    summarise(not_cheap, carat_price = mean(price / carat))
    
    # $8,044.44 per carat
  15. Display the largest diamond depth observed for each clarity group.

    new <- ddply(diamonds, "clarity", summarise, max_depth = max(depth))

Players data set

  1. Create a new dataframe that only includes players less than 60 inches tall.

    new <- subset(players, height < 60)
  2. What percentage of players bat left handed? Right handed? Both?

    Use table()

    table(players$bats) / nrow(players) * 100
    
    #or 
    
    summarise(players, table(bats) / nrow(players) * 100)
  3. Make a dataframe that displays just the average height of players from each country.

    new <- ddply(players, "country", summarise, avg_height = mean(height))
  4. Create a new data set that displays the ratio of right handed batters to all batters for each country.

    new <- ddply(players, "country", summarise, ratio = sum(bats == "R") / length(bats))
  5. Create a dataframe that compares the average body mass index (BMI) of players from each state, nation combination. (BMI = (weight * 703) / height2).

    new <- ddply(players, c("state", "country"), summarise, 
      bmi = mean((weight * 703) / height ^ 2, na.rm = T))
  6. For each player, display the average height of all players with his weight. Put this new variable in a new column.

    heights <- ddply(players, "weight", transform, 
      avg_height = mean(height, na.rm = T))
  7. Make a dataframe that displays just the average weight of players from each state.

    Don't be surprised to see numbers in the states column. The players data set records some states as numbers.

    new <- ddply(players, "state", summarise, 
      avg_weight = mean(weight))
  8. Add a column to the players data set that displays the average weight for all players from the player's state who throw with the same hand as the player.

    new <- ddply(players, c("state", "throws"), transform, 
      avg_weight = mean(weight, na.rm = T))
  9. Make a data set that just includes players with nicknames.

    named <- subset(players, nickname != "")
  10. Do players with nicknames weigh more on average than players without nicknames?

    Save time: use the data set made above.

    Avoid NA's by using mean(weight, na.rm = T)

    mean(named$weight, na.rm = T)
    summarise(players[players$nickname == "", ], mean(weight, na.rm = T))
    
    # no
  11. Which country has the fattest players (on average)?

    Measure fatness by the body mass index.

    Use order() and subsetting to sort entries by weight. For example, players[order(-players$weight), ]

    new <- ddply(players, "country", summarise, avg_weight = mean(weight, na.rm = T))
    new[order(-new$avg_weight), ]
    
    # Indonesia
  12. From which states have the most baseball players come from?

    new <- ddply(players, "state", summarise, number = length(state))
    new[order(-new$number), ]
    
    # California, out of the country, Texas, etc.
  13. Do more American (USA) players or foreign players have a nicknames?

    Use the results of question 9.

    nrow(subset(named, country == "USA"))
    nrow(subset(named, country != "USA"))
    
    # American players
  14. Which player is the tallest when compared to the average height of players with his weight?

    Use the results of question 6.

    new <- transform(heights, diff = height - avg_height) 
    subset(new, diff == max(diff, na.rm = T)) 
    
    # Dave Pavlos
  15. Which are taller on average: players who throw with their right hand, or players who throw with their left?

    new <- ddply(players, "throws", summarise, avg_height = mean(height, na.rm = T))
    
    # Left

Baby names data set

  1. Create a new dataframe that only includes years since 2000.

    newest <- subset(names, year >= 2000)
  2. What name has been used for the most number of years (when used for a single gender)?

    years <- ddply(names, c("name", "sex"), summarise, year = length(name))
    subset(years, year == max(year))
    
    #It is a 329 way tie
  3. What names have only been used one year?

    Use the result above.

    subset(years, year == 1)
    
    # There are 876 of them
  4. Create a new column that displays the number of genders the name was used for each year. Note: not recommended for slower computers.

    boygirl <- ddply(names, c("year", "name"), transform, both = length(year))
    
  5. Make a data set of names that have been both boy and girl names.

    Use the results from above.

    unique() removes identical rows in a dataframe. Apply this to your results to get a concise list.

    new <- subset(boygirl, both == 2)
    boths <- unique(new[, 2])
    
    
    # or
    
    years2 <- ddply(names, "name", summarise, year2 = length(name))
    temp <- merge(years, years2, by = "name") #years from question 2
    boths <- subset(temp, year != year2)
    boths <- unique(boths$name)
    
  6. For each year, display the total number of names that were used. Treat boy and girl versions of the same name as two separate names.

    new <- ddply(names, "year", summarise, num_names = length(year))
    
    # 2000 names were recorded for each year
  7. Which name received the largest percentage of any name for any year (consider boy and girl names as distinct)

    subset(names, percent == max(percent))
    
    # John in 1880
  8. Which girl's name received the largest percentage of any girl's name for any year?

    girls <- subset(names, sex == "girl")
    subset(girls, percent == max(percent))
    
    # Mary in 1880
  9. Display the average percentage each name received during all the years it was used. Treat girl and boy versions of the same name as different names.

    percent <- ddply(names, c("name", "sex"), summarise, avg_per = mean(percent))
  10. Which name recorded in the data set has been out of use for the longest time?

    You could consider the last year each name was recorded

    last <- ddply(names, c("name"), summarise, last_use = max(year))
    subset(last, last_use == min(last_use))
    
    # 43 names have been out of use since 1880
  11. In a new column, display the total number of years a name has been used (as either a boys name or a girls name).

    new <- ddply(names, "name", transform, num_years = length(year))
  12. In general, are names that have been used for both boys and girls more popular for boys or girls?

    Compare the average percentage a name receives when used as a boy to the average percentage when its used for a girl.

    Use the results from questions 5 and 9.

    bg_names <- subset(percent, name %in% both) 
    new <- ddply(bg_names, "name", summarise, diff = avg_per[sex=="boy"] - avg_per[sex =="girl"])
    mean(new$diff $gt; 0)
    
    # More popular for girls (new$diff $gt; 0 is TRUE (i.e, 1) when the boys version is more popular. This occurred less than half the time (i.e. mean(new$diff $gt; 0) < 0.5)
  13. What name has earned the most percentage points in a year for any name since 2000?

    Use the results of question 1.

    subset(newest, percent == max(percent)) 
    
    # Jacob in 2000