Activity: Data Wrangling

Instructions

For this activity, work in groups of 2 - 3. Work together to come to a solution, and help each other out when stuck! The goal is to use the journey as a vessel for your and your peer’s learning, not to make it to the ‘correct’ answer as fast as possible.

Problem 0: load (and install) packages

# Packages (I got you started)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Lahman) # the baseball data
library(nycflights13) # the flights data

Problem 1: Baseball

The Major League Baseball Angels have at times been called the California Angels (CAL), the Anaheim Angels (ANA), and the Los Angeles Angels of Anaheim (LAA). Using the Teams data frame in the Lahman package:

  1. Find the 10 most successful seasons in Angels history, defining “successful” as the fraction of regular-season games won in the year. In the table you create, include the yearID, teamID, lgID, W, L, and WSWin. See the documentation for Teams (see help(Teams)) for the definition of these variables.

  2. Have the Angels ever won the World Series? If so, when?

data(Teams)
head(Teams)
  yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
1   1884   UA    ALT      ALT  <NA>   10  25    NA  6 19   <NA>  <NA>     N
2   1961   AL    LAA      ANA  <NA>    8 162    82 70 91   <NA>  <NA>     N
3   1962   AL    LAA      ANA  <NA>    3 162    81 86 76   <NA>  <NA>     N
4   1963   AL    LAA      ANA  <NA>    9 161    81 70 91   <NA>  <NA>     N
5   1964   AL    LAA      ANA  <NA>    5 162    81 82 80   <NA>  <NA>     N
6   1965   AL    CAL      ANA  <NA>    7 162    80 75 87   <NA>  <NA>     N
  WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER  ERA CG SHO SV
1  <NA>  90  899  223  30   6   2  22  130  NA NA  NA NA 216 114 4.67 20   0  0
2     N 744 5424 1331 218  22 189 681 1068  37 28  NA NA 784 689 4.31 25   5 34
3     N 718 5499 1377 232  35 137 602  917  46 27  NA NA 706 603 3.70 23  15 47
4     N 597 5506 1378 208  38  95 448  916  43 30  NA NA 660 569 3.52 30  13 31
5     N 544 5362 1297 186  27 102 472  920  49 39  NA NA 551 469 2.91 30  28 41
6     N 527 5354 1279 200  36  92 443  973 107 59  NA NA 569 508 3.17 39  14 33
  IPouts   HA HRA BBA SOA   E  DP    FP                  name
1    659  292   3  52  93 156   4 0.862 Altoona Mountain City
2   4314 1391 180 713 973 192 154 0.969    Los Angeles Angels
3   4398 1412 118 616 858 175 153 0.972    Los Angeles Angels
4   4365 1317 120 578 889 163 155 0.974    Los Angeles Angels
5   4350 1273 100 530 965 138 168 0.978    Los Angeles Angels
6   4323 1259  91 563 847 123 149 0.981     California Angels
                park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
1               <NA>         NA 101 109      ALT            ALT         ALT
2 Wrigley Field (LA)     603510 111 112      LAA            LAA         LAA
3     Dodger Stadium    1144063  97  97      LAA            LAA         LAA
4     Dodger Stadium     821015  94  94      LAA            LAA         LAA
5     Dodger Stadium     760439  90  90      LAA            LAA         LAA
6     Dodger Stadium     566727  97  98      CAL            CAL         CAL

Problem 2: Flights

Use the nycflights13 package and the flights data frame to answer the following questions:

  1. What plane (specified by the tailnum variable) traveled the most times from New York City airports in 2013?

  2. Plot the number of trips per week over the year.

data(flights)
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>