Different Classes in R






Grayson White

Math 241
Week 7 | Spring 2026

Annoucements

  • Welcome to the admitted students!!

Week 7 Goals

Mon Lecture

  • Learn more about strings, factors, dates, and times in R!

Wed Lecture

  • Project 1 work day

Project 1 Check-In

  • If you haven’t already, make sure to read over the “Tips for getting started” section of the Project 1 instructions.
  • Everyone should have access to their project group repo.
  • Make sure to come by office hours with questions or to talk out your plan for your dashboard!

Timeline

  • 3/2: Receive project groups released
  • 3/4: Receive project instructions and invite to your group’s GitHub repo.
    • Please use your assigned Math 241 GitHub repo for this project.
  • 3/18 (noon): Post a working draft of your dashboard to https://www.shinyapps.io/
  • 3/18 (noon): Post the link to the group’s dashboard to this spreadsheet.
  • 3/18 - 3/20: Peer feedback period
    • Each person will provide feedback on the dashboards of two groups.
    • More guidance on providing feedback will be given in class that week.
    • Peer feedback is due 3/20 at 10pm.
  • 4/3 10pm: Link for the final version of dashboard should be added to this spreadsheet and PDF of your data scientist’s statement should be submitted on Gradescope.
  • 4/5 10pm: Group member feedback form due.

Projects and Git/GitHub

  • Github Repo = RStudio Project / Positron folder

  • This means you need to create a new RStudio project that is synced with your group’s GitHub repo that I created.

    • Quick video tutorial available here

Workflow

Once your GitHub repo and RStudio project are synced, here’s your workflow:

  • Pull the most recent version of the repo from GitHub to your RStudio project.

  • Do some work on your project in RStudio.

  • Commit that work.
    • Committing takes a snapshot of all the files in the project.
    • Look over the Diff: which shows what has changed since your last update.
    • Include a quick note, Commit Message to summarize the motivation for the changes.
  • Push your commit to GitHub from RStudio.

Git Collaboration: Merge conflicts

  • What if my collaborators and I both make changes?
    • Scenario: Your collaborator makes changes to a file, commits, and pushes to GitHub. You also modify that file, commit and push.
    • Result: Your push will fail because there’s a commit on GitHub that you don’t have.
    • Usual Solution: Pull and usually git will merge their work nicely with yours. Then push. If that doesn’t work, you have a merge conflict. Let’s cross that bridge when we get there.
  • How to avoid merge conflicts?
    • First, always pull when you are going to work on your project.
    • Then, always commit and push when you are done even if you made small changes.

Collaboration: Git Style

  • Projects: Can use to create to do lists and stay organized.

  • Issues: Useful method to communicate with your group members.

  • Branches: A tool for taking a detour from the main stream of development.

Git Branches

  • Branch = Detour from main stream of development.
  • Workflow:
    • Create a new branch.
    • Checkout (switch) to that branch.
    • Commit the work for that branch.
    • Merge it into the main branch.
      • Can also be done on GitHub via a Pull Request.
  • If you have Git experience or want to try out branches, check out Ch 22 in Happy Git with R.

  • For novices, I recommend staying on the main branch.

Now: dates and times in R with lubridate

Why do we need to talk about dates and times?

Question: When did the crashes happen?

library(tidyverse)
crashes <- read_csv("data/pdx_crash_2018_page1.csv")

crashes %>%
  count(CRASH_DT) %>%
  ggplot(mapping = 
           aes(x = CRASH_DT,
               y = n)) +
  geom_point()

Dates

head(crashes$CRASH_DT)
[1] "02/01/18 00:00:00" "02/11/18 00:00:00" "03/09/18 00:00:00"
[4] "04/09/18 00:00:00" "10/10/18 00:00:00" "05/24/18 00:00:00"
class(crashes$CRASH_DT)
[1] "character"

What class should it be?

Converting Strings to Dates

  • Identify the order of year, month, day, hour, minute, second

  • Pick the lubridate function that replicates that order.

class(crashes$CRASH_DT)
[1] "character"
head(crashes$CRASH_DT)
[1] "02/01/18 00:00:00" "02/11/18 00:00:00" "03/09/18 00:00:00"
[4] "04/09/18 00:00:00" "10/10/18 00:00:00" "05/24/18 00:00:00"
library(lubridate)

crashes <- crashes %>%
  mutate(crash_date_time = mdy_hms(CRASH_DT),
         crash_date = date(crash_date_time))

class(crashes$crash_date)
[1] "Date"
head(crashes$crash_date)
[1] "2018-02-01" "2018-02-11" "2018-03-09" "2018-04-09" "2018-10-10"
[6] "2018-05-24"

Why do we need to talk about dates and times?

Question: When did the crashes happen?

crashes %>%
  count(crash_date) %>%
  ggplot(mapping = 
           aes(x = crash_date,
               y = n)) +
  geom_point()

  • Hard to see daily patterns. Switch time interval?

Why do we need to talk about dates and times?

Question: When did the crashes happen?

crashes %>%
  mutate(month = month(crash_date, label = TRUE)) %>%
  count(month) %>%
  ggplot(mapping = 
           aes(x = month,
               y = n)) +
  geom_col() + 
  labs(title = "Number of car crashes per month",
       subtitle = "Portland, OR (2018)",
       x = "", y = "") + 
  theme_bw()

  • Better! Chart junk?

Let’s Look at Portland’s Biketown Data

All check-outs for July - August of 2017

biketown <- read_csv("data/biketown.csv") %>%
  filter(Distance_Miles < 1000)

biketown_dt <- biketown %>%
  select(StartDate, StartTime, EndDate, EndTime, Distance_Miles,
         BikeID)

glimpse(biketown_dt)
Rows: 9,999
Columns: 6
$ StartDate      <chr> "8/17/2017", "7/22/2017", "7/27/2017", "7/12/2017", "7/…
$ StartTime      <time> 10:44:00, 14:49:00, 14:13:00, 13:23:00, 19:30:00, 10:0…
$ EndDate        <chr> "8/17/2017", "7/22/2017", "7/27/2017", "7/12/2017", "7/…
$ EndTime        <time> 10:56:00, 15:00:00, 14:42:00, 13:38:00, 20:30:00, 10:5…
$ Distance_Miles <dbl> 1.91, 0.72, 3.42, 1.81, 4.51, 5.54, 1.59, 1.03, 0.70, 1…
$ BikeID         <dbl> 6163, 6843, 6409, 7375, 6354, 6088, 6089, 5988, 6857, 6…

Let’s Look at Portland’s Biketown Data

  • Fix the class of the date columns.
  • Create date-time columns.
library(lubridate)
biketown_dt <- biketown_dt %>%
  mutate(StartDate = mdy(StartDate),
         EndDate = mdy(EndDate)) %>%
  mutate(StartDateTime = ymd_hms(paste(StartDate, StartTime, sep = " ")),
         EndDateTime = ymd_hms(paste(EndDate, EndTime, sep = " "))) 

glimpse(biketown_dt)
Rows: 9,999
Columns: 8
$ StartDate      <date> 2017-08-17, 2017-07-22, 2017-07-27, 2017-07-12, 2017-0…
$ StartTime      <time> 10:44:00, 14:49:00, 14:13:00, 13:23:00, 19:30:00, 10:0…
$ EndDate        <date> 2017-08-17, 2017-07-22, 2017-07-27, 2017-07-12, 2017-0…
$ EndTime        <time> 10:56:00, 15:00:00, 14:42:00, 13:38:00, 20:30:00, 10:5…
$ Distance_Miles <dbl> 1.91, 0.72, 3.42, 1.81, 4.51, 5.54, 1.59, 1.03, 0.70, 1…
$ BikeID         <dbl> 6163, 6843, 6409, 7375, 6354, 6088, 6089, 5988, 6857, 6…
$ StartDateTime  <dttm> 2017-08-17 10:44:00, 2017-07-22 14:49:00, 2017-07-27 1…
$ EndDateTime    <dttm> 2017-08-17 10:56:00, 2017-07-22 15:00:00, 2017-07-27 1…

Grabbing Components

biketown_dt$StartDateTime[1000]
[1] "2017-08-26 17:26:00 UTC"
year(biketown_dt$StartDateTime[1000])
[1] 2017
month(biketown_dt$StartDateTime[1000], label = TRUE)
[1] Aug
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
day(biketown_dt$StartDateTime[1000])
[1] 26

Grabbing Components

week(biketown_dt$StartDateTime[1000])
[1] 34
wday(biketown_dt$StartDateTime[1000], label = TRUE)
[1] Sat
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
hour(biketown_dt$StartDateTime[1000])
[1] 17
minute(biketown_dt$StartDateTime[1000])
[1] 26

Grabbing Components

ggplot(data = biketown_dt, 
       mapping = 
         aes(month(StartDateTime,
                   label = TRUE))) +
  geom_bar()

Grabbing Components

ggplot(data = biketown_dt, 
       mapping = aes(wday(StartDateTime,
                          label = TRUE))) +
  geom_bar()

And if you are in R and want to know the current date/time:

today()
[1] "2026-04-06"
now()
[1] "2026-04-06 00:56:53 PDT"

Topic Shift!

Factors with forcats

Motivation: Imposing Structure on Categorical Variables

library(pdxTrees)
pdxTrees <- get_pdxTrees_parks()

five_most_common <- c("Douglas-Fir", "Norway Maple",
                      "Western Redcedar", "Northern Red Oak",
                      "Pin Oak")

pdxCommon <- pdxTrees %>%
  filter(Common_Name %in% five_most_common)

Motivation: Imposing Structure on Categorical Variables

ggplot(data = pdxCommon,
       mapping = aes(x = Common_Name)) + 
  geom_bar() +
  coord_flip()

How might we want to restructure this graph?

Levels and Class

  • Why does Common_Name have no levels?
levels(pdxCommon$Common_Name)
NULL
class(pdxCommon$Common_Name)
[1] "character"
pdxCommon <- mutate(pdxCommon, Common_Name = factor(Common_Name))

levels(pdxCommon$Common_Name)
[1] "Douglas-Fir"      "Northern Red Oak" "Norway Maple"     "Pin Oak"         
[5] "Western Redcedar"
class(pdxCommon$Common_Name)
[1] "factor"
  • How is R deciding the order of the levels?

What Are the levels/categories?

fct_unique(pdxCommon$Common_Name)
[1] Douglas-Fir      Northern Red Oak Norway Maple     Pin Oak         
[5] Western Redcedar
5 Levels: Douglas-Fir Northern Red Oak Norway Maple ... Western Redcedar
unique(pdxCommon$Common_Name)
[1] Douglas-Fir      Northern Red Oak Norway Maple     Pin Oak         
[5] Western Redcedar
5 Levels: Douglas-Fir Northern Red Oak Norway Maple ... Western Redcedar

Reorder the Levels

pdxCommon %>%
  mutate(Common_Name = 
           fct_infreq(Common_Name)) %>%
  ggplot(mapping = aes(Common_Name)) +
  geom_bar() +
  coord_flip()

  • Note: This code didn’t permanently change the order in pdxCommon. Why?

  • How might we want to restructure this graph?

reverse the Levels

pdxCommon %>%
  mutate(Common_Name = 
           fct_infreq(Common_Name),
         Common_Name = 
           fct_rev(Common_Name)) %>%
  ggplot(mapping = aes(Common_Name)) +
  geom_bar() +
  coord_flip()

Or, If You Love the Pipe…

pdxCommon %>%
  mutate(Common_Name = 
           fct_infreq(Common_Name) %>%
           fct_rev()) %>%
  ggplot(mapping = aes(Common_Name)) +
  geom_bar() +
  coord_flip()

Reorder the Levels

pdxCommon %>%
  mutate(Common_Name = 
           fct_relevel(Common_Name, 
                       five_most_common)) %>%
  ggplot(mapping = aes(x = Common_Name)) + 
  geom_bar() +
  coord_flip()

  • Can also relevel manually

Reorder the Levels

pdxCommon %>%
  mutate(Common_Name = 
           fct_relevel(Common_Name,
                       "Norway Maple",
                       "Pin Oak")) %>%
  ggplot(mapping = aes(x = Common_Name)) + 
  geom_bar() +
  coord_flip()

  • Or maybe I just want to bring one or two category to the front

What Have We Wrangled Here?

DBH_by_name <- pdxCommon %>%
  group_by(Common_Name) %>%
  summarize(mean_DBH = mean(DBH),
            lb_DBH = mean_DBH - 1.96*sd(DBH)/sqrt(n()),
            ub_DBH = mean_DBH + 1.96*sd(DBH/sqrt(n()))) 
DBH_by_name
# A tibble: 5 × 4
  Common_Name      mean_DBH lb_DBH ub_DBH
  <fct>               <dbl>  <dbl>  <dbl>
1 Douglas-Fir          29.6   29.3   29.8
2 Northern Red Oak     29.4   28.3   30.5
3 Norway Maple         20.3   19.9   20.8
4 Pin Oak              25.6   24.8   26.4
5 Western Redcedar     18.1   17.3   18.9

Reordering by Another Variable

ggplot(data = DBH_by_name, 
      mapping = aes(y = mean_DBH,
                    x = Common_Name)) +
  geom_point() +
  geom_errorbar(mapping =
                  aes(ymin = lb_DBH,
                      ymax = ub_DBH),
                width = 0.4)

  • How might we want to reorder Common_Name?

Reordering by Another Variable

DBH_by_name %>%
  mutate(Common_Name =
           fct_reorder(Common_Name,
                       -mean_DBH)) %>%
  ggplot(mapping = aes(y = mean_DBH,
                       x = Common_Name)) +
  geom_point() +
  geom_errorbar(mapping =
                  aes(ymin = lb_DBH,
                      ymax = ub_DBH),
                width = 0.4)

Reordering by Other Variables

ggplot(data = pdxCommon,
       mapping = 
         aes(x = DBH,
             y = Total_Annual_Services,
             color = Condition)) +
  geom_smooth()

  • How might we want to reorder Condition?

Factors

Other useful functions in forcats:

  • fct_collapse(): Collapse some levels together
  • fct_drop(): Remove levels (useful after a filter()!)
  • fct_recode(): Change names of levels

And now:

Strings with stringr!

Language

String

x <- "lemur"

Character vector

x <- c("capybara", "lemur", "pigeon")

Factor vector

x <- factor(x)
levels(x)
[1] "capybara" "lemur"    "pigeon"  

String Manipulation with Stringr

  • Learn how to handle character vectors!
    • Character manipulation
    • Pattern matching
  • Let’s look at some of the functionalities of stringr using a character vector of song lyrics.

Our Toy Lyric

lyric <- c("But I would walk 500 miles,",
              "And I would walk 500 more,",
              "Just to be the man who walks a 1000 miles,",
              "To fall down at your door")
lyric
[1] "But I would walk 500 miles,"               
[2] "And I would walk 500 more,"                
[3] "Just to be the man who walks a 1000 miles,"
[4] "To fall down at your door"                 
  • Song?
  • Artist?

String Length

length(lyric)
[1] 4
library(stringr)
str_length(lyric)
[1] 27 26 42 25
  • Most stringr functions start with str_

Accessing and Replacing

str_sub(string = lyric[1], start = 18, end = 20)
[1] "500"
str_sub(string = lyric[1], start = 18, end = 20) <- "2"
lyric
[1] "But I would walk 2 miles,"                 
[2] "And I would walk 500 more,"                
[3] "Just to be the man who walks a 1000 miles,"
[4] "To fall down at your door"                 

Change Cases

str_to_upper(lyric)
[1] "BUT I WOULD WALK 2 MILES,"                 
[2] "AND I WOULD WALK 500 MORE,"                
[3] "JUST TO BE THE MAN WHO WALKS A 1000 MILES,"
[4] "TO FALL DOWN AT YOUR DOOR"                 


str_to_title(lyric)
[1] "But I Would Walk 2 Miles,"                 
[2] "And I Would Walk 500 More,"                
[3] "Just To Be The Man Who Walks A 1000 Miles,"
[4] "To Fall Down At Your Door"                 


str_to_lower(lyric)
[1] "but i would walk 2 miles,"                 
[2] "and i would walk 500 more,"                
[3] "just to be the man who walks a 1000 miles,"
[4] "to fall down at your door"                 

Sorting

str_sort(lyric)
[1] "And I would walk 500 more,"                
[2] "But I would walk 2 miles,"                 
[3] "Just to be the man who walks a 1000 miles,"
[4] "To fall down at your door"                 

Pattern Matching

  • Learn to:
    • Detect pattern
    • Extract pattern
    • Replace pattern
    • Split pattern

Common Goal: Match a particular pattern

I want to match the pattern 500 from lyric.

lyric
[1] "But I would walk 2 miles,"                 
[2] "And I would walk 500 more,"                
[3] "Just to be the man who walks a 1000 miles,"
[4] "To fall down at your door"                 


str_view_all(string = lyric, pattern = "500")
[1] │ But I would walk 2 miles,
[2] │ And I would walk <500> more,
[3] │ Just to be the man who walks a 1000 miles,
[4] │ To fall down at your door


or:

str_view(string = lyric, pattern = "500")
[2] │ And I would walk <500> more,

Let’s make it more general.

I want to locate all the numbers.

lyric
[1] "But I would walk 2 miles,"                 
[2] "And I would walk 500 more,"                
[3] "Just to be the man who walks a 1000 miles,"
[4] "To fall down at your door"                 


str_view_all(lyric, "500|1000|2")
[1] │ But I would walk <2> miles,
[2] │ And I would walk <500> more,
[3] │ Just to be the man who walks a <1000> miles,
[4] │ To fall down at your door

Trivia Time!

Name the artist and song title for each of the following!

lyrics <- c("But I would walk 500 miles",
            "Yeah, 360. When you're in the mirror, do you like what you see?", 
            "I have loved you for a 1000 years, I'll love you for a 1000 more",
            "Where 2 and 2 always makes a 5",
            "17-38, ay",
            "I'm so 3008, You so 2000 and late")

How should we modify the code to locate all the numbers from these lyrics of various songs?

lyrics
[1] "But I would walk 500 miles"                                      
[2] "Yeah, 360. When you're in the mirror, do you like what you see?" 
[3] "I have loved you for a 1000 years, I'll love you for a 1000 more"
[4] "Where 2 and 2 always makes a 5"                                  
[5] "17-38, ay"                                                       
[6] "I'm so 3008, You so 2000 and late"                               
str_view_all(lyrics, "500|1000|2")
[1] │ But I would walk <500> miles
[2] │ Yeah, 360. When you're in the mirror, do you like what you see?
[3] │ I have loved you for a <1000> years, I'll love you for a <1000> more
[4] │ Where <2> and <2> always makes a 5
[5] │ 17-38, ay
[6] │ I'm so 3008, You so <2>000 and late

How should we modify the code to locate all the numbers from these lyrics of various songs?

lyrics
[1] "But I would walk 500 miles"                                      
[2] "Yeah, 360. When you're in the mirror, do you like what you see?" 
[3] "I have loved you for a 1000 years, I'll love you for a 1000 more"
[4] "Where 2 and 2 always makes a 5"                                  
[5] "17-38, ay"                                                       
[6] "I'm so 3008, You so 2000 and late"                               
str_view_all(lyrics, "500|1000|2|360|5|17|38|3008|2000")
[1] │ But I would walk <500> miles
[2] │ Yeah, <360>. When you're in the mirror, do you like what you see?
[3] │ I have loved you for a <1000> years, I'll love you for a <1000> more
[4] │ Where <2> and <2> always makes a <5>
[5] │ <17>-<38>, ay
[6] │ I'm so <3008>, You so <2>000 and late

Need for More Sophisticated Pattern Matching

But now imagine you had a very long vector and you want to locate any number?

str_view_all(lyrics, "1|2|3|4...")
  • Not a good approach!

  • Next time: Regular Expressions!