Final Exam Review Module 2

Git to Strings

Agenda

  • Git
  • Data Wrangling
  • Data Viz and Grammar of Graphics
  • Project 3: ER Visits
  • EDA and Explanatory DA
  • Joins and Pivots
  • Strings
  • Project 4: EV Power

Git

Git Basics

  • Git is part of your toolbox to do computations with (includes R and the unix shell) - Version control system that helps you track changes to files and collaborate with others
  • A repository (or “repo”) is a directory that contains your project files and the history of changes made to those files

Common Git commands:

  • git init: Initialize a new Git repository.
  • git clone <url>: Clone an existing repository from a URL.
  • git add <file>: Stage changes to a file for the next commit.
  • git commit -m "message": Commit staged changes with a message.
  • git push: Push committed changes to a remote repository.

Data Wrangling

Base R vs. dplyr

Base R lets you subset data frames using $, [], and logical expressions, but this often becomes messy and hard to read:

library(tidyverse)
starwars[starwars$height < 1.75, ]
# A tibble: 6 × 14
  name  height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 <NA>      NA    NA <NA>       <NA>       <NA>              NA <NA>  <NA>  
2 <NA>      NA    NA <NA>       <NA>       <NA>              NA <NA>  <NA>  
3 <NA>      NA    NA <NA>       <NA>       <NA>              NA <NA>  <NA>  
4 <NA>      NA    NA <NA>       <NA>       <NA>              NA <NA>  <NA>  
5 <NA>      NA    NA <NA>       <NA>       <NA>              NA <NA>  <NA>  
6 <NA>      NA    NA <NA>       <NA>       <NA>              NA <NA>  <NA>  
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
starwars$height_in <- starwars$height * 39.3701
mean(starwars$height[starwars$homeworld == "Tatooine"])
[1] NA

How is dplyr better?

  • Reusing clean data frames

  • Using readable verbs (filter(), select(), mutate(), etc.)

  • Referencing columns without quotes

  • Producing consistent data frames as output

Review Question

Q: In Base R, how do you compute the average height of characters from Tatooine?

Answer

mean(starwars$height[starwars$homeworld == "Tatooine"], na.rm = TRUE)
[1] 169.8

Core dplyr functions

  • slice(): pick rows by number
library(dplyr)
slice(starwars, 2)
# A tibble: 1 × 15
  name  height  mass hair_color skin_color eye_color birth_year sex   gender   
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>    
1 C-3PO    167    75 <NA>       gold       yellow           112 none  masculine
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, height_in <dbl>
  • select(): pick columns by name or helper
select(starwars, homeworld)
# A tibble: 87 × 1
   homeworld
   <chr>    
 1 Tatooine 
 2 Tatooine 
 3 Naboo    
 4 Tatooine 
 5 Alderaan 
 6 Tatooine 
 7 Tatooine 
 8 Tatooine 
 9 Tatooine 
10 Stewjon  
# ℹ 77 more rows

  • filter(): pick rows by condition
filter(starwars, height < 1.75)
# A tibble: 0 × 15
# ℹ 15 variables: name <chr>, height <int>, mass <dbl>, hair_color <chr>,
#   skin_color <chr>, eye_color <chr>, birth_year <dbl>, sex <chr>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, height_in <dbl>
  • mutate(): create or transform columns
mutate(starwars, height_in = height * 39.3701)
# A tibble: 87 × 15
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, height_in <dbl>

  • summarize(): collapse to single row
summarize(starwars, avg_height = mean(height))
# A tibble: 1 × 1
  avg_height
       <dbl>
1         NA

Review Question

Using dplyr, how do you compute the average height of characters from Tatooine?

Answer:

summarise(
  filter(starwars, homeworld == "Tatooine"),
  avg_height = mean(height)
)
# A tibble: 1 × 1
  avg_height
       <dbl>
1       170.

Chaining with the Pipe Operator

  • The pipe operator |> allows you to chain multiple dplyr functions together in a readable way.
starwars |>
  filter(homeworld == "Tatooine") |>
  summarize(mean(height))
# A tibble: 1 × 1
  `mean(height)`
           <dbl>
1           170.

Grouped Operations

  • Use group_by() to perform operations by group
  • It marks groups in the data, allowing functions like summarise() and mutate() to work group by group instead of on the whole dataset

Example

Without grouping:

starwars |>
  summarise(avg_height = mean(height))
# A tibble: 1 × 1
  avg_height
       <dbl>
1         NA

With grouping:

starwars |>
  group_by(homeworld) |>
  mutate(height_mean = mean(height))
# A tibble: 87 × 16
# Groups:   homeworld [49]
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, height_in <dbl>, height_mean <dbl>

Review Question

Q: Explain what happens if you do group_by() → mutate() → summarize().

Answer

A:
- group_by() creates groups

  • mutate() calculates new variables within each group (like group means, group z-scores)

  • summarize(), when applied afterward, collapses each group to one row, producing one summary per group

  • After summarize(), the grouping is dropped automatically, because each group has been reduced to one row and grouping no longer makes sense

Data Viz and Grammar of Graphics

What is the Grammar of Graphics?

A systematic way to describe and build visualizations built from three core components:

  • Data

  • Aesthetic mappings (link variables → visual channels)

  • Geometries (how data becomes marks: points, lines, bars)

Review Question

Q: What is one example of an aesthetic mapping? What visual channel does it use?

Answer

A: An example is mapping species to color. This uses the color visual channel to encode differences between species.

Different geometries

Different geoms = different ways to summarize a variable
- geom_point(): scatterplots (two continuous variables)
- geom_line(): line plots (continuous variable over time)
- geom_bar(): bar plots (categorical variable counts or summaries)
- geom_histogram(): histograms (distribution of a single continuous variable)
- geom_boxplot(): boxplots (distribution summaries by category)
… and more! Depends on the use case

Example components of ggplot

library(palmerpenguins)
penguins |> 
  ggplot(aes(x = bill_length_mm,
             y = flipper_length_mm,
             color = species,
             shape = sex)) +
  geom_point()

  • Data: palmerpenguins::penguins

  • Aesthetics: x, y, color, shape

  • Geometry: geom_point()

Project 3: ER Visits

Core skills:

  • Data cleaning (using dplyr)
  • EDA (including missing value analysis)
  • Communicating findings (using ggplot2)

EDA vs. Explanatory Data Analysis

EDA

Exploratory Data Analysis
- Open-ended exploration of data to find patterns or anomalies
- Often uses many plots and summary statistics

Example questions to answer:
- What are the cols and rows?
- Is there any missing data?
- What are the distributions of key variables?

Explanatory Data Analysis

  • Focused on communicating specific findings or insights.
  • Uses clear, well-designed visualizations to tell a story.

Example tools to use:
- Clear titles and labels
- Setting aesthetics and themes

Joins and Pivots

Joins

  • Joins combine two data frames based on a common key.
  • Types of joins:
    • Inner join: keeps only matching rows.
    • Left join: keeps all rows from the left table, matching from the right.
    • Right join: keeps all rows from the right table, matching from the left.
    • Full join: keeps all rows from both tables, matching where possible.

Pivots

  • Pivoting reshapes data between wide and long formats.
  • pivot_longer(): converts wide format to long format.
  • pivot_wider(): converts long format to wide format.

Review Question

Q: When would you use a left join instead of an inner join?

Answer

A: You would use a left join when you want to keep all rows from the left table, even if there are no matching rows in the right table. This is useful when you want to preserve all data from the primary dataset while adding information from a secondary dataset where available.

Strings

String Basics

  • Strings are sequences of characters enclosed in quotes.
  • Common stringr functions:
    • str_length(): gets the length of strings
    • str_c(): concatenates strings
    • str_sub(): extracts substrings
    • str_detect(): checks for pattern presence
    • str_view(): visualizes patterns in strings

Regular Expressions

  • A sequence of characters expressing a string or pattern to be searched for within a longer piece of text.
  • Search for:
    • Literal characters
    • Metacharacters
    • Character sets
    • Character classes

Review Question

Q: How would you use str_detect() to find all strings that contain the word “data”?

library(stringr)
strings <- c("data science", "statistics", "big data", "analysis")

Answer

str_detect(strings, "data")
[1]  TRUE FALSE  TRUE FALSE

Project 4: EV Power

Core skills:

  • Data cleaning (using stringr and regex)
  • Table joins and pivots
  • Mapping (using ggplot2, sf, leaflet)
  • Communicating findings (using Quarto Dashboards)

Summary

  • Git helps manage version control and collaboration.
  • dplyr simplifies data wrangling with readable functions and piping.
  • The Grammar of Graphics provides a framework for building visualizations.
  • EDA is exploratory, while Explanatory Data Analysis focuses on communication.
  • Joins and pivots are essential for combining and reshaping data.
  • String manipulation is crucial for handling text data in R.
  • Practice these concepts to prepare for the final exam!