Git to Strings
git init: Initialize a new Git repository.git clone <url>: Clone an existing repository from a URL.git add <file>: Stage changes to a file for the next commit.git commit -m "message": Commit staged changes with a message.git push: Push committed changes to a remote repository.Base R lets you subset data frames using $, [], and logical expressions, but this often becomes messy and hard to read:
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 <NA> NA NA <NA> <NA> <NA> NA <NA> <NA>
2 <NA> NA NA <NA> <NA> <NA> NA <NA> <NA>
3 <NA> NA NA <NA> <NA> <NA> NA <NA> <NA>
4 <NA> NA NA <NA> <NA> <NA> NA <NA> <NA>
5 <NA> NA NA <NA> <NA> <NA> NA <NA> <NA>
6 <NA> NA NA <NA> <NA> <NA> NA <NA> <NA>
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
[1] NA
Reusing clean data frames
Using readable verbs (filter(), select(), mutate(), etc.)
Referencing columns without quotes
Producing consistent data frames as output
Q: In Base R, how do you compute the average height of characters from Tatooine?
[1] 169.8
slice(): pick rows by number# A tibble: 1 × 15
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none masculine
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, height_in <dbl>
select(): pick columns by name or helper# A tibble: 87 × 1
homeworld
<chr>
1 Tatooine
2 Tatooine
3 Naboo
4 Tatooine
5 Alderaan
6 Tatooine
7 Tatooine
8 Tatooine
9 Tatooine
10 Stewjon
# ℹ 77 more rows
filter(): pick rows by condition# A tibble: 0 × 15
# ℹ 15 variables: name <chr>, height <int>, mass <dbl>, hair_color <chr>,
# skin_color <chr>, eye_color <chr>, birth_year <dbl>, sex <chr>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, height_in <dbl>
mutate(): create or transform columns# A tibble: 87 × 15
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, height_in <dbl>
summarize(): collapse to single row# A tibble: 1 × 1
avg_height
<dbl>
1 NA
Using dplyr, how do you compute the average height of characters from Tatooine?
# A tibble: 1 × 1
avg_height
<dbl>
1 170.
|> allows you to chain multiple dplyr functions together in a readable way.# A tibble: 1 × 1
`mean(height)`
<dbl>
1 170.
group_by() to perform operations by groupsummarise() and mutate() to work group by group instead of on the whole datasetWithout grouping:
# A tibble: 1 × 1
avg_height
<dbl>
1 NA
With grouping:
# A tibble: 87 × 16
# Groups: homeworld [49]
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, height_in <dbl>, height_mean <dbl>
Q: Explain what happens if you do group_by() → mutate() → summarize().
A:
- group_by() creates groups
mutate() calculates new variables within each group (like group means, group z-scores)
summarize(), when applied afterward, collapses each group to one row, producing one summary per group
After summarize(), the grouping is dropped automatically, because each group has been reduced to one row and grouping no longer makes sense
A systematic way to describe and build visualizations built from three core components:
Data
Aesthetic mappings (link variables → visual channels)
Geometries (how data becomes marks: points, lines, bars)
Q: What is one example of an aesthetic mapping? What visual channel does it use?
A: An example is mapping species to color. This uses the color visual channel to encode differences between species.
Different geoms = different ways to summarize a variable
- geom_point(): scatterplots (two continuous variables)
- geom_line(): line plots (continuous variable over time)
- geom_bar(): bar plots (categorical variable counts or summaries)
- geom_histogram(): histograms (distribution of a single continuous variable)
- geom_boxplot(): boxplots (distribution summaries by category)
… and more! Depends on the use case

Data: palmerpenguins::penguins
Aesthetics: x, y, color, shape
Geometry: geom_point()
Exploratory Data Analysis
- Open-ended exploration of data to find patterns or anomalies
- Often uses many plots and summary statistics
Example questions to answer:
- What are the cols and rows?
- Is there any missing data?
- What are the distributions of key variables?
Example tools to use:
- Clear titles and labels
- Setting aesthetics and themes
pivot_longer(): converts wide format to long format.pivot_wider(): converts long format to wide format.Q: When would you use a left join instead of an inner join?
A: You would use a left join when you want to keep all rows from the left table, even if there are no matching rows in the right table. This is useful when you want to preserve all data from the primary dataset while adding information from a secondary dataset where available.
str_length(): gets the length of stringsstr_c(): concatenates stringsstr_sub(): extracts substringsstr_detect(): checks for pattern presencestr_view(): visualizes patterns in stringsQ: How would you use str_detect() to find all strings that contain the word “data”?
[1] TRUE FALSE TRUE FALSE
