Explanatory Data Analysis

Focus your Story

Exploratory Data Analysis (EDA)

Features

  • consumer: the analyst
  • goal: uncover structure for further analysis
  • follow templates
  • prioritize quick and accurate interpretation

Common Exploratory Questions

  • What is the unit of observation?
  • What are the variables, their types, and levels?
  • How is missingness encoded? How prevalent is it?
  • What is the distribution of the variables?
  • Are there any interesting outliers?
  • Use View() in Positron

Explanatory Data Analysis

Features

  • consumer: stakeholder
  • goal: tell a specific story
  • customized
  • polished aesthetic

Common Exploratory Questions

  • Who is my audience?
  • What is the main story?
  • How can I facilitate comparisons?
  • Can my plot be clearly read?
  • Is it polished and customized for audience?

Agenda for today

  1. Clarify with Labels
  2. Facilitate Comparisons
  3. Polish Your Look
    1. Themes
    2. Setting aesthetics
  4. Electric Vehicles Revisited

Clarify with Labels

Labels with labs()

The labels on the plot can refer to many different places where text is added to describe what is plotted.

  • title
  • subtitle
  • x-axis
  • y-axis
  • legends

In ggplot2, these are arguments to a labs() layer.

  • title
  • subtitle
  • color
  • fill
  • size
  • shape

penguins |>
  ggplot(aes(x = bill_length_mm,
             y = flipper_length_mm,
             color = species)) +
  geom_point()

penguins |>
  ggplot(aes(x = bill_length_mm,
             y = flipper_length_mm,
             color = species)) +
  geom_point() +
  labs(title = "Penguin size differs by island",
       x = "Bill Length (mm)",
       y = "Bill Depth (mm)",
       color = "Species")

Labels without labs()

An alternative method uses a separate layer for each label.

  • ggtitle()
  • xlab()
  • ylab()

. . . I recommend you just use labs().

Facilitate Comparisons

Facilitating Comparison

  1. How does this compare to that?
  2. How does this compare to what’s typical?
  3. How big is this variability relative to that variability?

Compares one species to another species.

p2 <- penguins |>
  ggplot(aes(x = bill_length_mm,
             y = flipper_length_mm,
             color = species)) +
  geom_point() +
  labs(title = "Penguin size differs by island",
       x = "Bill Length (mm)",
       y = "Bill Depth (mm)",
       color = "Species") +
  lims(x = c(0, 60),
       y = c(0, 233))

p2

Compares each species to a baseline of zero.

p1
#  scale_x_log10()

Pay attention to where the x = 40 and x = 60 lines are…

p1 +
  scale_x_log10()

Limits and Scales

Emphasize certain comparisons (and occlude others) by altering the limits of the axes to include a relevant baseline and by using non-linear scales.

  • lims()
  • scale_x_log10()
  • scale_y_log10()
  • General scale_ family of layers

Facilitating Comparison

  1. How does this compare to that?
  2. How does this compare to what’s typical?
  3. How big is this variability relative to that variability?

How could I make it easiest to understand the relative commonality of the three species?

penguins |>
  ggplot(aes(x = species)) +
  geom_bar()

forcats

library(forcats) # useful functions for factors
penguins |> 
  mutate(species = fct_infreq(species)) |>
  ggplot(aes(x = species)) +
  geom_bar()

Polish Your Look

Themes

Themes describe a set of visual design choices made for the plot on things like the background, the gridlines, the axes, the tick marks, and the fonts.

Some ggplot2 themes (each one can be added as layer)

  • theme_gray(): the default
  • theme_bw()
  • theme_minimal()

p1 <- penguins |>
  ggplot(aes(x = bill_length_mm,
             y = flipper_length_mm,
             color = species)) +
  geom_point() +
  labs(title = "Penguin size differs by island",
       x = "Bill Length (mm)",
       y = "Bill Depth (mm)",
       color = "Species")
p1

p1 +
  theme_bw()

p1 +
  theme_minimal()

There are many more themes…

ggplot themes can be stored inside packages.

# install.packages("ggthemes")
library(ggthemes)
p1 +
  theme_wsj()

ggthemes::theme_economist()

ggthemes::theme_fivethirtyeight()

ggthemes::theme_solarized()

ggthemes::theme_excel()

Facilitate Comparisons

Your Turn

Write the ggplot2 code that has generated the following plot.

EV: Explanatory Plot A

The Plan

Audience: Residents and decisions makers in all fifty states.

My story: There is considerable variability from state to state in EV registrations.

My strategy: Display the registration counts in a manner that makes it easy to see the counts of each individual state equally well.

Where we left off

ev <- read_csv("data/ev-registrations.csv")
ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5))

Where we left off

Labels

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count")

Labels

Labels

Labels

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count")

Labels

What next?

Facilitating Comparisons

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt()

Facilitating Comparisons

Facilitating Comparisons

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt(labels = scales::comma)

Facilitating Comparisons

What next?

Themes

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") + 
  scale_y_sqrt(labels = scales::comma) +
  theme_minimal()

Themes

Themes

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col() +
  theme_minimal(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt(labels = scales::comma)
Error in theme_minimal(axis.text.x = element_text(angle = 270, hjust = 0, : unused arguments (axis.text.x = element_text(angle = 270, hjust = 0, vjust = 0.5), plot.title = element_text(hjust = 0.5))

Themes

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt(labels = scales::comma)

Themes

Themes

Themes describe a set of visual design choices made for the plot on things like the background, the gridlines, the axes, the tick marks, and the fonts.

Some ggplot2 themes (each one can be added as layer)

  • theme_gray(): the default
  • theme_bw()
  • theme_minimal()

theme() will change defaults of the exiting theme, so add it last.

What next?

Setting Aesthetics

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col(color = "blue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt(labels = scales::comma)

Setting Aesthetics

Setting Aesthetics

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col(fill = "blue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt(labels = scales::comma)

Setting Aesthetics

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt(labels = scales::comma)

EV: Explanatory Plot B

The Plan

Audience: Residents and decisions makers in all fifty states.

My story: Just a few states account for most of the electrical vehicle registrations

My strategy: Display the registration counts in a manner that draws attention to the discrepancy between the highets and lowest states.

ev |>
  ggplot(aes(x = State, y = Count)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt(labels = scales::comma)

ev |>
  mutate(State = fct_reorder(State, Count, .desc = TRUE)) |>
  ggplot(aes(x = State, y = Count)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count") +
  scale_y_sqrt(labels = scales::comma)

ev |>
  mutate(State = fct_reorder(State, Count, .desc = TRUE)) |>
  ggplot(aes(x = State, y = Count)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 270, 
                                   hjust = 0, 
                                   vjust = 0.5),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Electric Vehicle Registrations by State",
       x = "",
       y = "Registration Count")