Most popular ggplot2 geoms - A blog about math and programming

In this post we will use the gh package to search Github for uses of the different geoms used in ggplot2.

We load some necessary packages.

library(tidyverse)
library(scales)
library(stringr)
library(gh)

Next, we find all the functions exported by ggplot2, keep only those of the form geom_*, and put them in a data frame.

geoms <- getNamespaceExports("ggplot2") %>%
  keep(str_detect, pattern = "^geom") %>%
  data_frame(geom = .)

geoms

## # A tibble: 44 x 1
##           geom
##          <chr>
##  1   geom_text
##  2  geom_vline
##  3    geom_col
##  4   geom_tile
##  5     geom_qq
##  6  geom_label
##  7   geom_line
##  8 geom_smooth
##  9   geom_path
## 10  geom_spoke
## # ... with 34 more rows

Since there are 44 different geoms we will have to make 44 separate calls to the Github API. To avoid running into a rate limit, we thus define a function operator called delay, which takes as input a function f and a number delay, and returns that same function, but now modified so that its call is delayed by delay seconds.

delay <- function(f, delay = 3) {
  function(...) {
    Sys.sleep(delay)
    f(...)
  }
}

When web scraping and making API calls it’s often nice to get some real-time feedback on what calls are being made. For this we define a second function operator, w_msg, which modifies a function so that it prints out its first argument whenever it is called.

w_msg <- function(f) {
  function(...) {
    args <- list(...)
    message("Processing: ", args[[1]])
    f(...)
  }
}

Our workhorse function will be search_gh, which takes a search query parameter q, and then searches Github for that query. (NB: if you want to run this yourself, you should follow the instructions in ?gh::gh_whoami for how to handle your Github API token).

search_gh <- function(q, ...) {
  gh("/search/code", q = q, ...)
}

With all this in place, we simply map our modified search_gh over all 44 geoms, and extract the total_count field.

geoms <- geoms %>%
  mutate(
    result      = map(geom, w_msg(delay(search_gh))),
    total_count = map_int(result, "total_count")
)

## Processing: geom_text

## Processing: geom_vline

## Processing: geom_col

## Processing: geom_tile

## Processing: geom_qq

## ...

geoms

## # A tibble: 44 x 3
##           geom            result total_count
##          <chr>            <list>       <int>
##  1   geom_text <S3: gh_response>       38912
##  2  geom_vline <S3: gh_response>       29418
##  3    geom_col <S3: gh_response>        5088
##  4   geom_tile <S3: gh_response>       13993
##  5     geom_qq <S3: gh_response>         482
##  6  geom_label <S3: gh_response>        7703
##  7   geom_line <S3: gh_response>      116986
##  8 geom_smooth <S3: gh_response>       46169
##  9   geom_path <S3: gh_response>       12378
## 10  geom_spoke <S3: gh_response>         512
## # ... with 34 more rows

Finally, we plot the results.

ggplot(geoms, aes(x = total_count,
                  y = reorder(geom, total_count))) +
  geom_point(color = "red") +
  geom_text(
    aes(label = str_replace(geom, "geom_(.*)", "\\1  ")),
    size = 3,
    hjust = 1,
    color = "grey30"
  ) +
  scale_y_discrete(expand = c(0.03, 0)) +
  scale_x_log10(
    limits = c(100, 1500000),
    expand = c(0, 0),
    breaks = 10 ^ c(1:6),
    labels = format(10 ^ c(1:6),
                    scientific = FALSE, big.mark = ",")
  ) +
  annotation_logticks(sides = "b") +
  theme_minimal() +
  theme(
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    plot.margin = unit(rep(5, 4), "mm")
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "Make geom_qq Great Again",
    subtitle = paste0("Most popular ggplot2 geoms,",
                      " by total count on Github")
  )