6 min read

Using purrr to refactor imperative code

Introduction

In a recent blog post, Nathan Eastwood solved the so-called Twitter Waterfall Problem using R. In his post, Nathan provides two solutions; one imperative approach using a large for-loop, and one more substantive approach using R6.

This post uses Nathan’s first solution as a case study of how to refactor imperative code using a more functional approach.

Problem

Consider this picture:

The blocks here represent walls, and we’re imagining what would happen if water were poured onto this structure. All water poured on the sides would run off, while some water would get trapped in the middle. In effect, it would end up looking like this:

So the problem is: given a set of walls, where would water accumulate?

Imperative solution

The walls are represented by a simple numerical vector wall. We then iterate over these walls from left to right. As Nathan explains,

The approach I took was one of many ways you could solve this problem. I chose to treat each column (wall) as an index and then for each index I implement a loop:

  • Find the height of the current index
  • Find the maximum height of the walls to the left of the current index
  • Find the maximum height of the walls to the right of the index

I then work out what the smallest maximum height is between the maximum heights to the left and right of the current index. If this smallest height minus the current index height is greater than zero, then I know how many blocks will fill with water for the current index. Of course, if the smallest maximum height to the left or right is less than the current height, then I get the run off.

Nathan offers the following solution (which I wrapped inside a function):

wall <- c(2, 5, 1, 2, 3, 4, 7, 7, 6)

get_water_original <- function(wall) {
  len <- length(wall)
  # pre-allocate memory to make the loop more efficient
  water <- rep(0, len)
  for (i in seq_along(wall)) {
    currentHeight <- wall[i]
    maxLeftHeight <- if (i > 1) {
      max(wall[1:(i - 1)])
    } else {
      0
    }
    maxRightHeight <- if (i == len) {
      0
    } else {
      max(wall[(i + 1):len])
    }
    smallestMaxHeight <- min(maxLeftHeight, maxRightHeight)
    water[i] <- if (smallestMaxHeight - currentHeight > 0) {
      smallestMaxHeight - currentHeight
    } else {
      0
    }
  }
  water
}

get_water_original(wall)
## [1] 0 0 4 3 2 1 0 0 0

The function outputs the correct solution, viz. that we would get water columns of height 4 in position 3, height 3 in position 4, and so on.

Functional solution

If we look closely at the for-loop we see that it really consists of three separate parts:

  1. For a given i, find the current height, maximum height to the left, and maximum height to the right.
  2. For a given set of heights, find the minimum of max heights to the left and the right, and find the difference between that and the current height.
  3. Machinery for iterating over the wall vector and populating the water vector.

Having thus identified the relevant parts, we should be able to define/reuse three corresponding functions. We can use purrr::map to take care of all the looping machinery (point 3), so we just need to define functions get_heights and get_depth.

We first set a seed and load some packages.

set.seed(1)
library(tidyverse)
library(forcats)
library(microbenchmark)

Now, get_heights takes a vector wall and an index i as input, and starts by splitting up wall into left, right, and (implicit) mid parts. It then finds the maximum values for each part, and returns a list with the results.

get_heights <- function(wall, i) {
  left  <- wall[seq_len(i - 1)]
  right <- wall[seq(i + 1, length(wall))]
  list(l = max(left, 0, na.rm = TRUE),
       m = wall[i],
       r = max(right, 0, na.rm = TRUE))
}

get_heights(wall, 2)
## $l
## [1] 2
## 
## $m
## [1] 5
## 
## $r
## [1] 7

Next, get_depth takes a list of heights h produced by get_heights as input and returns either their least difference, or 0.

get_depth <- function(h) {
  max(min(h$l, h$r) - h$m, 0)
}

get_depth(get_heights(wall, 3))
## [1] 4

Since the co-domain of get_heights matches up with the domain of get_depth, we can now compose the two functions with purrr::compose to create a function f which takes as input a wall and an i and returns the depth of water at that position.

# equivalent to
# f <- function(wall, i) get_depth(get_heights(wall, i))
f <- compose(get_depth, get_heights)

Finally, we let map_dbl take care of the looping/iterating for us.

get_water <- function(wall) {
  map_dbl(seq_along(wall), f, wall = wall)
}

get_water(wall)
## [1] 0 0 4 3 2 1 0 0 0

In summary, then, we took a large for-loop and

  1. split it up into two small, pure, encapsulated functions,
  2. composed together those functions, and
  3. mapped the composite function over the wall vector.

This, in my view, is the essence of functional programming.

Comparing results

We can test if the two solutions give identical results by generating a large wall and compare the results.

big_wall <- sample(1:1000, 1000, TRUE)
all(get_water_original(big_wall) == get_water(big_wall))
## [1] TRUE

We can also check whether one solution is faster than the other:

microbenchmark(get_water(big_wall), get_water_original(big_wall))
## Unit: milliseconds
##                          expr      min       lq     mean   median       uq
##           get_water(big_wall) 23.61028 25.01164 26.67536 25.36272 25.73384
##  get_water_original(big_wall) 10.32664 11.28836 11.99218 11.67553 11.84404
##       max neval
##  68.60116   100
##  53.56293   100

Unsurprisingly, the functional solution is somewhat slower, largely owing (presumably) to its greater number of function calls.

Plotting results

To plot a solution we first define a function that takes a wall as input, solves the problem, and returns a tidy data frame with all the necessary information for drawing the walls and water columns.

df_solution <- function(wall) {
  df <- data_frame(
    x = seq_along(wall),
    wall,
    water = get_water(wall)
  )
  gather(df, key, y, -x)
}

df_solution(wall)
## # A tibble: 18 x 3
##        x   key     y
##    <int> <chr> <dbl>
##  1     1  wall     2
##  2     2  wall     5
##  3     3  wall     1
##  4     4  wall     2
##  5     5  wall     3
##  6     6  wall     4
##  7     7  wall     7
##  8     8  wall     7
##  9     9  wall     6
## 10     1 water     0
## 11     2 water     0
## 12     3 water     4
## 13     4 water     3
## 14     5 water     2
## 15     6 water     1
## 16     7 water     0
## 17     8 water     0
## 18     9 water     0

We can then pass such a data frame into plot_solution to draw the walls and water.

plot_solution <- function(df) {
  ggplot(df, aes(x + 0.5, y, fill = fct_rev(key))) +
    geom_col(position = "stack", show.legend = FALSE, width = 1) +
    scale_fill_manual(values = c("steelblue", "grey")) +
    scale_x_continuous(breaks = seq_along(wall)) +
    scale_y_continuous(breaks = seq(0, max(wall), 1)) +
    theme_void() +
    theme(
      strip.text = element_blank(),
      panel.ontop = TRUE,
      panel.grid.major.x = element_line(colour = "white", size = 0.1),
      panel.grid.major.y = element_line(colour = "white", size = 0.1),
      plot.margin = unit(rep(0.1, 4), "cm")
    ) +
    coord_equal()
}

plot_solution(df_solution(wall))

Finally, we can generate a large number of random walls and plot each solution as a separate facet.

walls <- rerun(25, df_solution(sample(1:10, 10, TRUE))) %>% 
  bind_rows(.id = "draw")

plot_solution(walls) +
  facet_wrap(~draw) +
  ggtitle("Twitter Waterfall Challenge")