François Michonneau, PhD

How to work with remote Parquet files with the duckdb R package?

2023-06-19T00:00:00+00:00

For large datasets, it is sometimes convenient to explore them without downloading them locally. With Arrow, you can work with these remotes files if they are stored in AWS S3 or Google Cloud Storage. It is however not yet possible for files stored over HTTPS (it is on the roadmap). On the other hand, with the “httpfs” extension, DuckDB allows you to query over the wire these Parquet files.

You can even set things up so you can use dplyr verbs to work with these remote files. I will demonstrate this using a Parquet version of the penguins dataset hosted on my site.

Let’s start by loading the required packages:

library(DBI)
library(duckdb)
library(dplyr)

We are creating a con object to hold our DuckDB connection:

con <- duckdb::duckdb()

Let’s install (only needed once) and load the httpfs extension:

dbExecute(con, "INSTALL httpfs;")
dbExecute(con, "LOAD httpfs;")

At this point, we could use DuckDB’s SQL syntax to work with our remote dataset:

dbGetQuery(con,
  "SELECT species,
          AVG(bill_length_mm) AS avg_bill_length,
          AVG(bill_depth_mm) AS avg_bill_depth
   FROM PARQUET_SCAN('https://francoismichonneau.net/assets/data/penguins.parquet')
   GROUP BY species;")

# A tibble: 3 × 3
  species   avg_bill_length avg_bill_depth
                           
1 Adelie               38.8           18.3
2 Gentoo               47.5           15.0
3 Chinstrap            48.8           18.4

However, you can create a view using this remote file, which in turn, will allow you to use dplyr to query your file:

dbExecute(con,
  "CREATE VIEW penguins AS
   SELECT * FROM PARQUET_SCAN('https://francoismichonneau.net/assets/data/penguins.parquet');
")

You can check it worked by running:

dbListTables(con)

[1] "penguins"

Now you can work with this remote data with dplyr:

tbl(con, "penguins") |>
  group_by(species) |>
  summarize(
    avg_bill_length = mean(bill_length_mm),
    avg_bill_depth = mean(bill_depth_mm)
  )

# Source:   SQL [3 x 3]
# Database: DuckDB 0.8.1 [francois@Linux 6.2.0-20-generic:R 4.3.0/:memory:]
  species   avg_bill_length avg_bill_depth
                           
1 Adelie               38.8           18.3
2 Gentoo               47.5           15.0
3 Chinstrap            48.8           18.4

How to use Arrow to work with large CSV files?

2022-10-13T00:00:00+00:00

Some background

Lucky you! You just got hold of a largish CSV file (let’s say 15 GB, about 140 million rows). How do you handle this file to be able to work with it using Apache Arrow?

Going through the documentation of Arrow, you might notice that several ways are mentioned to import data. They fall into two families:

one that I will refer to as the Single file API¹;
the other is the Dataset API.

The Single file API contains functions for each supported file format (CSV, JSON, Parquet, Feather/Arrow, ORC). They work on one file at a time, and they load the data in memory. So depending on the size of your file and the amount of memory you have available on your system, it might not be possible to load the dataset this way. If you can load the dataset in memory queries will run faster because the data will be readily accessible to the query engine.

The Dataset API is very flexible. It can read multiple file formats, you can point to a folder with multiple files and create a dataset from them, and it can read datasets from multiple sources (even combining remote and local sources). This API can also be used to read single files that are too large to fit in memory. This works because the files are not actually loaded in memory. The functions scan the content so they know where to look for the data and what the schema is (the data types and names of each column). When you query the data, there is some overhead because the query engine needs to first read the data before it can operate on it. (If you want to see some examples of what the Dataset API can do, check out the two previous posts on datasets with Arrow: Part 1, and Part 2)

In this post, we will explore how to convert a large CSV file to the Apache Parquet format using the Single file and the Dataset APIs with code examples in R and Python. We do the conversion from CSV to Parquet, because in a previous post we found that the Parquet format provided the best compromise between disk space usage and query performance. Having the content of this file in the Apache Parquet format will ensure that we can read and operate on this data quickly.

The Single file API in R

The functions in the Single file API in R start with read_ or write_ followed by the name of the file format. For instance, read_csv_arrow(), read_parquet(), and read_feather() belong to what I refer here as the Single file API.

To read the data with our 15 GB CSV file, we would use:

library(arrow)

data <- read_csv_arrow(
  "~/dataset/path_to_file.csv",
  as_data_frame = FALSE
)

Using as_data_frame = FALSE keeps the result as an Arrow table which is a better representation for a file of this size. Attempting to convert it into a data frame will take longer to load, and you will most likely run out of memory.

This step takes about 15 seconds on my system. As far as I can tell, the arrow R package is the only way to load a file of this size in memory. Both readr/vroom and data.table ran out of memory after several minutes and before being able to finish reading the file.

At this point, you have an Arrow formatted table loaded in memory that is ready for you to work with.

To convert this file into the Apache Parquet format using the Single file API, you would use:

write_parquet(data, "~/dataset/data.parquet")

Creating this file takes about 85 seconds on my system. The resulting file is about 9.5 GB, reducing the amount of hard drive space needed to store the data by about 60%.

The read_parquet() function will load this dataset the next time you need to work with it:

data <- read_parquet("~/dataset/data.parquet", as_data_frame = FALSE)

Let’s count the number of unique values in one of the columns of this dataset:

data %>%
  count(variable) %>%
  collect()

This query takes only half a second on my laptop. Half a second to summarize the content of 140 million rows: this is fast! Very fast!

Whether you use read_csv_arrow() or read_parquet(), the dataset is loaded in memory using the same representation: an Arrow table. The performance of queries would therefore be the same regardless of the format used to store the data. In this case, the decision to storing the data as a CSV or a Parquet file will be based on the amount of storage, how fast reading from CSV or Parquet compares to the overhead associated with the conversion from one format to the other.

Let’s now use the Dataset API.

The Dataset API in R

We will read the large CSV file with open_dataset(). This function can be pointed to a folder with several files but it can also be used to read a single file.

data <- open_dataset("~/dataset/path_to_file.csv")

With our 15 GB file, it takes 0.05 seconds to “read” the file. It is fast because the data does not get loaded in memory. open_dataset() scans the content of the file to identify the name of the columns and their data types.

Running the same query as above, which counts the number of unique values in a column, takes 18 seconds compared to the 0.5 seconds when the data is loaded in memory. It is slower because the query engine needs to read the data. It is the same result that we had found in a previous post: running queries directly on a CSV file is slow. In that post, we also found that storing the data in the Parquet format sped things up. Let’s now convert this dataset to Parquet using the Dataset API.

Instead of using a single Parquet file as we did above when we looked at the Single file API, we will also partition the Parquet dataset to see how it could help with query performance. The particular dataset I have on hand does not have any obvious variable we can use to partition the data. If you are dealing with a dataset that has timestamps for data collected at regular intervals, partitioning on a temporal dimension could make sense (that’s what the NYC taxi dataset does by partitioning by year and month). Instead, here, we can use the max_rows_per_file argument of the write_dataset() function to limit how large each Parquet file is. At least for this dataset, I found that limiting the number of rows to 10 million per file seemed like a good compromise. Each file is about 720 MB which is close to the file sizes in the NYC taxi dataset. The PyArrow documentation has a good overview of strategies for partitioning a dataset. The general recommendation is to avoid individual Parquet files smaller than 20 MB and larger than 2 GB while avoiding a partition layout that would create more than 10,000 partitions.

write_dataset(
  data,
  format = "parquet",
  path = "~/datasets/my-data/",
  max_rows_per_file = 1e7
)

Writing these files on my system takes about 50 seconds. We end up with 14 Parquet files totaling 9.9 GB.

Next time we want to work with this data, we can load these files with:

data <- open_dataset(
  "~/datasets/my-data"
  )

It takes about the same amount of time as scanning the CSV files. It is almost instantaneous taking only 0.02 seconds. Again, this is fast because the data is not loaded in memory. We saw that with this approach it took almost 20 seconds to run this query on our CSV file. So what is the performance of a query on this dataset split into multiple Parquet files?

Counting the unique values in a column takes just 1 second. You read that correctly. One second to summarize 140 million rows. It is a little slower than doing it when the entire dataset is loaded in memory but scanning the files is faster. And because the dataset is not loaded in memory, you are not limited by the amount of memory you have available. With the Single File API, a file of 15 GB is the upper limit of what my laptop with 32 GB of RAM can handle.

One of the advantages of the Arrow ecosystem is that it is polyglot. The approach we described with R also works with Python. And because both languages use the same C++ backend, the code looks very similar.

Single file API in Python

There are two functions in the PyArrow Single API to read CSV files: read_csv() and open_csv(). While read_csv() loads all the data in memory and does it fast by using multiple threads to read different parts of the files, open_csv() reads the data in batches and uses a single thread.

If the CSV file is small enough, you should use read_csv(). The code to read the CSV file and write it to a Parquet file would then look like this:

import pyarrow as pa
import pyarrow.csv
import pyarrow.parquet as pq

in_path = '~/datasets/data.csv'
out_path = '~/datasets/data.parquet'

data =  pa.csv.read_csv(in_path)

pq.write_table(data, out_path)

In our case, the file is too large to fit in memory². So instead of using read_csv(), we need to use open_csv(). Because, the CSV file is read in chunks, the code is a little more complex. We need to loop through each chunk, read it, and write it to the Parquet file. This uses little memory but is not as fast as using read_csv(). While read_csv() is multi-threaded, open_csv() uses a single thread. When using open_csv(), the data types need to be consistent in your columns. The function infers the data types on the first chunk of data read, and if the type changes halfway through your dataset in one of your columns, you will run into errors. You can avoid this by specifying the data types manually.

# from 
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv

in_path = '~/datasets/data.csv'
out_path = '~/datasets/data.parquet'

writer = None
with pyarrow.csv.open_csv(in_path) as reader:
    for next_chunk in reader:
        if next_chunk is None:
            break
        if writer is None:
            writer = pq.ParquetWriter(out_path, next_chunk.schema)
        next_table = pa.Table.from_batches([next_chunk])
        writer.write_table(next_table)
writer.close()

On my system, the conversion from file to Parquet takes about 190 seconds. Reading the Parquet file can be done with:

data = pq.ParquetDataset(out_path).read()

With this approach, the dataset is in memory, just like when we were using R. Again with 32 GB of RAM in my laptop, I need to be careful with what is running on my system to be able to load this dataset without running out of memory and crashing my Python session.

The Dataset API in Python

To load the CSV file with the Dataset API, we use the dataset() function:

import pyarrow.dataset as ds

in_path = "~/datasets/data.csv"
out_path = "~/datasets/my-data/"

data = ds.dataset(in_path)

Just like with R, importing this file takes about 0.02 seconds.

To convert it to a collection of Parquet files, you use the write_dataset() function. This function takes the same max_rows_per_file argument to control the size of the Parquet file in each partition.

ds.write_dataset(data, out_path, format = "parquet",
                 max_rows_per_file = 1e7)

Reading this collection of Parquet files can also be done with the dataset() function, just like when we used the function to read the single CSV file above. The dataset() function is very flexible and can be used to import data in a variety of formats, and structures, and even combines files from local and remote locations. The format argument is optional as the function detects automatically the file type.

data = ds.dataset(out_path, format = "parquet")

Given the current functionalities implemented PyArrow, querying datasets of this size is possible but it is neither blazing fast nor convenient. A good alternative is to use Ibis with DuckDB as a backend. Ibis provides a single interface to work with data stored in memory or in databases. DuckDB is a self-contained database designed for data analytics. These tools deserve a lot more than a one-sentence summary but this is beyond the scope of this post.

To count the number of unique values, you could use the following approach:

import ibis

ibis.options.interactive = True

con = ibis.duckdb.connect()
data = con.register("parquet:///home/user/datasets/my-data/*.parquet", table_name = "table")

con.table("table").variable.value_counts()

Just like with using R, this takes about a second to count the unique values in one column of our 140 million row dataset.

What this post didn’t mention

I focused on the reading of a CSV file and its conversion to Parquet. I didn’t talk about all the options that both the Single file and the Dataset APIs have to customize the format of the files that are being imported. For instance, both APIs can be used to specify a different column-separator, and cell content that should be treated as missing data.

Conclusion

For a 15 GB data file, the dataset API is better suited to read, convert, and query the data. There is an overhead associated with not having the data in memory but it is greatly reduced if the data is stored as Parquet files. Another advantage is that the approach developed here would scale to much larger datasets where the Single file API would not be able to serialize the data in memory.

With the dataset in this example, the Single file API did not have an opportunity to shine given the hardware constraints of a modern laptop. However, if you are dealing with datasets that fit easily in memory, working with data directly in memory will lead to better query performance.

To summarize what we learned in this post, this brief decision guide to help you choose the correct API to import your data.

Decision tree to help you choose the appropriate Apache Arrow API for your dataset.

Acknowledgments

Thank you to Kae Suarez and Danielle Navarro for reviewing this post and providing feedback that improved its content.

Footnotes

This is not an official name for it but found it is helpful to group these functions that work on a file type at a time. ↩
I am not sure why it fit in memory when I was loading it in R but not with Python. ↩

Creating an Arrow dataset (part 2)

2022-09-06T00:00:00+00:00

Background

In this follow-up post (see part 1 if you missed it), we will explore what happens to the query performance if we read the files straight into Arrow instead of downloading them locally first.

Reading remote CSV files

In the first part, we first downloaded the compressed CSV files locally (using the download.file() function) and then we used the open_dataset() function on this set of files to make it available to Arrow.

However, it is possible to bypass the local download. We can import the files directly over an Internet connection using the read_csv_arrow() function and providing the file URL as the first argument. Once the file is loaded in memory, we can then write it to disk in the parquet format (given that we learned in part 1 that this format provided the best compromise of disk space usage and query performance).

We can then modify the code from the download_daily_package_logs_csv() function from part 1 to the following (lines changed have comments indicated by # <--- at the end of the line).

library(tidyverse)
library(arrow)

## Download the data set for a given date from the RStudio CRAN log website.
## `date` is a single date for which we want the data
## `path` is where we want the data to live
download_daily_package_logs_parquet <- function(date,
                                                path = "~/datasets/cran-logs-parquet-by-day") {

  ## build the URL for the download
  date <- parse_date(date)
  url <- paste0(
    'https://cran-logs.rstudio.com/', date$year, '/', date$date_chr, '.csv.gz'
  )

  ## build the path for the destination of the download
  file <- file.path(
    path,
    paste0("year=", date$year),
    paste0("month=", date$month),
    paste0(date$date_chr, ".parquet")   # <--- change extension to .parquet
  )

  ## create the folder if it doesn't exist
  if (!dir.exists(dirname(file))) {
    dir.create(dirname(file), recursive = TRUE)
  }

  ## download the file
  message("Downloading data for ", date$date_chr, " ... ", appendLF = FALSE)
    arrow::read_csv_arrow(url) %>%      # <--- read directly from URL
      arrow::write_parquet(sink = file) # <--- convert to parquet on disk
  message("done.")

  ## quick check to make sure that the file was created
  if (!file.exists(file)) {
    stop("Download failed for ", date$date_chr, call. = FALSE)
  }

  ## return the path
  file
}

## This function is unchanged from part 1
## and extract the year and month from it
parse_date <- function(date) {
  stopifnot(
    "`date` must be a date" = inherits(date, "Date"),
    "provide only one date" = identical(length(date), 1L),
    "date must be in the past" = date < Sys.Date()
  )
  list(
    date_chr = as.character(date),
    year = as.POSIXlt(date)$year + 1900L, 
    month = as.POSIXlt(date)$mon + 1L
  )
}

Now that we are set up, we can create the file system the same way we did, in part 1.

dates_to_get <- seq(
  as.Date("2022-06-01"),
  as.Date("2022-08-15"),
  by = "day"
)

purrr::walk(dates_to_get, download_daily_package_logs_parquet)

The result is similar to what we achieved in part 1. We have one file for each day placed in a folder corresponding to their month. Except that this time, instead of having compressed CSV files, we have parquet files:

~/datasets/cran-logs-parquet-by-day/
└── year=2022
    ├── month=6
    │   ├── 2022-06-01.parquet
    │   ├── 2022-06-02.parquet
    │   ├── 2022-06-03.parquet
    │   ├── ...
    │   └── 2022-06-30.parquet
    ├── month=7
    │   ├── 2022-07-01.parquet
    │   ├── 2022-07-02.parquet
    │   ├── 2022-07-03.parquet
    │   ├── ...
    │   └── 2022-07-31.parquet
    └── month=8
        ├── 2022-08-01.parquet
        ├── 2022-08-02.parquet
        ├── 2022-08-03.parquet
        ├── ...
        └── 2022-08-15.parquet

Let’s check how large this data is compared to the datasets we created in part 1:

dataset_size <- function(path) {
  fs::dir_info(path, recurse = TRUE) %>%
    filter(type == "file") %>%
    pull(size) %>%
    sum()
}

tribble(
  ~ Format, ~ size,
  "Compressed CSV", dataset_size("~/datasets/cran-logs-csv/"),
  "Arrow", dataset_size("~/datasets/cran-logs-arrow/"),
  "Parquet", dataset_size("~/datasets/cran-logs-parquet/"),
  "Parquet by day",  dataset_size("~/datasets/cran-logs-parquet-by-day/")
) 

# A tibble: 4 × 2
  Format                size
            
1 Compressed CSV       5.01G
2 Arrow               29.67G
3 Parquet              5.06G
4 Parquet by day       4.63G

The dataset with one parquet file per day, is slightly smaller than when we let write_dataset() do its own partitioning which led to one file per month.

We can now compare how quickly Arrow can read these datasets.

bench::mark(
  parquet = open_dataset("~/datasets/cran-logs-parquet", format = "parquet"),
  parquet_by_day = open_dataset("~/datasets/cran-logs-parquet-by-day", format = "parquet"),
  check = FALSE
)

# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
                  
1 parquet        139.43ms 143.66ms      6.62    7.91MB     0   
2 parquet_by_day   3.52ms   3.82ms    254.      4.28KB     6.45

Even though there are more files to parse (76 vs. 3), loading the dataset with a parquet file per day is a bit faster.

cran_logs_parquet <- open_dataset("~/datasets/cran-logs-parquet",  format = "parquet")
cran_logs_parquet_by_day <- open_dataset("~/datasets/cran-logs-parquet-by-day",  format = "parquet")

Let’s now explore the performance of a few queries on these datasets.

First, how long does it take to compute the number of rows in these datasets:

bench::mark(
  parquet = nrow(cran_logs_parquet),
  parquet_by_day = nrow(cran_logs_parquet)
)

# A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
                  
1 parquet           743µs    773µs     1267.    4.74KB     8.48
2 parquet_by_day    745µs    773µs     1273.    1.97KB    10.7 

Not much of a difference.

Let’s now compare the performance of the query we ran in part 1, where we computed the 10 most downloaded packages in the period covered by our dataset.

top_10_packages <- function(data) {
  data %>%
    count(package, sort = TRUE) %>%
    head(10) %>%
    mutate(n_million_downloads = n/1e6) %>%
    select( - n) %>% 
    collect()
}

bench::mark(
  top_10_packages(cran_logs_parquet),
  top_10_packages(cran_logs_parquet_by_day)
)

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 2 × 6
  expression                                     min   median `itr/sec` mem_al…¹
                                         
1 top_10_packages(cran_logs_parquet)           3.58s    3.58s     0.279   7.19MB
2 top_10_packages(cran_logs_parquet_by_day)    5.76s    5.76s     0.174 165.36KB
# … with 1 more variable: `gc/sec` , and abbreviated variable name
#   ¹​mem_alloc
# ℹ Use `colnames()` to see all variable names

This query runs 1.5 seconds faster on the dataset with one parquet file per month compared to the dataset with one parquet file per day.

The way a dataset is partitioned has an impact on the performance of queries. If you are filtering your dataset along a variable used in the partitioning, some of the files can be skipped. Arrow can directly and only read the file(s) with the relevant information for your query. For instance, if you are performing a query that only touches the month of July, Arrow does not need to look at the files for June or August, leading to potential speed-ups.

Would the partitioning by day help us run our query faster if we were to compute the 10 most downloaded packages for a single day? After all, in this case, we would only need to look at one of the files in our folder of parquet files, and the file in question would be smaller than one that has all the data for the month. Let’s compare the performance of this query for August 1st, 2022:

top_10_packages_by_day <- function(data) {
  data %>%
    filter(date == as.Date("2022-08-01")) %>%
    count(package, sort = TRUE) %>%
    head(10) %>%
    collect()
}

bench::mark(
  top_10_packages_by_day(cran_logs_parquet),
  top_10_packages_by_day(cran_logs_parquet_by_day)
)

# A tibble: 2 × 6
  expression                                          min median itr/s…¹ mem_a…²
                                              
1 top_10_packages_by_day(cran_logs_parquet)         304ms  348ms    2.87   222KB
2 top_10_packages_by_day(cran_logs_parquet_by_day)  354ms  354ms    2.82   167KB
# … with 1 more variable: `gc/sec` , and abbreviated variable names
#   ¹​`itr/sec`, ²​mem_alloc
# ℹ Use `colnames()` to see all variable names

Interestingly, running the query on the monthly parquet file is still faster. It takes about 30% longer to run the queries on the one parquet file per day. The overhead associated with having too many small files in this situation does not compensate for having to look inside a single file to perform this operation. For the benefits of partitioning to be visible, we would need to have more data in each parquet file.

We don’t see a performance benefit of having many small files even when we try to get the result on a single day. But how does this partitioning impact the performance of a query that needs to access multiple random rows? Let’s compare how a query that looks at the number of downloads per day for a given package.

package_downloads_by_day <- function(data, pkg = "arrow") {
  data %>%
    filter(package == pkg) %>%
    count(date) %>%
    arrange(date) %>%
    collect()
}

bench::mark(
  package_downloads_by_day(cran_logs_parquet),
  package_downloads_by_day(cran_logs_parquet_by_day)
)

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 2 × 6
  expression                                              min   median `itr/sec`
                                                 
1 package_downloads_by_day(cran_logs_parquet)           3.31s    3.31s     0.302
2 package_downloads_by_day(cran_logs_parquet_by_day)    4.46s    4.46s     0.224
# … with 2 more variables: mem_alloc , `gc/sec` 
# ℹ Use `colnames()` to see all variable names

In this case, it takes about 45% longer to perform this query. In this situation, the performance is affected by having to look inside many more files in the dataset with one parquet file per day.

Conclusion

This small example illustrates that it might be worth exploring how best to partition your dataset to benefit the most from the speed that Arrow brings to your queries. In this example, the partitioning that seemed the most “natural” based on the format the data is provided (one parquet file per day) is not the best to make queries run fast.

The variables you include in your queries have also a role to play when deciding how to partition your dataset. It might be best to partition your dataset according to variables you use most often in your queries.

The useR!2022 Arrow tutorial has a convincing demonstration that taking advantage of partitioning for your queries makes them run much faster.

Expand for Session Info

Creating an Arrow dataset

2022-08-22T00:00:00+00:00

Background

While getting started with Apache Arrow, I was intrigued by the variety of formats Arrow supports. Arrow tutorials tend to start with already prepared datasets ready to be ingested by open_dataset(). I wanted to explore what it takes to create your dataset aimed to be analyzed with Arrow and understand the respective benefits of the different file formats it supports.

Arrow can read in a variety of formats: parquet, arrow (also known as ipc and feather)¹, and text-based formats like csv (as well as tsv). Additionally, Arrow provides tools to convert between these formats.

Having the possibility to import datasets in a variety of formats is helpful as you are less constrained by the type of data you can start your analysis on. However, if you are building a dataset from scratch, which one should you choose?

To try to answer this question, we will be using the {arrow} R package to compare the amount of hard drive space these file formats use and the performance of a query in a multi-file dataset using these different formats. This is not a formal evaluation of the performance of Arrow or how best to optimize the partitioning of a dataset, rather it is a brief exploration of the tradeoffs that come with using the different datasets supported by Arrow. I also don’t explain the differences in the data structure of these different formats.

The dataset

We will be using data from https://cran-logs.rstudio.com/. This site gives you access to the log files for all hits to the CRAN² mirror hosted by RStudio. For each day since October 1st, 2012, there is a compressed CSV file (file with the extension .csv.gz) that records the downloaded packages. Each row contains the date, the time, the name of the R package downloaded, the R version used, the architecture (32-bit or 64-bit), the operating system, the country inferred from the IP address, and a daily unique identifier assigned to each IP address. This website has also similar data for the daily downloads of R itself but I will not be using this data in this post.

For this exploration, we are going to limit ourselves to a couple of months of data which will be providing enough data for our purpose. We will download the data for the period from June 1st, 2022 to August 15th, 2022.

Arrow is designed to read data that is split across multiple files. So, you can point open_dataset() to a directory that contains all the files that make up your dataset. There is no need to loop over each file to build your dataset in memory. Splitting your datasets across multiple files can even make queries on your dataset faster, as only some of the files might need to be accessed to get the results needed. Depending on the type of queries you perform most often on your dataset, it can be worth considering how best to partition your files to accelerate your analyses (but this is beyond the scope of this post). Here, the files are provided by date, and we will keep a time-based file organization.

We will use a Hive-style partitioning by year and month. We will have a directory for each year (there is only one year in our example), and within it, a directory for each month. The directory are named according to the convention =. So we will want to organize the files as illustrated below:

└── year=2022
    ├── month=6
    │   └── 
    ├── month=7
    │   └── 
    └── month=8
        └── 

Import the data as it is provided

library(arrow)
library(tidyverse)
library(fs)
library(bench)

The open_dataset() function in the {arrow} package can directly read compressed CSV files³ (with the extension .csv.gz) as they are provided on the RStudio CRAN logs website.

As a first step, we can download the files from the site and organize them using the Hive-style directory structure as shown above.

## Check that the date is really a date,
## and extract the year and month from it
parse_date <- function(date) {
  stopifnot(
    "`date` must be a date" = inherits(date, "Date"),
    "provide only one date" = identical(length(date), 1L),
    "date must be in the past" = date < Sys.Date()
  )
  list(
    date_chr = as.character(date),
    year = as.POSIXlt(date)$year + 1900L, 
    month = as.POSIXlt(date)$mon + 1L
  )
}

## Download the data set for a given date from the RStudio CRAN log website.
## `date` is a single date for which we want the data
## `path` is where we want the data to live
download_daily_package_logs_csv <- function(date,
                                            path = "~/datasets/cran-logs-csv") {

  ## build the URL for the download
  date <- parse_date(date)
  url <- paste0(
    'https://cran-logs.rstudio.com/', date$year, '/', date$date_chr, '.csv.gz'
  )

  ## build the path for the destination of the download
  file <- file.path(
    path,
    paste0("year=", date$year),
    paste0("month=", date$month),
    paste0(date$date_chr, ".csv.gz")
  )

  ## create the folder if it doesn't exist
  if (!dir.exists(dirname(file))) {
    dir.create(dirname(file), recursive = TRUE)
  }

  ## download the file
  message("Downloading data for ", date$date_chr, " ... ", appendLF = FALSE)
  download.file(
    url = url,
    destfile = file,
    method = "libcurl",
    quiet = TRUE,
    mode = "wb"
  )
  message("done.")

  ## quick check to make sure that the file was created
  if (!file.exists(file)) {
    stop("Download failed for ", date$date_chr, call. = FALSE)
  }

  ## return the path
  file
}

## build sequence of dates for which we want the data
dates_to_get <- seq(
  as.Date("2022-06-01"),
  as.Date("2022-08-15"),
  by = "day"
)

## download the data
walk(dates_to_get, download_daily_package_logs_csv)

Let’s check the content of the folder that holds the data we downloaded:

~/datasets/cran-logs-csv/
└── year=2022
    ├── month=6
    │   ├── 2022-06-01.csv.gz
    │   ├── 2022-06-02.csv.gz
    │   ├── 2022-06-03.csv.gz
    │   ├── ...
    │   └── 2022-06-30.csv.gz
    ├── month=7
    │   ├── 2022-07-01.csv.gz
    │   ├── 2022-07-02.csv.gz
    │   ├── 2022-07-03.csv.gz
    │   ├── ...
    │   └── 2022-07-31.csv.gz
    └── month=8
        ├── 2022-08-01.csv.gz
        ├── 2022-08-02.csv.gz
        ├── 2022-08-03.csv.gz
        ├── ...
        └── 2022-08-15.csv.gz

We have one file for each day, placed in a folder corresponding to their month. We can now read this data using {arrow}’s open_dataset() function:

cran_logs_csv <- open_dataset(
  "~/datasets/cran-logs-csv/",
  format = "csv",
  partitioning = c("year", "month")
)
cran_logs_csv

FileSystemDataset with 76 csv files
date: date32[day]
time: time32[s]
size: int64
r_version: string
r_arch: string
r_os: string
package: string
version: string
country: string
ip_id: int64
year: int32
month: int32

The partitioning has been taken into consideration as the output shows that the dataset contains the variables year and month which are not part of the data we downloaded. They are coming from the way we organized the downloaded files.

Convert to Arrow and Parquet files

Now that we have the compressed CSV files on disk, and that we opened the dataset with open_dataset(), we can convert it to the other file formats supported by Arrow using {arrow}’s write_dataset() function. We are going to convert our collection of .csv.gz files into the Arrow and Parquet formats.

## Convert the dataset into the Arrow format
write_dataset(
  cran_logs_csv,
  path = "~/datasets/cran-logs-arrow",
  format = "arrow",
  partitioning = c("year", "month")
)

## Convert the dataset into the Parquet format
write_dataset(
  cran_logs_csv,
  path = "~/datasets/cran-logs-parquet",
  format = "parquet",
  partitioning = c("year", "month")
)

Let’s inspect the content of the directories that contain these datasets.

fs::dir_tree("~/datasets/cran-logs-arrow/")

~/datasets/cran-logs-arrow/
└── year=2022
    ├── month=6
    │   └── part-0.arrow
    ├── month=7
    │   └── part-0.arrow
    └── month=8
        └── part-0.arrow

fs::dir_tree("~/datasets/cran-logs-parquet/")

~/datasets/cran-logs-parquet/
└── year=2022
    ├── month=6
    │   └── part-0.parquet
    ├── month=7
    │   └── part-0.parquet
    └── month=8
        └── part-0.parquet

These two directories have the same layout organized by year and month as with our CSV files given that we kept the same partitioning. The files within the directories have an extension that matches their file format. One difference is that there is a single file for each month. We used the default values for write_dataset() and the number of rows for each month is smaller than the threshold this function uses to split the dataset into multiple files.

Comparison of the different formats

Let’s compare how much space these different file formats take on disk:

dataset_size <- function(path) {
  fs::dir_info(path, recurse = TRUE) %>%
    filter(type == "file") %>%
    pull(size) %>%
    sum()
}

tribble(
  ~ Format, ~ size,
  "Compressed CSV", dataset_size("~/datasets/cran-logs-csv/"),
  "Arrow", dataset_size("~/datasets/cran-logs-arrow/"),
  "Parquet", dataset_size("~/datasets/cran-logs-parquet/")
) 

# A tibble: 3 × 2
  Format                size
            
1 Compressed CSV       5.01G
2 Arrow               29.67G
3 Parquet              5.06G

The Arrow format takes the most space with almost 30GB while both the compressed CSV and the Parquet files use about 5GB of hard drive.

We are now set up to compare the performance of doing computation of these different dataset formats.

Let’s open these datasets with the different formats:

cran_logs_csv <- open_dataset("~/datasets/cran-logs-csv/", format = "csv")
cran_logs_arrow <- open_dataset("~/datasets/cran-logs-arrow/", format = "arrow")
cran_logs_parquet <- open_dataset("~/datasets/cran-logs-parquet/", format = "parquet")

We will compare how long it takes for Arrow to compute the 10 most downloaded packages in the time period our dataset covers using each file format.

top_10_packages <- function(data) {
  data %>%
    count(package, sort = TRUE) %>%
    head(10) %>%
    mutate(n_million_downloads = n/1e6) %>%
    select( - n) %>% 
    collect()
}

bench::mark(
  top_10_packages(cran_logs_csv),
  top_10_packages(cran_logs_arrow),
  top_10_packages(cran_logs_parquet)
)

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 3 × 6
  expression                              min   median itr/se…¹ mem_al…² gc/se…³
                                    
1 top_10_packages(cran_logs_csv)       29.57s   29.57s   0.0338   8.19MB   0    
2 top_10_packages(cran_logs_arrow)       2.1s     2.1s   0.475  165.39KB   0.475
3 top_10_packages(cran_logs_parquet)    3.32s    3.32s   0.301  137.11KB   0    
# … with abbreviated variable names ¹​`itr/sec`, ²​mem_alloc, ³​`gc/sec`

While it takes about 4 seconds to perform this task on the Arrow or Parquet files, it takes more than 30 seconds to do it on the CSV files.

Conclusion

Having Arrow point directly to the folder of compressed CSV file might be the most convenient but it comes with a high-performance cost. Arrow and Parquet have similar performance but the Parquet files take less space on disk and would be more suitable for long-term storage. This is why large datasets like the NYC taxi data is distributed as a series of Parquet files.

In the future, I might explore how using different variables for partitioning or how the number of files in the partitions affects the performance of the queries (EDIT: this post is now available. If you have other ideas of topics that you would me to explore, do not hesitate to leave a comment below.

Going further

If you would like to learn more about the different formats, check out the Arrow workshop (especially Part 3: Data Storage) that Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt taught at useR!2022.

Acknowledgments

Thank you to Kae Suarez and Danielle Navarro for reviewing this post.

Post Scriptum

I wrote a follow-up post that explores the impact of partitioning the dataset on performance.

Expand for Session Info

`foghorn` 1.3.1 released

2020-09-08T00:00:00+00:00

A new version of foghorn (version 1.3.1) was just accepted on CRAN.

foghorn is an R package that allows you to:

browse the results of the CRAN checks on your package (with cran_results() and cran_details());
check where your package stands when submitted to CRAN (with cran_incoming());
and starting with version 1.3.1, check whether your package is in the Win builder queue (with winbuilder_queue()).

The idea of inspecting the Win-builder queue was proposed by Kirill Müller.

If you would like to start using foghorn, check out the vignette that comes with the package.

Feedback and suggestions for foghorn are welcome!

Migrate from Gmail to HelpScout with R

2020-04-17T00:00:00+00:00

Preamble

This is a long and somewhat dense post. Even if you do not have to migrate emails from Gmail to HelpScout, I hope this post will be useful to you, as the general approach could be interesting to other problems that involve working with APIs.
The full code I actually used for the email migration is available at: https://github.com/carpentries/emailmigration and I include links pointing to functions in the GitHub repo¹ throughout the post below to illustrate my points.

The problem and its solution

At The Carpentries, Regional Coordinators help us organize workshops across the globe. In the past, each Regional Coordinator was set up with a Gmail account (through The Carpentries’s GSuite plan). However, as the number of Regional Coordinators grew, and as some geographic areas have more than one Regional Coordinator, the Gmail account model was starting to cause some issues.

The Carpentries Core Team has been using HelpScout for a while and is a much more suitable tool to manage emails and inboxes as a team.

The main challenge with transitioning the Regional Coordinators to using HelpScout was to import the old messages from Gmail to HelpScout. To tackle this problem, I used R and this blog post describes the approach I took.

Technical overview

Before doing anything else, we used the GSuite data migration tool to transfer all emails for each Regional Coordinator account into a single account. Having all the emails to import in the same place makes things easier.

This post goes through the steps I took to perform this migration:

Figure out authentication with the Gmail API, and with the HelpScout API
Get familiar with the HelpScout API and write R functions to perform the tasks needed
Convert Gmail threads into HelpScout conversations
Test migration on 100 Gmail threads
Perform the full migration

Choice of packages and approach:

Working with the Gmail API is made much easier with the wonderful gmailr package.
I didn’t find an already made package to work with the HelpScout web API so I wrote a few functions to interact with the endpoints I needed using the httr package.
The mechanics of converting the data coming from the Gmail web API into the format needed by the HelpScout API to import the conversation was done using the R6 package. The R6 classes and methods made it easier to separate storing each element needed by the HelpScout API as private elements and the actual formatting that was handled with methods.
When working with web APIs a lot can go wrong: there is a weird data format that your code didn’t know how to handle, your internet connection goes down, you reach the rate limit, etc. Therefore, I used the storr package to cache (1) the R6 objects that act as the bridge between the 2 APIs; (2) the responses from the HelpScout API to make sure all the threads were converted correctly.
I organized all the code as a barebone package. It makes code management easier and is a good habit to take. Here it was a one-off task but if it was something that I’d use regularly, it means that I could develop tests, write documentation, and enable continuous testing. I could then write and update my code, and rely on devtools::load_all().

1. Authentication

1.1. Gmail API

The instructions in the gmailr package’s README are clear. You can use the gm_threads() function, for instance, to check that the authentication is working as expected.

1.2. The HelpScout API

The HelpScout API uses the OAuth 2.0 protocol. The httr package handles this well.

Create a new app within HelpScout, and use https://localhost:1410/ for the redict URL. Take note of the key and secret. Use this information to create a new app object in R with httr:

hs_app <- httr::oauth_app(
  "helpscout",
  key = "",
  secret = ""
)

and then use this object to do the authentication online:

hs_token <- httr::oauth2.0_token(
  httr::oauth_endpoint(
    authorize = "https://secure.helpscout.net/authentication/authorizeClientApplication",
    access = "https://api.helpscout.net/v2/oauth2/token"),
  app = hs_app)

htoken <- httr::config(token = hs_token)

We can then use the htoken object across all our calls to the HelpScout web API.

2. Getting started with the HelpScout web API

When working with a new web API, first read the documentation to understand how things are set up. From this initial reading, it became clear that Gmail and HelpScout use different words for related concepts.

HelpScout	Gmail
thread	message
conversation	thread

Keeping this straight in my mind took some time… and because I’m more used to the terms used by Gmail, I used this vocabulary in my function names (for the most part).

Another thing that I needed was HelpScout’s internal identifier for the mailbox into which the emails were being imported. So the first function I wrote against HelpScout’s API was hs_mailbox_id() which returned the internal identifier for the mailbox that was of interest to me.

The second thing I needed to do was to make sure I understood how to use the API to import an actual conversation. I started with fake data I could control to ensure that I had something simple that I knew worked and I could compare against when things didn’t work with real data. Even if the documentation of an API is good, there are, more often than not, small details that are not described that you need to figure out. Having this data as a starting point is useful for these tests.

The actual code to create a new ~~thread~~ conversation in HelpScout ended up being:

hs_create_thread <- function(thread, hstoken) {
  body <- jsonlite::toJSON(thread, auto_unbox = TRUE)

  httr::POST(
    "https://api.helpscout.net",
    path = "/v2/conversations",
    body = body,
    htoken,
    httr::content_type("application/json; charset=UTF-8")
  )
}

This is not the code I would have written if it was part of a package intended for others to use. For instance, I would have wanted to check the response of the API after each request. But for my particular use case, it made it easier to return this response and inspect manually after the fact once I had confirmed that this code was working for most requests.

2. Extracting the content of the emails from Gmail

This was the most time-consuming part as there were lots of unexpected details that came up to get a smooth conversion between the two APIs.

2.1. Things that were easy

The gmailr::gm_subject() worked every time to get the subject of the threads for all the messages.

2.2. Things that were almost easy

Extracting the people involved in the conversation. The gmailr::gm_to() and gmailr::gm_from() worked well to extract the email addresses. The small catch was that some email addresses were formatted as FirstName LastName, others had only email@address.rr, and when multiple people were involved a comma separated them. However, in some cases, people have a comma in their names.
Extracting the date. The gmailr::date() returns the date from the email in Unix time. The anytime package is useful at converting Unix time into other formats, including the iso8601 that was expected by the HelpScout API. I still had to manually add a final Z to the character string.

2.3. Things that were not so easy

Extracting the email attachments. The attachments themselves are not returned by the API. Instead, the API returns an URL that points to the address where the attachments can be retrieved. The HelpScout’s API accepts the attachments as base64-encoded strings. The gmailr package helped to retrieve this data, but the data returned by the Gmail API is base64url encoded. Thankfully, converting to regular base64 is a short regular expression substitution away once you know the difference between the two.
The thing that was the most puzzling was parsing the actual body of the emails. The gmailr::gm_body() worked for only a small fraction of the emails I had to deal with. After many trials and errors, I wrote a function to reliably retrieve the content of the emails². There were many situations to deal with as the messages can be:
- “multipart” the body of the email is provided both in plain text format or in HTML format which allows for email clients that don’t support HTML-formatting to provide the plain text version of the message;
- either only plain text or in HTML format
- provided as attachments (what some email clients do when you forward a message).
Depending on the situation, the location of the body of the email within the deeply nested list that was returned by the Gmail API could vary. I ended up writing a recursive algorithm that traversed the list to find and retrieve the relevant content of the emails.

The last catch was that plain text messages that included an URL were interpreted by the HelpScout API as being HTML-formatted. It meant that the whitespace to indicate the line breaks were ignored making the body of the messages large blocks of texts that were very hard to read and follow. I relied on the commonmark::markdown_html() to convert these plain text messages into HTML that then looked good once they were uploaded onto HelpScout using the API.

3. Conversion between Gmail and HelpScout

Now that I had access to all the relevant information from the emails, I needed to format it so it could be imported by the HelpScout API. For this, I used the R6 object-oriented programming system.

Each element coming from the Gmail API was individually stored as a private field, and an accessor method ($get()) created the list in the format needed to be ingested by HelpScout’s API.

I used 3 classes for this:

one for the HelpScout conversations (the Gmail threads)
one for the HelpScout threads (the Gmail messages)
one for the attachments

This modularity helped debugging and limited the complexity of each class.

Because all the emails are going to be in the same inbox in HelpScout, I wanted an easy way to tag the conversations based on the team of Regional Coordinators that were involved. The R6 system was useful for this because once the email information was stored within the object, I could use a private method called by the accessor to extract all the people involved, and add tags in HelpScout to help Regional Coordinators find past conversations that are relevant to them.

It was one of the first times I used R6³ for a real task and I could see its potential. If the code written here were for public consumption, it would have provided a good framework to add more tests on the data structure of the individual elements that were coming from the Gmail API to ensure that the output from the accessor method was always formatted correctly before trying to convert it in the format required by HelpScout’s API.

4. Caching

My previous experience working with web APIs have taught me that things can go wrong, and it is always a good idea to keep track (on disk and not only on memory) of the requests that have been tried and the ones that have not, and the requests that succeeded and the ones that failed. Especially, when your scripts do thousands of API calls, you don’t want to have to run everything again once your script fails because your internet connection goes down for a short while, or the data is not formatted properly because you are dealing with an edge case.

For this, I use the storr package and its functionality to rely on hooks to retrieve external data. storr is a key-value store. It is not that different than using variable names to store objects in memory as you normally do in your R terminal:

## setting a variable
cat_name <- "Felix"

## getting the content of the variable
cat_name

When using a storr store:

## defining the storr
st <- storr::storr_rds(path = "cache")

## setting a variable
st$set("cat_name", "Felix")

## getting the variable name
st$get("cat_name")

The difference is that storr provides different backends for storing your object and, if like in this example, you use storr_rds, your objects are stored as rds files on your disk and are available beyond your current R session. How does that help with the problem here?

A great feature of storr is that you can set up your store to call a function to create the object instead of providing it directly with $set().

It means that you store the content of a variable, your key into the store, and you can retrieve it:

## the hook function
fetch_hook_random_cat_name <- function(key, namespace) {
  sample(c("Felix", "Garfield", "Tigger", "Mowgli"), 1)
}

## defining the storr
st <- storr::storr_external(
  storr::driver_rds(path = "cache"),
  fetch_hook_random_cat_name
)

## the first time you call a key, it will run the hook function
st$get("cat_name")

## subsenquently, it will return the value stored in the store
st$get("cat_name")

The hook function always takes the two arguments key and namespace but they don’t need to be used in the body of the function as in the example above.

We can extend this approach to store the output of time-consuming computations or the results of API calls⁴. For instance, here, I created a store to keep the output of the function convert_gmail_thread(), and used get_gmail_thread() as a wrapper to access the store.

fetch_hook_gmail_threads <- function(key, namespace) {
  convert_gmail_thread(key)
}

store_gmail_threads <- function(path = "cache/threads") {
  storr::storr_external(
    storr::driver_rds(path),
    fetch_hook_gmail_threads
  )
}

get_gmail_thread <- function(thread_id, namespace) {
  store_gmail_threads()$get(thread_id, namespace)
}

When calling get_gmail_thread(), using a thread_id that had not been retrieved using the Gmail API before, the function convert_gmail_thread() will be called, getting all the information needed for this particular thread, and storing it in an R6-class object. If another part of the script fails, we do not need to redo the calls to the Gmail API, instead the cached copy within the store will be retrieved.

I used a similar approach to store the responses from the HelpScout API, and wrapped at the same time the call to the get_gmail_thread() function above. A slightly simplified version of what I used is:

fetch_hook_hs_response <- function(key, namespace) {
  res <- get_gmail_thread(key, namespace)
  hs_create_thread(res$get(), htoken)
}

store_hs_responses <- function(path = "cache/hs_responses") {
  storr::storr_external(
    storr::driver_rds(path),
    fetch_hook_hs_response
  )
}

get_hs_response <- function(thread_id, namespace) {
  store_hs_responses()$get(thread_id, namespace)
}

So, what’s happening here? I use the Gmail thread ID as a single point of entry for the entire script (retrieve the thread from the Gmail API, convert it to the format expected by the HelpScout API, upload the thread to HelpScout). Depending on whether the queries have already been made and stored in the cache, the script will retrieve the data from the API or the objects stored on disk in the cache.

What does the namespace argument do? Using namespacing in storr allows you to organize your objects in your store. Especially, it allows you to have objects with the same name but with different values. Here, I planned to use namespaces to keep track of my different attempts. If the first attempt would have failed for some threads, I could fix the problem in the code, and re-attempt the HelpScout API calls just for the ones that failed under a different namespace.

5. Putting it all together

Once I had most of the pieces together, I started by testing the code on the first 100 threads (as it’s the default number of threads that gmailr returns). That was a manageable number to see how the script behaved while being large enough that many different types of messages would be encountered. At that time, I didn’t use the caching system.

Once the first 100 threads could be imported successfully in HelpScout, I wrote a function to retrieve the identifiers for all the threads in the inbox that needed to be imported, and iterated on these identifiers to call the get_hs_response function:

get_all_threads <- function() {
  
  first_it <- gm_threads()
  next_token <- first_it[[1]]$nextPageToken
  
  res <- append(list(), first_it)
  
  while (length(next_token) > 0) {
    tmp <- gm_threads(page_token = next_token)
    res <- append(
      res, tmp
    )
    next_token <- tmp[[1]]$nextPageToken
    message("next token: ", next_token)
  }
  
  res
}

threads <- get_all_threads()

threads_ids <- purrr::map(
  threads,
  ~ purrr::map_chr(.$threads, ~ .$id)
) %>%
  unlist()

hs_res <- purrr::walk(
  threads_ids,
  ~ get_hs_response(., namespace = "v2020-04-10.1")
)

As part of the hook function that takes care of uploading conversations to HelpScout, I check whether the upload was successful and based on that I created and assigned a Gmail label to the thread. This was an additional safeguard that I could use to flag threads that didn’t import successfully.

Once the upload completed, I could then inspect the content of the store:

## Retrieve the threads_ids from the store
idx <- store_hs_responses()$list(namespace = "v2020-04-10.1")

## Retrieve the status code for the HelpScout API responses
is_error <- purrr::map_lgl(
  idx,
  ~ httr::status_code(
    store_hs_responses()$get(., namespace = "v2020-04-10.1")
  ) >=  400
)

## How many calls failed?
sum(is_error)

## Which thread_ids failed?
idx[is_error]

and double check that it was the same threads that were labeled with failure- in Gmail.

Lessons learned

As often with using programming to solve problems, what might seem like a simple task: “Transfering emails from one system to an other” is a collection of small problems. Being able to break down the big problems into small ones, and knowing how to address them comes with experience. Experience will help you recognize problems similar to some you have already solved, and reflecting on these past experiences will help you identify the algorithms, packages, and general code organization that are most likely to help you solve your problem.

In The Carpentries Instructor Training, when we teach about expertise, we talk about how the mental model of experts is denser and more connected. These features make it more difficult for experts to teach beginners because they have forgotten what it is like to not know how to break down a large problem into multiple small ones. The problem here is not just “migrate a bunch of emails between two systems”, there is a lot more to it. I wrote this blog post with the intent to demonstrate the approach I took to break down a problem into small ones and, in the process, describe the tools and techniques I chose to address them.

Expertise is subjective and relative, and I certainly do not claim that the approach I chose here is the best, the most efficient or the most elegant. There is certainly room for improvement. For instance, parts of the code could be re-factored to make it more organized, parts could be rewritten to be more defensive, and there is no documentation (besides this blog post) and barely any comments.

I am interested in hearing your perspective and thoughts on how the problem could have been approached differently and the tools you would have chosen to address it. If this post was useful to you to help you solve a different problem, I would also love to hear about it! Leave a comment below or contact me using the info provided on the left of this page.

Footnotes

You may notice that the Git history for the repo includes the key and secret for the HelpScout OAuth authentication. By themselves, these are not enough to access any data, as you also need to authenticate with a valid HelpScout account within our organization. These credentials have also been revoked. ↩
I’ll be submitting a pull request to gmailr soon. ↩
If you are interested in learning more about the object-oriented programming R6 system, the chapter about it in the “Advanced R” book by Hadley Wickham is a great place to start. ↩
If you are interested in learning more about storr, read the documentation for the package and the vignette on external data that initially helped me get started with this amazingly useful package. ↩

Advent of Code 2018

2018-12-01T00:00:00+00:00

I’m going to try to complete the Advent of Code again this year. I’ll put all the exercises I complete in this post.

Links to the puzzles are at https://adventofcode.com/2018

Day 1

library(tidyverse)


## part 1
readr::read_lines("advent-data/2018-12-01-day1.txt") %>%
  as.numeric() %>%
  sum()

## [1] 408

## part 2
input <- readr::read_lines("advent-data/2018-12-01-day1.txt") %>%
  as.numeric()

already_seen <- function(input) {

  i <- 1

  while (TRUE) {
    v_sum <- cumsum(rep(input, i))
    has_dup <- any(duplicated(v_sum))
    if (has_dup) {
      return(v_sum[which(duplicated(v_sum))[1]])
    }
    i <- i + 1
  }
  
}

already_seen(input)

## [1] 55250

Day 2

input <- readr::read_lines("advent-data/2018-12-02-day2.txt") 

## part 1
count_letters <- function(input) {

  n_letters <- strsplit(input, "") %>%
    purrr::map(table)
  
  has_2 <- function(x) {
    as.integer(any(x == 2))
  }
  has_3 <- function(x) {
    as.integer(any(x == 3))
  }

  has_2_vec <- purrr::map_int(n_letters, has_2)
  has_3_vec <- purrr::map_int(n_letters, has_3)

  sum(has_2_vec) * sum(has_3_vec)
}

count_letters(input)

## [1] 6000

## part 2
all_in <- crossing(in1 = input, in2 = input) %>%
  mutate(
    split1 = strsplit(in1, ""),
    split2 = strsplit(in2, ""),     
    n_diff = map2_int(split1, split2, ~ sum(.x != .y))    
  )

all_in %>%
  filter(n_diff == 1) %>%
  slice(1) %>%
  mutate(word = map2_chr(
    split1,
    split2,
    function(x, y) {
      x <- unlist(x)
      y <- unlist(y)
      paste(x[x == y], collapse = "")
    })) %>%
  pull(word)

## [1] "pbykrmjmizwhxlqnasfgtycdv"

Day 3

That’s far from the prettiest code I’ve written! But it gets the job done.

extract_coords <- function(input) {

  readr::read_delim(
    input, delim = " ", col_names = FALSE
  ) %>%
    tidyr::extract(X1, into = "id", regexp = "([[:digit]]+)") %>%
    tidyr::extract(X3, into = c("x_begin", "y_begin"), regex = "([[:digit:]]+),([[:digit:]]+):") %>%
    tidyr::extract(X4, into = c("width", "height"), regex = "([[:digit:]]+)x([[:digit:]]+)") %>%
    dplyr::select(-X2) %>%
    dplyr::mutate_all(as.numeric)
  
}


find_total_dim <- function(coords) {
  res <- coords %>%
    dplyr::mutate(total_width = x_begin + width,
                  total_height = y_begin + height)

  c(total_width = max(res$total_width),
    total_height = max(res$total_height))
  
}


fill_matrix <- function(input) {

  c <- extract_coords(input)
  m_dim <- find_total_dim(c)

  M <- matrix(0,
              nrow = m_dim[1],
              ncol = m_dim[2])


  for (i in seq_len(nrow(c))) {
    i_s <- (c$x_begin[i] + 1):(c$x_begin[i] + c$width[i])
    j_s <- (c$y_begin[i] + 1):(c$y_begin[i] + c$height[i])
    M[i_s, j_s] <- M[i_s, j_s] + 1
  }

  M
  
}


more_two_claims <- function(input) {
  M <- fill_matrix(input)
  sum(M >= 2) 
}

## part 1 answer
more_two_claims("advent-data/2018-12-03-day3.txt")

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )

## [1] 109716

overlaps <- function(x_begin, y_begin, width, height, M) {
  i <- (x_begin + 1):(x_begin + width)
  j <- (y_begin + 1):(y_begin + height)
  all(M[i, j] == 1) 
}

no_overlap <- function(input) {
  M <- fill_matrix(input)

  c <- extract_coords(input)
  res <- logical(nrow(c))
  
  for (i in seq_len(nrow(c))) {
    res[i] <- overlaps(c$x_begin[i], c$y_begin[i],
                       c$width[i], c$height[i], M)
  }

  c$id[res]
  
}

## part 2 answer
no_overlap("advent-data/2018-12-03-day3.txt")

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )

## [1] 124

A list of resources to learn more about R programming (and other things)

2018-02-04T00:00:00+00:00

Data Visualization

Data Visualization a Practical Introduction by Kieran Healy
Fundamental of Data Visualization by Claus O. Wilke

Text analysis

Text Mining with R by Julia Silge and David Robinson

Git

Happy Git and GitHub for the useR by Jenny Bryan

UNIX

The Linux Workbench by Sean Kross

Rcpp/C++

Rcpp for Everyone by Masaki E. Tsuda

Random

Disposable Laptops With Docker Compose And NPM

How to setup magithub if you have GitHub 2-factor authentication enabled?

2018-01-25T00:00:00+00:00

If you are trying to set up magithub when you have 2 factor authentication enabled, here are the steps you need to take:

Go to https://github.com/settings/tokens and create a personal token, and give it the name that the prompt suggest. For me it was: “Emacs package magithub @ francois-XPS-15-9560”, and give it the following scopes: “notification”, “repo” and “user”.

create a file ~/.authinfo with the following:

machine api.github.com login YOUR_GITHUB_USERNAME^magithub password

encrypt the file (assumes you have GPG setup) by running: M-x epa-encrypt-file and give it ~/.authinfo.
Make sure that ~/.authinfo.gpg was created and that its content is right.
Delete the unencrypted ~/.authinfo
Do M-x customize-variable RET auth-sources and put ~/.autoinfo.gpg first in the list of files inspected.

Advent of Code: Day 21

2017-12-21T00:00:00+00:00

Problem

Parts 1 and 2

library(tidyverse)

# rotate matrix 90° clockwise
rotate <- function(mat, n = 1) {
    i <- 0
    while (i < n) {
        mat <- t(apply(mat, 2, rev))
        i <- i + 1
    }
    mat
}

# create a mirror image of the matrix
flip <- function(mat) {
    t(apply(mat, 1, rev))
}

# convert enhancement rule string into matrix
string_to_mat <- function(ii) {
    uu <- strsplit(ii, "/")[[1]]
    s <- nchar(uu[1])
    uu <- unlist(strsplit(uu, ""))
    matrix(uu, ncol = s, byrow = TRUE)
}

# convert matrix into enhancement rule string
mat_to_string <- function(m) {
    paste(apply(m, 1, paste, collapse = ""),
          collapse = "/")
}

# from an enhacement string, find all possible combinations
expand_combinations <- function(ii) {
    m <- string_to_mat(ii)
    s <- dim(m)[1]
    mr <- vector("list", 4)
    for (i in 0:3) {
        mr[[(i*2) + 1]] <- rotate(m, i + 1)        
        mr[[(i*2) + 2]] <- rotate(flip(m), i + 1)
    }
    map_chr(mr, mat_to_string)
}

# create data frame from input, and include all unique combinations
read_rules <- function(input) {
    read_delim(input, delim = "=",
               col_names = FALSE) %>%
        mutate_all(~ gsub(">?\\s+", "", .)) %>%
        set_names(c("ii", "oo")) %>%
        mutate(comb = map(ii, ~ expand_combinations(.))) %>%
        unnest() %>%
        distinct(comb, .keep_all = TRUE)
}

# split a matrix into a list that contains 2x2 or 3x3 
split_mat <- function(m, n) {
    n_mat <- dim(m)[1] / n
    res <- vector("list", n_mat^2)
    idx_start <- seq(1, to = dim(m)[1], by = n)
    k <- 1
    for (i in idx_start) {
        for (j in idx_start) {
            res[[k]] <- m[i:(i + n - 1), j:(j + n - 1)]
            k <- k + 1
        }
    }
    res
}

# reassamble a matrix from a list of 2x2 or 3x3 matrices
list_to_mat <- function(lst) {
    si <- nrow(lst[[1]])  
    s <- si * sqrt(length(lst))
    res <- array(, dim = c(s, s))
    idx_start <- seq(1, to = s, by = si)
    k <- 1
    for (i in idx_start) {
        for (j in idx_start) {
            res[i:(i + si - 1), j:(j + si - 1)] <- lst[[k]]
            k <- k + 1
        }
    }
    res
}

# use the enhancement rule book to grow the matrix
convert_rule <- function(pattern, rules) {
    res <- rules$oo[rules$comb == pattern]
    if (nchar(res) ==  0) stop("problem...")
    string_to_mat(res)       
}

# apply the enhacement algo for a set of rules, and n iterations
enhance <- function(rules, n) {
    i <- 0
    d <- c(2, 3)
    start <- c(".#./..#/###")
    m <- string_to_mat(start)

    while (i < n) {
        s <- dim(m)[1]
        sp <- d[s %% d == 0][1]
        smat <- split_mat(m, sp)
        sstr <- map(smat, mat_to_string)
        mlst <- map(sstr, convert_rule, rules)
        m <- list_to_mat(mlst)
        i <- i + 1
    }   
    m    
}

rules <- read_rules("advent-data/2017-12-21-data.txt")

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character()
## )

part1 <- enhance(rules, 5)
sum(part1 == "#")

## [1] 190

part2 <- enhance(rules, 18)
sum(part2 == "#")

## [1] 2335049