Jekyll2023-06-19T17:01:26+00:00https://francoismichonneau.net/feed.xmlFrançois MichonneauPersonal websiteFrançois Michonneaufrancois.michonneau@gmail.comHow to work with remote Parquet files with the duckdb R package?2023-06-19T00:00:00+00:002023-06-19T00:00:00+00:00https://francoismichonneau.net/2023/06/duckdb-r-remote-data<p>For large datasets, it is sometimes convenient to explore them without
downloading them locally. With Arrow, you can work with these remotes files if
they are stored in AWS S3 or Google Cloud Storage. It is however not yet
possible for files stored over HTTPS (it is on the roadmap). On the other hand,
with the “httpfs” extension, DuckDB allows you to query over the wire these
Parquet files.</p>
<p>You can even set things up so you can use dplyr verbs to work with these remote
files. I will demonstrate this using a Parquet version of the <a href="https://allisonhorst.github.io/palmerpenguins/">penguins
dataset</a> hosted on my site.</p>
<p>Let’s start by loading the required packages:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">DBI</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">duckdb</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>We are creating a <code class="language-plaintext highlighter-rouge">con</code> object to hold our DuckDB connection:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p>Let’s install (only needed once) and load the <code class="language-plaintext highlighter-rouge">httpfs</code> extension:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbExecute</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"INSTALL httpfs;"</span><span class="p">)</span><span class="w">
</span><span class="n">dbExecute</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"LOAD httpfs;"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>At this point, we could use DuckDB’s SQL syntax to work with our remote dataset:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w">
</span><span class="s2">"SELECT species,
AVG(bill_length_mm) AS avg_bill_length,
AVG(bill_depth_mm) AS avg_bill_depth
FROM PARQUET_SCAN('https://francoismichonneau.net/assets/data/penguins.parquet')
GROUP BY species;"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 3 × 3
species avg_bill_length avg_bill_depth
<chr> <dbl> <dbl>
1 Adelie 38.8 18.3
2 Gentoo 47.5 15.0
3 Chinstrap 48.8 18.4
</code></pre></div></div>
<p>However, you can create a view using this remote file, which in turn, will allow
you to use dplyr to query your file:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbExecute</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w">
</span><span class="s2">"CREATE VIEW penguins AS
SELECT * FROM PARQUET_SCAN('https://francoismichonneau.net/assets/data/penguins.parquet');
"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>You can check it worked by running:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbListTables</span><span class="p">(</span><span class="n">con</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] "penguins"
</code></pre></div></div>
<p>Now you can work with this remote data with dplyr:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"penguins"</span><span class="p">)</span><span class="w"> </span><span class="o">|></span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">species</span><span class="p">)</span><span class="w"> </span><span class="o">|></span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_bill_length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">bill_length_mm</span><span class="p">),</span><span class="w">
</span><span class="n">avg_bill_depth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">bill_depth_mm</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Source: SQL [3 x 3]
# Database: DuckDB 0.8.1 [francois@Linux 6.2.0-20-generic:R 4.3.0/:memory:]
species avg_bill_length avg_bill_depth
<chr> <dbl> <dbl>
1 Adelie 38.8 18.3
2 Gentoo 47.5 15.0
3 Chinstrap 48.8 18.4
</code></pre></div></div>François Michonneaufrancois.michonneau@gmail.comLearn how to work with Parquet files over HTTPS using duckdb and dplyr.How to use Arrow to work with large CSV files?2022-10-13T00:00:00+00:002022-10-13T00:00:00+00:00https://francoismichonneau.net/2022/10/import-big-csv<h2 id="some-background">Some background</h2>
<p>Lucky you! You just got hold of a largish CSV file (let’s say 15 GB,
about 140 million rows). How do you handle this file to be able to
work with it using Apache Arrow?</p>
<p>Going through the documentation of Arrow, you might notice that
several ways are mentioned to import data. They fall into two
families:</p>
<ul>
<li>one that I will refer to as the <strong>Single file API</strong><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>;</li>
<li>the other is the <strong>Dataset API</strong>.</li>
</ul>
<p>The Single file API contains functions for each supported file format
(CSV, JSON, Parquet, Feather/Arrow, ORC). They work on one file at a
time, and they load the data in memory. So depending on the size of
your file and the amount of memory you have available on your system,
it might not be possible to load the dataset this way. If you <em>can</em>
load the dataset in memory queries will run faster because the data
will be readily accessible to the query engine.</p>
<p>The Dataset API is very flexible. It can read multiple file formats,
you can point to a folder with multiple files and create a dataset
from them, and it can read datasets from multiple sources (even
combining remote and local sources). This API can also be used to read
single files that are too large to fit in memory. This works because
the files are not actually loaded in memory. The functions scan the
content so they know where to look for the data and what the schema is
(the data types and names of each column). When you query the data,
there is some overhead because the query engine needs to first read
the data before it can operate on it. (If you want to see some
examples of what the Dataset API can do, check out the two previous
posts on datasets with Arrow: <a href="/2022/08/arrow-dataset-creation/">Part 1</a>, and <a href="/2022/09/arrow-dataset-part-2/">Part 2</a>)</p>
<p>In this post, we will explore how to convert a large CSV file to the
Apache Parquet format using the Single file and the Dataset APIs with
code examples in R and Python. We do the conversion from CSV to
Parquet, because in a <a href="/2022/08/arrow-dataset-creation/">previous post</a> we found that the Parquet format
provided the best compromise between disk space usage and query
performance. Having the content of this file in the Apache Parquet
format will ensure that we can read and operate on this data quickly.</p>
<h2 id="the-single-file-api-in-r">The Single file API in R</h2>
<p>The functions in the Single file API in R start with <code class="language-plaintext highlighter-rouge">read_</code> or
<code class="language-plaintext highlighter-rouge">write_</code> followed by the name of the file format. For instance,
<code class="language-plaintext highlighter-rouge">read_csv_arrow()</code>, <code class="language-plaintext highlighter-rouge">read_parquet()</code>, and <code class="language-plaintext highlighter-rouge">read_feather()</code> belong to
what I refer here as the Single file API.</p>
<p>To read the data with our 15 GB CSV file, we would use:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">arrow</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_csv_arrow</span><span class="p">(</span><span class="w">
</span><span class="s2">"~/dataset/path_to_file.csv"</span><span class="p">,</span><span class="w">
</span><span class="n">as_data_frame</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Using <code class="language-plaintext highlighter-rouge">as_data_frame = FALSE</code> keeps the result as an Arrow table which
is a better representation for a file of this size. Attempting to
convert it into a data frame will take longer to load, and you will
most likely run out of memory.</p>
<p>This step takes about 15 seconds on my system. As far as I can tell,
the arrow R package is the only way to load a file of this size in
memory. Both readr/vroom and data.table ran out of memory after
several minutes and before being able to finish reading the file.</p>
<p>At this point, you have an Arrow formatted table loaded in memory that
is ready for you to work with.</p>
<p>To convert this file into the Apache Parquet format using the Single
file API, you would use:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">write_parquet</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="s2">"~/dataset/data.parquet"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Creating this file takes about 85 seconds on my system. The resulting
file is about 9.5 GB, reducing the amount of hard drive space needed
to store the data by about 60%.</p>
<p>The <code class="language-plaintext highlighter-rouge">read_parquet()</code> function will load this dataset the next time you
need to work with it:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_parquet</span><span class="p">(</span><span class="s2">"~/dataset/data.parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">as_data_frame</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Let’s count the number of unique values in one of the columns of this
dataset:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">variable</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">collect</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p>This query takes only <strong>half a second</strong> on my laptop. Half a second to
summarize the content of 140 million rows: this is fast! Very fast!</p>
<p>Whether you use <code class="language-plaintext highlighter-rouge">read_csv_arrow()</code> or <code class="language-plaintext highlighter-rouge">read_parquet()</code>, the dataset is
loaded in memory using the same representation: an Arrow table. The
performance of queries would therefore be the same regardless of the
format used to store the data. In this case, the decision to storing
the data as a CSV or a Parquet file will be based on the amount of
storage, how fast reading from CSV or Parquet compares to the overhead
associated with the conversion from one format to the other.</p>
<p>Let’s now use the Dataset API.</p>
<h2 id="the-dataset-api-in-r">The Dataset API in R</h2>
<p>We will read the large CSV file with <code class="language-plaintext highlighter-rouge">open_dataset()</code>. This function
can be pointed to a folder with several files but it can also be used
to read a single file.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/dataset/path_to_file.csv"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>With our 15 GB file, it takes 0.05 seconds to “read” the file. It is
fast because the data does not get loaded in memory. <code class="language-plaintext highlighter-rouge">open_dataset()</code>
scans the content of the file to identify the name of the columns and
their data types.</p>
<p>Running the same query as above, which counts the number of unique
values in a column, takes 18 seconds compared to the 0.5 seconds when
the data is loaded in memory. It is slower because the query engine
needs to read the data. It is the same result that we had found in a
<a href="/2022/08/arrow-dataset-creation/">previous post</a>:
running queries directly on a CSV file is slow. In that post, we also
found that storing the data in the Parquet format sped things
up. Let’s now convert this dataset to Parquet using the Dataset API.</p>
<p>Instead of using a single Parquet file as we did above when we looked
at the Single file API, we will also partition the Parquet dataset to
see how it could help with query performance. The particular dataset I
have on hand does not have any obvious variable we can use to
partition the data. If you are dealing with a dataset that has
timestamps for data collected at regular intervals, partitioning on a
temporal dimension could make sense (that’s what the NYC taxi dataset
does by partitioning by year and month). Instead, here, we can use the
<code class="language-plaintext highlighter-rouge">max_rows_per_file</code> argument of the <code class="language-plaintext highlighter-rouge">write_dataset()</code> function to
limit how large each Parquet file is. At least for this dataset, I
found that limiting the number of rows to 10 million per file seemed
like a good compromise. Each file is about 720 MB which is close to
the file sizes in the NYC taxi dataset. The <a href="https://arrow.apache.org/docs/python/dataset.html#partitioning-performance-considerations">PyArrow
documentation</a>
has a good overview of strategies for partitioning a dataset. The
general recommendation is to avoid individual Parquet files smaller
than 20 MB and larger than 2 GB while avoiding a partition layout that
would create more than 10,000 partitions.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">write_dataset</span><span class="p">(</span><span class="w">
</span><span class="n">data</span><span class="p">,</span><span class="w">
</span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">,</span><span class="w">
</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/my-data/"</span><span class="p">,</span><span class="w">
</span><span class="n">max_rows_per_file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e7</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Writing these files on my system takes about 50 seconds. We end up
with 14 Parquet files totaling 9.9 GB.</p>
<p>Next time we want to work with this data, we can load these files
with:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="w">
</span><span class="s2">"~/datasets/my-data"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>It takes about the same amount of time as scanning the CSV files. It
is almost instantaneous taking only 0.02 seconds. Again, this is fast
because the data is not loaded in memory. We saw that with this approach it
took almost 20 seconds to run this query on our CSV file. So what is
the performance of a query on this dataset split into multiple Parquet
files?</p>
<p>Counting the unique values in a column takes just <strong>1 second</strong>. You
read that correctly. One second to summarize 140 million rows. It is a
little slower than doing it when the entire dataset is loaded in
memory but scanning the files is faster. And because the dataset is
not loaded in memory, you are not limited by the amount of memory you
have available. With the Single File API, a file of 15 GB is the upper
limit of what my laptop with 32 GB of RAM can handle.</p>
<p>One of the advantages of the Arrow ecosystem is that it is
polyglot. The approach we described with R also works with Python. And
because both languages use the same C++ backend, the code looks very
similar.</p>
<h2 id="single-file-api-in-python">Single file API in Python</h2>
<p>There are two functions in the PyArrow Single API to read CSV files:
<code class="language-plaintext highlighter-rouge">read_csv()</code> and <code class="language-plaintext highlighter-rouge">open_csv()</code>. While <code class="language-plaintext highlighter-rouge">read_csv()</code> loads all the data
in memory and does it fast by using multiple threads to read different
parts of the files, <code class="language-plaintext highlighter-rouge">open_csv()</code> reads the data in batches and uses a
single thread.</p>
<p>If the CSV file is small enough, you should use <code class="language-plaintext highlighter-rouge">read_csv()</code>. The code
to read the CSV file and write it to a Parquet file would then look
like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="n">pa</span>
<span class="kn">import</span> <span class="nn">pyarrow.csv</span>
<span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="n">pq</span>
<span class="n">in_path</span> <span class="o">=</span> <span class="s">'~/datasets/data.csv'</span>
<span class="n">out_path</span> <span class="o">=</span> <span class="s">'~/datasets/data.parquet'</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pa</span><span class="p">.</span><span class="n">csv</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">in_path</span><span class="p">)</span>
<span class="n">pq</span><span class="p">.</span><span class="n">write_table</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">out_path</span><span class="p">)</span>
</code></pre></div></div>
<p>In our case, the file is too large to fit in memory<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. So instead of
using <code class="language-plaintext highlighter-rouge">read_csv()</code>, we need to use <code class="language-plaintext highlighter-rouge">open_csv()</code>. Because, the CSV file
is read in chunks, the code is a little more complex. We need to loop
through each chunk, read it, and write it to the Parquet file. This
uses little memory but is not as fast as using <code class="language-plaintext highlighter-rouge">read_csv()</code>. While
<code class="language-plaintext highlighter-rouge">read_csv()</code> is multi-threaded, <code class="language-plaintext highlighter-rouge">open_csv()</code> uses a single
thread. When using <code class="language-plaintext highlighter-rouge">open_csv()</code>, the data types need to be consistent
in your columns. The function infers the data types on the first chunk
of data read, and if the type changes halfway through your dataset in
one of your columns, you will run into errors. You can avoid this by
specifying the data types manually.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># from <https://stackoverflow.com/a/68563617/1113276>
</span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="n">pa</span>
<span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="n">pq</span>
<span class="kn">import</span> <span class="nn">pyarrow.csv</span>
<span class="n">in_path</span> <span class="o">=</span> <span class="s">'~/datasets/data.csv'</span>
<span class="n">out_path</span> <span class="o">=</span> <span class="s">'~/datasets/data.parquet'</span>
<span class="n">writer</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="n">pyarrow</span><span class="p">.</span><span class="n">csv</span><span class="p">.</span><span class="n">open_csv</span><span class="p">(</span><span class="n">in_path</span><span class="p">)</span> <span class="k">as</span> <span class="n">reader</span><span class="p">:</span>
<span class="k">for</span> <span class="n">next_chunk</span> <span class="ow">in</span> <span class="n">reader</span><span class="p">:</span>
<span class="k">if</span> <span class="n">next_chunk</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">if</span> <span class="n">writer</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">writer</span> <span class="o">=</span> <span class="n">pq</span><span class="p">.</span><span class="n">ParquetWriter</span><span class="p">(</span><span class="n">out_path</span><span class="p">,</span> <span class="n">next_chunk</span><span class="p">.</span><span class="n">schema</span><span class="p">)</span>
<span class="n">next_table</span> <span class="o">=</span> <span class="n">pa</span><span class="p">.</span><span class="n">Table</span><span class="p">.</span><span class="n">from_batches</span><span class="p">([</span><span class="n">next_chunk</span><span class="p">])</span>
<span class="n">writer</span><span class="p">.</span><span class="n">write_table</span><span class="p">(</span><span class="n">next_table</span><span class="p">)</span>
<span class="n">writer</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>
<p>On my system, the conversion from file to Parquet takes about 190
seconds. Reading the Parquet file can be done with:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">pq</span><span class="p">.</span><span class="n">ParquetDataset</span><span class="p">(</span><span class="n">out_path</span><span class="p">).</span><span class="n">read</span><span class="p">()</span>
</code></pre></div></div>
<p>With this approach, the dataset is in memory, just like when we were
using R. Again with 32 GB of RAM in my laptop, I need to be careful
with what is running on my system to be able to load this dataset
without running out of memory and crashing my Python session.</p>
<h2 id="the-dataset-api-in-python">The Dataset API in Python</h2>
<p>To load the CSV file with the Dataset API, we use the <code class="language-plaintext highlighter-rouge">dataset()</code>
function:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyarrow.dataset</span> <span class="k">as</span> <span class="n">ds</span>
<span class="n">in_path</span> <span class="o">=</span> <span class="s">"~/datasets/data.csv"</span>
<span class="n">out_path</span> <span class="o">=</span> <span class="s">"~/datasets/my-data/"</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">ds</span><span class="p">.</span><span class="n">dataset</span><span class="p">(</span><span class="n">in_path</span><span class="p">)</span>
</code></pre></div></div>
<p>Just like with R, importing this file takes about 0.02 seconds.</p>
<p>To convert it to a collection of Parquet files, you use the
<code class="language-plaintext highlighter-rouge">write_dataset()</code> function. This function takes the same
<code class="language-plaintext highlighter-rouge">max_rows_per_file</code> argument to control the size of the Parquet file
in each partition.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ds</span><span class="p">.</span><span class="n">write_dataset</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">out_path</span><span class="p">,</span> <span class="nb">format</span> <span class="o">=</span> <span class="s">"parquet"</span><span class="p">,</span>
<span class="n">max_rows_per_file</span> <span class="o">=</span> <span class="mf">1e7</span><span class="p">)</span>
</code></pre></div></div>
<p>Reading this collection of Parquet files can also be done with the
<code class="language-plaintext highlighter-rouge">dataset()</code> function, just like when we used the function to read the
single CSV file above. The <code class="language-plaintext highlighter-rouge">dataset()</code> function is very flexible and
can be used to import data in a variety of formats, and structures,
and even combines files from local and remote locations. The <code class="language-plaintext highlighter-rouge">format</code>
argument is optional as the function detects automatically the file
type.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">ds</span><span class="p">.</span><span class="n">dataset</span><span class="p">(</span><span class="n">out_path</span><span class="p">,</span> <span class="nb">format</span> <span class="o">=</span> <span class="s">"parquet"</span><span class="p">)</span>
</code></pre></div></div>
<p>Given the current functionalities implemented PyArrow, querying
datasets of this size is possible but it is neither blazing fast nor
convenient. A good alternative is to use
<a href="https://ibis-project.org">Ibis</a> with <a href="https://duckdb.org">DuckDB</a> as
a backend. Ibis provides a single interface to work with data stored
in memory or in databases. DuckDB is a self-contained database
designed for data analytics. These tools deserve a lot more than a
one-sentence summary but this is beyond the scope of this post.</p>
<p>To count the number of unique values, you could use the following
approach:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ibis</span>
<span class="n">ibis</span><span class="p">.</span><span class="n">options</span><span class="p">.</span><span class="n">interactive</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">con</span> <span class="o">=</span> <span class="n">ibis</span><span class="p">.</span><span class="n">duckdb</span><span class="p">.</span><span class="n">connect</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">con</span><span class="p">.</span><span class="n">register</span><span class="p">(</span><span class="s">"parquet:///home/user/datasets/my-data/*.parquet"</span><span class="p">,</span> <span class="n">table_name</span> <span class="o">=</span> <span class="s">"table"</span><span class="p">)</span>
<span class="n">con</span><span class="p">.</span><span class="n">table</span><span class="p">(</span><span class="s">"table"</span><span class="p">).</span><span class="n">variable</span><span class="p">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div></div>
<p>Just like with using R, this takes about a second to count the unique
values in one column of our 140 million row dataset.</p>
<h2 id="what-this-post-didnt-mention">What this post didn’t mention</h2>
<p>I focused on the reading of a CSV file and its conversion to
Parquet. I didn’t talk about all the options that both the Single file
and the Dataset APIs have to customize the format of the files that
are being imported. For instance, both APIs can be used to specify a
different column-separator, and cell content that should be treated as
missing data.</p>
<h2 id="conclusion">Conclusion</h2>
<p>For a 15 GB data file, the dataset API is better suited to read,
convert, and query the data. There is an overhead associated with not
having the data in memory but it is greatly reduced if the data is
stored as Parquet files. Another advantage is that the approach
developed here would scale to much larger datasets where the Single
file API would not be able to serialize the data in memory.</p>
<p>With the dataset in this example, the Single file API did not have an
opportunity to shine given the hardware constraints of a modern
laptop. However, if you are dealing with datasets that fit easily in
memory, working with data directly in memory will lead to better query
performance.</p>
<p>To summarize what we learned in this post, this brief decision guide
to help you choose the correct API to import your data.</p>
<figure class="">
<img src="/images/2022-09-decision-map.webp" alt="Decision tree to help you choose the most suitable API for your
data. If your dataset is large (more than a third of your available
RAM) or if it is split into multiple files use the Dataset
API. Reserve the use of the Single file API when the dataset is
small." /><figcaption>
Decision tree to help you choose the appropriate
Apache Arrow API for your dataset.
</figcaption></figure>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>Thank you to <a href="https://twitter.com/kae_suarez/">Kae Suarez</a> and
<a href="https://djnavarro.net">Danielle Navarro</a> for reviewing this post and
providing feedback that improved its content.</p>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>This is not an official name for it but found it is helpful to
group these functions that work on a file type at a time. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>I am not sure why it fit in memory when I was loading it in R but
not with Python. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>François Michonneaufrancois.michonneau@gmail.comA short practical guide to load a 15 GB dataset with Apache Arrow using R and Python.Creating an Arrow dataset (part 2)2022-09-06T00:00:00+00:002022-09-06T00:00:00+00:00https://francoismichonneau.net/2022/09/arrow-dataset-part-2<h2 id="background">Background</h2>
<p>In this follow-up post (see
<a href="/2022/08/arrow-dataset-creation/">part 1</a> if you missed
it), we will explore what happens to the query performance if we read
the files straight into Arrow instead of downloading them locally first.</p>
<h2 id="reading-remote-csv-files">Reading remote CSV files</h2>
<p>In the first part, we first downloaded the compressed CSV files locally
(using the <code class="language-plaintext highlighter-rouge">download.file()</code> function) and then we used the
<code class="language-plaintext highlighter-rouge">open_dataset()</code> function on this set of files to make it available to
Arrow.</p>
<p>However, it is possible to bypass the local download. We can import the
files directly over an Internet connection using the <code class="language-plaintext highlighter-rouge">read_csv_arrow()</code>
function and providing the file URL as the first argument. Once the file
is loaded in memory, we can then write it to disk in the parquet format
(given that we learned in
<a href="/2022/08/arrow-dataset-creation/">part 1</a> that this
format provided the best compromise of disk space usage and query
performance).</p>
<p>We can then modify the code from the <code class="language-plaintext highlighter-rouge">download_daily_package_logs_csv()</code>
function from part 1 to the following (lines changed have comments
indicated by <code class="language-plaintext highlighter-rouge"># <---</code> at the end of the line).</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">arrow</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Download the data set for a given date from the RStudio CRAN log website.</span><span class="w">
</span><span class="c1">## `date` is a single date for which we want the data</span><span class="w">
</span><span class="c1">## `path` is where we want the data to live</span><span class="w">
</span><span class="n">download_daily_package_logs_parquet</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w">
</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/cran-logs-parquet-by-day"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1">## build the URL for the download</span><span class="w">
</span><span class="n">date</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">parse_date</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w">
</span><span class="n">url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="w">
</span><span class="s1">'https://cran-logs.rstudio.com/'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">year</span><span class="p">,</span><span class="w"> </span><span class="s1">'/'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s1">'.csv.gz'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">## build the path for the destination of the download</span><span class="w">
</span><span class="n">file</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">file.path</span><span class="p">(</span><span class="w">
</span><span class="n">path</span><span class="p">,</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"year="</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">year</span><span class="p">),</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"month="</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">month</span><span class="p">),</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s2">".parquet"</span><span class="p">)</span><span class="w"> </span><span class="c1"># <--- change extension to .parquet</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">## create the folder if it doesn't exist</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">dir.exists</span><span class="p">(</span><span class="n">dirname</span><span class="p">(</span><span class="n">file</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dir.create</span><span class="p">(</span><span class="n">dirname</span><span class="p">(</span><span class="n">file</span><span class="p">),</span><span class="w"> </span><span class="n">recursive</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## download the file</span><span class="w">
</span><span class="n">message</span><span class="p">(</span><span class="s2">"Downloading data for "</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s2">" ... "</span><span class="p">,</span><span class="w"> </span><span class="n">appendLF</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">arrow</span><span class="o">::</span><span class="n">read_csv_arrow</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="c1"># <--- read directly from URL</span><span class="w">
</span><span class="n">arrow</span><span class="o">::</span><span class="n">write_parquet</span><span class="p">(</span><span class="n">sink</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">file</span><span class="p">)</span><span class="w"> </span><span class="c1"># <--- convert to parquet on disk</span><span class="w">
</span><span class="n">message</span><span class="p">(</span><span class="s2">"done."</span><span class="p">)</span><span class="w">
</span><span class="c1">## quick check to make sure that the file was created</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="n">file</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"Download failed for "</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="n">call.</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## return the path</span><span class="w">
</span><span class="n">file</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## This function is unchanged from part 1</span><span class="w">
</span><span class="c1">## and extract the year and month from it</span><span class="w">
</span><span class="n">parse_date</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stopifnot</span><span class="p">(</span><span class="w">
</span><span class="s2">"`date` must be a date"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inherits</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="s2">"Date"</span><span class="p">),</span><span class="w">
</span><span class="s2">"provide only one date"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">identical</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="m">1L</span><span class="p">),</span><span class="w">
</span><span class="s2">"date must be in the past"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">Sys.Date</span><span class="p">()</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="n">date_chr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w">
</span><span class="n">year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">year</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1900L</span><span class="p">,</span><span class="w">
</span><span class="n">month</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">mon</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1L</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Now that we are set up, we can create the file system the same way we
did, in part 1.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dates_to_get</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="w">
</span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-06-01"</span><span class="p">),</span><span class="w">
</span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-08-15"</span><span class="p">),</span><span class="w">
</span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"day"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">purrr</span><span class="o">::</span><span class="n">walk</span><span class="p">(</span><span class="n">dates_to_get</span><span class="p">,</span><span class="w"> </span><span class="n">download_daily_package_logs_parquet</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The result is similar to what we achieved in part 1. We have one file
for each day placed in a folder corresponding to their month. Except
that this time, instead of having compressed CSV files, we have parquet
files:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/datasets/cran-logs-parquet-by-day/
└── year=2022
├── month=6
│ ├── 2022-06-01.parquet
│ ├── 2022-06-02.parquet
│ ├── 2022-06-03.parquet
│ ├── ...
│ └── 2022-06-30.parquet
├── month=7
│ ├── 2022-07-01.parquet
│ ├── 2022-07-02.parquet
│ ├── 2022-07-03.parquet
│ ├── ...
│ └── 2022-07-31.parquet
└── month=8
├── 2022-08-01.parquet
├── 2022-08-02.parquet
├── 2022-08-03.parquet
├── ...
└── 2022-08-15.parquet
</code></pre></div></div>
<p>Let’s check how large this data is compared to the datasets we created
in part 1:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_size</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">fs</span><span class="o">::</span><span class="n">dir_info</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w"> </span><span class="n">recurse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"file"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pull</span><span class="p">(</span><span class="n">size</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="nf">sum</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">tribble</span><span class="p">(</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">Format</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w">
</span><span class="s2">"Compressed CSV"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-csv/"</span><span class="p">),</span><span class="w">
</span><span class="s2">"Arrow"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-arrow/"</span><span class="p">),</span><span class="w">
</span><span class="s2">"Parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet/"</span><span class="p">),</span><span class="w">
</span><span class="s2">"Parquet by day"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet-by-day/"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 4 × 2
Format size
<chr> <fs::bytes>
1 Compressed CSV 5.01G
2 Arrow 29.67G
3 Parquet 5.06G
4 Parquet by day 4.63G
</code></pre></div></div>
<p>The dataset with one parquet file per day, is slightly smaller than when
we let <code class="language-plaintext highlighter-rouge">write_dataset()</code> do its own partitioning which led to one file
per month.</p>
<p>We can now compare how quickly Arrow can read these datasets.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
</span><span class="n">parquet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">),</span><span class="w">
</span><span class="n">parquet_by_day</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet-by-day"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">),</span><span class="w">
</span><span class="n">check</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 parquet 139.43ms 143.66ms 6.62 7.91MB 0
2 parquet_by_day 3.52ms 3.82ms 254. 4.28KB 6.45
</code></pre></div></div>
<p>Even though there are more files to parse (76 vs. 3), loading the
dataset with a parquet file per day is a bit faster.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cran_logs_parquet</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">)</span><span class="w">
</span><span class="n">cran_logs_parquet_by_day</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet-by-day"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Let’s now explore the performance of a few queries on these datasets.</p>
<p>First, how long does it take to compute the number of rows in these
datasets:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
</span><span class="n">parquet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">),</span><span class="w">
</span><span class="n">parquet_by_day</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 parquet 743µs 773µs 1267. 4.74KB 8.48
2 parquet_by_day 745µs 773µs 1273. 1.97KB 10.7
</code></pre></div></div>
<p>Not much of a difference.</p>
<p>Let’s now compare the performance of the query we ran in part 1, where
we computed the 10 most downloaded packages in the period covered by our
dataset.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_10_packages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">package</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">n_million_downloads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">1e6</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">collect</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
</span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">),</span><span class="w">
</span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_parquet_by_day</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_al…¹
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:by>
1 top_10_packages(cran_logs_parquet) 3.58s 3.58s 0.279 7.19MB
2 top_10_packages(cran_logs_parquet_by_day) 5.76s 5.76s 0.174 165.36KB
# … with 1 more variable: `gc/sec` <dbl>, and abbreviated variable name
# ¹mem_alloc
# ℹ Use `colnames()` to see all variable names
</code></pre></div></div>
<p>This query runs 1.5 seconds faster on the dataset with one parquet file
per month compared to the dataset with one parquet file per day.</p>
<p>The way a dataset is partitioned has an impact on the performance of
queries. If you are filtering your dataset along a variable used in the
partitioning, some of the files can be skipped. Arrow can directly and
only read the file(s) with the relevant information for your query. For
instance, if you are performing a query that only touches the month of
July, Arrow does not need to look at the files for June or August,
leading to potential speed-ups.</p>
<p>Would the partitioning by day help us run our query faster if we were to
compute the 10 most downloaded packages for a single day? After all, in
this case, we would only need to look at one of the files in our folder
of parquet files, and the file in question would be smaller than one
that has all the data for the month. Let’s compare the performance of
this query for August 1st, 2022:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_10_packages_by_day</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-08-01"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">package</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">collect</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
</span><span class="n">top_10_packages_by_day</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">),</span><span class="w">
</span><span class="n">top_10_packages_by_day</span><span class="p">(</span><span class="n">cran_logs_parquet_by_day</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 2 × 6
expression min median itr/s…¹ mem_a…²
<bch:expr> <bch:> <bch:> <dbl> <bch:b>
1 top_10_packages_by_day(cran_logs_parquet) 304ms 348ms 2.87 222KB
2 top_10_packages_by_day(cran_logs_parquet_by_day) 354ms 354ms 2.82 167KB
# … with 1 more variable: `gc/sec` <dbl>, and abbreviated variable names
# ¹`itr/sec`, ²mem_alloc
# ℹ Use `colnames()` to see all variable names
</code></pre></div></div>
<p>Interestingly, running the query on the monthly parquet file is still
faster. It takes about 30% longer to run the queries on the one parquet
file per day. The overhead associated with having too many small files
in this situation does not compensate for having to look inside a single
file to perform this operation. For the benefits of partitioning to be
visible, we would need to have more data in each parquet file.</p>
<p>We don’t see a performance benefit of having many small files even when
we try to get the result on a single day. But how does this partitioning
impact the performance of a query that needs to access multiple random
rows? Let’s compare how a query that looks at the number of downloads
per day for a given package.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">package_downloads_by_day</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">pkg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"arrow"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">package</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">pkg</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">collect</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
</span><span class="n">package_downloads_by_day</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">),</span><span class="w">
</span><span class="n">package_downloads_by_day</span><span class="p">(</span><span class="n">cran_logs_parquet_by_day</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 6
expression min median `itr/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl>
1 package_downloads_by_day(cran_logs_parquet) 3.31s 3.31s 0.302
2 package_downloads_by_day(cran_logs_parquet_by_day) 4.46s 4.46s 0.224
# … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
# ℹ Use `colnames()` to see all variable names
</code></pre></div></div>
<p>In this case, it takes about 45% longer to perform this query. In this
situation, the performance is affected by having to look inside many
more files in the dataset with one parquet file per day.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This small example illustrates that it might be worth exploring how best
to partition your dataset to benefit the most from the speed that Arrow
brings to your queries. In this example, the partitioning that seemed
the most “natural” based on the format the data is provided (one parquet
file per day) is not the best to make queries run fast.</p>
<p>The variables you include in your queries have also a role to play when
deciding how to partition your dataset. It might be best to partition
your dataset according to variables you use most often in your queries.</p>
<p>The useR!2022 Arrow tutorial has a <a href="https://arrow-user2022.netlify.app/data-storage.html#multi-file-data-sets">convincing
demonstration</a>
that taking advantage of partitioning for your queries makes them run
much faster.</p>
<details>
<summary>
<p>Expand for Session Info</p>
</summary>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sessioninfo</span><span class="o">::</span><span class="n">session_info</span><span class="p">()</span><span class="w">
</span></code></pre></div> </div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os Ubuntu 22.04.1 LTS
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Paris
date 2022-09-01
pandoc NA (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
arrow * 9.0.0 2022-08-10 [1] CRAN (R 4.2.1)
assertthat 0.2.1 2019-03-21 [1] RSPM
backports 1.4.1 2021-12-13 [1] RSPM
bench 1.1.2 2021-11-30 [1] RSPM
bit 4.0.4 2020-08-04 [1] RSPM
bit64 4.0.5 2020-08-30 [1] RSPM
broom 1.0.0 2022-07-01 [1] RSPM
cellranger 1.1.0 2016-07-27 [1] RSPM
cli 3.3.0 2022-04-25 [1] RSPM (R 4.2.0)
colorspace 2.0-3 2022-02-21 [1] RSPM
crayon 1.5.1 2022-03-26 [1] RSPM
DBI 1.1.3 2022-06-18 [1] RSPM
dbplyr 2.2.1 2022-06-27 [1] RSPM
digest 0.6.29 2021-12-01 [1] RSPM
dplyr * 1.0.9 2022-04-28 [1] RSPM
ellipsis 0.3.2 2021-04-29 [1] RSPM
evaluate 0.15 2022-02-18 [1] RSPM
fansi 1.0.3 2022-03-24 [1] RSPM
fastmap 1.1.0 2021-01-25 [1] RSPM
forcats * 0.5.1 2021-01-27 [1] RSPM
fs 1.5.2 2021-12-08 [1] RSPM
gargle 1.2.0 2021-07-02 [1] RSPM
generics 0.1.3 2022-07-05 [1] RSPM
ggplot2 * 3.3.6 2022-05-03 [1] RSPM
glue 1.6.2 2022-02-24 [1] RSPM (R 4.2.0)
googledrive 2.0.0 2021-07-08 [1] RSPM
googlesheets4 1.0.0 2021-07-21 [1] RSPM
gtable 0.3.0 2019-03-25 [1] RSPM
haven 2.5.0 2022-04-15 [1] RSPM
hms 1.1.1 2021-09-26 [1] RSPM
htmltools 0.5.3 2022-07-18 [1] RSPM
httr 1.4.3 2022-05-04 [1] RSPM
jsonlite 1.8.0 2022-02-22 [1] RSPM
knitr 1.39 2022-04-26 [1] RSPM
lifecycle 1.0.1 2021-09-24 [1] RSPM
lubridate 1.8.0 2021-10-07 [1] RSPM
magrittr 2.0.3 2022-03-30 [1] RSPM
modelr 0.1.8 2020-05-19 [1] RSPM
munsell 0.5.0 2018-06-12 [1] RSPM
pillar 1.8.0 2022-07-18 [1] RSPM
pkgconfig 2.0.3 2019-09-22 [1] RSPM
profmem 0.6.0 2020-12-13 [1] RSPM
purrr * 0.3.4 2020-04-17 [1] RSPM
R6 2.5.1 2021-08-19 [1] RSPM
readr * 2.1.2 2022-01-30 [1] RSPM
readxl 1.4.0 2022-03-28 [1] RSPM
reprex 2.0.1 2021-08-05 [1] RSPM
rlang 1.0.4 2022-07-12 [1] RSPM (R 4.2.0)
rmarkdown 2.14 2022-04-25 [1] RSPM
rvest 1.0.2 2021-10-16 [1] RSPM
scales 1.2.0 2022-04-13 [1] RSPM
sessioninfo 1.2.2 2021-12-06 [1] RSPM
stringi 1.7.8 2022-07-11 [1] RSPM
stringr * 1.4.0 2019-02-10 [1] RSPM
tibble * 3.1.8 2022-07-22 [1] RSPM
tidyr * 1.2.0 2022-02-01 [1] RSPM
tidyselect 1.1.2 2022-02-21 [1] RSPM
tidyverse * 1.3.2 2022-07-18 [1] RSPM
tzdb 0.3.0 2022-03-28 [1] RSPM
utf8 1.2.2 2021-07-24 [1] RSPM
vctrs 0.4.1 2022-04-13 [1] RSPM
withr 2.5.0 2022-03-03 [1] RSPM
xfun 0.31 2022-05-10 [1] RSPM
xml2 1.3.3 2021-11-30 [1] RSPM
yaml 2.3.5 2022-02-21 [1] RSPM
[1] /home/francois/.R-library
[2] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────
</code></pre></div> </div>
</details>François Michonneaufrancois.michonneau@gmail.comHow does partitioning impact query performance?Creating an Arrow dataset2022-08-22T00:00:00+00:002022-08-22T00:00:00+00:00https://francoismichonneau.net/2022/08/arrow-dataset-creation<h2 id="background">Background</h2>
<p>While getting started with Apache Arrow, I was intrigued by the variety
of formats Arrow supports. Arrow tutorials tend to start with already
prepared datasets ready to be ingested by <code class="language-plaintext highlighter-rouge">open_dataset()</code>. I wanted to
explore what it takes to create your dataset aimed to be analyzed with
Arrow and understand the respective benefits of the different file
formats it supports.</p>
<p>Arrow can read in a variety of formats: <code class="language-plaintext highlighter-rouge">parquet</code>, <code class="language-plaintext highlighter-rouge">arrow</code> (also known
as <code class="language-plaintext highlighter-rouge">ipc</code> and <code class="language-plaintext highlighter-rouge">feather</code>)<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, and text-based formats like <code class="language-plaintext highlighter-rouge">csv</code> (as well
as <code class="language-plaintext highlighter-rouge">tsv</code>). Additionally, Arrow provides tools to convert between these
formats.</p>
<p>Having the possibility to import datasets in a variety of formats is
helpful as you are less constrained by the type of data you can start
your analysis on. However, if you are building a dataset from scratch,
which one should you choose?</p>
<p>To try to answer this question, we will be using the <code class="language-plaintext highlighter-rouge">{arrow}</code> R package
to compare the amount of hard drive space these file formats use and the
performance of a query in a multi-file dataset using these different
formats. This is not a formal evaluation of the performance of Arrow or
how best to optimize the partitioning of a dataset, rather it is a brief
exploration of the tradeoffs that come with using the different datasets
supported by Arrow. I also don’t explain the differences in the data
structure of these different formats.</p>
<h2 id="the-dataset">The dataset</h2>
<p>We will be using data from <a href="https://cran-logs.rstudio.com/">https://cran-logs.rstudio.com/</a>. This site
gives you access to the log files for all hits to the CRAN<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> mirror
hosted by RStudio. For each day since October 1st, 2012, there is a
compressed CSV file (file with the extension <code class="language-plaintext highlighter-rouge">.csv.gz</code>) that records the
downloaded packages. Each row contains the date, the time, the name of
the R package downloaded, the R version used, the architecture (32-bit
or 64-bit), the operating system, the country inferred from the IP
address, and a daily unique identifier assigned to each IP address. This
website has also similar data for the daily downloads of R itself but I
will not be using this data in this post.</p>
<p>For this exploration, we are going to limit ourselves to a couple of
months of data which will be providing enough data for our purpose. We
will download the data for the period from June 1st, 2022 to August
15th, 2022.</p>
<p>Arrow is designed to read data that is split across multiple files. So,
you can point <code class="language-plaintext highlighter-rouge">open_dataset()</code> to a directory that contains all the
files that make up your dataset. There is no need to loop over each file
to build your dataset in memory. Splitting your datasets across multiple
files can even make queries on your dataset faster, as only some of the
files might need to be accessed to get the results needed. Depending on
the type of queries you perform most often on your dataset, it can be
worth considering how best to partition your files to accelerate your
analyses (but this is beyond the scope of this post). Here, the files
are provided by date, and we will keep a time-based file organization.</p>
<p>We will use a <a href="https://hive.apache.org/">Hive-style</a> partitioning by
year and month. We will have a directory for each year (there is only
one year in our example), and within it, a directory for each month. The
directory are named according to the convention
<code class="language-plaintext highlighter-rouge"><variable_name>=<value></code>. So we will want to organize the files as
illustrated below:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>└── year=2022
├── month=6
│ └── <data files>
├── month=7
│ └── <data files>
└── month=8
└── <data files>
</code></pre></div></div>
<h2 id="import-the-data-as-it-is-provided">Import the data as it is provided</h2>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">arrow</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">fs</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">bench</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">open_dataset()</code> function in the <code class="language-plaintext highlighter-rouge">{arrow}</code> package can directly read
compressed CSV files<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> (with the extension <code class="language-plaintext highlighter-rouge">.csv.gz</code>) as they are
provided on the RStudio CRAN logs website.</p>
<p>As a first step, we can download the files from the site and organize
them using the Hive-style directory structure as shown above.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Check that the date is really a date,</span><span class="w">
</span><span class="c1">## and extract the year and month from it</span><span class="w">
</span><span class="n">parse_date</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stopifnot</span><span class="p">(</span><span class="w">
</span><span class="s2">"`date` must be a date"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inherits</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="s2">"Date"</span><span class="p">),</span><span class="w">
</span><span class="s2">"provide only one date"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">identical</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="m">1L</span><span class="p">),</span><span class="w">
</span><span class="s2">"date must be in the past"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">Sys.Date</span><span class="p">()</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="n">date_chr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w">
</span><span class="n">year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">year</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1900L</span><span class="p">,</span><span class="w">
</span><span class="n">month</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">mon</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1L</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## Download the data set for a given date from the RStudio CRAN log website.</span><span class="w">
</span><span class="c1">## `date` is a single date for which we want the data</span><span class="w">
</span><span class="c1">## `path` is where we want the data to live</span><span class="w">
</span><span class="n">download_daily_package_logs_csv</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w">
</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/cran-logs-csv"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1">## build the URL for the download</span><span class="w">
</span><span class="n">date</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">parse_date</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w">
</span><span class="n">url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="w">
</span><span class="s1">'https://cran-logs.rstudio.com/'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">year</span><span class="p">,</span><span class="w"> </span><span class="s1">'/'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s1">'.csv.gz'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">## build the path for the destination of the download</span><span class="w">
</span><span class="n">file</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">file.path</span><span class="p">(</span><span class="w">
</span><span class="n">path</span><span class="p">,</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"year="</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">year</span><span class="p">),</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"month="</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">month</span><span class="p">),</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s2">".csv.gz"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">## create the folder if it doesn't exist</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">dir.exists</span><span class="p">(</span><span class="n">dirname</span><span class="p">(</span><span class="n">file</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dir.create</span><span class="p">(</span><span class="n">dirname</span><span class="p">(</span><span class="n">file</span><span class="p">),</span><span class="w"> </span><span class="n">recursive</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## download the file</span><span class="w">
</span><span class="n">message</span><span class="p">(</span><span class="s2">"Downloading data for "</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s2">" ... "</span><span class="p">,</span><span class="w"> </span><span class="n">appendLF</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">download.file</span><span class="p">(</span><span class="w">
</span><span class="n">url</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">url</span><span class="p">,</span><span class="w">
</span><span class="n">destfile</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">file</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"libcurl"</span><span class="p">,</span><span class="w">
</span><span class="n">quiet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
</span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wb"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">message</span><span class="p">(</span><span class="s2">"done."</span><span class="p">)</span><span class="w">
</span><span class="c1">## quick check to make sure that the file was created</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="n">file</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"Download failed for "</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="n">call.</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## return the path</span><span class="w">
</span><span class="n">file</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## build sequence of dates for which we want the data</span><span class="w">
</span><span class="n">dates_to_get</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="w">
</span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-06-01"</span><span class="p">),</span><span class="w">
</span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-08-15"</span><span class="p">),</span><span class="w">
</span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"day"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">## download the data</span><span class="w">
</span><span class="n">walk</span><span class="p">(</span><span class="n">dates_to_get</span><span class="p">,</span><span class="w"> </span><span class="n">download_daily_package_logs_csv</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Let’s check the content of the folder that holds the data we downloaded:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/datasets/cran-logs-csv/
└── year=2022
├── month=6
│ ├── 2022-06-01.csv.gz
│ ├── 2022-06-02.csv.gz
│ ├── 2022-06-03.csv.gz
│ ├── ...
│ └── 2022-06-30.csv.gz
├── month=7
│ ├── 2022-07-01.csv.gz
│ ├── 2022-07-02.csv.gz
│ ├── 2022-07-03.csv.gz
│ ├── ...
│ └── 2022-07-31.csv.gz
└── month=8
├── 2022-08-01.csv.gz
├── 2022-08-02.csv.gz
├── 2022-08-03.csv.gz
├── ...
└── 2022-08-15.csv.gz
</code></pre></div></div>
<p>We have one file for each day, placed in a folder corresponding to their
month. We can now read this data using <code class="language-plaintext highlighter-rouge">{arrow}</code>’s <code class="language-plaintext highlighter-rouge">open_dataset()</code>
function:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cran_logs_csv</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="w">
</span><span class="s2">"~/datasets/cran-logs-csv/"</span><span class="p">,</span><span class="w">
</span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"csv"</span><span class="p">,</span><span class="w">
</span><span class="n">partitioning</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"month"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">cran_logs_csv</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FileSystemDataset with 76 csv files
date: date32[day]
time: time32[s]
size: int64
r_version: string
r_arch: string
r_os: string
package: string
version: string
country: string
ip_id: int64
year: int32
month: int32
</code></pre></div></div>
<p>The partitioning has been taken into consideration as the output shows
that the dataset contains the variables <code class="language-plaintext highlighter-rouge">year</code> and <code class="language-plaintext highlighter-rouge">month</code> which are not
part of the data we downloaded. They are coming from the way we
organized the downloaded files.</p>
<h2 id="convert-to-arrow-and-parquet-files">Convert to Arrow and Parquet files</h2>
<p>Now that we have the compressed CSV files on disk, and that we opened
the dataset with <code class="language-plaintext highlighter-rouge">open_dataset()</code>, we can convert it to the other file
formats supported by Arrow using <code class="language-plaintext highlighter-rouge">{arrow}</code>’s <code class="language-plaintext highlighter-rouge">write_dataset()</code> function.
We are going to convert our collection of <code class="language-plaintext highlighter-rouge">.csv.gz</code> files into the Arrow
and Parquet formats.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Convert the dataset into the Arrow format</span><span class="w">
</span><span class="n">write_dataset</span><span class="p">(</span><span class="w">
</span><span class="n">cran_logs_csv</span><span class="p">,</span><span class="w">
</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/cran-logs-arrow"</span><span class="p">,</span><span class="w">
</span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"arrow"</span><span class="p">,</span><span class="w">
</span><span class="n">partitioning</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"month"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">## Convert the dataset into the Parquet format</span><span class="w">
</span><span class="n">write_dataset</span><span class="p">(</span><span class="w">
</span><span class="n">cran_logs_csv</span><span class="p">,</span><span class="w">
</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/cran-logs-parquet"</span><span class="p">,</span><span class="w">
</span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">,</span><span class="w">
</span><span class="n">partitioning</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"month"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Let’s inspect the content of the directories that contain these
datasets.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fs</span><span class="o">::</span><span class="n">dir_tree</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-arrow/"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/datasets/cran-logs-arrow/
└── year=2022
├── month=6
│ └── part-0.arrow
├── month=7
│ └── part-0.arrow
└── month=8
└── part-0.arrow
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fs</span><span class="o">::</span><span class="n">dir_tree</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet/"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/datasets/cran-logs-parquet/
└── year=2022
├── month=6
│ └── part-0.parquet
├── month=7
│ └── part-0.parquet
└── month=8
└── part-0.parquet
</code></pre></div></div>
<p>These two directories have the same layout organized by year and month
as with our CSV files given that we kept the same partitioning. The
files within the directories have an extension that matches their file
format. One difference is that there is a single file for each month. We
used the default values for <code class="language-plaintext highlighter-rouge">write_dataset()</code> and the number of rows for
each month is smaller than the threshold this function uses to split the
dataset into multiple files.</p>
<h2 id="comparison-of-the-different-formats">Comparison of the different formats</h2>
<p>Let’s compare how much space these different file formats take on disk:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_size</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">fs</span><span class="o">::</span><span class="n">dir_info</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w"> </span><span class="n">recurse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"file"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pull</span><span class="p">(</span><span class="n">size</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="nf">sum</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">tribble</span><span class="p">(</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">Format</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w">
</span><span class="s2">"Compressed CSV"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-csv/"</span><span class="p">),</span><span class="w">
</span><span class="s2">"Arrow"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-arrow/"</span><span class="p">),</span><span class="w">
</span><span class="s2">"Parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet/"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 3 × 2
Format size
<chr> <fs::bytes>
1 Compressed CSV 5.01G
2 Arrow 29.67G
3 Parquet 5.06G
</code></pre></div></div>
<p>The Arrow format takes the most space with almost 30GB while both the
compressed CSV and the Parquet files use about 5GB of hard drive.</p>
<p>We are now set up to compare the performance of doing computation of
these different dataset formats.</p>
<p>Let’s open these datasets with the different formats:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cran_logs_csv</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-csv/"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"csv"</span><span class="p">)</span><span class="w">
</span><span class="n">cran_logs_arrow</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-arrow/"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"arrow"</span><span class="p">)</span><span class="w">
</span><span class="n">cran_logs_parquet</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet/"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>We will compare how long it takes for Arrow to compute the 10 most
downloaded packages in the time period our dataset covers using each
file format.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_10_packages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">package</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">n_million_downloads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">1e6</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">collect</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
</span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_csv</span><span class="p">),</span><span class="w">
</span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_arrow</span><span class="p">),</span><span class="w">
</span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 3 × 6
expression min median itr/se…¹ mem_al…² gc/se…³
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:by> <dbl>
1 top_10_packages(cran_logs_csv) 29.57s 29.57s 0.0338 8.19MB 0
2 top_10_packages(cran_logs_arrow) 2.1s 2.1s 0.475 165.39KB 0.475
3 top_10_packages(cran_logs_parquet) 3.32s 3.32s 0.301 137.11KB 0
# … with abbreviated variable names ¹`itr/sec`, ²mem_alloc, ³`gc/sec`
</code></pre></div></div>
<p>While it takes about 4 seconds to perform this task on the Arrow or
Parquet files, it takes more than 30 seconds to do it on the CSV files.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Having Arrow point directly to the folder of compressed CSV file might
be the most convenient but it comes with a high-performance cost. Arrow
and Parquet have similar performance but the Parquet files take less
space on disk and would be more suitable for long-term storage. This is
why large datasets like the NYC taxi data is distributed as a series of
Parquet files.</p>
<p>In the future, I might explore how using different variables for
partitioning or how the number of files in the partitions affects the
performance of the queries (EDIT: this <a href="/2022/09/arrow-dataset-part-2/">post is now available</a>. If you have other ideas
of topics that you would me to explore, do not hesitate to leave a
comment below.</p>
<h2 id="going-further">Going further</h2>
<p>If you would like to learn more about the different formats, check out
the <a href="https://arrow-user2022.netlify.app/">Arrow workshop</a> (especially
<a href="https://arrow-user2022.netlify.app/data-storage.html">Part 3: Data
Storage</a>) that
Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt taught at
useR!2022.</p>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>Thank you to <a href="https://twitter.com/kae_suarez/">Kae Suarez</a> and
<a href="https://djnavarro.net">Danielle Navarro</a> for reviewing this post.</p>
<h2 id="post-scriptum">Post Scriptum</h2>
<p>I wrote a <a href="/2022/09/arrow-dataset-part-2/">follow-up post</a> that explores the impact of partitioning the dataset on
performance.</p>
<details>
<summary>
<p>Expand for Session Info</p>
</summary>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sessioninfo</span><span class="o">::</span><span class="n">session_info</span><span class="p">()</span><span class="w">
</span></code></pre></div> </div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os Ubuntu 22.04.1 LTS
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Paris
date 2022-08-19
pandoc NA (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
arrow * 9.0.0 2022-08-10 [1] CRAN (R 4.2.1)
assertthat 0.2.1 2019-03-21 [1] RSPM
backports 1.4.1 2021-12-13 [1] RSPM
bench * 1.1.2 2021-11-30 [1] RSPM
bit 4.0.4 2020-08-04 [1] RSPM
bit64 4.0.5 2020-08-30 [1] RSPM
broom 1.0.0 2022-07-01 [1] RSPM
cellranger 1.1.0 2016-07-27 [1] RSPM
cli 3.3.0 2022-04-25 [1] RSPM (R 4.2.0)
colorspace 2.0-3 2022-02-21 [1] RSPM
crayon 1.5.1 2022-03-26 [1] RSPM
DBI 1.1.3 2022-06-18 [1] RSPM
dbplyr 2.2.1 2022-06-27 [1] RSPM
digest 0.6.29 2021-12-01 [1] RSPM
dplyr * 1.0.9 2022-04-28 [1] RSPM
ellipsis 0.3.2 2021-04-29 [1] RSPM
evaluate 0.15 2022-02-18 [1] RSPM
fansi 1.0.3 2022-03-24 [1] RSPM
fastmap 1.1.0 2021-01-25 [1] RSPM
forcats * 0.5.1 2021-01-27 [1] RSPM
fs * 1.5.2 2021-12-08 [1] RSPM
gargle 1.2.0 2021-07-02 [1] RSPM
generics 0.1.3 2022-07-05 [1] RSPM
ggplot2 * 3.3.6 2022-05-03 [1] RSPM
glue 1.6.2 2022-02-24 [1] RSPM (R 4.2.0)
googledrive 2.0.0 2021-07-08 [1] RSPM
googlesheets4 1.0.0 2021-07-21 [1] RSPM
gtable 0.3.0 2019-03-25 [1] RSPM
haven 2.5.0 2022-04-15 [1] RSPM
hms 1.1.1 2021-09-26 [1] RSPM
htmltools 0.5.3 2022-07-18 [1] RSPM
httr 1.4.3 2022-05-04 [1] RSPM
jsonlite 1.8.0 2022-02-22 [1] RSPM
knitr 1.39 2022-04-26 [1] RSPM
lifecycle 1.0.1 2021-09-24 [1] RSPM
lubridate 1.8.0 2021-10-07 [1] RSPM
magrittr 2.0.3 2022-03-30 [1] RSPM
modelr 0.1.8 2020-05-19 [1] RSPM
munsell 0.5.0 2018-06-12 [1] RSPM
pillar 1.8.0 2022-07-18 [1] RSPM
pkgconfig 2.0.3 2019-09-22 [1] RSPM
purrr * 0.3.4 2020-04-17 [1] RSPM
R6 2.5.1 2021-08-19 [1] RSPM
readr * 2.1.2 2022-01-30 [1] RSPM
readxl 1.4.0 2022-03-28 [1] RSPM
reprex 2.0.1 2021-08-05 [1] RSPM
rlang 1.0.4 2022-07-12 [1] RSPM (R 4.2.0)
rmarkdown 2.14 2022-04-25 [1] RSPM
rvest 1.0.2 2021-10-16 [1] RSPM
scales 1.2.0 2022-04-13 [1] RSPM
sessioninfo 1.2.2 2021-12-06 [1] RSPM
stringi 1.7.8 2022-07-11 [1] RSPM
stringr * 1.4.0 2019-02-10 [1] RSPM
tibble * 3.1.8 2022-07-22 [1] RSPM
tidyr * 1.2.0 2022-02-01 [1] RSPM
tidyselect 1.1.2 2022-02-21 [1] RSPM
tidyverse * 1.3.2 2022-07-18 [1] RSPM
tzdb 0.3.0 2022-03-28 [1] RSPM
utf8 1.2.2 2021-07-24 [1] RSPM
vctrs 0.4.1 2022-04-13 [1] RSPM
withr 2.5.0 2022-03-03 [1] RSPM
xfun 0.31 2022-05-10 [1] RSPM
xml2 1.3.3 2021-11-30 [1] RSPM
yaml 2.3.5 2022-02-21 [1] RSPM
[1] /home/francois/.R-library
[2] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────
</code></pre></div> </div>
</details>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Feather was the first iteration of the file format (v1), the Arrow
Interprocess Communication (IPC) file format is the newer version
(v2) and has many new features. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Comprehensive R Archive Network, the repository for the R packages <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>since Arrow 9.0.0 <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>François Michonneaufrancois.michonneau@gmail.comAn exploration of the file formats that Arrow can read and write.`foghorn` 1.3.1 released2020-09-08T00:00:00+00:002020-09-08T00:00:00+00:00https://francoismichonneau.net/2020/09/foghorn-1.3.1<p>A new version of <a href="https://cran.r-project.org/package=foghorn"><code class="language-plaintext highlighter-rouge">foghorn</code></a>
(version 1.3.1) was just accepted on CRAN.</p>
<p><code class="language-plaintext highlighter-rouge">foghorn</code> is an R package that allows you to:</p>
<ul>
<li>browse the results of the CRAN checks on your package (with <a href="https://fmichonneau.github.io/foghorn/reference/cran_results.html"><code class="language-plaintext highlighter-rouge">cran_results()</code></a>
and <a href="https://fmichonneau.github.io/foghorn/reference/cran_details.html"><code class="language-plaintext highlighter-rouge">cran_details()</code></a>);</li>
<li>check where your package stands when submitted to CRAN (with
<a href="https://fmichonneau.github.io/foghorn/reference/cran_incoming.html"><code class="language-plaintext highlighter-rouge">cran_incoming()</code></a>);</li>
<li>and starting with version 1.3.1, check whether your package is in the Win
builder queue (with <a href="https://fmichonneau.github.io/foghorn/reference/winbuilder_queue.html"><code class="language-plaintext highlighter-rouge">winbuilder_queue()</code></a>).</li>
</ul>
<p>The idea of inspecting the Win-builder queue <a href="https://github.com/fmichonneau/foghorn/issues/40">was proposed</a> by
Kirill Müller.</p>
<p>If you would like to start using <code class="language-plaintext highlighter-rouge">foghorn</code>, check out the
<a href="https://fmichonneau.github.io/foghorn/articles/foghorn.html">vignette</a> that
comes with the package.</p>
<p><a href="https://github.com/fmichonneau/foghorn/issues/new">Feedback and suggestions</a> for <code class="language-plaintext highlighter-rouge">foghorn</code> are welcome!</p>François Michonneaufrancois.michonneau@gmail.comNew version of foghorn provides access to Win-builder queueMigrate from Gmail to HelpScout with R2020-04-17T00:00:00+00:002020-04-17T00:00:00+00:00https://francoismichonneau.net/2020/04/gmail-helpscout-migration<h2 id="preamble">Preamble</h2>
<ul>
<li>This is a long and somewhat dense post. Even if you do not have to migrate
emails from Gmail to HelpScout, I hope this post will be useful to you, as the
general approach could be interesting to other problems that involve working
with APIs.</li>
<li>The full code I actually used for the email migration is available at:
<a href="https://github.com/carpentries/emailmigration">https://github.com/carpentries/emailmigration</a> and I include links pointing
to functions in the GitHub repo<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> throughout the post below to illustrate
my points.</li>
</ul>
<h2 id="the-problem-and-its-solution">The problem and its solution</h2>
<p>At <a href="https://carpentries.org">The Carpentries</a>, <a href="https://carpentries.org/regionalcoordinators/">Regional
Coordinators</a> help us organize
workshops across the globe. In the past, each Regional Coordinator was
set up with a Gmail account (through The Carpentries’s GSuite plan). However, as
the number of Regional Coordinators grew, and as some geographic areas have more
than one Regional Coordinator, the Gmail account model was starting to cause
some issues.</p>
<p>The Carpentries Core Team has been using HelpScout for a while and is a much more suitable tool to manage emails and inboxes as a team.</p>
<p>The main challenge with transitioning the Regional Coordinators to using HelpScout was to import the old messages from Gmail to HelpScout. To tackle this problem, I used R and this blog post describes the approach I took.</p>
<h2 id="technical-overview">Technical overview</h2>
<p>Before doing anything else, we used the GSuite data migration tool to transfer
all emails for each Regional Coordinator account into a single account. Having
all the emails to import in the same place makes things easier.</p>
<p>This post goes through the steps I took to perform this migration:</p>
<ol>
<li>Figure out authentication with the Gmail API, and with the HelpScout API</li>
<li>Get familiar with the HelpScout API and write R functions to perform the
tasks needed</li>
<li>Convert Gmail threads into HelpScout conversations</li>
<li>Test migration on 100 Gmail threads</li>
<li>Perform the full migration</li>
</ol>
<p>Choice of packages and approach:</p>
<ul>
<li>Working with the Gmail API is made much easier with the wonderful
<a href="https://gmailr.r-lib.org/"><code class="language-plaintext highlighter-rouge">gmailr</code></a> package.</li>
<li>I didn’t find an already made package to work with the HelpScout web API so I
wrote a few functions to interact with the endpoints I needed using the
<a href="https://httr.r-lib.org/"><code class="language-plaintext highlighter-rouge">httr</code></a> package.</li>
<li>The mechanics of converting the data coming from the Gmail web API into the
format needed by the HelpScout API to import the conversation was done using
the <a href="https://r6.r-lib.org/"><code class="language-plaintext highlighter-rouge">R6</code></a> package. The R6 classes and methods made it
easier to separate storing each element needed by the HelpScout API as private
elements and the actual formatting that was handled with methods.</li>
<li>When working with web APIs a lot can go wrong: there is a weird data
format that your code didn’t know how to handle, your internet connection goes
down, you reach the rate limit, etc. Therefore, I used the
<a href="https://richfitz.github.io/storr/"><code class="language-plaintext highlighter-rouge">storr</code></a> package to cache (1) the R6 objects that
act as the bridge between the 2 APIs; (2) the responses from the HelpScout API
to make sure all the threads were converted correctly.</li>
<li>I organized all the code as a barebone package. It makes code management
easier and is a good habit to take. Here it was a one-off task but if it was
something that I’d use regularly, it means that I could develop tests, write
documentation, and enable continuous testing. I could then write and update my
code, and rely on <code class="language-plaintext highlighter-rouge">devtools::load_all()</code>.</li>
</ul>
<h2 id="1-authentication">1. Authentication</h2>
<h3 id="11-gmail-api">1.1. Gmail API</h3>
<p>The instructions in the <code class="language-plaintext highlighter-rouge">gmailr</code> package’s
<a href="https://gmailr.r-lib.org/#setup">README</a> are clear. You can use the <code class="language-plaintext highlighter-rouge">gm_threads()</code> function, for instance, to check that the authentication is working as expected.</p>
<h3 id="12-the-helpscout-api">1.2. The HelpScout API</h3>
<p>The HelpScout API uses the OAuth 2.0 protocol. The <code class="language-plaintext highlighter-rouge">httr</code> package handles this well.</p>
<p>Create a new app within HelpScout, and use <code class="language-plaintext highlighter-rouge">https://localhost:1410/</code> for the redict URL. Take note of the key and secret. Use this information to create a new app object in R with <code class="language-plaintext highlighter-rouge">httr</code>:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hs_app</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">httr</span><span class="o">::</span><span class="n">oauth_app</span><span class="p">(</span><span class="w">
</span><span class="s2">"helpscout"</span><span class="p">,</span><span class="w">
</span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"<your app key here>"</span><span class="p">,</span><span class="w">
</span><span class="n">secret</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"<your app secret here>"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>and then use this object to do the authentication online:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hs_token</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">httr</span><span class="o">::</span><span class="n">oauth2.0_token</span><span class="p">(</span><span class="w">
</span><span class="n">httr</span><span class="o">::</span><span class="n">oauth_endpoint</span><span class="p">(</span><span class="w">
</span><span class="n">authorize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"https://secure.helpscout.net/authentication/authorizeClientApplication"</span><span class="p">,</span><span class="w">
</span><span class="n">access</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"https://api.helpscout.net/v2/oauth2/token"</span><span class="p">),</span><span class="w">
</span><span class="n">app</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hs_app</span><span class="p">)</span><span class="w">
</span><span class="n">htoken</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">httr</span><span class="o">::</span><span class="n">config</span><span class="p">(</span><span class="n">token</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hs_token</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>We can then use the <code class="language-plaintext highlighter-rouge">htoken</code> object across all our calls to the HelpScout web API.</p>
<h2 id="2-getting-started-with-the-helpscout-web-api">2. Getting started with the HelpScout web API</h2>
<p>When working with a new web API, first read the documentation to understand how things are set up. From this initial reading, it became clear that Gmail and HelpScout use different words for related concepts.</p>
<table>
<thead>
<tr>
<th>HelpScout</th>
<th>Gmail</th>
</tr>
</thead>
<tbody>
<tr>
<td>thread</td>
<td>message</td>
</tr>
<tr>
<td>conversation</td>
<td>thread</td>
</tr>
</tbody>
</table>
<p>Keeping this straight in my mind took some time… and because I’m more used to the terms used by Gmail, I used this vocabulary in my function names (for the most part).</p>
<p>Another thing that I needed was HelpScout’s internal identifier for the mailbox into which the emails were being imported. So the first function I wrote against HelpScout’s API was <code class="language-plaintext highlighter-rouge">hs_mailbox_id()</code> which returned the internal identifier for the mailbox that was of interest to me.</p>
<p>The second thing I needed to do was to make sure I understood how to use the API to import an actual conversation. I started with fake data I could control to ensure that I had something simple that I knew worked and I could compare against when things didn’t work with real data. Even if the documentation of an API is good, there are, more often than not, small details that are not described that you need to figure out. Having this data as a starting point is useful for these tests.</p>
<p>The actual code to create a new <del>thread</del> conversation in HelpScout ended up being:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hs_create_thread</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">thread</span><span class="p">,</span><span class="w"> </span><span class="n">hstoken</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">body</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">jsonlite</span><span class="o">::</span><span class="n">toJSON</span><span class="p">(</span><span class="n">thread</span><span class="p">,</span><span class="w"> </span><span class="n">auto_unbox</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">httr</span><span class="o">::</span><span class="n">POST</span><span class="p">(</span><span class="w">
</span><span class="s2">"https://api.helpscout.net"</span><span class="p">,</span><span class="w">
</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/v2/conversations"</span><span class="p">,</span><span class="w">
</span><span class="n">body</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">body</span><span class="p">,</span><span class="w">
</span><span class="n">htoken</span><span class="p">,</span><span class="w">
</span><span class="n">httr</span><span class="o">::</span><span class="n">content_type</span><span class="p">(</span><span class="s2">"application/json; charset=UTF-8"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>This is not the code I would have written if it was part of a package intended for others to use. For instance, I would have wanted to check the response of the API after each request. But for my particular use case, it made it easier to return this response and inspect manually after the fact once I had confirmed that this code was working for most requests.</p>
<h2 id="2-extracting-the-content-of-the-emails-from-gmail">2. Extracting the content of the emails from Gmail</h2>
<p>This was the most time-consuming part as there were lots of unexpected details that came up to get a smooth conversion between the two APIs.</p>
<h3 id="21-things-that-were-easy">2.1. Things that were easy</h3>
<ul>
<li>The <code class="language-plaintext highlighter-rouge">gmailr::gm_subject()</code> worked every time to get the subject of the
threads for all the messages.</li>
</ul>
<h3 id="22-things-that-were-almost-easy">2.2. Things that were almost easy</h3>
<ul>
<li><a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L128-L150">Extracting the people involved in the conversation</a>. The <code class="language-plaintext highlighter-rouge">gmailr::gm_to()</code> and
<code class="language-plaintext highlighter-rouge">gmailr::gm_from()</code> worked well to extract the email addresses. The small
catch was that some email addresses were formatted as <code class="language-plaintext highlighter-rouge">FirstName LastName
<email@address.rr></code>, others had only <code class="language-plaintext highlighter-rouge">email@address.rr</code>, and when multiple
people were involved a comma separated them. However, in some cases, people
have a comma in their names.</li>
<li><a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L121-L126">Extracting the date</a>. The <code class="language-plaintext highlighter-rouge">gmailr::date()</code> returns the date from the email in
<a href="https://en.wikipedia.org/wiki/Unix_time">Unix time</a>. The <code class="language-plaintext highlighter-rouge">anytime</code>
<a href="https://cran.r-project.org/web/packages/anytime/index.html">package</a> is
useful at converting Unix time into other formats, including the <code class="language-plaintext highlighter-rouge">iso8601</code>
that was expected by the HelpScout API. I still had to manually add a final
<code class="language-plaintext highlighter-rouge">Z</code> to the character string.</li>
</ul>
<h3 id="23-things-that-were-not-so-easy">2.3. Things that were not so easy</h3>
<ul>
<li><a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L14-L24">Extracting the email attachments</a>. The attachments themselves are not returned
by the API. Instead, the API returns an URL that points to the address where
the attachments can be retrieved. The HelpScout’s API accepts the attachments
as <a href="https://en.wikipedia.org/wiki/Base64">base64-encoded</a> strings. The
<code class="language-plaintext highlighter-rouge">gmailr</code> package helped to retrieve this data, but the data returned by the
Gmail API is base64url encoded. Thankfully, converting to regular
base64 is a short regular expression substitution away once you know the
difference between the two.</li>
<li>The thing that was the most puzzling was parsing the actual body of the
emails. The <code class="language-plaintext highlighter-rouge">gmailr::gm_body()</code> worked for only a small fraction of the emails
I had to deal with. After many trials and errors, <a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L48">I wrote a function</a> to
reliably retrieve the content of the emails<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. There were many situations to deal with as the messages can be:
<ul>
<li>“multipart” the body of the email is provided both in plain text format or
in HTML format which allows for email clients that don’t support
HTML-formatting to provide the plain text version of the message;</li>
<li>either only plain text or in HTML format</li>
<li>provided as attachments (what some email clients do when you forward a
message).</li>
</ul>
<p>Depending on the situation, the location of the body of the email within the
deeply nested list that was returned by the Gmail API could vary. I ended up
writing a recursive algorithm that traversed the list to find and retrieve the
relevant content of the emails.</p>
<p>The last catch was that plain text messages that included an URL were
interpreted by the HelpScout API as being HTML-formatted. It meant that the
whitespace to indicate the line breaks were ignored making the body of the
messages large blocks of texts that were very hard to read and follow. I
relied on the <code class="language-plaintext highlighter-rouge">commonmark::markdown_html()</code> to <a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L1-L6">convert these plain text
messages</a> into HTML that then looked good once they were uploaded onto
HelpScout using the API.</p>
</li>
</ul>
<h2 id="3-conversion-between-gmail-and-helpscout">3. Conversion between Gmail and HelpScout</h2>
<p>Now that I had access to all the relevant information from the emails, I needed to format it so it could be imported by the HelpScout API. For this, I used the R6 object-oriented programming system.</p>
<p>Each element coming from the Gmail API was individually stored as a private field, and an accessor method (<code class="language-plaintext highlighter-rouge">$get()</code>) created the list in the format needed to be ingested by HelpScout’s API.</p>
<p>I used 3 classes for this:</p>
<ul>
<li><a href="https://github.com/carpentries/emailmigration/blob/master/R/HelpScout-classes.R#L85">one for the HelpScout conversations</a> (the Gmail threads)</li>
<li><a href="https://github.com/carpentries/emailmigration/blob/master/R/HelpScout-classes.R#L30">one for the HelpScout threads</a> (the Gmail messages)</li>
<li><a href="https://github.com/carpentries/emailmigration/blob/master/R/HelpScout-classes.R#L6">one for the attachments</a></li>
</ul>
<p>This modularity helped debugging and limited the complexity of each class.</p>
<p>Because all the emails are going to be in the same inbox in HelpScout, I wanted an easy way to tag the conversations based on the team of Regional Coordinators that were involved. The R6 system was useful for this because once the email information was stored within the object, I could use a private method called by the accessor to extract all the people involved, and add tags in HelpScout to help Regional Coordinators find past conversations that are relevant to them.</p>
<p>It was one of the first times I used R6<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> for a real task and I could see its potential. If the code written here were for public consumption, it would have provided a good framework to add more tests on the data structure of the individual elements that were coming from the Gmail API to ensure that the output from the accessor method was always formatted correctly before trying to convert it in the format required by HelpScout’s API.</p>
<h2 id="4-caching">4. Caching</h2>
<p>My previous experience working with web APIs have taught me that things can go wrong, and it is always a good idea to keep track (on disk and not only on memory) of the requests that have been tried and the ones that have not, and the requests that succeeded and the ones that failed. Especially, when your scripts do thousands of API calls, you don’t want to have to run everything again once your script fails because your internet connection goes down for a short while, or the data is not formatted properly because you are dealing with an edge case.</p>
<p>For this, I use the <a href="https://richfitz.github.io/storr/"><code class="language-plaintext highlighter-rouge">storr</code> package</a> and its functionality to rely on hooks to retrieve external data. <code class="language-plaintext highlighter-rouge">storr</code> is a key-value store. It is not that different than using variable names to store objects in memory as you normally do in your R terminal:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## setting a variable</span><span class="w">
</span><span class="n">cat_name</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Felix"</span><span class="w">
</span><span class="c1">## getting the content of the variable</span><span class="w">
</span><span class="n">cat_name</span><span class="w">
</span></code></pre></div></div>
<p>When using a <code class="language-plaintext highlighter-rouge">storr</code> store:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## defining the storr</span><span class="w">
</span><span class="n">st</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">storr</span><span class="o">::</span><span class="n">storr_rds</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cache"</span><span class="p">)</span><span class="w">
</span><span class="c1">## setting a variable</span><span class="w">
</span><span class="n">st</span><span class="o">$</span><span class="n">set</span><span class="p">(</span><span class="s2">"cat_name"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Felix"</span><span class="p">)</span><span class="w">
</span><span class="c1">## getting the variable name</span><span class="w">
</span><span class="n">st</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"cat_name"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The difference is that <code class="language-plaintext highlighter-rouge">storr</code> provides different backends for storing your object and, if like in this example, you use <code class="language-plaintext highlighter-rouge">storr_rds</code>, your objects are stored as <code class="language-plaintext highlighter-rouge">rds</code> files on your disk and are available beyond your current R session. How does that help with the problem here?</p>
<p>A great feature of <code class="language-plaintext highlighter-rouge">storr</code> is that you can set up your store to call a function to create the object instead of providing it directly with <code class="language-plaintext highlighter-rouge">$set()</code>.</p>
<p>It means that you store the content of a variable, your key into the store, and you can retrieve it:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## the hook function</span><span class="w">
</span><span class="n">fetch_hook_random_cat_name</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">sample</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Felix"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Garfield"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Tigger"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Mowgli"</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## defining the storr</span><span class="w">
</span><span class="n">st</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">storr</span><span class="o">::</span><span class="n">storr_external</span><span class="p">(</span><span class="w">
</span><span class="n">storr</span><span class="o">::</span><span class="n">driver_rds</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cache"</span><span class="p">),</span><span class="w">
</span><span class="n">fetch_hook_random_cat_name</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">## the first time you call a key, it will run the hook function</span><span class="w">
</span><span class="n">st</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"cat_name"</span><span class="p">)</span><span class="w">
</span><span class="c1">## subsenquently, it will return the value stored in the store</span><span class="w">
</span><span class="n">st</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"cat_name"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The hook function always takes the two arguments <code class="language-plaintext highlighter-rouge">key</code> and <code class="language-plaintext highlighter-rouge">namespace</code> but they don’t need to be used in the body of the function as in the example above.</p>
<p>We can extend this approach to store the output of time-consuming computations or the results of API calls<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>. For instance, here, I created <a href="https://github.com/carpentries/emailmigration/blob/master/R/caching.R#L7">a store</a> to keep the output of the function <code class="language-plaintext highlighter-rouge">convert_gmail_thread()</code>, and used <code class="language-plaintext highlighter-rouge">get_gmail_thread()</code> as a wrapper to access the store.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fetch_hook_gmail_threads</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">convert_gmail_thread</span><span class="p">(</span><span class="n">key</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">store_gmail_threads</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cache/threads"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">storr</span><span class="o">::</span><span class="n">storr_external</span><span class="p">(</span><span class="w">
</span><span class="n">storr</span><span class="o">::</span><span class="n">driver_rds</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w">
</span><span class="n">fetch_hook_gmail_threads</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">get_gmail_thread</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">thread_id</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">store_gmail_threads</span><span class="p">()</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="n">thread_id</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>When calling <code class="language-plaintext highlighter-rouge">get_gmail_thread()</code>, using a <code class="language-plaintext highlighter-rouge">thread_id</code> that had not been retrieved using the Gmail API before, the function <code class="language-plaintext highlighter-rouge">convert_gmail_thread()</code> will be called, getting all the information needed for this particular thread, and storing it in an R6-class object. If another part of the script fails, we do not need to redo the calls to the Gmail API, instead the cached copy within the store will be retrieved.</p>
<p>I used a similar approach to <a href="https://github.com/carpentries/emailmigration/blob/master/R/caching.R#L71">store the responses from the HelpScout API</a>, and wrapped at the same time the call to the <code class="language-plaintext highlighter-rouge">get_gmail_thread()</code> function above. A slightly simplified version of what I used is:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fetch_hook_hs_response</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_gmail_thread</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w">
</span><span class="n">hs_create_thread</span><span class="p">(</span><span class="n">res</span><span class="o">$</span><span class="n">get</span><span class="p">(),</span><span class="w"> </span><span class="n">htoken</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">store_hs_responses</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cache/hs_responses"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">storr</span><span class="o">::</span><span class="n">storr_external</span><span class="p">(</span><span class="w">
</span><span class="n">storr</span><span class="o">::</span><span class="n">driver_rds</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w">
</span><span class="n">fetch_hook_hs_response</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">get_hs_response</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">thread_id</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">store_hs_responses</span><span class="p">()</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="n">thread_id</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>So, what’s happening here? I use the Gmail thread ID as a single point of entry for the entire script (retrieve the thread from the Gmail API, convert it to the format expected by the HelpScout API, upload the thread to HelpScout). Depending on whether the queries have already been made and stored in the cache, the script will retrieve the data from the API or the objects stored on disk in the cache.</p>
<p>What does the <code class="language-plaintext highlighter-rouge">namespace</code> argument do? Using namespacing in <code class="language-plaintext highlighter-rouge">storr</code> allows you to organize your objects in your store. Especially, it allows you to have objects with the same name but with different values. Here, I planned to use namespaces to keep track of my different attempts. If the first attempt would have failed for some threads, I could fix the problem in the code, and re-attempt the HelpScout API calls just for the ones that failed under a different namespace.</p>
<h2 id="5-putting-it-all-together">5. Putting it all together</h2>
<p>Once I had most of the pieces together, I started by testing the code on the first 100 threads (as it’s the default number of threads that <code class="language-plaintext highlighter-rouge">gmailr</code> returns). That was a manageable number to see how the script behaved while being large enough that many different types of messages would be encountered. At that time, I didn’t use the caching system.</p>
<p>Once the first 100 threads could be imported successfully in HelpScout, I wrote a function to retrieve the identifiers for all the threads in the inbox that needed to be imported, and iterated on these identifiers to call the <code class="language-plaintext highlighter-rouge">get_hs_response</code> function:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">get_all_threads</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">first_it</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gm_threads</span><span class="p">()</span><span class="w">
</span><span class="n">next_token</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">first_it</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">nextPageToken</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">append</span><span class="p">(</span><span class="nf">list</span><span class="p">(),</span><span class="w"> </span><span class="n">first_it</span><span class="p">)</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">next_token</span><span class="p">)</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">tmp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gm_threads</span><span class="p">(</span><span class="n">page_token</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">next_token</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">append</span><span class="p">(</span><span class="w">
</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">tmp</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">next_token</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tmp</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">nextPageToken</span><span class="w">
</span><span class="n">message</span><span class="p">(</span><span class="s2">"next token: "</span><span class="p">,</span><span class="w"> </span><span class="n">next_token</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">threads</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_all_threads</span><span class="p">()</span><span class="w">
</span><span class="n">threads_ids</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map</span><span class="p">(</span><span class="w">
</span><span class="n">threads</span><span class="p">,</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_chr</span><span class="p">(</span><span class="n">.</span><span class="o">$</span><span class="n">threads</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="o">$</span><span class="n">id</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">unlist</span><span class="p">()</span><span class="w">
</span><span class="n">hs_res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">walk</span><span class="p">(</span><span class="w">
</span><span class="n">threads_ids</span><span class="p">,</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">get_hs_response</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"v2020-04-10.1"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>As part of the hook function that takes care of uploading conversations to HelpScout, I check whether the upload was successful and based on that I created and assigned a Gmail label to the thread. This was an additional safeguard that I could use to flag threads that didn’t import successfully.</p>
<p>Once the upload completed, I could then inspect the content of the store:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Retrieve the threads_ids from the store</span><span class="w">
</span><span class="n">idx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">store_hs_responses</span><span class="p">()</span><span class="o">$</span><span class="nf">list</span><span class="p">(</span><span class="n">namespace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"v2020-04-10.1"</span><span class="p">)</span><span class="w">
</span><span class="c1">## Retrieve the status code for the HelpScout API responses</span><span class="w">
</span><span class="n">is_error</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_lgl</span><span class="p">(</span><span class="w">
</span><span class="n">idx</span><span class="p">,</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">httr</span><span class="o">::</span><span class="n">status_code</span><span class="p">(</span><span class="w">
</span><span class="n">store_hs_responses</span><span class="p">()</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"v2020-04-10.1"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">400</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">## How many calls failed?</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">is_error</span><span class="p">)</span><span class="w">
</span><span class="c1">## Which thread_ids failed?</span><span class="w">
</span><span class="n">idx</span><span class="p">[</span><span class="n">is_error</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>
<p>and double check that it was the same threads that were labeled with <code class="language-plaintext highlighter-rouge">failure-<namespace></code> in Gmail.</p>
<h2 id="lessons-learned">Lessons learned</h2>
<p>As often with using programming to solve problems, what might seem like a simple task: “Transfering emails from one system to an other” is a collection of small problems. Being able to break down the big problems into small ones, and knowing how to address them comes with experience. Experience will help you recognize problems similar to some you have already solved, and reflecting on these past experiences will help you identify the algorithms, packages, and general code organization that are most likely to help you solve your problem.</p>
<p>In The Carpentries Instructor Training, when <a href="https://carpentries.github.io/instructor-training/03-expertise/index.html">we teach about expertise</a>, we talk about how the mental model of experts is denser and more connected. These features make it more difficult for experts to teach beginners because they have forgotten what it is like to not know how to break down a large problem into multiple small ones. The problem here is not just “migrate a bunch of emails between two systems”, there is a lot more to it. I wrote this blog post with the intent to demonstrate the approach I took to break down a problem into small ones and, in the process, describe the tools and techniques I chose to address them.</p>
<p>Expertise is subjective and relative, and I certainly do not claim that the approach I chose here is the best, the most efficient or the most elegant. There is certainly room for improvement. For instance, parts of the code could be re-factored to make it more organized, parts could be rewritten to be more <a href="https://en.wikipedia.org/wiki/Defensive_programming">defensive</a>, and there is no documentation (besides this blog post) and barely any comments.</p>
<p>I am interested in hearing your perspective and thoughts on how the problem could have been approached differently and the tools you would have chosen to address it. If this post was useful to you to help you solve a different problem, I would also love to hear about it! Leave a comment below or contact me using the info provided on the left of this page.</p>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>You may notice that the Git history for the repo includes the key and secret for the HelpScout OAuth authentication. By themselves, these are not enough to access any data, as you also need to authenticate with a valid HelpScout account within our organization. These credentials have also been revoked. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>I’ll be submitting a pull request to <code class="language-plaintext highlighter-rouge">gmailr</code> soon. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>If you are interested in learning more about the object-oriented programming R6 system, the <a href="https://adv-r.hadley.nz/r6.html">chapter about it</a> in the “Advanced R” book by Hadley Wickham is a great place to start. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>If you are interested in learning more about <code class="language-plaintext highlighter-rouge">storr</code>, read the <a href="https://richfitz.github.io/storr/articles/storr.html">documentation for the package</a> and the <a href="https://richfitz.github.io/storr/articles/external.html">vignette on external data</a> that initially helped me get started with this amazingly useful package. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>François Michonneaufrancois.michonneau@gmail.comHow R allowed The Carpentries to migrate emails from Gmail to HelpScout using their web APIsAdvent of Code 20182018-12-01T00:00:00+00:002018-12-01T00:00:00+00:00https://francoismichonneau.net/2018/12/advent<p>I’m going to try to complete the Advent of Code again this year. I’ll put all the exercises I complete in this post.</p>
<p>Links to the puzzles are at https://adventofcode.com/2018</p>
<h1 id="day-1">Day 1</h1>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="c1">## part 1</span><span class="w">
</span><span class="n">readr</span><span class="o">::</span><span class="n">read_lines</span><span class="p">(</span><span class="s2">"advent-data/2018-12-01-day1.txt"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="nf">as.numeric</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="nf">sum</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 408
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## part 2</span><span class="w">
</span><span class="n">input</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readr</span><span class="o">::</span><span class="n">read_lines</span><span class="p">(</span><span class="s2">"advent-data/2018-12-01-day1.txt"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="n">already_seen</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">v_sum</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">cumsum</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="n">has_dup</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">any</span><span class="p">(</span><span class="n">duplicated</span><span class="p">(</span><span class="n">v_sum</span><span class="p">))</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">has_dup</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">v_sum</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">duplicated</span><span class="p">(</span><span class="n">v_sum</span><span class="p">))[</span><span class="m">1</span><span class="p">]])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">already_seen</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 55250
</code></pre></div></div>
<h1 id="day-2">Day 2</h1>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readr</span><span class="o">::</span><span class="n">read_lines</span><span class="p">(</span><span class="s2">"advent-data/2018-12-02-day2.txt"</span><span class="p">)</span><span class="w">
</span><span class="c1">## part 1</span><span class="w">
</span><span class="n">count_letters</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n_letters</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">strsplit</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">purrr</span><span class="o">::</span><span class="n">map</span><span class="p">(</span><span class="n">table</span><span class="p">)</span><span class="w">
</span><span class="n">has_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">as.integer</span><span class="p">(</span><span class="nf">any</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">has_3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">as.integer</span><span class="p">(</span><span class="nf">any</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">has_2_vec</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_int</span><span class="p">(</span><span class="n">n_letters</span><span class="p">,</span><span class="w"> </span><span class="n">has_2</span><span class="p">)</span><span class="w">
</span><span class="n">has_3_vec</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_int</span><span class="p">(</span><span class="n">n_letters</span><span class="p">,</span><span class="w"> </span><span class="n">has_3</span><span class="p">)</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">has_2_vec</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">has_3_vec</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">count_letters</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 6000
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## part 2</span><span class="w">
</span><span class="n">all_in</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">crossing</span><span class="p">(</span><span class="n">in1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">in2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="w">
</span><span class="n">split1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">strsplit</span><span class="p">(</span><span class="n">in1</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
</span><span class="n">split2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">strsplit</span><span class="p">(</span><span class="n">in2</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
</span><span class="n">n_diff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map2_int</span><span class="p">(</span><span class="n">split1</span><span class="p">,</span><span class="w"> </span><span class="n">split2</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">.x</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">.y</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">all_in</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">n_diff</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map2_chr</span><span class="p">(</span><span class="w">
</span><span class="n">split1</span><span class="p">,</span><span class="w">
</span><span class="n">split2</span><span class="p">,</span><span class="w">
</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">paste</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">x</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">y</span><span class="p">],</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="p">}))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pull</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "pbykrmjmizwhxlqnasfgtycdv"
</code></pre></div></div>
<h2 id="day-3">Day 3</h2>
<p>That’s far from the prettiest code I’ve written! But it gets the job done.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">extract_coords</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">readr</span><span class="o">::</span><span class="n">read_delim</span><span class="p">(</span><span class="w">
</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">delim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> </span><span class="n">col_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">tidyr</span><span class="o">::</span><span class="n">extract</span><span class="p">(</span><span class="n">X1</span><span class="p">,</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"id"</span><span class="p">,</span><span class="w"> </span><span class="n">regexp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"([[:digit]]+)"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">tidyr</span><span class="o">::</span><span class="n">extract</span><span class="p">(</span><span class="n">X3</span><span class="p">,</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"x_begin"</span><span class="p">,</span><span class="w"> </span><span class="s2">"y_begin"</span><span class="p">),</span><span class="w"> </span><span class="n">regex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"([[:digit:]]+),([[:digit:]]+):"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">tidyr</span><span class="o">::</span><span class="n">extract</span><span class="p">(</span><span class="n">X4</span><span class="p">,</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"width"</span><span class="p">,</span><span class="w"> </span><span class="s2">"height"</span><span class="p">),</span><span class="w"> </span><span class="n">regex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"([[:digit:]]+)x([[:digit:]]+)"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">X2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate_all</span><span class="p">(</span><span class="n">as.numeric</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">find_total_dim</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">coords</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coords</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">total_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">width</span><span class="p">,</span><span class="w">
</span><span class="n">total_height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">height</span><span class="p">)</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="n">total_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">res</span><span class="o">$</span><span class="n">total_width</span><span class="p">),</span><span class="w">
</span><span class="n">total_height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">res</span><span class="o">$</span><span class="n">total_height</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">fill_matrix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">extract_coords</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
</span><span class="n">m_dim</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">find_total_dim</span><span class="p">(</span><span class="n">c</span><span class="p">)</span><span class="w">
</span><span class="n">M</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w">
</span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m_dim</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w">
</span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m_dim</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">c</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">i_s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">x_begin</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">x_begin</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="o">$</span><span class="n">width</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="n">j_s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">y_begin</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">y_begin</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="o">$</span><span class="n">height</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="n">M</span><span class="p">[</span><span class="n">i_s</span><span class="p">,</span><span class="w"> </span><span class="n">j_s</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">M</span><span class="p">[</span><span class="n">i_s</span><span class="p">,</span><span class="w"> </span><span class="n">j_s</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">M</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">more_two_claims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">M</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fill_matrix</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">M</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## part 1 answer</span><span class="w">
</span><span class="n">more_two_claims</span><span class="p">(</span><span class="s2">"advent-data/2018-12-03-day3.txt"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character()
## )
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 109716
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">overlaps</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x_begin</span><span class="p">,</span><span class="w"> </span><span class="n">y_begin</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="p">,</span><span class="w"> </span><span class="n">M</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">x_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="n">x_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">width</span><span class="p">)</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">y_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="n">y_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">height</span><span class="p">)</span><span class="w">
</span><span class="nf">all</span><span class="p">(</span><span class="n">M</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">no_overlap</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">M</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fill_matrix</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">extract_coords</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">logical</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">c</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">c</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">overlaps</span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">x_begin</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">c</span><span class="o">$</span><span class="n">y_begin</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w">
</span><span class="n">c</span><span class="o">$</span><span class="n">width</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">c</span><span class="o">$</span><span class="n">height</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">M</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">c</span><span class="o">$</span><span class="n">id</span><span class="p">[</span><span class="n">res</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## part 2 answer</span><span class="w">
</span><span class="n">no_overlap</span><span class="p">(</span><span class="s2">"advent-data/2018-12-03-day3.txt"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character()
## )
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character()
## )
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 124
</code></pre></div></div>François Michonneaufrancois.michonneau@gmail.comSolutions for the 2018 Advent of CodeA list of resources to learn more about R programming (and other things)2018-02-04T00:00:00+00:002018-02-04T00:00:00+00:00https://francoismichonneau.net/2018/02/resources<h2 id="data-visualization">Data Visualization</h2>
<ul>
<li><a href="https://socviz.co/">Data Visualization a Practical Introduction</a> by Kieran Healy</li>
<li><a href="https://serialmentor.com/dataviz/">Fundamental of Data Visualization</a> by Claus O. Wilke</li>
</ul>
<h2 id="text-analysis">Text analysis</h2>
<ul>
<li><a href="https://tidytextmining.com/">Text Mining with R</a> by Julia Silge and David Robinson</li>
</ul>
<h2 id="git">Git</h2>
<ul>
<li><a href="https://happygitwithr.com/">Happy Git and GitHub for the useR</a> by Jenny Bryan</li>
</ul>
<h2 id="unix">UNIX</h2>
<ul>
<li><a href="https://seankross.com/the-unix-workbench/">The Linux Workbench</a> by Sean Kross</li>
</ul>
<h2 id="rcppc">Rcpp/C++</h2>
<ul>
<li><a href="https://teuder.github.io/rcpp4everyone_en/">Rcpp for Everyone</a> by
Masaki E. Tsuda</li>
</ul>
<h2 id="random">Random</h2>
<ul>
<li><a href="https://youknowfordevs.com/2017/07/23/disposable-laptops-with-docker-compose-and-npm.html">Disposable Laptops With Docker Compose And NPM</a></li>
</ul>François Michonneaufrancois.michonneau@gmail.comA non-exhaustive semi-curated list of useful books/websites for R programming, data analysis, and working at the shell.How to setup magithub if you have GitHub 2-factor authentication enabled?2018-01-25T00:00:00+00:002018-01-25T00:00:00+00:00https://francoismichonneau.net/2018/01/setup-magithub-with-2FA<p>If you are trying to set up <code class="language-plaintext highlighter-rouge">magithub</code> when you have 2 factor authentication enabled, here are the steps you need to take:</p>
<ul>
<li>Go to https://github.com/settings/tokens and create a personal token, and give
it the name that the prompt suggest. For me it was: “Emacs package magithub @
francois-XPS-15-9560”, and give it the following scopes: “notification”,
“repo” and “user”.</li>
<li>
<p>create a file <code class="language-plaintext highlighter-rouge">~/.authinfo</code> with the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>machine api.github.com login YOUR_GITHUB_USERNAME^magithub password <your token>
</code></pre></div> </div>
</li>
<li>encrypt the file (assumes you have GPG setup) by running: <code class="language-plaintext highlighter-rouge">M-x epa-encrypt-file</code> and give it <code class="language-plaintext highlighter-rouge">~/.authinfo</code>.</li>
<li>Make sure that <code class="language-plaintext highlighter-rouge">~/.authinfo.gpg</code> was created and that its content is right.</li>
<li>Delete the unencrypted <code class="language-plaintext highlighter-rouge">~/.authinfo</code></li>
<li>Do <code class="language-plaintext highlighter-rouge">M-x customize-variable RET auth-sources</code> and put <code class="language-plaintext highlighter-rouge">~/.autoinfo.gpg</code> first
in the list of files inspected.</li>
</ul>François Michonneaufrancois.michonneau@gmail.comThe steps involved to create the authinfo.gpg file used by magithub when you have 2FA enabled on GitHubAdvent of Code: Day 212017-12-21T00:00:00+00:002017-12-21T00:00:00+00:00https://francoismichonneau.net/2017/12/advent-day-21<ul>
<li><a href="https://adventofcode.com/2017/day/21">Problem</a></li>
</ul>
<h1 id="parts-1-and-2">Parts 1 and 2</h1>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="c1"># rotate matrix 90° clockwise</span><span class="w">
</span><span class="n">rotate</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">mat</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">mat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">apply</span><span class="p">(</span><span class="n">mat</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">rev</span><span class="p">))</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">mat</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># create a mirror image of the matrix</span><span class="w">
</span><span class="n">flip</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">mat</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">t</span><span class="p">(</span><span class="n">apply</span><span class="p">(</span><span class="n">mat</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">rev</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># convert enhancement rule string into matrix</span><span class="w">
</span><span class="n">string_to_mat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">ii</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">uu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">strsplit</span><span class="p">(</span><span class="n">ii</span><span class="p">,</span><span class="w"> </span><span class="s2">"/"</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nchar</span><span class="p">(</span><span class="n">uu</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="n">uu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">strsplit</span><span class="p">(</span><span class="n">uu</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w">
</span><span class="n">matrix</span><span class="p">(</span><span class="n">uu</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># convert matrix into enhancement rule string</span><span class="w">
</span><span class="n">mat_to_string</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">paste</span><span class="p">(</span><span class="n">apply</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">paste</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
</span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># from an enhacement string, find all possible combinations</span><span class="w">
</span><span class="n">expand_combinations</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">ii</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">string_to_mat</span><span class="p">(</span><span class="n">ii</span><span class="p">)</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">dim</span><span class="p">(</span><span class="n">m</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">mr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="s2">"list"</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">0</span><span class="o">:</span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">mr</span><span class="p">[[(</span><span class="n">i</span><span class="o">*</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rotate</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">mr</span><span class="p">[[(</span><span class="n">i</span><span class="o">*</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">2</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rotate</span><span class="p">(</span><span class="n">flip</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">map_chr</span><span class="p">(</span><span class="n">mr</span><span class="p">,</span><span class="w"> </span><span class="n">mat_to_string</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># create data frame from input, and include all unique combinations</span><span class="w">
</span><span class="n">read_rules</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">read_delim</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">delim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"="</span><span class="p">,</span><span class="w">
</span><span class="n">col_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate_all</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">">?\\s+"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">set_names</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"ii"</span><span class="p">,</span><span class="w"> </span><span class="s2">"oo"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">comb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="n">ii</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">expand_combinations</span><span class="p">(</span><span class="n">.</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">unnest</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">distinct</span><span class="p">(</span><span class="n">comb</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># split a matrix into a list that contains 2x2 or 3x3 </span><span class="w">
</span><span class="n">split_mat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n_mat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">dim</span><span class="p">(</span><span class="n">m</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">n</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="s2">"list"</span><span class="p">,</span><span class="w"> </span><span class="n">n_mat</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">idx_start</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">dim</span><span class="p">(</span><span class="n">m</span><span class="p">)[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">idx_start</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">idx_start</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="n">k</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="o">:</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">j</span><span class="o">:</span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)]</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># reassamble a matrix from a list of 2x2 or 3x3 matrices</span><span class="w">
</span><span class="n">list_to_mat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">lst</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">si</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">lst</span><span class="p">[[</span><span class="m">1</span><span class="p">]])</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">si</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">lst</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">array</span><span class="p">(,</span><span class="w"> </span><span class="n">dim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">s</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="p">))</span><span class="w">
</span><span class="n">idx_start</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">si</span><span class="p">)</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">idx_start</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">idx_start</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="o">:</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">si</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">j</span><span class="o">:</span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">si</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lst</span><span class="p">[[</span><span class="n">k</span><span class="p">]]</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># use the enhancement rule book to grow the matrix</span><span class="w">
</span><span class="n">convert_rule</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span><span class="w"> </span><span class="n">rules</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rules</span><span class="o">$</span><span class="n">oo</span><span class="p">[</span><span class="n">rules</span><span class="o">$</span><span class="n">comb</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">pattern</span><span class="p">]</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">nchar</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="n">stop</span><span class="p">(</span><span class="s2">"problem..."</span><span class="p">)</span><span class="w">
</span><span class="n">string_to_mat</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># apply the enhacement algo for a set of rules, and n iterations</span><span class="w">
</span><span class="n">enhance</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">rules</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">start</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">".#./..#/###"</span><span class="p">)</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">string_to_mat</span><span class="p">(</span><span class="n">start</span><span class="p">)</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">dim</span><span class="p">(</span><span class="n">m</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">sp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">d</span><span class="p">[</span><span class="n">s</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">][</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">smat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">split_mat</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">sp</span><span class="p">)</span><span class="w">
</span><span class="n">sstr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="n">smat</span><span class="p">,</span><span class="w"> </span><span class="n">mat_to_string</span><span class="p">)</span><span class="w">
</span><span class="n">mlst</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="n">sstr</span><span class="p">,</span><span class="w"> </span><span class="n">convert_rule</span><span class="p">,</span><span class="w"> </span><span class="n">rules</span><span class="p">)</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">list_to_mat</span><span class="p">(</span><span class="n">mlst</span><span class="p">)</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">m</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">rules</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_rules</span><span class="p">(</span><span class="s2">"advent-data/2017-12-21-data.txt"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character()
## )
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">part1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">enhance</span><span class="p">(</span><span class="n">rules</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">part1</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"#"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 190
</code></pre></div></div>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">part2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">enhance</span><span class="p">(</span><span class="n">rules</span><span class="p">,</span><span class="w"> </span><span class="m">18</span><span class="p">)</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">part2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"#"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 2335049
</code></pre></div></div>François Michonneaufrancois.michonneau@gmail.comSolution for Day 21 of Advent of Code