<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://francoismichonneau.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://francoismichonneau.net/" rel="alternate" type="text/html" /><updated>2025-12-19T09:38:55+00:00</updated><id>https://francoismichonneau.net/feed.xml</id><title type="html">François Michonneau, PhD</title><subtitle>Personal website</subtitle><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><entry><title type="html">Advent of SQL 2025 with DuckDB and R</title><link href="https://francoismichonneau.net/2025/12/advent-of-sql/" rel="alternate" type="text/html" title="Advent of SQL 2025 with DuckDB and R" /><published>2025-12-11T00:00:00+00:00</published><updated>2025-12-11T00:00:00+00:00</updated><id>https://francoismichonneau.net/2025/12/advent-of-sql</id><content type="html" xml:base="https://francoismichonneau.net/2025/12/advent-of-sql/"><![CDATA[<h2 id="intro">Intro</h2>

<p>This year, the Advent of SQL is hosted by the Database School. I don’t know anything about them except that they took over the Advent of SQL from last year. There will be only 10 challenges this year (with 25 challenges last year, it felt a little long, so this is a welcome change). The spirit of the challenges seem to remain the same: using SQL to solve Christmas theme puzzles. The delivery format is however different as it uses the Database School platform and format. You need to create an account, and log in to access the challenges and their associated data. Each challenge is in the form of a video tutorial with an associated playground.</p>

<p>I’m going to use these challenges as an opportunity to brush up my SQL skills, using DuckDB. I’m going to work from R (just in case I need to do any additional data manipulation or visualization), but my goal this year is to do everything using DuckDB SQL (and not use LLMs for help, just searching and reading the docs the old fashion way). I might use LLMs to propose more elegant/alternative solutions once I have a working solution.</p>

<p>I’ll post my solutions daily (or as often as I can manage) below. The data can be downloaded from the Database School website once you created an account.</p>

<h2 id="day-1">Day 1</h2>

<p>It’s a single table, containing messy wish list data. The goal is to find the most common wishes ordered in descending order.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Replace `BIGSERIAL` with `INTEGER` in `wish_list` table definition.</span><span class="w">

</span><span class="c1"># Create DuckDB database with:</span><span class="w">
</span><span class="c1">#  duckdb ./data_duckdb/advent_day_01.duckdb &lt; ./data_sql/day1-wish-list.sql</span><span class="w">
</span><span class="c1"># Be patient, these single inserts take a while to run in duckdb (about 90s)</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DBI</span><span class="o">::</span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_01.duckdb"</span><span class="p">)</span><span class="w">
</span><span class="n">DBI</span><span class="o">::</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  SELECT wish, count(wish) AS n
    FROM (SELECT lower(trim(raw_wish)) AS wish FROM 'wish_list')
    GROUP BY wish
    ORDER BY n DESC;
"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-2">Day 2</h2>

<p>With day 2, we have two tables: <code class="language-plaintext highlighter-rouge">snowball_inventory</code> and <code class="language-plaintext highlighter-rouge">snowball_categories</code>. The goal is to find the total quantity of items in inventory for each category, ordered by total quantity ascending. Only items with quantity &gt; 0 should be included. You need to watch the video to understand the challenge as some information is not included in the challenge description itself.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create DuckDB database with (no need to edit the file):</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_02.duckdb &lt; ./data_sql/day2-inserts.sql</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DBI</span><span class="o">::</span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_02.duckdb"</span><span class="p">)</span><span class="w">
</span><span class="n">DBI</span><span class="o">::</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  SELECT category_name, SUM(quantity) AS total_quantity
  FROM (
       SELECT i.category_name, i.status, i.quantity, o.*
       FROM snowball_inventory i
       JOIN snowball_categories o
      ON (i.category_name = o.official_category AND quantity &gt; 0)
    )
    GROUP BY category_name
    ORDER BY total_quantity ASC;
  "</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-3">Day 3</h2>

<p>Copy and paste from the challenge:</p>

<blockquote>
  <p>Using the hotline_messages table, update any record that has “sorry” (case insensitive) in the transcript and doesn’t currently have a status assigned to have a status of “approved”.
Then delete any records where the tag is “penguin prank”, “time-loop advisory”, “possible dragon”, or “nonsense alert” or if the caller’s name is “Test Caller”.
After updating and deleting the records as described, write a final query that returns how many messages currently have a status of “approved” and how many still need to be reviewed (i.e., status is <code class="language-plaintext highlighter-rouge">NULL</code>).</p>
</blockquote>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create DuckDB database with (no need to edit the file):</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_03.duckdb &lt; ./data_sql/day3-inserts.sql</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DBI</span><span class="o">::</span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_03.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">DBI</span><span class="o">::</span><span class="n">dbExecute</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  UPDATE hotline_messages
  SET status = 'approved' 
  WHERE LOWER(transcript) LIKE '%sorry%'
    AND status IS NULL;
  "</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">DBI</span><span class="o">::</span><span class="n">dbExecute</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  DELETE FROM hotline_messages
  WHERE tag IN (
    'penguin prank',
    'time-loop advisory',
    'possible dragon',
    'nonsense alert'
    )
    OR caller_name = 'Test Caller';
  "</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">DBI</span><span class="o">::</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  SELECT clean_status, COUNT(clean_status)
  FROM (
    SELECT
    status, 
      CASE 
        WHEN status IS NULL THEN 'TBD'
        ELSE 'approved'
      END as clean_status
    FROM hotline_messages
    )
  GROUP BY clean_status;
"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-4">Day 4</h2>

<p>Copy and paste from the challenge:</p>

<blockquote>
  <p>Using the official_shifts and last_minute_signups tables, create a combined de-duplicated volunteer list.
Ensure the list has standardized role labels of Stage Setup, Cocoa Station, Parking Support, Choir Assistant, Snow Shoveling, Handwarmer Handout.
Make sure that the timeslot formats follow John’s official shifts format.</p>
</blockquote>

<p>I used the snake case format for the role, it looks like the challenge actually asked for title case. I left the ‘ELSE TBD’ clauses in there as I used them when building the queries to make sure I caught all the cases. I had also checked for unique values in the time slots and given there were just a few I went for a CASE WHEN approach rather than something more sophisticated.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create DuckDB database with (no need to edit the file):</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_04.duckdb &lt; ./data_sql/day4-inserts.sql</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DBI</span><span class="o">::</span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_04.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">DBI</span><span class="o">::</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"SELECT * FROM official_shifts"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">DBI</span><span class="o">::</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"SELECT
  volunteer_name,
  CASE
    WHEN assigned_task ILIKE '%choir%' THEN 'choir_assistant'
    WHEN assigned_task ILIKE '%stage%' THEN 'stage_setup'
    WHEN assigned_task ILIKE '%cocoa%' THEN 'cocoa_station'
    WHEN assigned_task ILIKE '%parking%' THEN 'parking_support'
    WHEN assigned_task ILIKE '%shovel%' THEN 'snow_shoveling'
    WHEN assigned_task ILIKE '%hand%' THEN 'handwarmer_handout'
    ELSE 'TBD'
  END as role,
  CASE
    WHEN (time_slot='10AM' OR time_slot ='10 am') THEN '10:00 AM'
    WHEN (time_slot='2 PM' OR time_slot='2 pm') THEN '2:00 PM'
    WHEN time_slot = 'noon' THEN '12:00 PM'
    ELSE 'TBD'
  END as shift_time
  FROM last_minute_signups
 
  UNION

  SELECT volunteer_name,
         role,
         shift_time
  FROM official_shifts
  ORDER BY volunteer_name;
"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-5">Day 5</h2>

<p>Copy and paste from the challenge:</p>

<blockquote>
  <p>Challenge: Write a query that returns the top 3 artists per user. Order the results by the most played.</p>
</blockquote>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Create DuckDB database with (no need to edit the file):</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_05.duckdb &lt; ./data_sql/day5-inserts.sql</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DBI</span><span class="o">::</span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_05.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  SELECT * FROM(
    SELECT 
      user_name,
      artist,
      COUNT(artist) AS n,
      row_number() OVER (PARTITION BY user_name ORDER BY n DESC) as top
    FROM listening_logs
    GROUP BY user_name, artist
    ORDER BY user_name, n DESC
  )
  WHERE top &lt;= 3;
  "</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-6">Day 6</h2>

<blockquote>
  <p>Challenge: Generate a report that returns the dates and families that have no delivery assigned after December 14th, using the families and deliveries_assigned.
Each row in the report should be a date and family name that represents the dates in which families don’t have a delivery assigned yet.
Label the columns as unassigned_date and name. Order the results by unassigned_date and name, respectively, both in ascending order.</p>
</blockquote>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Create DuckDB database with (no need to edit the file):</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_06.duckdb &lt; ./data_sql/day6-inserts.sql</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_06.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"WITH december_2025 AS
     (SELECT date::DATE date
      FROM generate_series(
        DATE '2025-12-15',
        DATE '2025-12-31',
        INTERVAL '1 day'
      ) AS t(date)
      ),

  full_info AS (
    SELECT december_2025.date,
      families.id AS family_id,
      families.family_name
    FROM families
    CROSS JOIN december_2025
  )

  SELECT
    full_info.family_id AS full_fid,
    full_info.family_name,
    full_info.date AS full_date,
    deliveries_assigned.*
  FROM full_info
  LEFT JOIN deliveries_assigned ON (
     full_info.date = deliveries_assigned.gift_date AND
     deliveries_assigned.family_id = full_info.family_id
  )
  WHERE deliveries_assigned.gift_name IS NULL
  ORDER BY date ASC, family_name ASC 
  ;
 "</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-7">Day 7</h2>

<blockquote>
  <p>Challenge: Get the stewards a list of all the passengers and the cocoa car(s) they can be served from that has at least one of their favorite mixins.
Remember only the top three most-stocked cocoa cars remained operational, so the passengers must be served from one of those cars.</p>
</blockquote>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create DuckDB database with (no need to edit the file):</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_07.duckdb &lt; ./data_sql/day7-inserts.sql</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_07.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  WITH available_mixins AS (
    SELECT
      car_id AS mixins_car_id,
      available_mixins
    FROM cocoa_cars
    ORDER BY total_stock DESC
    LIMIT 3
  )

  SELECT 
    passenger_name,
    string_agg(mixins_car_id) AS available_cars
  FROM passengers
  JOIN available_mixins ON (list_has_any(passengers.favorite_mixins, available_mixins.available_mixins))
  GROUP BY passenger_name
  ORDER BY passenger_name
  "</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-8">Day 8</h2>

<blockquote>
  <p>Generate a report, using the products and price_changes tables for leadership that returns the product_name, current_price, previous_price, and the difference between the current and previous prices.</p>
</blockquote>

<p>I took a (maybe?) unconventional approach by using the list functions to solve this challenge. I was focused on getting the price difference first. Using the lag would have reduced the redundancy or the <code class="language-plaintext highlighter-rouge">list(... ORDER BY rn)</code>.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create DuckDB database with (no need to edit the file):</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_08.duckdb &lt; ./data_sql/day8-inserts.sql</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_08.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  WITH sub_prices AS (SELECT 
    product_id,
    price,
    effective_timestamp,
    row_number() OVER (PARTITION BY product_id ORDER BY effective_timestamp DESC) AS rn
  FROM price_changes)

  SELECT
    product_name,
    list(price ORDER BY rn)[2] AS current_price,
    list(price ORDER by rn)[1] AS previous_price,
    list_reduce(list(price ORDER by rn), lambda x,y : x - y) AS price_change
  FROM  sub_prices
  JOIN products USING (product_id)
  WHERE rn &lt; 3
  GROUP BY product_id, product_name
  ORDER BY product_id;
  "</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-9">Day 9</h2>

<blockquote>
  <p>Build a report using the orders table that shows the latest order for each customer, along with their requested shipping method, gift wrap choice (as true or false), and the risk flag in separate columns.
Order the report by the most recent order first so Evergreen Market can reach out to them ASAP.</p>
</blockquote>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Edit the orders table definition to replace `JSONB` with `JSON`.</span><span class="w">

</span><span class="c1"># Create DuckDB database with :</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_09.duckdb &lt; ./data_sql/day9-inserts.sql</span><span class="w">

</span><span class="n">library</span><span class="p">(</span><span class="n">duckdb</span><span class="p">)</span><span class="w">
</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_09.duckdb"</span><span class="p">)</span><span class="w">
</span><span class="n">dbExecute</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"INSTALL json; LOAD JSON;"</span><span class="p">)</span><span class="w">
</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  WITH customer_orders AS (
    SELECT *,
      row_number() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
    FROM orders
    ORDER BY customer_id, rn
  )

  SELECT
    customer_id,
    json_extract_string(order_data, '$.shipping.method') AS shipping_method,
    json_extract_string(order_data, '$.gift.wrapped')::BOOL AS gift_wrap,
    json_extract_string(order_data, '$.risk.flag') AS risk_flag
  FROM customer_orders 
  WHERE rn = 1
  ORDER BY created_at DESC;
  "</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-10">Day 10</h2>

<blockquote>
  <p>Challenge:
Clean-up the deliveries table to remove any records where the delivery_location is ‘Volcano Rim’, ‘Drifting Igloo’, ‘Abandoned Lighthouse’, ‘The Vibes’.
Move those records to the misdelivered_presents with all the same columns as deliveries plus a flagged_at column with the current time and a reason column with “Invalid delivery location” listed as the reason for each moved record.
Make sure your final step shows the misdelivered_presents records that you just moved (i.e. don’t include any existing records from the misdelivered_presents table).</p>
</blockquote>

<p>I first solved the challenge by using CTEs to return the appropriate sets of rows. After watching the solution, I discovered <code class="language-plaintext highlighter-rouge">RETURNING</code>. I don’t think I have ever used operations in SQL that were destructive, so that was new to me. It seems (and don’t quote me on that) that it’s not possible to do a CTE using <code class="language-plaintext highlighter-rouge">DELETE</code> in duckDB. Instead, I relied on a temporary table, first combined with an anti-join to delete the records from the <code class="language-plaintext highlighter-rouge">deliveries</code> table, and then combined with an <code class="language-plaintext highlighter-rouge">INSERT</code> to add these records to the <code class="language-plaintext highlighter-rouge">misdelivered_presents</code>. I still got to use <code class="language-plaintext highlighter-rouge">RETURNING</code> to only see the insert rows at the end of the query.</p>

<p>The last few queries at the end validated that the tables were modified correctly.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create DuckDB database with (no need to edit the file):</span><span class="w">
</span><span class="c1"># duckdb ./data_duckdb/advent_day_10.duckdb &lt; ./data_sql/day10-inserts.sql</span><span class="w">

</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"data_duckdb/advent_day_10.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
    CREATE TEMPORARY TABLE deliveries_to_remove AS (
      SELECT * FROM deliveries
      WHERE delivery_location IN
        ('Volcano Rim', 'Drifting Igloo', 'Abandoned Lighthouse', 'The Vibes')
    );

    CREATE OR REPLACE TABLE deliveries AS (
      SELECT * FROM deliveries
      ANTI JOIN deliveries_to_remove USING (id)
    );

    INSERT INTO misdelivered_presents
    SELECT
      id, child_name, delivery_location, gift_name, scheduled_at, NOW(), 'Invalid Delivery Location'
    FROM deliveries_to_remove
    RETURNING *
  "</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"SELECT * FROM deliveries;"</span><span class="p">)</span><span class="w">
</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"SELECT * FROM deliveries
      WHERE delivery_location IN
        ('Volcano Rim', 'Drifting Igloo', 'Abandoned Lighthouse', 'The Vibes')"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"SELECT * FROM misdelivered_presents;"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Hacking" /><category term="r" /><category term="duckdb" /><summary type="html"><![CDATA[An annotated list of solutions to the Advent of SQL challenges]]></summary></entry><entry><title type="html">Using Air to reformat code with Emacs ESS</title><link href="https://francoismichonneau.net/2025/02/air-with-emacs-ess/" rel="alternate" type="text/html" title="Using Air to reformat code with Emacs ESS" /><published>2025-02-24T00:00:00+00:00</published><updated>2025-02-24T00:00:00+00:00</updated><id>https://francoismichonneau.net/2025/02/air-with-emacs-ess</id><content type="html" xml:base="https://francoismichonneau.net/2025/02/air-with-emacs-ess/"><![CDATA[<p><a href="https://posit-dev.github.io/air/">Air</a> is an R formatter and language server
written in Rust.</p>

<p>It is very <a href="https://www.tidyverse.org/blog/2025/02/air/">fast and opiniated</a>. It
integrates with VSCode and Positron, RStudio (and soon with Zed).</p>

<p>Maybe there is a better way of integrating it with Emacs and ESS but for the
time being, I wrote this short snippet that uses its command line interface to
reformat the current buffer on save:</p>

<div class="language-lisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; use Air to format the content of the file</span>
<span class="p">(</span><span class="nb">defun</span> <span class="nv">run-air-on-r-save</span> <span class="p">()</span>
  <span class="s">"Run Air after saving .R files and refresh buffer."</span>
  <span class="p">(</span><span class="nb">when</span> <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">stringp</span> <span class="nv">buffer-file-name</span><span class="p">)</span>
             <span class="p">(</span><span class="nv">string-match</span> <span class="s">"\\.R$"</span> <span class="nv">buffer-file-name</span><span class="p">))</span>
    <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">current-buffer</span> <span class="p">(</span><span class="nv">current-buffer</span><span class="p">)))</span>
      <span class="p">(</span><span class="nv">shell-command</span> <span class="p">(</span><span class="nv">concat</span> <span class="s">"air format "</span> <span class="nv">buffer-file-name</span><span class="p">))</span>
      <span class="c1">;; Refresh buffer from disk</span>
      <span class="p">(</span><span class="nv">with-current-buffer</span> <span class="nv">current-buffer</span>
        <span class="p">(</span><span class="nv">revert-buffer</span> <span class="no">nil</span> <span class="no">t</span> <span class="no">t</span><span class="p">)))))</span>

<span class="p">(</span><span class="nv">add-hook</span> <span class="ss">'after-save-hook</span> <span class="ss">'run-air-on-r-save</span><span class="p">)</span>
</code></pre></div></div>

<p>From my limited testing, it works well enough for now.</p>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Hacking" /><category term="r" /><category term="ess" /><summary type="html"><![CDATA[A snippet to add to Emacs configuration file to use the new R code formatter Air]]></summary></entry><entry><title type="html">Advent of SQL with DuckDB and R</title><link href="https://francoismichonneau.net/2024/12/advent-of-sql/" rel="alternate" type="text/html" title="Advent of SQL with DuckDB and R" /><published>2024-12-01T00:00:00+00:00</published><updated>2024-12-01T00:00:00+00:00</updated><id>https://francoismichonneau.net/2024/12/advent-of-sql</id><content type="html" xml:base="https://francoismichonneau.net/2024/12/advent-of-sql/"><![CDATA[<h2 id="quick-overview">Quick Overview</h2>

<p><a href="https://adventofcode.com/">Advent of
Code</a> is
a popular advent calendar of
programming puzzles. I have
attempted to do it in the past using
R but I always gave up after a few
days because it was taking too much
of my time, and prefer programming
puzzles that work with data. Last
year, I had fun going through the
challenges of the <a href="https://hanukkah.bluebird.sh/">Hanukkah of
Data</a>.
This year, <a href="https://rud.is/">Bob
Rudis</a>, via his
excellent <a href="https://dailydrop.hrbrmstr.dev/">Daily Drop
Newsletter</a>,
pointed to <a href="https://adventofsql.com/">Advent of
SQL</a>. I
solved the challenges using DuckDB
and/or {dplyr}. I appreciated that I
could solve all the challenges
relatively quickly.</p>

<p>My answers to the challenges and some annotations are in this post.</p>

<h2 id="data-import">Data import</h2>

<p>The Advent of SQL provides data for
each challenge as a SQL file. They
use Postgres, and
while the compatibility between
Postgres and DuckDB is pretty good,
some features are not available in
DuckDB, and the SQL dump files need to
be modified to be able to import
data with DuckDB.</p>

<p>To create a DuckDB database from a SQL file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>duckdb &lt;database_file_name.duckdb&gt; &lt; &lt;sql_file.sql&gt;
</code></pre></div></div>

<p>Once the database file is created, you can work with it from R.</p>

<h2 id="day-1">Day 1</h2>

<p>Create the DuckDB database:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>duckdb advent_day_01.duckdb &lt; advent_of_sql_day_1.sql
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">duckdb</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">

</span><span class="c1">## Connect to the Database</span><span class="w">
</span><span class="n">con_day01</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_01.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="c1">## Check the content</span><span class="w">
</span><span class="n">dbListTables</span><span class="p">(</span><span class="n">con_day01</span><span class="p">)</span><span class="w">
</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">con_day01</span><span class="p">,</span><span class="w"> </span><span class="s2">"SELECT wishes FROM wish_lists limit 10;"</span><span class="p">)</span><span class="w">

</span><span class="c1">## Install and load the json extension to work with the JSON data</span><span class="w">
</span><span class="n">dbExecute</span><span class="p">(</span><span class="n">con_day01</span><span class="p">,</span><span class="w"> </span><span class="s2">"INSTALL json; LOAD json;"</span><span class="p">)</span><span class="w">

</span><span class="c1">## Create tidy version of the data</span><span class="w">
</span><span class="n">dbExecute</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day01</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  CREATE OR REPLACE VIEW tidy_wishlist AS
  SELECT
    list_id,
    child_id,
    trim(wishes.first_choice::VARCHAR, '\"') AS primary_wish,
    trim(wishes.second_choice::VARCHAR, '\"') as backup_wish,
    trim(wishes.colors[0]::VARCHAR, '\"') AS favorite_color,
    json_array_length(wishes.colors) AS color_count
  FROM wish_lists;
  "</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1">## Inspect newly created VIEW</span><span class="w">
</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day01</span><span class="p">,</span><span class="w"> </span><span class="s2">"tidy_wishlist"</span><span class="p">)</span><span class="w">

</span><span class="c1">## Build answer</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="w">
  </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day01</span><span class="p">,</span><span class="w"> </span><span class="s2">"children"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">select</span><span class="p">(</span><span class="n">child_id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">),</span><span class="w">
  </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day01</span><span class="p">,</span><span class="w"> </span><span class="s2">"tidy_wishlist"</span><span class="p">),</span><span class="w">
  </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"child_id"</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="w">
    </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day01</span><span class="p">,</span><span class="w"> </span><span class="s2">"toy_catalogue"</span><span class="p">),</span><span class="w">
    </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"primary_wish"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"toy_name"</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">gift_complexity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w">
      </span><span class="n">difficulty_to_make</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Simple Gift"</span><span class="p">,</span><span class="w">
      </span><span class="n">difficulty_to_make</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Moderate Gift"</span><span class="p">,</span><span class="w">
      </span><span class="n">difficulty_to_make</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Complex Gift"</span><span class="p">,</span><span class="w">
      </span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">NA_character_</span><span class="w">
    </span><span class="p">),</span><span class="w">
    </span><span class="n">workshop_assignment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w">
      </span><span class="n">category</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"outdoor"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Outside workshop"</span><span class="p">,</span><span class="w">
      </span><span class="n">category</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"educational"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Learning workshop"</span><span class="p">,</span><span class="w">
      </span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"General workshop"</span><span class="w">
    </span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">primary_wish</span><span class="p">,</span><span class="w"> </span><span class="n">backup_wish</span><span class="p">,</span><span class="w"> </span><span class="n">favorite_color</span><span class="p">,</span><span class="w"> </span><span class="n">color_count</span><span class="p">,</span><span class="w"> </span><span class="n">gift_complexity</span><span class="p">,</span><span class="w"> </span><span class="n">workshop_assignment</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">collect</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">rowwise</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">answer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="o">::</span><span class="n">glue_collapse</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">primary_wish</span><span class="p">,</span><span class="w"> </span><span class="n">backup_wish</span><span class="p">,</span><span class="w"> </span><span class="n">favorite_color</span><span class="p">,</span><span class="w"> </span><span class="n">color_count</span><span class="p">,</span><span class="w"> </span><span class="n">gift_complexity</span><span class="p">,</span><span class="w"> </span><span class="n">workshop_assignment</span><span class="p">),</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">answer</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day01</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The JSON extension in DuckDB allowed me to extract the required data to
create a tidy version to solve the problem.</p>

<h2 id="day-2">Day 2</h2>

<p>The SQL dump for this challenge used the <code class="language-plaintext highlighter-rouge">SERIAL</code> data type which is
<a href="https://github.com/duckdb/duckdb/issues/1768">not supported</a> by DuckDB.
<code class="language-plaintext highlighter-rouge">SERIAL</code> is a convenience to create unique, auto-incrementing, ids. The
way around it in DuckDB, is to create a sequence and using in the table
definition. I edited the SQL dump so the beginning of the file now looks
like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">letters_a</span> <span class="k">CASCADE</span><span class="p">;</span>
<span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">letters_b</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">laid</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">letters_a</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'laid'</span><span class="p">),</span>
  <span class="n">value</span> <span class="nb">INTEGER</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">lbid</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">letters_b</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'lbid'</span><span class="p">),</span>
  <span class="n">value</span> <span class="nb">INTEGER</span>
<span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day02</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_02.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="c1">## Create a single table combining `letters_a` and `letters_b` with `UNION`</span><span class="w">
</span><span class="c1">## Use function `chr()` to convert ASCII codes into letters</span><span class="w">
</span><span class="n">dbExecute</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day02</span><span class="p">,</span><span class="w">
  </span><span class="s2">"CREATE OR REPLACE VIEW letters_decoded AS
   SELECT  *, chr(value) AS character FROM letters_a
   UNION
   SELECT  *, chr(value) AS character FROM letters_b
  "</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1">## Define list of valid characters</span><span class="w">
</span><span class="n">valid_characters</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"[A-Za-z !\"\'(),-.:;?]"</span><span class="w">

</span><span class="c1">## Filter data to only keep valid_characters</span><span class="w">
</span><span class="c1">## and collapse results to extract message</span><span class="w">
</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day02</span><span class="p">,</span><span class="w"> </span><span class="s2">"letters_decoded"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">grepl</span><span class="p">(</span><span class="n">valid_characters</span><span class="p">,</span><span class="w"> </span><span class="n">character</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">character</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">paste</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="err">_</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day02</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-3">Day 3</h2>

<p>Again, the SQL dump used <code class="language-plaintext highlighter-rouge">SERIAL</code>. Additionally, DuckDB does not support
the <code class="language-plaintext highlighter-rouge">XML</code> data type, so I switched to use <code class="language-plaintext highlighter-rouge">VARCHAR</code> and used R to work
with the XML data. I edited the beginning of the dump file to look like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">christmas_menus</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">cmid</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">christmas_menus</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'cmid'</span><span class="p">),</span>
  <span class="n">menu_data</span> <span class="nb">VARCHAR</span>
<span class="p">);</span>
</code></pre></div></div>

<p>I really didn’t use DuckDB’s engine for this challenge. I only relied on
R to work with the XML data:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day03</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_03.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">menus</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day03</span><span class="p">,</span><span class="w"> </span><span class="s2">"christmas_menus"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">collect</span><span class="p">()</span><span class="w">


</span><span class="c1">## Figure out how many XML schemas are being used in the data</span><span class="w">
</span><span class="n">get_menu_version</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">menu_data</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">xml2</span><span class="o">::</span><span class="n">read_xml</span><span class="p">(</span><span class="n">menu_data</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">xml2</span><span class="o">::</span><span class="n">xml_find_all</span><span class="p">(</span><span class="s2">".//@version"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">xml2</span><span class="o">::</span><span class="n">xml_text</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">menu_versions</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_chr</span><span class="p">(</span><span class="n">menus</span><span class="o">$</span><span class="n">menu_data</span><span class="p">,</span><span class="w"> </span><span class="n">get_menu_version</span><span class="p">)</span><span class="w">

</span><span class="c1">## There are 3 different versions</span><span class="w">
</span><span class="n">unique</span><span class="p">(</span><span class="n">menu_versions</span><span class="p">)</span><span class="w">

</span><span class="c1">## Extract the number of guests based on the XML schema</span><span class="w">
</span><span class="n">get_guest_number</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">menu_data</span><span class="p">,</span><span class="w"> </span><span class="n">xml_version</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">element</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">switch</span><span class="p">(</span><span class="w">
    </span><span class="n">xml_version</span><span class="p">,</span><span class="w">
    </span><span class="s2">"3.0"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">".//headcount/total_present"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"2.0"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">".//total_guests"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"1.0"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">".//total_count"</span><span class="w">
  </span><span class="p">)</span><span class="w">

  </span><span class="n">xml2</span><span class="o">::</span><span class="n">read_xml</span><span class="p">(</span><span class="n">menu_data</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">xml2</span><span class="o">::</span><span class="n">xml_find_all</span><span class="p">(</span><span class="n">element</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">xml2</span><span class="o">::</span><span class="n">xml_text</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">n_guests</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map2_dbl</span><span class="p">(</span><span class="n">menus</span><span class="o">$</span><span class="n">menu_data</span><span class="p">,</span><span class="w"> </span><span class="n">menu_versions</span><span class="p">,</span><span class="w"> </span><span class="n">get_guest_number</span><span class="p">)</span><span class="w">

</span><span class="c1">## Extract the food ids (only for events with the right number of guests)</span><span class="w">
</span><span class="n">food_ids</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map2</span><span class="p">(</span><span class="n">menus</span><span class="o">$</span><span class="n">menu_data</span><span class="p">,</span><span class="w"> </span><span class="n">n_guests</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">.x</span><span class="p">,</span><span class="w"> </span><span class="n">.g</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">.g</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">78</span><span class="p">)</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="kc">NULL</span><span class="p">)</span><span class="w">
  </span><span class="n">xml2</span><span class="o">::</span><span class="n">read_xml</span><span class="p">(</span><span class="n">.x</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">xml2</span><span class="o">::</span><span class="n">xml_find_all</span><span class="p">(</span><span class="s2">".//food_item_id"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">xml2</span><span class="o">::</span><span class="n">xml_text</span><span class="p">()</span><span class="w">
</span><span class="p">})</span><span class="w">

</span><span class="c1">## And count them to find the most common one</span><span class="w">
</span><span class="n">unlist</span><span class="p">(</span><span class="n">food_ids</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day03</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-4">Day 4</h2>

<p>Again, the original dump used <code class="language-plaintext highlighter-rouge">SERIAL</code>, for this challenge, I simply
replaced it with <code class="language-plaintext highlighter-rouge">INTEGER</code>, so the beginning of the file looks like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">toy_production</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">toy_production</span> <span class="p">(</span>
  <span class="n">toy_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">toy_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
  <span class="n">previous_tags</span> <span class="nb">TEXT</span><span class="p">[],</span>
  <span class="n">new_tags</span> <span class="nb">TEXT</span><span class="p">[]</span>
  <span class="p">);</span>
</code></pre></div></div>

<p>This challenge required to dive into DuckDB’s functions to work with
lists. While there is a <code class="language-plaintext highlighter-rouge">list_intersect()</code> function, there does not seem
to be a <code class="language-plaintext highlighter-rouge">list_setdiff()</code> so instead I combined <code class="language-plaintext highlighter-rouge">list_where()</code> with
<code class="language-plaintext highlighter-rouge">list_transform()</code> and <code class="language-plaintext highlighter-rouge">list_contains()</code> to get there. I would be happy
to hear alternative approaches!</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_04.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"SELECT
     toy_id,
     list_where(new_tags, list_transform(new_tags, x -&gt; NOT list_contains(previous_tags, x))) AS added_tags,
     list_intersect(previous_tags, new_tags) AS unchanged_tags,
     list_where(previous_tags, list_transform(previous_tags, x -&gt; NOT list_contains(new_tags, x))) AS removed_tags,
     len(added_tags) AS added_tags_length,
     len(unchanged_tags) AS unchanged_tags_length,
     len(removed_tags) AS removed_tags_length
   FROM toy_production;"</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">slice_max</span><span class="p">(</span><span class="n">added_tags_length</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">toy_id</span><span class="p">,</span><span class="w"> </span><span class="n">ends_with</span><span class="p">(</span><span class="s2">"length"</span><span class="p">))</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day04</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-5">Day 5</h2>

<p>The data could be imported directly from the SQL dump.</p>

<p>Solving this challenge required using the <code class="language-plaintext highlighter-rouge">lead()</code> function from the
tidyverse to calculate the change in production and its percentage. I
then used <code class="language-plaintext highlighter-rouge">slice_max()</code> to extract the row with the largest percentage
change.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day05</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_05.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day05</span><span class="p">,</span><span class="w"> </span><span class="s2">"toy_production"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">previous_day_production</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">toys_produced</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">production_change</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">toys_produced</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">previous_day_production</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">production_change_percentage</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">production_change</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">toys_produced</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">slice_max</span><span class="p">(</span><span class="n">production_change_percentage</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day05</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-6">Day 6</h2>

<p>The original dump used <code class="language-plaintext highlighter-rouge">SERIAL</code>, so I updated the beginning of the file
to look like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">children</span> <span class="k">CASCADE</span><span class="p">;</span>
<span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">gifts</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">children</span> <span class="p">(</span>
    <span class="n">child_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
    <span class="n">age</span> <span class="nb">INTEGER</span><span class="p">,</span>
    <span class="n">city</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">gifts</span> <span class="p">(</span>
    <span class="n">gift_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
    <span class="n">price</span> <span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span>
    <span class="n">child_id</span> <span class="nb">INTEGER</span> <span class="k">REFERENCES</span> <span class="n">children</span><span class="p">(</span><span class="n">child_id</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day06</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_06.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="c1">## First calculate average  price</span><span class="w">
</span><span class="n">avg_price</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day06</span><span class="p">,</span><span class="w"> </span><span class="s2">"gifts"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">mean_price</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">price</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">mean_price</span><span class="p">)</span><span class="w">

</span><span class="c1">## Join the gifts and children table and filter out results based on average</span><span class="w">
</span><span class="c1">## price. Finally arrange by price.</span><span class="w">
</span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"children"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"gifts"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">join_by</span><span class="p">(</span><span class="n">child_id</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">price</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">avg_price</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">price</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day06</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-7">Day 7</h2>

<p>The original dump used <code class="language-plaintext highlighter-rouge">SERIAL</code> again, but provided all the <code class="language-plaintext highlighter-rouge">elf_id</code> so I
updated the beginning of the file to use <code class="language-plaintext highlighter-rouge">INTEGER</code> instead and it looks like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DROP TABLE IF EXISTS workshop_elves CASCADE;
CREATE TABLE workshop_elves (
    elf_id INTEGER PRIMARY KEY,
    elf_name VARCHAR(100) NOT NULL,
    primary_skill VARCHAR(50) NOT NULL,
    years_experience INTEGER NOT NULL
);
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day07</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_07.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day07</span><span class="p">,</span><span class="w"> </span><span class="s2">"workshop_elves"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="w">
    </span><span class="n">years_experience</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">years_experience</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">years_experience</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">years_experience</span><span class="p">),</span><span class="w">
    </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">primary_skill</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="w">
    </span><span class="n">elf_id</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">elf_id</span><span class="p">),</span><span class="w">
    </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">primary_skill</span><span class="p">,</span><span class="w"> </span><span class="n">years_experience</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="w">
    </span><span class="n">primary_skill</span><span class="p">,</span><span class="w">
    </span><span class="n">desc</span><span class="p">(</span><span class="n">years_experience</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">collect</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="w">
    </span><span class="n">result</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">elf_id</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">),</span><span class="w">
    </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">primary_skill</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="w">
    </span><span class="n">result</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">result</span><span class="p">,</span><span class="w"> </span><span class="n">primary_skill</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">),</span><span class="w">
    </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">primary_skill</span><span class="w">
  </span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day07</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-8">Day 8</h2>

<p>Again, the original data dump used
<code class="language-plaintext highlighter-rouge">SERIAL</code> which I substituted for
<code class="language-plaintext highlighter-rouge">INTEGER</code> to be able to import the
data into DuckDB, so the beginning
of the file looks like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">staff</span> <span class="k">CASCADE</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">staff</span> <span class="p">(</span>
    <span class="n">staff_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">staff_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">manager_id</span> <span class="nb">INTEGER</span>
<span class="p">);</span>
</code></pre></div></div>

<p>I was in the rush that day, and the solution I came up with is quite hacky and
slow. All the computation takes place in R using a recursive function. This
challenge is a good opportunity to learn recursive CTEs but I’ll need to come
back to it. (See <a href="#day-18">Day 18</a> for the resurvie CTE approach).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day08</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_08.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="c1">## make sure there is a single NA</span><span class="w">
</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day08</span><span class="p">,</span><span class="w"> </span><span class="s2">"staff"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">manager_id</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="nf">is.na</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="nf">sum</span><span class="p">()</span><span class="w">

</span><span class="n">find_boss</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">.data</span><span class="p">,</span><span class="w"> </span><span class="n">idx</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">idx</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nf">return</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  
  </span><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">.data</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">filter</span><span class="p">(</span><span class="n">staff_id</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">idx</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
    </span><span class="n">pull</span><span class="p">(</span><span class="n">manager_id</span><span class="p">)</span><span class="w">

  </span><span class="nf">c</span><span class="p">(</span><span class="n">find_boss</span><span class="p">(</span><span class="n">.data</span><span class="p">,</span><span class="w"> </span><span class="n">res</span><span class="p">),</span><span class="w"> </span><span class="n">idx</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">staff</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day08</span><span class="p">,</span><span class="w"> </span><span class="s2">"staff"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">collect</span><span class="p">()</span><span class="w">

</span><span class="n">staff</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">rowwise</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">find_boss</span><span class="p">(</span><span class="n">staff</span><span class="p">,</span><span class="w"> </span><span class="n">.data</span><span class="o">$</span><span class="n">manager_id</span><span class="p">)))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">path</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">slice_max</span><span class="p">(</span><span class="n">level</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day08</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-9">Day 9</h2>

<p>To replace <code class="language-plaintext highlighter-rouge">SERIAL</code>, I used <code class="language-plaintext highlighter-rouge">SEQUENCE</code> for both the <code class="language-plaintext highlighter-rouge">reindeer_id</code> and the
<code class="language-plaintext highlighter-rouge">session_id</code> so the beginning of the dump file looks like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">training_sessions</span> <span class="k">CASCADE</span><span class="p">;</span>
<span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">reindeers</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">r_id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">reindeers</span> <span class="p">(</span>
    <span class="n">reindeer_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'r_id'</span><span class="p">),</span>
    <span class="n">reindeer_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">years_of_service</span> <span class="nb">INTEGER</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">speciality</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">s_id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">training_sessions</span> <span class="p">(</span>
    <span class="n">session_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'s_id'</span><span class="p">),</span>
    <span class="n">reindeer_id</span> <span class="nb">INTEGER</span><span class="p">,</span>
    <span class="n">exercise_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">speed_record</span> <span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">session_date</span> <span class="nb">DATE</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">weather_conditions</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">50</span><span class="p">),</span>
    <span class="k">FOREIGN</span> <span class="k">KEY</span> <span class="p">(</span><span class="n">reindeer_id</span><span class="p">)</span> <span class="k">REFERENCES</span> <span class="n">reindeers</span><span class="p">(</span><span class="n">reindeer_id</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day09</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_09.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"training_sessions"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="w">
    </span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"reindeers"</span><span class="p">),</span><span class="w">
    </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">join_by</span><span class="p">(</span><span class="n">reindeer_id</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">reindeer_name</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"Rudolf"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="w">
    </span><span class="n">avg_speed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">speed_record</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w">
    </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">reindeer_name</span><span class="p">,</span><span class="w"> </span><span class="n">exercise_name</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">slice_max</span><span class="p">(</span><span class="n">avg_speed</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reindeer_name</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">slice_max</span><span class="p">(</span><span class="n">avg_speed</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">collect</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">glue</span><span class="o">::</span><span class="n">glue_data</span><span class="p">(</span><span class="s2">"{reindeer_name},{round(avg_speed, 2)}"</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day09</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-10">Day 10</h2>

<p>I again replaced <code class="language-plaintext highlighter-rouge">SERIAL</code> with using a <code class="language-plaintext highlighter-rouge">SEQUENCE</code> in the data dump. The
beginning of the file looks like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">Drinks</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">d_id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">Drinks</span> <span class="p">(</span>
    <span class="n">drink_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'d_id'</span><span class="p">),</span>
    <span class="n">drink_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="nb">date</span> <span class="nb">DATE</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">quantity</span> <span class="nb">INTEGER</span> <span class="k">NOT</span> <span class="k">NULL</span>
    <span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day10</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_10.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day10</span><span class="p">,</span><span class="w"> </span><span class="s2">"Drinks"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="w">
    </span><span class="n">quantity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">quantity</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">drink_name</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">tidyr</span><span class="o">::</span><span class="n">pivot_wider</span><span class="p">(</span><span class="n">names_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">drink_name</span><span class="p">,</span><span class="w"> </span><span class="n">values_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">quantity</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="w">
    </span><span class="n">`Hot Cocoa`</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">38</span><span class="p">,</span><span class="w">
    </span><span class="n">`Peppermint Schnapps`</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">298</span><span class="p">,</span><span class="w">
    </span><span class="n">`Eggnog`</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">198</span><span class="w">
  </span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day10</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The magic here is that it all happens within DuckDB even with the call to
<code class="language-plaintext highlighter-rouge">pivot_wider()</code>.</p>

<h2 id="day-11">Day 11</h2>

<p>The data could be imported directly from DuckDB.</p>

<p>I first wrote the solution to this challenge using the <code class="language-plaintext highlighter-rouge">{slider}</code> package to get
the moving average. But the data has to be pulled in R’s memory to make this
work. I then tried to solve it using just DuckDB to practice window functions,
but the query returns a second result. I have not investigated why this is the
case yet.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day11</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_11.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="c1">## R solution</span><span class="w">
</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day11</span><span class="p">,</span><span class="w"> </span><span class="s2">"TreeHarvests"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">collect</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">avg_yield</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">slider</span><span class="o">::</span><span class="n">slide_dbl</span><span class="p">(</span><span class="n">trees_harvested</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">,</span><span class="w"> </span><span class="n">.before</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">.complete</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w">
    </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">field_name</span><span class="p">,</span><span class="w"> </span><span class="n">harvest_year</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">slice_max</span><span class="p">(</span><span class="n">avg_yield</span><span class="p">)</span><span class="w">

</span><span class="c1">## DuckDB SQL solution</span><span class="w">
</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">con_day11</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  WITH results AS (
  SELECT field_name, harvest_year, season,
     CASE WHEN season = 'Spring' THEN 1 WHEN season = 'Summer' THEN 2
          WHEN season = 'Fall' THEN 3 WHEN season = 'Winter' THEN 4 END AS season_order,
     trees_harvested,
     avg(trees_harvested) OVER
       (PARTITION BY 'field_name', 'harvest_year' ORDER BY 'season_order'
        ROWS 2 PRECEDING) AS avg_yield
  FROM TreeHarvests
  )
  SELECT * FROM results WHERE avg_yield = (SELECT max(avg_yield) FROM results)
  "</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day11</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-12">Day 12</h2>

<p>The data dump needed again to use <code class="language-plaintext highlighter-rouge">SEQUENCE</code> to replace <code class="language-plaintext highlighter-rouge">SERIAL</code>, so I edited
the beginning of the file to look like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">g_id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">r_id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">TABLE</span> <span class="n">gifts</span> <span class="p">(</span>
    <span class="n">gift_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'g_id'</span><span class="p">),</span>
    <span class="n">gift_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">price</span> <span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">TABLE</span> <span class="n">gift_requests</span> <span class="p">(</span>
    <span class="n">request_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'r_id'</span><span class="p">),</span>
    <span class="n">gift_id</span> <span class="nb">INT</span><span class="p">,</span>
    <span class="n">request_date</span> <span class="nb">DATE</span><span class="p">,</span>
    <span class="k">FOREIGN</span> <span class="k">KEY</span> <span class="p">(</span><span class="n">gift_id</span><span class="p">)</span> <span class="k">REFERENCES</span> <span class="n">gifts</span><span class="p">(</span><span class="n">gift_id</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<p>This challenge could be solve using only {dplyr} functions. There were 10 gifts
that tied for first place, so I relied on the <code class="language-plaintext highlighter-rouge">n</code> argument from the <code class="language-plaintext highlighter-rouge">print</code>
function to display more data to be able to see the second most popular item.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day12</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_12.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day12</span><span class="p">,</span><span class="w"> </span><span class="s2">"gift_requests"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">gift_id</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">overall_rank</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">percent_rank</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day12</span><span class="p">,</span><span class="w"> </span><span class="s2">"gifts"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">join_by</span><span class="p">(</span><span class="n">gift_id</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">overall_rank</span><span class="p">),</span><span class="w"> </span><span class="n">gift_name</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day12</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-13">Day 13</h2>

<p>To be able to import the data into DuckDB, I once again edited the beginning of
the dump file to use <code class="language-plaintext highlighter-rouge">SEQUENCE</code> to replace <code class="language-plaintext highlighter-rouge">SERIAL</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">cl_id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>

<span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">contact_list</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">contact_list</span> <span class="p">(</span>
    <span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'cl_id'</span><span class="p">),</span>
    <span class="n">name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">email_addresses</span> <span class="nb">TEXT</span><span class="p">[]</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day13</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_13.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day13</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  WITH all_domains AS
  (SELECT
   id, name, unnest(email_addresses) AS addresses,
   regexp_extract(addresses, '@(.+)$', 1) AS domains
  FROM contact_list
  )
  SELECT domains, COUNT(domains) AS n_users FROM all_domains
  GROUP BY domains
  ORDER BY n_users DESC
  "</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day13</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-14">Day 14</h2>

<p>I replaced <code class="language-plaintext highlighter-rouge">SERIAL</code> with <code class="language-plaintext highlighter-rouge">SEQUENCE</code> once again:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">SantaRecords</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">rid</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">SantaRecords</span> <span class="p">(</span>
    <span class="n">record_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'rid'</span><span class="p">),</span>
    <span class="n">record_date</span> <span class="nb">DATE</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">cleaning_receipts</span> <span class="n">JSON</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day14</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_14.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbExecute</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"LOAD json;"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day14</span><span class="p">,</span><span class="w">
  </span><span class="s2">"WITH extracted AS (
   SELECT
     record_date,
     cleaning_receipts-&gt;&gt;'$..garment' AS garment,
     cleaning_receipts-&gt;&gt;'$..color' AS color,
     cleaning_receipts-&gt;&gt;'$..drop_off' AS drop_off,
     cleaning_receipts-&gt;&gt;'$..receipt_id' AS receipt_id
   FROM SantaRecords
  ),
  tidy AS (
    SELECT
      record_date,
      unnest(garment) AS tidy_garment,
      unnest(color) AS tidy_color,
      unnest(drop_off) AS dropoff,
      unnest(receipt_id) AS tidy_receipt_id
    FROM extracted
  )
  SELECT * FROM tidy
  WHERE tidy_garment = 'suit' AND tidy_color = 'green'
  ORDER BY dropoff DESC;"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day14</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-15">Day 15</h2>

<p>This was the first challenge of the series that dealt with spatial data. The
data required a little more preparation. I updated the dump file to:</p>

<ul>
  <li>use <code class="language-plaintext highlighter-rouge">SEQUENCE</code> instead of <code class="language-plaintext highlighter-rouge">SERIAL</code></li>
  <li>replace <code class="language-plaintext highlighter-rouge">GEOGRAPHY(POINT)</code> and <code class="language-plaintext highlighter-rouge">GEOGRAPHY(POLYGON)</code> with <code class="language-plaintext highlighter-rouge">GEOMETRY</code></li>
  <li>for each spatial feature, I removed <code class="language-plaintext highlighter-rouge">ST_setSRID(..., 4326)</code> given that it’s
the default in DuckDB.</li>
</ul>

<p>The beginning of the file looked like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">sid</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">sleigh_locations</span> <span class="p">(</span>
<span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'sid'</span><span class="p">),</span>
<span class="nb">timestamp</span> <span class="nb">TIMESTAMP</span> <span class="k">WITH</span> <span class="nb">TIME</span> <span class="k">ZONE</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="n">coordinate</span> <span class="n">GEOMETRY</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>


<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">aid</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">areas</span> <span class="p">(</span>
    <span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'aid'</span><span class="p">),</span>
    <span class="n">place_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">255</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">polygon</span> <span class="n">GEOMETRY</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>
</code></pre></div></div>

<p>and the sleigh location table data looked like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">sleigh_locations</span> <span class="p">(</span><span class="nb">timestamp</span><span class="p">,</span> <span class="n">coordinate</span><span class="p">)</span> <span class="k">VALUES</span>
<span class="p">(</span><span class="s1">'2024-12-24 22:00:00+00'</span><span class="p">,</span> <span class="n">ST_Point</span><span class="p">(</span><span class="mi">37</span><span class="p">.</span><span class="mi">717634</span><span class="p">,</span> <span class="mi">55</span><span class="p">.</span><span class="mi">805825</span><span class="p">));</span>
</code></pre></div></div>

<p>I edited the <code class="language-plaintext highlighter-rouge">areas</code> table the same way, for instance the first area looked
like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="s1">'New_York'</span><span class="p">,</span> <span class="n">ST_GeomFromText</span><span class="p">(</span><span class="s1">'POLYGON((-74.25909 40.477399, -73.700272 40.477399, -73.700272 40.917577, -74.25909 40.917577, -74.25909 40.477399))'</span><span class="p">)),</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day15</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_15.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbExecute</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day15</span><span class="p">,</span><span class="w"> </span><span class="s2">"INSTALL spatial; LOAD spatial;"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day15</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
 SELECT areas.place_name
 FROM areas
 JOIN sleigh_locations on ST_Within(sleigh_locations.coordinate, areas.polygon)
  "</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-16">Day 16</h2>

<p>Day 16 required the same data preparation as for day 15.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day16</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_16.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbExecute</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day16</span><span class="p">,</span><span class="w"> </span><span class="s2">"INSTALL spatial; LOAD spatial;"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day16</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  SELECT sleigh_locations.timestamp, areas.place_name
  FROM sleigh_locations
  JOIN areas on ST_Within(sleigh_locations.coordinate, areas.polygon)
  "</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">time_spent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">timestamp</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">timestamp</span><span class="p">),</span><span class="w"> </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"place_name"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">slice_max</span><span class="p">(</span><span class="n">time_spent</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day16</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-17">Day 17</h2>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day17</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_17.duckdb"</span><span class="p">)</span><span class="w">
</span><span class="n">dbExecute</span><span class="p">(</span><span class="n">con_day17</span><span class="p">,</span><span class="w"> </span><span class="s2">"INSTALL icu; LOAD icu;"</span><span class="p">)</span><span class="w">
</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day17</span><span class="p">,</span><span class="w">
  </span><span class="s2">"SELECT
     *,
     ('2024-12-24' || ' ' || business_start_time || ' ' || timezone)::TIMESTAMPTZ AS start_time_utc,
     ('2024-12-24' || ' ' || business_end_time || ' ' || timezone)::TIMESTAMPTZ AS end_time_utc
   From Workshops,
 "</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="w">
  </span><span class="c1">## meeting cannot start before the earliest open workshop</span><span class="w">
  </span><span class="n">start_time_utc</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">start_time_utc</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">start_time_utc</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-18">Day 18</h2>

<p>The data for this challenge is the same as for day 8, and requires the same
preparation: replacing <code class="language-plaintext highlighter-rouge">SERIAL</code> with <code class="language-plaintext highlighter-rouge">INTEGER</code>. Instead of using the same
(inefficient) recursive function in R as for day 8, here I learned how to use
them in DuckDB to compute the managerial paths. The wording of this challenge
seems confusing and it seemed uncessary to compute the number of peers with the
same manager to find the answer.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day18</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="w">
  </span><span class="n">duckdb</span><span class="p">(),</span><span class="w">
  </span><span class="s2">"2024-advent-of-sql-data/advent_day_18.duckdb"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day18</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  WITH RECURSIVE path_tbl(staff_id, path) AS (
      SELECT staff_id, [manager_id] AS path
      FROM staff
      WHERE manager_id IS NULL
    UNION ALL
      SELECT staff.staff_id, list_prepend(staff.manager_id, path_tbl.path)
      FROM staff, path_tbl
      WHERE staff.manager_id = path_tbl.staff_id
  )
  SELECT path_tbl.staff_id, staff.manager_id, len(path) AS level
  FROM path_tbl
  JOIN staff ON staff.staff_id = path_tbl.staff_id
  ORDER BY path_tbl.staff_id, level DESC
  "</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">total_peers_same_level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w">
    </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">level</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">total_peers_same_level</span><span class="p">),</span><span class="w"> </span><span class="n">level</span><span class="p">,</span><span class="w"> </span><span class="n">staff_id</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day18</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-19">Day 19</h2>

<p>Replace <code class="language-plaintext highlighter-rouge">SERIAL</code> with a <code class="language-plaintext highlighter-rouge">SEQUENCE</code>.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">employees</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">eid</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">employees</span> <span class="p">(</span>
<span class="n">employee_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'eid'</span><span class="p">),</span>
<span class="n">name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="n">salary</span> <span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="n">year_end_performance_scores</span> <span class="nb">INTEGER</span><span class="p">[]</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>
</code></pre></div></div>

<p>The main challenge here, is that the result is a large number, and by default R
does not print enough significant digits to get the correct answer. There are
several ways to get the correct number displayed but, in the end, used
<code class="language-plaintext highlighter-rouge">tibble::num()</code> which was new to me.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day19</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_19.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbExecute</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day19</span><span class="p">,</span><span class="w">
  </span><span class="s2">"CREATE OR REPLACE VIEW average_score AS
    (SELECT
    *,
    year_end_performance_scores[len(year_end_performance_scores)] AS last_score
   FROM employees)
"</span><span class="p">)</span><span class="w">

</span><span class="n">avg_score</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day19</span><span class="p">,</span><span class="w"> </span><span class="s2">"average_score"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">avg_score</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">last_score</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">avg_score</span><span class="p">)</span><span class="w">

</span><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day19</span><span class="p">,</span><span class="w"> </span><span class="s2">"average_score"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">gets_bonus</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">last_score</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">avg_score</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">total_comp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w">
    </span><span class="n">gets_bonus</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">salary</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">1.15</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
    </span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">salary</span><span class="w">
  </span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w"> 
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">total_comp</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">total</span><span class="p">)</span><span class="w">

</span><span class="n">tibble</span><span class="o">::</span><span class="n">num</span><span class="p">(</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">digits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day19</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-20">Day 20</h2>

<p>Replace <code class="language-plaintext highlighter-rouge">SERIAL</code> with <code class="language-plaintext highlighter-rouge">SEQUENCE</code>, so the beginning of the file looks like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">web_requests</span> <span class="k">CASCADE</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">web_requests</span> <span class="p">(</span>
  <span class="n">request_id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'id'</span><span class="p">),</span>
  <span class="n">url</span> <span class="nb">TEXT</span> <span class="k">NOT</span> <span class="k">NULL</span>
  <span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day20</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_20.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day20</span><span class="p">,</span><span class="w">
  </span><span class="s2">"SELECT
    *,
    string_split(regexp_extract(url, '\\?(.+)', 1), '&amp;') AS query
   FROM web_requests
   WHERE contains(url, 'utm_source=advent-of-sql')
   ORDER BY len(list_distinct(list_transform(query, p -&gt; p.split('=')[1]))) DESC, url
   LIMIT 1
  "</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day20</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-21">Day 21</h2>

<p>Replace <code class="language-plaintext highlighter-rouge">SERIAL</code> with <code class="language-plaintext highlighter-rouge">SEQUENCE</code>, so the beginning of the file looks like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">sales</span> <span class="k">CASCADE</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">sales</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'id'</span><span class="p">),</span>
  <span class="n">sale_date</span> <span class="nb">DATE</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">amount</span> <span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day21</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="s2">"2024-advent-of-sql-data/advent_day_21.duckdb"</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day21</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  SELECT *
  FROM (
    SELECT
      year(sale_date) AS year,
      quarter(sale_date) AS quarter,
      sum(amount) AS total_sale,
      lag(total_sale, 1) OVER (ORDER BY year, quarter) AS prev_sale,
      (total_sale-prev_sale)/prev_sale AS growth
    FROM sales
    GROUP BY year, quarter
    ORDER BY year, quarter
  )
  ORDER BY growth DESC
"</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day21</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-22">Day 22</h2>

<p>Once again, I used <code class="language-plaintext highlighter-rouge">SEQUENCE</code> to replace <code class="language-plaintext highlighter-rouge">SERIAL</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">elves</span> <span class="k">CASCADE</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="n">SEQUENCE</span> <span class="n">id</span> <span class="k">START</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">elves</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="k">DEFAULT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'id'</span><span class="p">),</span>
  <span class="n">elf_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">255</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">skills</span> <span class="nb">TEXT</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day22</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="w">
  </span><span class="n">duckdb</span><span class="p">(),</span><span class="w">
  </span><span class="s2">"2024-advent-of-sql-data/advent_day_22.duckdb"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbGetQuery</span><span class="p">(</span><span class="w">
  </span><span class="n">con_day22</span><span class="p">,</span><span class="w">
  </span><span class="s2">"
  SELECT
     count(id)
   FROM elves
   WHERE str_split(skills, ',').list_contains('SQL')
   "</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day22</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-23">Day 23</h2>

<p>The data could be imported as provided, and chose to solve the challenge with
dplyr.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day23</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="w">
  </span><span class="n">duckdb</span><span class="p">(),</span><span class="w">
  </span><span class="s2">"2024-advent-of-sql-data/advent_day_23.duckdb"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">seq_id</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day23</span><span class="p">,</span><span class="w"> </span><span class="s2">"sequence_table"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">collect</span><span class="p">()</span><span class="w">

</span><span class="n">full_seq</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">seq_id</span><span class="o">$</span><span class="n">id</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">seq_id</span><span class="o">$</span><span class="n">id</span><span class="p">)))</span><span class="w">

</span><span class="c1">## join complete and provided sequence and keep both</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">full_seq</span><span class="p">,</span><span class="w"> </span><span class="n">seq_id</span><span class="p">,</span><span class="w"> </span><span class="n">keep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="c1">## the NAs are the gaps</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">id.y</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="c1">## identify groups</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">next_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">id.x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">lag</span><span class="p">(</span><span class="n">id.x</span><span class="p">,</span><span class="w"> </span><span class="n">default</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">next_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">cumsum</span><span class="p">(</span><span class="n">next_id</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="c1">## format as expected</span><span class="w">
  </span><span class="n">nest_by</span><span class="p">(</span><span class="n">next_id</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">res</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">id.x</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">))</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day23</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="day-24">Day 24</h2>

<p>The data could be imported directly
into DuckDB. For this challenge,
manipulating the data with the dplyr
verbs felt like the most efficient
way to get to the solution.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con_day24</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="w">
  </span><span class="n">duckdb</span><span class="p">(),</span><span class="w">
  </span><span class="s2">"2024-advent-of-sql-data/advent_day_24.duckdb"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day24</span><span class="p">,</span><span class="w"> </span><span class="s2">"user_plays"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">tbl</span><span class="p">(</span><span class="n">con_day24</span><span class="p">,</span><span class="w"> </span><span class="s2">"songs"</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">has_skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="n">duration</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">song_duration</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="w">
    </span><span class="n">n_plays</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w">
    </span><span class="n">n_skips</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">has_skip</span><span class="p">),</span><span class="w">
    </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">song_title</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n_plays</span><span class="p">),</span><span class="w"> </span><span class="n">n_skips</span><span class="p">)</span><span class="w">

</span><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con_day24</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Hacking" /><category term="r" /><category term="duckdb" /><summary type="html"><![CDATA[An annoted list of solutions to the Advent of SQL challenges]]></summary></entry><entry><title type="html">How to work with remote Parquet files with the duckdb R package?</title><link href="https://francoismichonneau.net/2023/06/duckdb-r-remote-data/" rel="alternate" type="text/html" title="How to work with remote Parquet files with the duckdb R package?" /><published>2023-06-19T00:00:00+00:00</published><updated>2023-06-19T00:00:00+00:00</updated><id>https://francoismichonneau.net/2023/06/duckdb-r-remote-data</id><content type="html" xml:base="https://francoismichonneau.net/2023/06/duckdb-r-remote-data/"><![CDATA[<p>For large datasets, it is sometimes convenient to explore them without
downloading them locally. With Arrow, you can work with these remotes files if
they are stored in AWS S3 or Google Cloud Storage. It is however not yet
possible for files stored over HTTPS (it is on the roadmap). On the other hand,
with the “httpfs” extension, DuckDB allows you to query over the wire these
Parquet files.</p>

<p>You can even set things up so you can use dplyr verbs to work with these remote
files. I will demonstrate this using a Parquet version of the <a href="https://allisonhorst.github.io/palmerpenguins/">penguins
dataset</a> hosted on my site.</p>

<p>Let’s start by loading the required packages:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">DBI</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">duckdb</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>We are creating a <code class="language-plaintext highlighter-rouge">con</code> object to hold our DuckDB connection:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">duckdb</span><span class="o">::</span><span class="n">duckdb</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<p>Let’s install (only needed once) and load the <code class="language-plaintext highlighter-rouge">httpfs</code> extension:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbExecute</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"INSTALL httpfs;"</span><span class="p">)</span><span class="w">
</span><span class="n">dbExecute</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"LOAD httpfs;"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>At this point, we could use DuckDB’s SQL syntax to work with our remote dataset:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"SELECT species,
          AVG(bill_length_mm) AS avg_bill_length,
          AVG(bill_depth_mm) AS avg_bill_depth
   FROM PARQUET_SCAN('https://francoismichonneau.net/assets/data/penguins.parquet')
   GROUP BY species;"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 3 × 3
  species   avg_bill_length avg_bill_depth
  &lt;chr&gt;               &lt;dbl&gt;          &lt;dbl&gt;
1 Adelie               38.8           18.3
2 Gentoo               47.5           15.0
3 Chinstrap            48.8           18.4
</code></pre></div></div>

<p>However, you can create a view using this remote file, which in turn, will allow
you to use dplyr to query your file:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbExecute</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w">
  </span><span class="s2">"CREATE VIEW penguins AS
   SELECT * FROM PARQUET_SCAN('https://francoismichonneau.net/assets/data/penguins.parquet');
"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>You can check it worked by running:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbListTables</span><span class="p">(</span><span class="n">con</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] "penguins"
</code></pre></div></div>

<p>Now you can work with this remote data with dplyr:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"penguins"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">species</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="w">
    </span><span class="n">avg_bill_length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">bill_length_mm</span><span class="p">),</span><span class="w">
    </span><span class="n">avg_bill_depth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">bill_depth_mm</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Source:   SQL [3 x 3]
# Database: DuckDB 0.8.1 [francois@Linux 6.2.0-20-generic:R 4.3.0/:memory:]
  species   avg_bill_length avg_bill_depth
  &lt;chr&gt;               &lt;dbl&gt;          &lt;dbl&gt;
1 Adelie               38.8           18.3
2 Gentoo               47.5           15.0
3 Chinstrap            48.8           18.4
</code></pre></div></div>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Hacking" /><category term="r" /><category term="arrow" /><category term="duckdb" /><summary type="html"><![CDATA[Learn how to work with Parquet files over HTTPS using duckdb and dplyr.]]></summary></entry><entry><title type="html">How to use Arrow to work with large CSV files?</title><link href="https://francoismichonneau.net/2022/10/import-big-csv/" rel="alternate" type="text/html" title="How to use Arrow to work with large CSV files?" /><published>2022-10-13T00:00:00+00:00</published><updated>2022-10-13T00:00:00+00:00</updated><id>https://francoismichonneau.net/2022/10/import-big-csv</id><content type="html" xml:base="https://francoismichonneau.net/2022/10/import-big-csv/"><![CDATA[<h2 id="some-background">Some background</h2>

<p>Lucky you! You just got hold of a largish CSV file (let’s say 15 GB,
about 140 million rows). How do you handle this file to be able to
work with it using Apache Arrow?</p>

<p>Going through the documentation of Arrow, you might notice that
several ways are mentioned to import data. They fall into two
families:</p>
<ul>
  <li>one that I will refer to as the <strong>Single file API</strong><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>;</li>
  <li>the other is the <strong>Dataset API</strong>.</li>
</ul>

<p>The Single file API contains functions for each supported file format
(CSV, JSON, Parquet, Feather/Arrow, ORC). They work on one file at a
time, and they load the data in memory. So depending on the size of
your file and the amount of memory you have available on your system,
it might not be possible to load the dataset this way.  If you <em>can</em>
load the dataset in memory queries will run faster because the data
will be readily accessible to the query engine.</p>

<p>The Dataset API is very flexible.  It can read multiple file formats,
you can point to a folder with multiple files and create a dataset
from them, and it can read datasets from multiple sources (even
combining remote and local sources). This API can also be used to read
single files that are too large to fit in memory. This works because
the files are not actually loaded in memory. The functions scan the
content so they know where to look for the data and what the schema is
(the data types and names of each column). When you query the data,
there is some overhead because the query engine needs to first read
the data before it can operate on it. (If you want to see some
examples of what the Dataset API can do, check out the two previous
posts on datasets with Arrow: <a href="/2022/08/arrow-dataset-creation/">Part 1</a>, and <a href="/2022/09/arrow-dataset-part-2/">Part 2</a>)</p>

<p>In this post, we will explore how to convert a large CSV file to the
Apache Parquet format using the Single file and the Dataset APIs with
code examples in R and Python. We do the conversion from CSV to
Parquet, because in a <a href="/2022/08/arrow-dataset-creation/">previous post</a> we found that the Parquet format
provided the best compromise between disk space usage and query
performance. Having the content of this file in the Apache Parquet
format will ensure that we can read and operate on this data quickly.</p>

<h2 id="the-single-file-api-in-r">The Single file API in R</h2>

<p>The functions in the Single file API in R start with <code class="language-plaintext highlighter-rouge">read_</code> or
<code class="language-plaintext highlighter-rouge">write_</code> followed by the name of the file format. For instance,
<code class="language-plaintext highlighter-rouge">read_csv_arrow()</code>, <code class="language-plaintext highlighter-rouge">read_parquet()</code>, and <code class="language-plaintext highlighter-rouge">read_feather()</code> belong to
what I refer here as the Single file API.</p>

<p>To read the data with our 15 GB CSV file, we would use:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">arrow</span><span class="p">)</span><span class="w">

</span><span class="n">data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_csv_arrow</span><span class="p">(</span><span class="w">
  </span><span class="s2">"~/dataset/path_to_file.csv"</span><span class="p">,</span><span class="w">
  </span><span class="n">as_data_frame</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Using <code class="language-plaintext highlighter-rouge">as_data_frame = FALSE</code> keeps the result as an Arrow table which
is a better representation for a file of this size. Attempting to
convert it into a data frame will take longer to load, and you will
most likely run out of memory.</p>

<p>This step takes about 15 seconds on my system. As far as I can tell,
the arrow R package is the only way to load a file of this size in
memory. Both readr/vroom and data.table ran out of memory after
several minutes and before being able to finish reading the file.</p>

<p>At this point, you have an Arrow formatted table loaded in memory that
is ready for you to work with.</p>

<p>To convert this file into the Apache Parquet format using the Single
file API, you would use:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">write_parquet</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="s2">"~/dataset/data.parquet"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Creating this file takes about 85 seconds on my system. The resulting
file is about 9.5 GB, reducing the amount of hard drive space needed
to store the data by about 60%.</p>

<p>The <code class="language-plaintext highlighter-rouge">read_parquet()</code> function will load this dataset the next time you
need to work with it:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_parquet</span><span class="p">(</span><span class="s2">"~/dataset/data.parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">as_data_frame</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Let’s count the number of unique values in one of the columns of this
dataset:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">variable</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">collect</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<p>This query takes only <strong>half a second</strong> on my laptop. Half a second to
summarize the content of 140 million rows: this is fast! Very fast!</p>

<p>Whether you use <code class="language-plaintext highlighter-rouge">read_csv_arrow()</code> or <code class="language-plaintext highlighter-rouge">read_parquet()</code>, the dataset is
loaded in memory using the same representation: an Arrow table. The
performance of queries would therefore be the same regardless of the
format used to store the data. In this case, the decision to storing
the data as a CSV or a Parquet file will be based on the amount of
storage, how fast reading from CSV or Parquet compares to the overhead
associated with the conversion from one format to the other.</p>

<p>Let’s now use the Dataset API.</p>

<h2 id="the-dataset-api-in-r">The Dataset API in R</h2>

<p>We will read the large CSV file with <code class="language-plaintext highlighter-rouge">open_dataset()</code>. This function
can be pointed to a folder with several files but it can also be used
to read a single file.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/dataset/path_to_file.csv"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>With our 15 GB file, it takes 0.05 seconds to “read” the file. It is
fast because the data does not get loaded in memory. <code class="language-plaintext highlighter-rouge">open_dataset()</code>
scans the content of the file to identify the name of the columns and
their data types.</p>

<p>Running the same query as above, which counts the number of unique
values in a column, takes 18 seconds compared to the 0.5 seconds when
the data is loaded in memory. It is slower because the query engine
needs to read the data. It is the same result that we had found in a
<a href="/2022/08/arrow-dataset-creation/">previous post</a>:
running queries directly on a CSV file is slow. In that post, we also
found that storing the data in the Parquet format sped things
up. Let’s now convert this dataset to Parquet using the Dataset API.</p>

<p>Instead of using a single Parquet file as we did above when we looked
at the Single file API, we will also partition the Parquet dataset to
see how it could help with query performance. The particular dataset I
have on hand does not have any obvious variable we can use to
partition the data. If you are dealing with a dataset that has
timestamps for data collected at regular intervals, partitioning on a
temporal dimension could make sense (that’s what the NYC taxi dataset
does by partitioning by year and month). Instead, here, we can use the
<code class="language-plaintext highlighter-rouge">max_rows_per_file</code> argument of the <code class="language-plaintext highlighter-rouge">write_dataset()</code> function to
limit how large each Parquet file is. At least for this dataset, I
found that limiting the number of rows to 10 million per file seemed
like a good compromise. Each file is about 720 MB which is close to
the file sizes in the NYC taxi dataset. The <a href="https://arrow.apache.org/docs/python/dataset.html#partitioning-performance-considerations">PyArrow
documentation</a>
has a good overview of strategies for partitioning a dataset. The
general recommendation is to avoid individual Parquet files smaller
than 20 MB and larger than 2 GB while avoiding a partition layout that
would create more than 10,000 partitions.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">write_dataset</span><span class="p">(</span><span class="w">
  </span><span class="n">data</span><span class="p">,</span><span class="w">
  </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">,</span><span class="w">
  </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/my-data/"</span><span class="p">,</span><span class="w">
  </span><span class="n">max_rows_per_file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e7</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Writing these files on my system takes about 50 seconds. We end up
with 14 Parquet files totaling 9.9 GB.</p>

<p>Next time we want to work with this data, we can load these files
with:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="w">
  </span><span class="s2">"~/datasets/my-data"</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>It takes about the same amount of time as scanning the CSV files. It
is almost instantaneous taking only 0.02 seconds. Again, this is fast
because the data is not loaded in memory. We saw that with this approach it
took almost 20 seconds to run this query on our CSV file. So what is
the performance of a query on this dataset split into multiple Parquet
files?</p>

<p>Counting the unique values in a column takes just <strong>1 second</strong>. You
read that correctly. One second to summarize 140 million rows. It is a
little slower than doing it when the entire dataset is loaded in
memory but scanning the files is faster. And because the dataset is
not loaded in memory, you are not limited by the amount of memory you
have available. With the Single File API, a file of 15 GB is the upper
limit of what my laptop with 32 GB of RAM can handle.</p>

<p>One of the advantages of the Arrow ecosystem is that it is
polyglot. The approach we described with R also works with Python. And
because both languages use the same C++ backend, the code looks very
similar.</p>

<h2 id="single-file-api-in-python">Single file API in Python</h2>

<p>There are two functions in the PyArrow Single API to read CSV files:
<code class="language-plaintext highlighter-rouge">read_csv()</code> and <code class="language-plaintext highlighter-rouge">open_csv()</code>. While <code class="language-plaintext highlighter-rouge">read_csv()</code> loads all the data
in memory and does it fast by using multiple threads to read different
parts of the files, <code class="language-plaintext highlighter-rouge">open_csv()</code> reads the data in batches and uses a
single thread.</p>

<p>If the CSV file is small enough, you should use <code class="language-plaintext highlighter-rouge">read_csv()</code>. The code
to read the CSV file and write it to a Parquet file would then look
like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="n">pa</span>
<span class="kn">import</span> <span class="nn">pyarrow.csv</span>
<span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="n">pq</span>

<span class="n">in_path</span> <span class="o">=</span> <span class="s">'~/datasets/data.csv'</span>
<span class="n">out_path</span> <span class="o">=</span> <span class="s">'~/datasets/data.parquet'</span>

<span class="n">data</span> <span class="o">=</span>  <span class="n">pa</span><span class="p">.</span><span class="n">csv</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">in_path</span><span class="p">)</span>

<span class="n">pq</span><span class="p">.</span><span class="n">write_table</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">out_path</span><span class="p">)</span>
</code></pre></div></div>

<p>In our case, the file is too large to fit in memory<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. So instead of
using <code class="language-plaintext highlighter-rouge">read_csv()</code>, we need to use <code class="language-plaintext highlighter-rouge">open_csv()</code>. Because, the CSV file
is read in chunks, the code is a little more complex. We need to loop
through each chunk, read it, and write it to the Parquet file. This
uses little memory but is not as fast as using <code class="language-plaintext highlighter-rouge">read_csv()</code>. While
<code class="language-plaintext highlighter-rouge">read_csv()</code> is multi-threaded, <code class="language-plaintext highlighter-rouge">open_csv()</code> uses a single
thread. When using <code class="language-plaintext highlighter-rouge">open_csv()</code>, the data types need to be consistent
in your columns. The function infers the data types on the first chunk
of data read, and if the type changes halfway through your dataset in
one of your columns, you will run into errors. You can avoid this by
specifying the data types manually.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># from &lt;https://stackoverflow.com/a/68563617/1113276&gt;
</span><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="n">pa</span>
<span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="n">pq</span>
<span class="kn">import</span> <span class="nn">pyarrow.csv</span>

<span class="n">in_path</span> <span class="o">=</span> <span class="s">'~/datasets/data.csv'</span>
<span class="n">out_path</span> <span class="o">=</span> <span class="s">'~/datasets/data.parquet'</span>

<span class="n">writer</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="n">pyarrow</span><span class="p">.</span><span class="n">csv</span><span class="p">.</span><span class="n">open_csv</span><span class="p">(</span><span class="n">in_path</span><span class="p">)</span> <span class="k">as</span> <span class="n">reader</span><span class="p">:</span>
    <span class="k">for</span> <span class="n">next_chunk</span> <span class="ow">in</span> <span class="n">reader</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">next_chunk</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">break</span>
        <span class="k">if</span> <span class="n">writer</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">writer</span> <span class="o">=</span> <span class="n">pq</span><span class="p">.</span><span class="n">ParquetWriter</span><span class="p">(</span><span class="n">out_path</span><span class="p">,</span> <span class="n">next_chunk</span><span class="p">.</span><span class="n">schema</span><span class="p">)</span>
        <span class="n">next_table</span> <span class="o">=</span> <span class="n">pa</span><span class="p">.</span><span class="n">Table</span><span class="p">.</span><span class="n">from_batches</span><span class="p">([</span><span class="n">next_chunk</span><span class="p">])</span>
        <span class="n">writer</span><span class="p">.</span><span class="n">write_table</span><span class="p">(</span><span class="n">next_table</span><span class="p">)</span>
<span class="n">writer</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>

<p>On my system, the conversion from file to Parquet takes about 190
seconds. Reading the Parquet file can be done with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">pq</span><span class="p">.</span><span class="n">ParquetDataset</span><span class="p">(</span><span class="n">out_path</span><span class="p">).</span><span class="n">read</span><span class="p">()</span>
</code></pre></div></div>

<p>With this approach, the dataset is in memory, just like when we were
using R. Again with 32 GB of RAM in my laptop, I need to be careful
with what is running on my system to be able to load this dataset
without running out of memory and crashing my Python session.</p>

<h2 id="the-dataset-api-in-python">The Dataset API in Python</h2>

<p>To load the CSV file with the Dataset API, we use the <code class="language-plaintext highlighter-rouge">dataset()</code>
function:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyarrow.dataset</span> <span class="k">as</span> <span class="n">ds</span>

<span class="n">in_path</span> <span class="o">=</span> <span class="s">"~/datasets/data.csv"</span>
<span class="n">out_path</span> <span class="o">=</span> <span class="s">"~/datasets/my-data/"</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">ds</span><span class="p">.</span><span class="n">dataset</span><span class="p">(</span><span class="n">in_path</span><span class="p">)</span>
</code></pre></div></div>

<p>Just like with R, importing this file takes about 0.02 seconds.</p>

<p>To convert it to a collection of Parquet files, you use the
<code class="language-plaintext highlighter-rouge">write_dataset()</code> function. This function takes the same
<code class="language-plaintext highlighter-rouge">max_rows_per_file</code> argument to control the size of the Parquet file
in each partition.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ds</span><span class="p">.</span><span class="n">write_dataset</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">out_path</span><span class="p">,</span> <span class="nb">format</span> <span class="o">=</span> <span class="s">"parquet"</span><span class="p">,</span>
                 <span class="n">max_rows_per_file</span> <span class="o">=</span> <span class="mf">1e7</span><span class="p">)</span>
</code></pre></div></div>

<p>Reading this collection of Parquet files can also be done with the
<code class="language-plaintext highlighter-rouge">dataset()</code> function, just like when we used the function to read the
single CSV file above. The <code class="language-plaintext highlighter-rouge">dataset()</code> function is very flexible and
can be used to import data in a variety of formats, and structures,
and even combines files from local and remote locations. The <code class="language-plaintext highlighter-rouge">format</code>
argument is optional as the function detects automatically the file
type.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">ds</span><span class="p">.</span><span class="n">dataset</span><span class="p">(</span><span class="n">out_path</span><span class="p">,</span> <span class="nb">format</span> <span class="o">=</span> <span class="s">"parquet"</span><span class="p">)</span>
</code></pre></div></div>

<p>Given the current functionalities implemented PyArrow, querying
datasets of this size is possible but it is neither blazing fast nor
convenient. A good alternative is to use
<a href="https://ibis-project.org">Ibis</a> with <a href="https://duckdb.org">DuckDB</a> as
a backend. Ibis provides a single interface to work with data stored
in memory or in databases. DuckDB is a self-contained database
designed for data analytics. These tools deserve a lot more than a
one-sentence summary but this is beyond the scope of this post.</p>

<p>To count the number of unique values, you could use the following
approach:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ibis</span>

<span class="n">ibis</span><span class="p">.</span><span class="n">options</span><span class="p">.</span><span class="n">interactive</span> <span class="o">=</span> <span class="bp">True</span>

<span class="n">con</span> <span class="o">=</span> <span class="n">ibis</span><span class="p">.</span><span class="n">duckdb</span><span class="p">.</span><span class="n">connect</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">con</span><span class="p">.</span><span class="n">register</span><span class="p">(</span><span class="s">"parquet:///home/user/datasets/my-data/*.parquet"</span><span class="p">,</span> <span class="n">table_name</span> <span class="o">=</span> <span class="s">"table"</span><span class="p">)</span>

<span class="n">con</span><span class="p">.</span><span class="n">table</span><span class="p">(</span><span class="s">"table"</span><span class="p">).</span><span class="n">variable</span><span class="p">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div></div>

<p>Just like with using R, this takes about a second to count the unique
values in one column of our 140 million row dataset.</p>

<h2 id="what-this-post-didnt-mention">What this post didn’t mention</h2>

<p>I focused on the reading of a CSV file and its conversion to
Parquet. I didn’t talk about all the options that both the Single file
and the Dataset APIs have to customize the format of the files that
are being imported. For instance, both APIs can be used to specify a
different column-separator, and cell content that should be treated as
missing data.</p>

<h2 id="conclusion">Conclusion</h2>

<p>For a 15 GB data file, the dataset API is better suited to read,
convert, and query the data. There is an overhead associated with not
having the data in memory but it is greatly reduced if the data is
stored as Parquet files. Another advantage is that the approach
developed here would scale to much larger datasets where the Single
file API would not be able to serialize the data in memory.</p>

<p>With the dataset in this example, the Single file API did not have an
opportunity to shine given the hardware constraints of a modern
laptop. However, if you are dealing with datasets that fit easily in
memory, working with data directly in memory will lead to better query
performance.</p>

<p>To summarize what we learned in this post, this brief decision guide
to help you choose the correct API to import your data.</p>

<figure class="">
  <img src="/images/2022-09-decision-map.webp" alt="Decision tree to help you choose the most suitable API for your
data. If your dataset is large (more than a third of your available
RAM) or if it is split into multiple files use the Dataset
API. Reserve the use of the Single file API when the dataset is
small." /><figcaption>
     Decision tree to help you choose the appropriate
Apache Arrow API for your dataset.

  </figcaption></figure>

<h2 id="acknowledgments">Acknowledgments</h2>

<p>Thank you to <a href="https://twitter.com/kae_suarez/">Kae Suarez</a> and
<a href="https://djnavarro.net">Danielle Navarro</a> for reviewing this post and
providing feedback that improved its content.</p>

<h2 id="footnotes">Footnotes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>This is not an official name for it but found it is helpful to
  group these functions that work on a file type at a time. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I am not sure why it fit in memory when I was loading it in R but
  not with Python. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Arrow exploration" /><category term="r" /><category term="arrow" /><summary type="html"><![CDATA[A short practical guide to load a 15 GB dataset with Apache Arrow using R and Python.]]></summary></entry><entry><title type="html">Creating an Arrow dataset (part 2)</title><link href="https://francoismichonneau.net/2022/09/arrow-dataset-part-2/" rel="alternate" type="text/html" title="Creating an Arrow dataset (part 2)" /><published>2022-09-06T00:00:00+00:00</published><updated>2022-09-06T00:00:00+00:00</updated><id>https://francoismichonneau.net/2022/09/arrow-dataset-part-2</id><content type="html" xml:base="https://francoismichonneau.net/2022/09/arrow-dataset-part-2/"><![CDATA[<h2 id="background">Background</h2>

<p>In this follow-up post (see
<a href="/2022/08/arrow-dataset-creation/">part 1</a> if you missed
it), we will explore what happens to the query performance if we read
the files straight into Arrow instead of downloading them locally first.</p>

<h2 id="reading-remote-csv-files">Reading remote CSV files</h2>

<p>In the first part, we first downloaded the compressed CSV files locally
(using the <code class="language-plaintext highlighter-rouge">download.file()</code> function) and then we used the
<code class="language-plaintext highlighter-rouge">open_dataset()</code> function on this set of files to make it available to
Arrow.</p>

<p>However, it is possible to bypass the local download. We can import the
files directly over an Internet connection using the <code class="language-plaintext highlighter-rouge">read_csv_arrow()</code>
function and providing the file URL as the first argument. Once the file
is loaded in memory, we can then write it to disk in the parquet format
(given that we learned in
<a href="/2022/08/arrow-dataset-creation/">part 1</a> that this
format provided the best compromise of disk space usage and query
performance).</p>

<p>We can then modify the code from the <code class="language-plaintext highlighter-rouge">download_daily_package_logs_csv()</code>
function from part 1 to the following (lines changed have comments
indicated by <code class="language-plaintext highlighter-rouge"># &lt;---</code> at the end of the line).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">arrow</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Download the data set for a given date from the RStudio CRAN log website.</span><span class="w">
</span><span class="c1">## `date` is a single date for which we want the data</span><span class="w">
</span><span class="c1">## `path` is where we want the data to live</span><span class="w">
</span><span class="n">download_daily_package_logs_parquet</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w">
                                                </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/cran-logs-parquet-by-day"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="c1">## build the URL for the download</span><span class="w">
  </span><span class="n">date</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">parse_date</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w">
  </span><span class="n">url</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="w">
    </span><span class="s1">'https://cran-logs.rstudio.com/'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">year</span><span class="p">,</span><span class="w"> </span><span class="s1">'/'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s1">'.csv.gz'</span><span class="w">
  </span><span class="p">)</span><span class="w">

  </span><span class="c1">## build the path for the destination of the download</span><span class="w">
  </span><span class="n">file</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">file.path</span><span class="p">(</span><span class="w">
    </span><span class="n">path</span><span class="p">,</span><span class="w">
    </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"year="</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">year</span><span class="p">),</span><span class="w">
    </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"month="</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">month</span><span class="p">),</span><span class="w">
    </span><span class="n">paste0</span><span class="p">(</span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s2">".parquet"</span><span class="p">)</span><span class="w">   </span><span class="c1"># &lt;--- change extension to .parquet</span><span class="w">
  </span><span class="p">)</span><span class="w">

  </span><span class="c1">## create the folder if it doesn't exist</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">dir.exists</span><span class="p">(</span><span class="n">dirname</span><span class="p">(</span><span class="n">file</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">dir.create</span><span class="p">(</span><span class="n">dirname</span><span class="p">(</span><span class="n">file</span><span class="p">),</span><span class="w"> </span><span class="n">recursive</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="c1">## download the file</span><span class="w">
  </span><span class="n">message</span><span class="p">(</span><span class="s2">"Downloading data for "</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s2">" ... "</span><span class="p">,</span><span class="w"> </span><span class="n">appendLF</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
    </span><span class="n">arrow</span><span class="o">::</span><span class="n">read_csv_arrow</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">      </span><span class="c1"># &lt;--- read directly from URL</span><span class="w">
      </span><span class="n">arrow</span><span class="o">::</span><span class="n">write_parquet</span><span class="p">(</span><span class="n">sink</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">file</span><span class="p">)</span><span class="w"> </span><span class="c1"># &lt;--- convert to parquet on disk</span><span class="w">
  </span><span class="n">message</span><span class="p">(</span><span class="s2">"done."</span><span class="p">)</span><span class="w">

  </span><span class="c1">## quick check to make sure that the file was created</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="n">file</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">stop</span><span class="p">(</span><span class="s2">"Download failed for "</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="n">call.</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="c1">## return the path</span><span class="w">
  </span><span class="n">file</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1">## This function is unchanged from part 1</span><span class="w">
</span><span class="c1">## and extract the year and month from it</span><span class="w">
</span><span class="n">parse_date</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">stopifnot</span><span class="p">(</span><span class="w">
    </span><span class="s2">"`date` must be a date"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inherits</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="s2">"Date"</span><span class="p">),</span><span class="w">
    </span><span class="s2">"provide only one date"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">identical</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="m">1L</span><span class="p">),</span><span class="w">
    </span><span class="s2">"date must be in the past"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">Sys.Date</span><span class="p">()</span><span class="w">
  </span><span class="p">)</span><span class="w">
  </span><span class="nf">list</span><span class="p">(</span><span class="w">
    </span><span class="n">date_chr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w">
    </span><span class="n">year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">year</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1900L</span><span class="p">,</span><span class="w"> 
    </span><span class="n">month</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">mon</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1L</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Now that we are set up, we can create the file system the same way we
did, in part 1.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dates_to_get</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="w">
  </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-06-01"</span><span class="p">),</span><span class="w">
  </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-08-15"</span><span class="p">),</span><span class="w">
  </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"day"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">purrr</span><span class="o">::</span><span class="n">walk</span><span class="p">(</span><span class="n">dates_to_get</span><span class="p">,</span><span class="w"> </span><span class="n">download_daily_package_logs_parquet</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The result is similar to what we achieved in part 1. We have one file
for each day placed in a folder corresponding to their month. Except
that this time, instead of having compressed CSV files, we have parquet
files:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/datasets/cran-logs-parquet-by-day/
└── year=2022
    ├── month=6
    │   ├── 2022-06-01.parquet
    │   ├── 2022-06-02.parquet
    │   ├── 2022-06-03.parquet
    │   ├── ...
    │   └── 2022-06-30.parquet
    ├── month=7
    │   ├── 2022-07-01.parquet
    │   ├── 2022-07-02.parquet
    │   ├── 2022-07-03.parquet
    │   ├── ...
    │   └── 2022-07-31.parquet
    └── month=8
        ├── 2022-08-01.parquet
        ├── 2022-08-02.parquet
        ├── 2022-08-03.parquet
        ├── ...
        └── 2022-08-15.parquet
</code></pre></div></div>

<p>Let’s check how large this data is compared to the datasets we created
in part 1:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_size</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">fs</span><span class="o">::</span><span class="n">dir_info</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w"> </span><span class="n">recurse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">filter</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"file"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">pull</span><span class="p">(</span><span class="n">size</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="nf">sum</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">tribble</span><span class="p">(</span><span class="w">
  </span><span class="o">~</span><span class="w"> </span><span class="n">Format</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w">
  </span><span class="s2">"Compressed CSV"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-csv/"</span><span class="p">),</span><span class="w">
  </span><span class="s2">"Arrow"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-arrow/"</span><span class="p">),</span><span class="w">
  </span><span class="s2">"Parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet/"</span><span class="p">),</span><span class="w">
  </span><span class="s2">"Parquet by day"</span><span class="p">,</span><span class="w">  </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet-by-day/"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> 
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 4 × 2
  Format                size
  &lt;chr&gt;          &lt;fs::bytes&gt;
1 Compressed CSV       5.01G
2 Arrow               29.67G
3 Parquet              5.06G
4 Parquet by day       4.63G
</code></pre></div></div>

<p>The dataset with one parquet file per day, is slightly smaller than when
we let <code class="language-plaintext highlighter-rouge">write_dataset()</code> do its own partitioning which led to one file
per month.</p>

<p>We can now compare how quickly Arrow can read these datasets.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
  </span><span class="n">parquet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">),</span><span class="w">
  </span><span class="n">parquet_by_day</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet-by-day"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">),</span><span class="w">
  </span><span class="n">check</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  &lt;bch:expr&gt;     &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt;
1 parquet        139.43ms 143.66ms      6.62    7.91MB     0   
2 parquet_by_day   3.52ms   3.82ms    254.      4.28KB     6.45
</code></pre></div></div>

<p>Even though there are more files to parse (76 vs. 3), loading the
dataset with a parquet file per day is a bit faster.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cran_logs_parquet</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet"</span><span class="p">,</span><span class="w">  </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">)</span><span class="w">
</span><span class="n">cran_logs_parquet_by_day</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet-by-day"</span><span class="p">,</span><span class="w">  </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Let’s now explore the performance of a few queries on these datasets.</p>

<p>First, how long does it take to compute the number of rows in these
datasets:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
  </span><span class="n">parquet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">),</span><span class="w">
  </span><span class="n">parquet_by_day</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 2 × 6
  expression          min   median `itr/sec` mem_alloc `gc/sec`
  &lt;bch:expr&gt;     &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt;
1 parquet           743µs    773µs     1267.    4.74KB     8.48
2 parquet_by_day    745µs    773µs     1273.    1.97KB    10.7 
</code></pre></div></div>

<p>Not much of a difference.</p>

<p>Let’s now compare the performance of the query we ran in part 1, where
we computed the 10 most downloaded packages in the period covered by our
dataset.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_10_packages</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">data</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">count</span><span class="p">(</span><span class="n">package</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">mutate</span><span class="p">(</span><span class="n">n_million_downloads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">1e6</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">select</span><span class="p">(</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
    </span><span class="n">collect</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
  </span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">),</span><span class="w">
  </span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_parquet_by_day</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 2 × 6
  expression                                     min   median `itr/sec` mem_al…¹
  &lt;bch:expr&gt;                                &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:by&gt;
1 top_10_packages(cran_logs_parquet)           3.58s    3.58s     0.279   7.19MB
2 top_10_packages(cran_logs_parquet_by_day)    5.76s    5.76s     0.174 165.36KB
# … with 1 more variable: `gc/sec` &lt;dbl&gt;, and abbreviated variable name
#   ¹​mem_alloc
# ℹ Use `colnames()` to see all variable names
</code></pre></div></div>

<p>This query runs 1.5 seconds faster on the dataset with one parquet file
per month compared to the dataset with one parquet file per day.</p>

<p>The way a dataset is partitioned has an impact on the performance of
queries. If you are filtering your dataset along a variable used in the
partitioning, some of the files can be skipped. Arrow can directly and
only read the file(s) with the relevant information for your query. For
instance, if you are performing a query that only touches the month of
July, Arrow does not need to look at the files for June or August,
leading to potential speed-ups.</p>

<p>Would the partitioning by day help us run our query faster if we were to
compute the 10 most downloaded packages for a single day? After all, in
this case, we would only need to look at one of the files in our folder
of parquet files, and the file in question would be smaller than one
that has all the data for the month. Let’s compare the performance of
this query for August 1st, 2022:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_10_packages_by_day</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">data</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">filter</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-08-01"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">count</span><span class="p">(</span><span class="n">package</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">collect</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
  </span><span class="n">top_10_packages_by_day</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">),</span><span class="w">
  </span><span class="n">top_10_packages_by_day</span><span class="p">(</span><span class="n">cran_logs_parquet_by_day</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 2 × 6
  expression                                          min median itr/s…¹ mem_a…²
  &lt;bch:expr&gt;                                       &lt;bch:&gt; &lt;bch:&gt;   &lt;dbl&gt; &lt;bch:b&gt;
1 top_10_packages_by_day(cran_logs_parquet)         304ms  348ms    2.87   222KB
2 top_10_packages_by_day(cran_logs_parquet_by_day)  354ms  354ms    2.82   167KB
# … with 1 more variable: `gc/sec` &lt;dbl&gt;, and abbreviated variable names
#   ¹​`itr/sec`, ²​mem_alloc
# ℹ Use `colnames()` to see all variable names
</code></pre></div></div>

<p>Interestingly, running the query on the monthly parquet file is still
faster. It takes about 30% longer to run the queries on the one parquet
file per day. The overhead associated with having too many small files
in this situation does not compensate for having to look inside a single
file to perform this operation. For the benefits of partitioning to be
visible, we would need to have more data in each parquet file.</p>

<p>We don’t see a performance benefit of having many small files even when
we try to get the result on a single day. But how does this partitioning
impact the performance of a query that needs to access multiple random
rows? Let’s compare how a query that looks at the number of downloads
per day for a given package.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">package_downloads_by_day</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">pkg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"arrow"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">data</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">filter</span><span class="p">(</span><span class="n">package</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">pkg</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">count</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">arrange</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">collect</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
  </span><span class="n">package_downloads_by_day</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">),</span><span class="w">
  </span><span class="n">package_downloads_by_day</span><span class="p">(</span><span class="n">cran_logs_parquet_by_day</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 2 × 6
  expression                                              min   median `itr/sec`
  &lt;bch:expr&gt;                                         &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt;
1 package_downloads_by_day(cran_logs_parquet)           3.31s    3.31s     0.302
2 package_downloads_by_day(cran_logs_parquet_by_day)    4.46s    4.46s     0.224
# … with 2 more variables: mem_alloc &lt;bch:byt&gt;, `gc/sec` &lt;dbl&gt;
# ℹ Use `colnames()` to see all variable names
</code></pre></div></div>

<p>In this case, it takes about 45% longer to perform this query. In this
situation, the performance is affected by having to look inside many
more files in the dataset with one parquet file per day.</p>

<h2 id="conclusion">Conclusion</h2>

<p>This small example illustrates that it might be worth exploring how best
to partition your dataset to benefit the most from the speed that Arrow
brings to your queries. In this example, the partitioning that seemed
the most “natural” based on the format the data is provided (one parquet
file per day) is not the best to make queries run fast.</p>

<p>The variables you include in your queries have also a role to play when
deciding how to partition your dataset. It might be best to partition
your dataset according to variables you use most often in your queries.</p>

<p>The useR!2022 Arrow tutorial has a <a href="https://arrow-user2022.netlify.app/data-storage.html#multi-file-data-sets">convincing
demonstration</a>
that taking advantage of partitioning for your queries makes them run
much faster.</p>

<details>
  <summary>
    <p>Expand for Session Info</p>
  </summary>

  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sessioninfo</span><span class="o">::</span><span class="n">session_info</span><span class="p">()</span><span class="w">
</span></code></pre></div>  </div>

  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 22.04.1 LTS
 system   x86_64, linux-gnu
 ui       X11
 language en_US
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Paris
 date     2022-09-01
 pandoc   NA (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 arrow         * 9.0.0   2022-08-10 [1] CRAN (R 4.2.1)
 assertthat      0.2.1   2019-03-21 [1] RSPM
 backports       1.4.1   2021-12-13 [1] RSPM
 bench           1.1.2   2021-11-30 [1] RSPM
 bit             4.0.4   2020-08-04 [1] RSPM
 bit64           4.0.5   2020-08-30 [1] RSPM
 broom           1.0.0   2022-07-01 [1] RSPM
 cellranger      1.1.0   2016-07-27 [1] RSPM
 cli             3.3.0   2022-04-25 [1] RSPM (R 4.2.0)
 colorspace      2.0-3   2022-02-21 [1] RSPM
 crayon          1.5.1   2022-03-26 [1] RSPM
 DBI             1.1.3   2022-06-18 [1] RSPM
 dbplyr          2.2.1   2022-06-27 [1] RSPM
 digest          0.6.29  2021-12-01 [1] RSPM
 dplyr         * 1.0.9   2022-04-28 [1] RSPM
 ellipsis        0.3.2   2021-04-29 [1] RSPM
 evaluate        0.15    2022-02-18 [1] RSPM
 fansi           1.0.3   2022-03-24 [1] RSPM
 fastmap         1.1.0   2021-01-25 [1] RSPM
 forcats       * 0.5.1   2021-01-27 [1] RSPM
 fs              1.5.2   2021-12-08 [1] RSPM
 gargle          1.2.0   2021-07-02 [1] RSPM
 generics        0.1.3   2022-07-05 [1] RSPM
 ggplot2       * 3.3.6   2022-05-03 [1] RSPM
 glue            1.6.2   2022-02-24 [1] RSPM (R 4.2.0)
 googledrive     2.0.0   2021-07-08 [1] RSPM
 googlesheets4   1.0.0   2021-07-21 [1] RSPM
 gtable          0.3.0   2019-03-25 [1] RSPM
 haven           2.5.0   2022-04-15 [1] RSPM
 hms             1.1.1   2021-09-26 [1] RSPM
 htmltools       0.5.3   2022-07-18 [1] RSPM
 httr            1.4.3   2022-05-04 [1] RSPM
 jsonlite        1.8.0   2022-02-22 [1] RSPM
 knitr           1.39    2022-04-26 [1] RSPM
 lifecycle       1.0.1   2021-09-24 [1] RSPM
 lubridate       1.8.0   2021-10-07 [1] RSPM
 magrittr        2.0.3   2022-03-30 [1] RSPM
 modelr          0.1.8   2020-05-19 [1] RSPM
 munsell         0.5.0   2018-06-12 [1] RSPM
 pillar          1.8.0   2022-07-18 [1] RSPM
 pkgconfig       2.0.3   2019-09-22 [1] RSPM
 profmem         0.6.0   2020-12-13 [1] RSPM
 purrr         * 0.3.4   2020-04-17 [1] RSPM
 R6              2.5.1   2021-08-19 [1] RSPM
 readr         * 2.1.2   2022-01-30 [1] RSPM
 readxl          1.4.0   2022-03-28 [1] RSPM
 reprex          2.0.1   2021-08-05 [1] RSPM
 rlang           1.0.4   2022-07-12 [1] RSPM (R 4.2.0)
 rmarkdown       2.14    2022-04-25 [1] RSPM
 rvest           1.0.2   2021-10-16 [1] RSPM
 scales          1.2.0   2022-04-13 [1] RSPM
 sessioninfo     1.2.2   2021-12-06 [1] RSPM
 stringi         1.7.8   2022-07-11 [1] RSPM
 stringr       * 1.4.0   2019-02-10 [1] RSPM
 tibble        * 3.1.8   2022-07-22 [1] RSPM
 tidyr         * 1.2.0   2022-02-01 [1] RSPM
 tidyselect      1.1.2   2022-02-21 [1] RSPM
 tidyverse     * 1.3.2   2022-07-18 [1] RSPM
 tzdb            0.3.0   2022-03-28 [1] RSPM
 utf8            1.2.2   2021-07-24 [1] RSPM
 vctrs           0.4.1   2022-04-13 [1] RSPM
 withr           2.5.0   2022-03-03 [1] RSPM
 xfun            0.31    2022-05-10 [1] RSPM
 xml2            1.3.3   2021-11-30 [1] RSPM
 yaml            2.3.5   2022-02-21 [1] RSPM

 [1] /home/francois/.R-library
 [2] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────
</code></pre></div>  </div>

</details>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Arrow exploration" /><category term="r" /><category term="arrow" /><summary type="html"><![CDATA[How does partitioning impact query performance?]]></summary></entry><entry><title type="html">Creating an Arrow dataset</title><link href="https://francoismichonneau.net/2022/08/arrow-dataset-creation/" rel="alternate" type="text/html" title="Creating an Arrow dataset" /><published>2022-08-22T00:00:00+00:00</published><updated>2022-08-22T00:00:00+00:00</updated><id>https://francoismichonneau.net/2022/08/arrow-dataset-creation</id><content type="html" xml:base="https://francoismichonneau.net/2022/08/arrow-dataset-creation/"><![CDATA[<h2 id="background">Background</h2>

<p>While getting started with Apache Arrow, I was intrigued by the variety
of formats Arrow supports. Arrow tutorials tend to start with already
prepared datasets ready to be ingested by <code class="language-plaintext highlighter-rouge">open_dataset()</code>. I wanted to
explore what it takes to create your dataset aimed to be analyzed with
Arrow and understand the respective benefits of the different file
formats it supports.</p>

<p>Arrow can read in a variety of formats: <code class="language-plaintext highlighter-rouge">parquet</code>, <code class="language-plaintext highlighter-rouge">arrow</code> (also known
as <code class="language-plaintext highlighter-rouge">ipc</code> and <code class="language-plaintext highlighter-rouge">feather</code>)<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, and text-based formats like <code class="language-plaintext highlighter-rouge">csv</code> (as well
as <code class="language-plaintext highlighter-rouge">tsv</code>). Additionally, Arrow provides tools to convert between these
formats.</p>

<p>Having the possibility to import datasets in a variety of formats is
helpful as you are less constrained by the type of data you can start
your analysis on. However, if you are building a dataset from scratch,
which one should you choose?</p>

<p>To try to answer this question, we will be using the <code class="language-plaintext highlighter-rouge">{arrow}</code> R package
to compare the amount of hard drive space these file formats use and the
performance of a query in a multi-file dataset using these different
formats. This is not a formal evaluation of the performance of Arrow or
how best to optimize the partitioning of a dataset, rather it is a brief
exploration of the tradeoffs that come with using the different datasets
supported by Arrow. I also don’t explain the differences in the data
structure of these different formats.</p>

<h2 id="the-dataset">The dataset</h2>

<p>We will be using data from <a href="https://cran-logs.rstudio.com/">https://cran-logs.rstudio.com/</a>. This site
gives you access to the log files for all hits to the CRAN<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> mirror
hosted by RStudio. For each day since October 1st, 2012, there is a
compressed CSV file (file with the extension <code class="language-plaintext highlighter-rouge">.csv.gz</code>) that records the
downloaded packages. Each row contains the date, the time, the name of
the R package downloaded, the R version used, the architecture (32-bit
or 64-bit), the operating system, the country inferred from the IP
address, and a daily unique identifier assigned to each IP address. This
website has also similar data for the daily downloads of R itself but I
will not be using this data in this post.</p>

<p>For this exploration, we are going to limit ourselves to a couple of
months of data which will be providing enough data for our purpose. We
will download the data for the period from June 1st, 2022 to August
15th, 2022.</p>

<p>Arrow is designed to read data that is split across multiple files. So,
you can point <code class="language-plaintext highlighter-rouge">open_dataset()</code> to a directory that contains all the
files that make up your dataset. There is no need to loop over each file
to build your dataset in memory. Splitting your datasets across multiple
files can even make queries on your dataset faster, as only some of the
files might need to be accessed to get the results needed. Depending on
the type of queries you perform most often on your dataset, it can be
worth considering how best to partition your files to accelerate your
analyses (but this is beyond the scope of this post). Here, the files
are provided by date, and we will keep a time-based file organization.</p>

<p>We will use a <a href="https://hive.apache.org/">Hive-style</a> partitioning by
year and month. We will have a directory for each year (there is only
one year in our example), and within it, a directory for each month. The
directory are named according to the convention
<code class="language-plaintext highlighter-rouge">&lt;variable_name&gt;=&lt;value&gt;</code>. So we will want to organize the files as
illustrated below:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>└── year=2022
    ├── month=6
    │   └── &lt;data files&gt;
    ├── month=7
    │   └── &lt;data files&gt;
    └── month=8
        └── &lt;data files&gt;
</code></pre></div></div>

<h2 id="import-the-data-as-it-is-provided">Import the data as it is provided</h2>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">arrow</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">fs</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">bench</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">open_dataset()</code> function in the <code class="language-plaintext highlighter-rouge">{arrow}</code> package can directly read
compressed CSV files<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> (with the extension <code class="language-plaintext highlighter-rouge">.csv.gz</code>) as they are
provided on the RStudio CRAN logs website.</p>

<p>As a first step, we can download the files from the site and organize
them using the Hive-style directory structure as shown above.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Check that the date is really a date,</span><span class="w">
</span><span class="c1">## and extract the year and month from it</span><span class="w">
</span><span class="n">parse_date</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">stopifnot</span><span class="p">(</span><span class="w">
    </span><span class="s2">"`date` must be a date"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inherits</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="s2">"Date"</span><span class="p">),</span><span class="w">
    </span><span class="s2">"provide only one date"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">identical</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="m">1L</span><span class="p">),</span><span class="w">
    </span><span class="s2">"date must be in the past"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">Sys.Date</span><span class="p">()</span><span class="w">
  </span><span class="p">)</span><span class="w">
  </span><span class="nf">list</span><span class="p">(</span><span class="w">
    </span><span class="n">date_chr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w">
    </span><span class="n">year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">year</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1900L</span><span class="p">,</span><span class="w"> 
    </span><span class="n">month</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXlt</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="o">$</span><span class="n">mon</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1L</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1">## Download the data set for a given date from the RStudio CRAN log website.</span><span class="w">
</span><span class="c1">## `date` is a single date for which we want the data</span><span class="w">
</span><span class="c1">## `path` is where we want the data to live</span><span class="w">
</span><span class="n">download_daily_package_logs_csv</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w">
                                            </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/cran-logs-csv"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="c1">## build the URL for the download</span><span class="w">
  </span><span class="n">date</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">parse_date</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w">
  </span><span class="n">url</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="w">
    </span><span class="s1">'https://cran-logs.rstudio.com/'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">year</span><span class="p">,</span><span class="w"> </span><span class="s1">'/'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s1">'.csv.gz'</span><span class="w">
  </span><span class="p">)</span><span class="w">

  </span><span class="c1">## build the path for the destination of the download</span><span class="w">
  </span><span class="n">file</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">file.path</span><span class="p">(</span><span class="w">
    </span><span class="n">path</span><span class="p">,</span><span class="w">
    </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"year="</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">year</span><span class="p">),</span><span class="w">
    </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"month="</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">month</span><span class="p">),</span><span class="w">
    </span><span class="n">paste0</span><span class="p">(</span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s2">".csv.gz"</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">

  </span><span class="c1">## create the folder if it doesn't exist</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">dir.exists</span><span class="p">(</span><span class="n">dirname</span><span class="p">(</span><span class="n">file</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">dir.create</span><span class="p">(</span><span class="n">dirname</span><span class="p">(</span><span class="n">file</span><span class="p">),</span><span class="w"> </span><span class="n">recursive</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="c1">## download the file</span><span class="w">
  </span><span class="n">message</span><span class="p">(</span><span class="s2">"Downloading data for "</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="s2">" ... "</span><span class="p">,</span><span class="w"> </span><span class="n">appendLF</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
  </span><span class="n">download.file</span><span class="p">(</span><span class="w">
    </span><span class="n">url</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">url</span><span class="p">,</span><span class="w">
    </span><span class="n">destfile</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">file</span><span class="p">,</span><span class="w">
    </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"libcurl"</span><span class="p">,</span><span class="w">
    </span><span class="n">quiet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
    </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wb"</span><span class="w">
  </span><span class="p">)</span><span class="w">
  </span><span class="n">message</span><span class="p">(</span><span class="s2">"done."</span><span class="p">)</span><span class="w">

  </span><span class="c1">## quick check to make sure that the file was created</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="n">file</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">stop</span><span class="p">(</span><span class="s2">"Download failed for "</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="o">$</span><span class="n">date_chr</span><span class="p">,</span><span class="w"> </span><span class="n">call.</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="c1">## return the path</span><span class="w">
  </span><span class="n">file</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## build sequence of dates for which we want the data</span><span class="w">
</span><span class="n">dates_to_get</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="w">
  </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-06-01"</span><span class="p">),</span><span class="w">
  </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"2022-08-15"</span><span class="p">),</span><span class="w">
  </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"day"</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1">## download the data</span><span class="w">
</span><span class="n">walk</span><span class="p">(</span><span class="n">dates_to_get</span><span class="p">,</span><span class="w"> </span><span class="n">download_daily_package_logs_csv</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Let’s check the content of the folder that holds the data we downloaded:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/datasets/cran-logs-csv/
└── year=2022
    ├── month=6
    │   ├── 2022-06-01.csv.gz
    │   ├── 2022-06-02.csv.gz
    │   ├── 2022-06-03.csv.gz
    │   ├── ...
    │   └── 2022-06-30.csv.gz
    ├── month=7
    │   ├── 2022-07-01.csv.gz
    │   ├── 2022-07-02.csv.gz
    │   ├── 2022-07-03.csv.gz
    │   ├── ...
    │   └── 2022-07-31.csv.gz
    └── month=8
        ├── 2022-08-01.csv.gz
        ├── 2022-08-02.csv.gz
        ├── 2022-08-03.csv.gz
        ├── ...
        └── 2022-08-15.csv.gz
</code></pre></div></div>

<p>We have one file for each day, placed in a folder corresponding to their
month. We can now read this data using <code class="language-plaintext highlighter-rouge">{arrow}</code>’s <code class="language-plaintext highlighter-rouge">open_dataset()</code>
function:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cran_logs_csv</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="w">
  </span><span class="s2">"~/datasets/cran-logs-csv/"</span><span class="p">,</span><span class="w">
  </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"csv"</span><span class="p">,</span><span class="w">
  </span><span class="n">partitioning</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"month"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">cran_logs_csv</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FileSystemDataset with 76 csv files
date: date32[day]
time: time32[s]
size: int64
r_version: string
r_arch: string
r_os: string
package: string
version: string
country: string
ip_id: int64
year: int32
month: int32
</code></pre></div></div>

<p>The partitioning has been taken into consideration as the output shows
that the dataset contains the variables <code class="language-plaintext highlighter-rouge">year</code> and <code class="language-plaintext highlighter-rouge">month</code> which are not
part of the data we downloaded. They are coming from the way we
organized the downloaded files.</p>

<h2 id="convert-to-arrow-and-parquet-files">Convert to Arrow and Parquet files</h2>

<p>Now that we have the compressed CSV files on disk, and that we opened
the dataset with <code class="language-plaintext highlighter-rouge">open_dataset()</code>, we can convert it to the other file
formats supported by Arrow using <code class="language-plaintext highlighter-rouge">{arrow}</code>’s <code class="language-plaintext highlighter-rouge">write_dataset()</code> function.
We are going to convert our collection of <code class="language-plaintext highlighter-rouge">.csv.gz</code> files into the Arrow
and Parquet formats.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Convert the dataset into the Arrow format</span><span class="w">
</span><span class="n">write_dataset</span><span class="p">(</span><span class="w">
  </span><span class="n">cran_logs_csv</span><span class="p">,</span><span class="w">
  </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/cran-logs-arrow"</span><span class="p">,</span><span class="w">
  </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"arrow"</span><span class="p">,</span><span class="w">
  </span><span class="n">partitioning</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"month"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1">## Convert the dataset into the Parquet format</span><span class="w">
</span><span class="n">write_dataset</span><span class="p">(</span><span class="w">
  </span><span class="n">cran_logs_csv</span><span class="p">,</span><span class="w">
  </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/datasets/cran-logs-parquet"</span><span class="p">,</span><span class="w">
  </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">,</span><span class="w">
  </span><span class="n">partitioning</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"month"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Let’s inspect the content of the directories that contain these
datasets.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fs</span><span class="o">::</span><span class="n">dir_tree</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-arrow/"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/datasets/cran-logs-arrow/
└── year=2022
    ├── month=6
    │   └── part-0.arrow
    ├── month=7
    │   └── part-0.arrow
    └── month=8
        └── part-0.arrow
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fs</span><span class="o">::</span><span class="n">dir_tree</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet/"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/datasets/cran-logs-parquet/
└── year=2022
    ├── month=6
    │   └── part-0.parquet
    ├── month=7
    │   └── part-0.parquet
    └── month=8
        └── part-0.parquet
</code></pre></div></div>

<p>These two directories have the same layout organized by year and month
as with our CSV files given that we kept the same partitioning. The
files within the directories have an extension that matches their file
format. One difference is that there is a single file for each month. We
used the default values for <code class="language-plaintext highlighter-rouge">write_dataset()</code> and the number of rows for
each month is smaller than the threshold this function uses to split the
dataset into multiple files.</p>

<h2 id="comparison-of-the-different-formats">Comparison of the different formats</h2>

<p>Let’s compare how much space these different file formats take on disk:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_size</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">fs</span><span class="o">::</span><span class="n">dir_info</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="w"> </span><span class="n">recurse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">filter</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"file"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">pull</span><span class="p">(</span><span class="n">size</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="nf">sum</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">tribble</span><span class="p">(</span><span class="w">
  </span><span class="o">~</span><span class="w"> </span><span class="n">Format</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w">
  </span><span class="s2">"Compressed CSV"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-csv/"</span><span class="p">),</span><span class="w">
  </span><span class="s2">"Arrow"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-arrow/"</span><span class="p">),</span><span class="w">
  </span><span class="s2">"Parquet"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset_size</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet/"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> 
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A tibble: 3 × 2
  Format                size
  &lt;chr&gt;          &lt;fs::bytes&gt;
1 Compressed CSV       5.01G
2 Arrow               29.67G
3 Parquet              5.06G
</code></pre></div></div>

<p>The Arrow format takes the most space with almost 30GB while both the
compressed CSV and the Parquet files use about 5GB of hard drive.</p>

<p>We are now set up to compare the performance of doing computation of
these different dataset formats.</p>

<p>Let’s open these datasets with the different formats:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cran_logs_csv</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-csv/"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"csv"</span><span class="p">)</span><span class="w">
</span><span class="n">cran_logs_arrow</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-arrow/"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"arrow"</span><span class="p">)</span><span class="w">
</span><span class="n">cran_logs_parquet</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">open_dataset</span><span class="p">(</span><span class="s2">"~/datasets/cran-logs-parquet/"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"parquet"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>We will compare how long it takes for Arrow to compute the 10 most
downloaded packages in the time period our dataset covers using each
file format.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_10_packages</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">data</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">count</span><span class="p">(</span><span class="n">package</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">head</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">mutate</span><span class="p">(</span><span class="n">n_million_downloads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">1e6</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">select</span><span class="p">(</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
    </span><span class="n">collect</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">bench</span><span class="o">::</span><span class="n">mark</span><span class="p">(</span><span class="w">
  </span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_csv</span><span class="p">),</span><span class="w">
  </span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_arrow</span><span class="p">),</span><span class="w">
  </span><span class="n">top_10_packages</span><span class="p">(</span><span class="n">cran_logs_parquet</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 3 × 6
  expression                              min   median itr/se…¹ mem_al…² gc/se…³
  &lt;bch:expr&gt;                         &lt;bch:tm&gt; &lt;bch:tm&gt;    &lt;dbl&gt; &lt;bch:by&gt;   &lt;dbl&gt;
1 top_10_packages(cran_logs_csv)       29.57s   29.57s   0.0338   8.19MB   0    
2 top_10_packages(cran_logs_arrow)       2.1s     2.1s   0.475  165.39KB   0.475
3 top_10_packages(cran_logs_parquet)    3.32s    3.32s   0.301  137.11KB   0    
# … with abbreviated variable names ¹​`itr/sec`, ²​mem_alloc, ³​`gc/sec`
</code></pre></div></div>

<p>While it takes about 4 seconds to perform this task on the Arrow or
Parquet files, it takes more than 30 seconds to do it on the CSV files.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Having Arrow point directly to the folder of compressed CSV file might
be the most convenient but it comes with a high-performance cost. Arrow
and Parquet have similar performance but the Parquet files take less
space on disk and would be more suitable for long-term storage. This is
why large datasets like the NYC taxi data is distributed as a series of
Parquet files.</p>

<p>In the future, I might explore how using different variables for
partitioning or how the number of files in the partitions affects the
performance of the queries (EDIT: this <a href="/2022/09/arrow-dataset-part-2/">post is now available</a>. If you have other ideas
of topics that you would me to explore, do not hesitate to leave a
comment below.</p>

<h2 id="going-further">Going further</h2>

<p>If you would like to learn more about the different formats, check out
the <a href="https://arrow-user2022.netlify.app/">Arrow workshop</a> (especially
<a href="https://arrow-user2022.netlify.app/data-storage.html">Part 3: Data
Storage</a>) that
Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt taught at
useR!2022.</p>

<h2 id="acknowledgments">Acknowledgments</h2>

<p>Thank you to <a href="https://twitter.com/kae_suarez/">Kae Suarez</a> and
<a href="https://djnavarro.net">Danielle Navarro</a> for reviewing this post.</p>

<h2 id="post-scriptum">Post Scriptum</h2>

<p>I wrote a <a href="/2022/09/arrow-dataset-part-2/">follow-up post</a> that explores the impact of partitioning the dataset on
performance.</p>

<details>
  <summary>
    <p>Expand for Session Info</p>
  </summary>

  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sessioninfo</span><span class="o">::</span><span class="n">session_info</span><span class="p">()</span><span class="w">
</span></code></pre></div>  </div>

  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 22.04.1 LTS
 system   x86_64, linux-gnu
 ui       X11
 language en_US
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Paris
 date     2022-08-19
 pandoc   NA (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 arrow         * 9.0.0   2022-08-10 [1] CRAN (R 4.2.1)
 assertthat      0.2.1   2019-03-21 [1] RSPM
 backports       1.4.1   2021-12-13 [1] RSPM
 bench         * 1.1.2   2021-11-30 [1] RSPM
 bit             4.0.4   2020-08-04 [1] RSPM
 bit64           4.0.5   2020-08-30 [1] RSPM
 broom           1.0.0   2022-07-01 [1] RSPM
 cellranger      1.1.0   2016-07-27 [1] RSPM
 cli             3.3.0   2022-04-25 [1] RSPM (R 4.2.0)
 colorspace      2.0-3   2022-02-21 [1] RSPM
 crayon          1.5.1   2022-03-26 [1] RSPM
 DBI             1.1.3   2022-06-18 [1] RSPM
 dbplyr          2.2.1   2022-06-27 [1] RSPM
 digest          0.6.29  2021-12-01 [1] RSPM
 dplyr         * 1.0.9   2022-04-28 [1] RSPM
 ellipsis        0.3.2   2021-04-29 [1] RSPM
 evaluate        0.15    2022-02-18 [1] RSPM
 fansi           1.0.3   2022-03-24 [1] RSPM
 fastmap         1.1.0   2021-01-25 [1] RSPM
 forcats       * 0.5.1   2021-01-27 [1] RSPM
 fs            * 1.5.2   2021-12-08 [1] RSPM
 gargle          1.2.0   2021-07-02 [1] RSPM
 generics        0.1.3   2022-07-05 [1] RSPM
 ggplot2       * 3.3.6   2022-05-03 [1] RSPM
 glue            1.6.2   2022-02-24 [1] RSPM (R 4.2.0)
 googledrive     2.0.0   2021-07-08 [1] RSPM
 googlesheets4   1.0.0   2021-07-21 [1] RSPM
 gtable          0.3.0   2019-03-25 [1] RSPM
 haven           2.5.0   2022-04-15 [1] RSPM
 hms             1.1.1   2021-09-26 [1] RSPM
 htmltools       0.5.3   2022-07-18 [1] RSPM
 httr            1.4.3   2022-05-04 [1] RSPM
 jsonlite        1.8.0   2022-02-22 [1] RSPM
 knitr           1.39    2022-04-26 [1] RSPM
 lifecycle       1.0.1   2021-09-24 [1] RSPM
 lubridate       1.8.0   2021-10-07 [1] RSPM
 magrittr        2.0.3   2022-03-30 [1] RSPM
 modelr          0.1.8   2020-05-19 [1] RSPM
 munsell         0.5.0   2018-06-12 [1] RSPM
 pillar          1.8.0   2022-07-18 [1] RSPM
 pkgconfig       2.0.3   2019-09-22 [1] RSPM
 purrr         * 0.3.4   2020-04-17 [1] RSPM
 R6              2.5.1   2021-08-19 [1] RSPM
 readr         * 2.1.2   2022-01-30 [1] RSPM
 readxl          1.4.0   2022-03-28 [1] RSPM
 reprex          2.0.1   2021-08-05 [1] RSPM
 rlang           1.0.4   2022-07-12 [1] RSPM (R 4.2.0)
 rmarkdown       2.14    2022-04-25 [1] RSPM
 rvest           1.0.2   2021-10-16 [1] RSPM
 scales          1.2.0   2022-04-13 [1] RSPM
 sessioninfo     1.2.2   2021-12-06 [1] RSPM
 stringi         1.7.8   2022-07-11 [1] RSPM
 stringr       * 1.4.0   2019-02-10 [1] RSPM
 tibble        * 3.1.8   2022-07-22 [1] RSPM
 tidyr         * 1.2.0   2022-02-01 [1] RSPM
 tidyselect      1.1.2   2022-02-21 [1] RSPM
 tidyverse     * 1.3.2   2022-07-18 [1] RSPM
 tzdb            0.3.0   2022-03-28 [1] RSPM
 utf8            1.2.2   2021-07-24 [1] RSPM
 vctrs           0.4.1   2022-04-13 [1] RSPM
 withr           2.5.0   2022-03-03 [1] RSPM
 xfun            0.31    2022-05-10 [1] RSPM
 xml2            1.3.3   2021-11-30 [1] RSPM
 yaml            2.3.5   2022-02-21 [1] RSPM

 [1] /home/francois/.R-library
 [2] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────
</code></pre></div>  </div>

</details>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Feather was the first iteration of the file format (v1), the Arrow
Interprocess Communication (IPC) file format is the newer version
(v2) and has many new features. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Comprehensive R Archive Network, the repository for the R packages <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>since Arrow 9.0.0 <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Arrow Exploration" /><category term="r" /><category term="arrow" /><summary type="html"><![CDATA[An exploration of the file formats that Arrow can read and write.]]></summary></entry><entry><title type="html">`foghorn` 1.3.1 released</title><link href="https://francoismichonneau.net/2020/09/foghorn-1.3.1/" rel="alternate" type="text/html" title="`foghorn` 1.3.1 released" /><published>2020-09-08T00:00:00+00:00</published><updated>2020-09-08T00:00:00+00:00</updated><id>https://francoismichonneau.net/2020/09/foghorn-1.3.1</id><content type="html" xml:base="https://francoismichonneau.net/2020/09/foghorn-1.3.1/"><![CDATA[<p>A new version of <a href="https://cran.r-project.org/package=foghorn"><code class="language-plaintext highlighter-rouge">foghorn</code></a>
(version 1.3.1) was just accepted on CRAN.</p>

<p><code class="language-plaintext highlighter-rouge">foghorn</code> is an R package that allows you to:</p>
<ul>
  <li>browse the results of the CRAN checks on your package (with <a href="https://fmichonneau.github.io/foghorn/reference/cran_results.html"><code class="language-plaintext highlighter-rouge">cran_results()</code></a>
and <a href="https://fmichonneau.github.io/foghorn/reference/cran_details.html"><code class="language-plaintext highlighter-rouge">cran_details()</code></a>);</li>
  <li>check where your package stands when submitted to CRAN (with
<a href="https://fmichonneau.github.io/foghorn/reference/cran_incoming.html"><code class="language-plaintext highlighter-rouge">cran_incoming()</code></a>);</li>
  <li>and starting with version 1.3.1, check whether your package is in the Win
builder queue (with <a href="https://fmichonneau.github.io/foghorn/reference/winbuilder_queue.html"><code class="language-plaintext highlighter-rouge">winbuilder_queue()</code></a>).</li>
</ul>

<p>The idea of inspecting the Win-builder queue <a href="https://github.com/fmichonneau/foghorn/issues/40">was proposed</a> by
Kirill Müller.</p>

<p>If you would like to start using <code class="language-plaintext highlighter-rouge">foghorn</code>, check out the
<a href="https://fmichonneau.github.io/foghorn/articles/foghorn.html">vignette</a> that
comes with the package.</p>

<p><a href="https://github.com/fmichonneau/foghorn/issues/new">Feedback and suggestions</a> for <code class="language-plaintext highlighter-rouge">foghorn</code> are welcome!</p>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Hacking" /><category term="r" /><category term="foghorn" /><summary type="html"><![CDATA[New version of foghorn provides access to Win-builder queue]]></summary></entry><entry><title type="html">Migrate from Gmail to HelpScout with R</title><link href="https://francoismichonneau.net/2020/04/gmail-helpscout-migration/" rel="alternate" type="text/html" title="Migrate from Gmail to HelpScout with R" /><published>2020-04-17T00:00:00+00:00</published><updated>2020-04-17T00:00:00+00:00</updated><id>https://francoismichonneau.net/2020/04/gmail-helpscout-migration</id><content type="html" xml:base="https://francoismichonneau.net/2020/04/gmail-helpscout-migration/"><![CDATA[<h2 id="preamble">Preamble</h2>

<ul>
  <li>This is a long and somewhat dense post. Even if you do not have to migrate
emails from Gmail to HelpScout, I hope this post will be useful to you, as the
general approach could be interesting to other problems that involve working
with APIs.</li>
  <li>The full code I actually used for the email migration is available at:
<a href="https://github.com/carpentries/emailmigration">https://github.com/carpentries/emailmigration</a> and I include links pointing
to functions in the GitHub repo<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> throughout the post below to illustrate
my points.</li>
</ul>

<h2 id="the-problem-and-its-solution">The problem and its solution</h2>

<p>At <a href="https://carpentries.org">The Carpentries</a>, <a href="https://carpentries.org/regionalcoordinators/">Regional
Coordinators</a> help us organize
workshops across the globe. In the past, each Regional Coordinator was
set up with a Gmail account (through The Carpentries’s GSuite plan). However, as
the number of Regional Coordinators grew, and as some geographic areas have more
than one Regional Coordinator, the Gmail account model was starting to cause
some issues.</p>

<p>The Carpentries Core Team has been using HelpScout for a while and is a much more suitable tool to manage emails and inboxes as a team.</p>

<p>The main challenge with transitioning the Regional Coordinators to using HelpScout was to import the old messages from Gmail to HelpScout. To tackle this problem, I used R and this blog post describes the approach I took.</p>

<h2 id="technical-overview">Technical overview</h2>

<p>Before doing anything else, we used the GSuite data migration tool to transfer
all emails for each Regional Coordinator account into a single account. Having
all the emails to import in the same place makes things easier.</p>

<p>This post goes through the steps I took to perform this migration:</p>

<ol>
  <li>Figure out authentication with the Gmail API, and with the HelpScout API</li>
  <li>Get familiar with the HelpScout API and write R functions to perform the
tasks needed</li>
  <li>Convert Gmail threads into HelpScout conversations</li>
  <li>Test migration on 100 Gmail threads</li>
  <li>Perform the full migration</li>
</ol>

<p>Choice of packages and approach:</p>

<ul>
  <li>Working with the Gmail API is made much easier with the wonderful
<a href="https://gmailr.r-lib.org/"><code class="language-plaintext highlighter-rouge">gmailr</code></a> package.</li>
  <li>I didn’t find an already made package to work with the HelpScout web API so I
wrote a few functions to interact with the endpoints I needed using the
<a href="https://httr.r-lib.org/"><code class="language-plaintext highlighter-rouge">httr</code></a> package.</li>
  <li>The mechanics of converting the data coming from the Gmail web API into the
format needed by the HelpScout API to import the conversation was done using
the <a href="https://r6.r-lib.org/"><code class="language-plaintext highlighter-rouge">R6</code></a> package. The R6 classes and methods made it
easier to separate storing each element needed by the HelpScout API as private
elements and the actual formatting that was handled with methods.</li>
  <li>When working with web APIs a lot can go wrong: there is a weird data
format that your code didn’t know how to handle, your internet connection goes
down, you reach the rate limit, etc. Therefore, I used the
<a href="https://richfitz.github.io/storr/"><code class="language-plaintext highlighter-rouge">storr</code></a> package to cache (1) the R6 objects that
act as the bridge between the 2 APIs; (2) the responses from the HelpScout API
to make sure all the threads were converted correctly.</li>
  <li>I organized all the code as a barebone package. It makes code management
easier and is a good habit to take. Here it was a one-off task but if it was
something that I’d use regularly, it means that I could develop tests, write
documentation, and enable continuous testing. I could then write and update my
code, and rely on <code class="language-plaintext highlighter-rouge">devtools::load_all()</code>.</li>
</ul>

<h2 id="1-authentication">1. Authentication</h2>

<h3 id="11-gmail-api">1.1. Gmail API</h3>

<p>The instructions in the <code class="language-plaintext highlighter-rouge">gmailr</code> package’s
<a href="https://gmailr.r-lib.org/#setup">README</a> are clear. You can use the <code class="language-plaintext highlighter-rouge">gm_threads()</code> function, for instance, to check that the authentication is working as expected.</p>

<h3 id="12-the-helpscout-api">1.2. The HelpScout API</h3>

<p>The HelpScout API uses the OAuth 2.0 protocol. The <code class="language-plaintext highlighter-rouge">httr</code> package handles this well.</p>

<p>Create a new app within HelpScout, and use <code class="language-plaintext highlighter-rouge">https://localhost:1410/</code> for the redict URL. Take note of the key and secret. Use this information to create a new app object in R with <code class="language-plaintext highlighter-rouge">httr</code>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hs_app</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">httr</span><span class="o">::</span><span class="n">oauth_app</span><span class="p">(</span><span class="w">
  </span><span class="s2">"helpscout"</span><span class="p">,</span><span class="w">
  </span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"&lt;your app key here&gt;"</span><span class="p">,</span><span class="w">
  </span><span class="n">secret</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"&lt;your app secret here&gt;"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>and then use this object to do the authentication online:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hs_token</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">httr</span><span class="o">::</span><span class="n">oauth2.0_token</span><span class="p">(</span><span class="w">
  </span><span class="n">httr</span><span class="o">::</span><span class="n">oauth_endpoint</span><span class="p">(</span><span class="w">
    </span><span class="n">authorize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"https://secure.helpscout.net/authentication/authorizeClientApplication"</span><span class="p">,</span><span class="w">
    </span><span class="n">access</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"https://api.helpscout.net/v2/oauth2/token"</span><span class="p">),</span><span class="w">
  </span><span class="n">app</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hs_app</span><span class="p">)</span><span class="w">

</span><span class="n">htoken</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">httr</span><span class="o">::</span><span class="n">config</span><span class="p">(</span><span class="n">token</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hs_token</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>We can then use the <code class="language-plaintext highlighter-rouge">htoken</code> object across all our calls to the HelpScout web API.</p>

<h2 id="2-getting-started-with-the-helpscout-web-api">2. Getting started with the HelpScout web API</h2>

<p>When working with a new web API, first read the documentation to understand how things are set up. From this initial reading, it became clear that Gmail and HelpScout use different words for related concepts.</p>

<table>
  <thead>
    <tr>
      <th>HelpScout</th>
      <th>Gmail</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>thread</td>
      <td>message</td>
    </tr>
    <tr>
      <td>conversation</td>
      <td>thread</td>
    </tr>
  </tbody>
</table>

<p>Keeping this straight in my mind took some time… and because I’m more used to the terms used by Gmail, I used this vocabulary in my function names (for the most part).</p>

<p>Another thing that I needed was HelpScout’s internal identifier for the mailbox into which the emails were being imported. So the first function I wrote against HelpScout’s API was <code class="language-plaintext highlighter-rouge">hs_mailbox_id()</code> which returned the internal identifier for the mailbox that was of interest to me.</p>

<p>The second thing I needed to do was to make sure I understood how to use the API to import an actual conversation. I started with fake data I could control to ensure that I had something simple that I knew worked and I could compare against when things didn’t work with real data. Even if the documentation of an API is good, there are, more often than not, small details that are not described that you need to figure out. Having this data as a starting point is useful for these tests.</p>

<p>The actual code to create a new <del>thread</del> conversation in HelpScout ended up being:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hs_create_thread</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">thread</span><span class="p">,</span><span class="w"> </span><span class="n">hstoken</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">body</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">jsonlite</span><span class="o">::</span><span class="n">toJSON</span><span class="p">(</span><span class="n">thread</span><span class="p">,</span><span class="w"> </span><span class="n">auto_unbox</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

  </span><span class="n">httr</span><span class="o">::</span><span class="n">POST</span><span class="p">(</span><span class="w">
    </span><span class="s2">"https://api.helpscout.net"</span><span class="p">,</span><span class="w">
    </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/v2/conversations"</span><span class="p">,</span><span class="w">
    </span><span class="n">body</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">body</span><span class="p">,</span><span class="w">
    </span><span class="n">htoken</span><span class="p">,</span><span class="w">
    </span><span class="n">httr</span><span class="o">::</span><span class="n">content_type</span><span class="p">(</span><span class="s2">"application/json; charset=UTF-8"</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This is not the code I would have written if it was part of a package intended for others to use. For instance, I would have wanted to check the response of the API after each request. But for my particular use case, it made it easier to return this response and inspect manually after the fact once I had confirmed that this code was working for most requests.</p>

<h2 id="2-extracting-the-content-of-the-emails-from-gmail">2. Extracting the content of the emails from Gmail</h2>

<p>This was the most time-consuming part as there were lots of unexpected details that came up to get a smooth conversion between the two APIs.</p>

<h3 id="21-things-that-were-easy">2.1. Things that were easy</h3>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">gmailr::gm_subject()</code> worked every time to get the subject of the
threads for all the messages.</li>
</ul>

<h3 id="22-things-that-were-almost-easy">2.2. Things that were almost easy</h3>

<ul>
  <li><a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L128-L150">Extracting the people involved in the conversation</a>. The <code class="language-plaintext highlighter-rouge">gmailr::gm_to()</code> and
<code class="language-plaintext highlighter-rouge">gmailr::gm_from()</code> worked well to extract the email addresses. The small
catch was that some email addresses were formatted as <code class="language-plaintext highlighter-rouge">FirstName LastName
&lt;email@address.rr&gt;</code>, others had only <code class="language-plaintext highlighter-rouge">email@address.rr</code>, and when multiple
people were involved a comma separated them. However, in some cases, people
have a comma in their names.</li>
  <li><a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L121-L126">Extracting the date</a>. The <code class="language-plaintext highlighter-rouge">gmailr::date()</code> returns the date from the email in
<a href="https://en.wikipedia.org/wiki/Unix_time">Unix time</a>. The <code class="language-plaintext highlighter-rouge">anytime</code>
<a href="https://cran.r-project.org/web/packages/anytime/index.html">package</a> is
useful at converting Unix time into other formats, including the <code class="language-plaintext highlighter-rouge">iso8601</code>
that was expected by the HelpScout API. I still had to manually add a final
<code class="language-plaintext highlighter-rouge">Z</code> to the character string.</li>
</ul>

<h3 id="23-things-that-were-not-so-easy">2.3. Things that were not so easy</h3>

<ul>
  <li><a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L14-L24">Extracting the email attachments</a>. The attachments themselves are not returned
by the API. Instead, the API returns an URL that points to the address where
the attachments can be retrieved. The HelpScout’s API accepts the attachments
as <a href="https://en.wikipedia.org/wiki/Base64">base64-encoded</a> strings. The
<code class="language-plaintext highlighter-rouge">gmailr</code> package helped to retrieve this data, but the data returned by the
Gmail API is base64url encoded. Thankfully, converting to regular
base64 is a short regular expression substitution away once you know the
difference between the two.</li>
  <li>The thing that was the most puzzling was parsing the actual body of the
emails. The <code class="language-plaintext highlighter-rouge">gmailr::gm_body()</code> worked for only a small fraction of the emails
I had to deal with. After many trials and errors, <a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L48">I wrote a function</a> to
reliably retrieve the content of the emails<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. There were many situations to deal with as the messages can be:
    <ul>
      <li>“multipart” the body of the email is provided both in plain text format or
 in HTML format which allows for email clients that don’t support
 HTML-formatting to provide the plain text version of the message;</li>
      <li>either only plain text or in HTML format</li>
      <li>provided as attachments (what some email clients do when you forward a
message).</li>
    </ul>

    <p>Depending on the situation, the location of the body of the email within the
deeply nested list that was returned by the Gmail API could vary. I ended up
writing a recursive algorithm that traversed the list to find and retrieve the
relevant content of the emails.</p>

    <p>The last catch was that plain text messages that included an URL were
interpreted by the HelpScout API as being HTML-formatted. It meant that the
whitespace to indicate the line breaks were ignored making the body of the
messages large blocks of texts that were very hard to read and follow. I
relied on the <code class="language-plaintext highlighter-rouge">commonmark::markdown_html()</code> to <a href="https://github.com/carpentries/emailmigration/blob/master/R/gmail.R#L1-L6">convert these plain text
messages</a> into HTML that then looked good once they were uploaded onto
HelpScout using the API.</p>
  </li>
</ul>

<h2 id="3-conversion-between-gmail-and-helpscout">3. Conversion between Gmail and HelpScout</h2>

<p>Now that I had access to all the relevant information from the emails, I needed to format it so it could be imported by the HelpScout API. For this, I used the R6 object-oriented programming system.</p>

<p>Each element coming from the Gmail API was individually stored as a private field, and an accessor method (<code class="language-plaintext highlighter-rouge">$get()</code>) created the list in the format needed to be ingested by HelpScout’s API.</p>

<p>I used 3 classes for this:</p>

<ul>
  <li><a href="https://github.com/carpentries/emailmigration/blob/master/R/HelpScout-classes.R#L85">one for the HelpScout conversations</a> (the Gmail threads)</li>
  <li><a href="https://github.com/carpentries/emailmigration/blob/master/R/HelpScout-classes.R#L30">one for the HelpScout threads</a> (the Gmail messages)</li>
  <li><a href="https://github.com/carpentries/emailmigration/blob/master/R/HelpScout-classes.R#L6">one for the attachments</a></li>
</ul>

<p>This modularity helped debugging and limited the complexity of each class.</p>

<p>Because all the emails are going to be in the same inbox in HelpScout, I wanted an easy way to tag the conversations based on the team of Regional Coordinators that were involved. The R6 system was useful for this because once the email information was stored within the object, I could use a private method called by the accessor to extract all the people involved, and add tags in HelpScout to help Regional Coordinators find past conversations that are relevant to them.</p>

<p>It was one of the first times I used R6<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> for a real task and I could see its potential. If the code written here were for public consumption, it would have provided a good framework to add more tests on the data structure of the individual elements that were coming from the Gmail API to ensure that the output from the accessor method was always formatted correctly before trying to convert it in the format required by HelpScout’s API.</p>

<h2 id="4-caching">4. Caching</h2>

<p>My previous experience working with web APIs have taught me that things can go wrong, and it is always a good idea to keep track (on disk and not only on memory) of the requests that have been tried and the ones that have not, and the requests that succeeded and the ones that failed. Especially, when your scripts do thousands of API calls, you don’t want to have to run everything again once your script fails because your internet connection goes down for a short while, or the data is not formatted properly because you are dealing with an edge case.</p>

<p>For this, I use the <a href="https://richfitz.github.io/storr/"><code class="language-plaintext highlighter-rouge">storr</code> package</a> and its functionality to rely on hooks to retrieve external data. <code class="language-plaintext highlighter-rouge">storr</code> is a key-value store. It is not that different than using variable names to store objects in memory as you normally do in your R terminal:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## setting a variable</span><span class="w">
</span><span class="n">cat_name</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"Felix"</span><span class="w">

</span><span class="c1">## getting the content of the variable</span><span class="w">
</span><span class="n">cat_name</span><span class="w">
</span></code></pre></div></div>

<p>When using a <code class="language-plaintext highlighter-rouge">storr</code> store:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## defining the storr</span><span class="w">
</span><span class="n">st</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">storr</span><span class="o">::</span><span class="n">storr_rds</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cache"</span><span class="p">)</span><span class="w">

</span><span class="c1">## setting a variable</span><span class="w">
</span><span class="n">st</span><span class="o">$</span><span class="n">set</span><span class="p">(</span><span class="s2">"cat_name"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Felix"</span><span class="p">)</span><span class="w">

</span><span class="c1">## getting the variable name</span><span class="w">
</span><span class="n">st</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"cat_name"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The difference is that <code class="language-plaintext highlighter-rouge">storr</code> provides different backends for storing your object and, if like in this example, you use <code class="language-plaintext highlighter-rouge">storr_rds</code>, your objects are stored as <code class="language-plaintext highlighter-rouge">rds</code> files on your disk and are available beyond your current R session. How does that help with the problem here?</p>

<p>A great feature of <code class="language-plaintext highlighter-rouge">storr</code> is that you can set up your store to call a function to create the object instead of providing it directly with <code class="language-plaintext highlighter-rouge">$set()</code>.</p>

<p>It means that you store the content of a variable, your key into the store, and you can retrieve it:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## the hook function</span><span class="w">
</span><span class="n">fetch_hook_random_cat_name</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">sample</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Felix"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Garfield"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Tigger"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Mowgli"</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1">## defining the storr</span><span class="w">
</span><span class="n">st</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">storr</span><span class="o">::</span><span class="n">storr_external</span><span class="p">(</span><span class="w">
  </span><span class="n">storr</span><span class="o">::</span><span class="n">driver_rds</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cache"</span><span class="p">),</span><span class="w">
  </span><span class="n">fetch_hook_random_cat_name</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1">## the first time you call a key, it will run the hook function</span><span class="w">
</span><span class="n">st</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"cat_name"</span><span class="p">)</span><span class="w">

</span><span class="c1">## subsenquently, it will return the value stored in the store</span><span class="w">
</span><span class="n">st</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"cat_name"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The hook function always takes the two arguments <code class="language-plaintext highlighter-rouge">key</code> and <code class="language-plaintext highlighter-rouge">namespace</code> but they don’t need to be used in the body of the function as in the example above.</p>

<p>We can extend this approach to store the output of time-consuming computations or the results of API calls<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>. For instance, here, I created <a href="https://github.com/carpentries/emailmigration/blob/master/R/caching.R#L7">a store</a> to keep the output of the function <code class="language-plaintext highlighter-rouge">convert_gmail_thread()</code>, and used <code class="language-plaintext highlighter-rouge">get_gmail_thread()</code> as a wrapper to access the store.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fetch_hook_gmail_threads</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">convert_gmail_thread</span><span class="p">(</span><span class="n">key</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">store_gmail_threads</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cache/threads"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">storr</span><span class="o">::</span><span class="n">storr_external</span><span class="p">(</span><span class="w">
    </span><span class="n">storr</span><span class="o">::</span><span class="n">driver_rds</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w">
    </span><span class="n">fetch_hook_gmail_threads</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">get_gmail_thread</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">thread_id</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">store_gmail_threads</span><span class="p">()</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="n">thread_id</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>When calling <code class="language-plaintext highlighter-rouge">get_gmail_thread()</code>, using a <code class="language-plaintext highlighter-rouge">thread_id</code> that had not been retrieved using the Gmail API before, the function <code class="language-plaintext highlighter-rouge">convert_gmail_thread()</code> will be called, getting all the information needed for this particular thread, and storing it in an R6-class object. If another part of the script fails, we do not need to redo the calls to the Gmail API, instead the cached copy within the store will be retrieved.</p>

<p>I used a similar approach to <a href="https://github.com/carpentries/emailmigration/blob/master/R/caching.R#L71">store the responses from the HelpScout API</a>, and wrapped at the same time the call to the <code class="language-plaintext highlighter-rouge">get_gmail_thread()</code> function above. A slightly simplified version of what I used is:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fetch_hook_hs_response</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">get_gmail_thread</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w">
  </span><span class="n">hs_create_thread</span><span class="p">(</span><span class="n">res</span><span class="o">$</span><span class="n">get</span><span class="p">(),</span><span class="w"> </span><span class="n">htoken</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">store_hs_responses</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cache/hs_responses"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">storr</span><span class="o">::</span><span class="n">storr_external</span><span class="p">(</span><span class="w">
    </span><span class="n">storr</span><span class="o">::</span><span class="n">driver_rds</span><span class="p">(</span><span class="n">path</span><span class="p">),</span><span class="w">
    </span><span class="n">fetch_hook_hs_response</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">get_hs_response</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">thread_id</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">store_hs_responses</span><span class="p">()</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="n">thread_id</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span></code></pre></div></div>

<p>So, what’s happening here? I use the Gmail thread ID as a single point of entry for the entire script (retrieve the thread from the Gmail API, convert it to the format expected by the HelpScout API, upload the thread to HelpScout). Depending on whether the queries have already been made and stored in the cache, the script will retrieve the data from the API or the objects stored on disk in the cache.</p>

<p>What does the <code class="language-plaintext highlighter-rouge">namespace</code> argument do? Using namespacing in <code class="language-plaintext highlighter-rouge">storr</code> allows you to organize your objects in your store. Especially, it allows you to have objects with the same name but with different values. Here, I planned to use namespaces to keep track of my different attempts. If the first attempt would have failed for some threads, I could fix the problem in the code, and re-attempt the HelpScout API calls just for the ones that failed under a different namespace.</p>

<h2 id="5-putting-it-all-together">5. Putting it all together</h2>

<p>Once I had most of the pieces together, I started by testing the code on the first 100 threads (as it’s the default number of threads that <code class="language-plaintext highlighter-rouge">gmailr</code> returns). That was a manageable number to see how the script behaved while being large enough that many different types of messages would be encountered. At that time, I didn’t use the caching system.</p>

<p>Once the first 100 threads could be imported successfully in HelpScout, I wrote a function to retrieve the identifiers for all the threads in the inbox that needed to be imported, and iterated on these identifiers to call the <code class="language-plaintext highlighter-rouge">get_hs_response</code> function:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">get_all_threads</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
  
  </span><span class="n">first_it</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gm_threads</span><span class="p">()</span><span class="w">
  </span><span class="n">next_token</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">first_it</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">nextPageToken</span><span class="w">
  
  </span><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">append</span><span class="p">(</span><span class="nf">list</span><span class="p">(),</span><span class="w"> </span><span class="n">first_it</span><span class="p">)</span><span class="w">
  
  </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">next_token</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">tmp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gm_threads</span><span class="p">(</span><span class="n">page_token</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">next_token</span><span class="p">)</span><span class="w">
    </span><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">append</span><span class="p">(</span><span class="w">
      </span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">tmp</span><span class="w">
    </span><span class="p">)</span><span class="w">
    </span><span class="n">next_token</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tmp</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">nextPageToken</span><span class="w">
    </span><span class="n">message</span><span class="p">(</span><span class="s2">"next token: "</span><span class="p">,</span><span class="w"> </span><span class="n">next_token</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  
  </span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">threads</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">get_all_threads</span><span class="p">()</span><span class="w">

</span><span class="n">threads_ids</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map</span><span class="p">(</span><span class="w">
  </span><span class="n">threads</span><span class="p">,</span><span class="w">
  </span><span class="o">~</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_chr</span><span class="p">(</span><span class="n">.</span><span class="o">$</span><span class="n">threads</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="o">$</span><span class="n">id</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">unlist</span><span class="p">()</span><span class="w">

</span><span class="n">hs_res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">walk</span><span class="p">(</span><span class="w">
  </span><span class="n">threads_ids</span><span class="p">,</span><span class="w">
  </span><span class="o">~</span><span class="w"> </span><span class="n">get_hs_response</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"v2020-04-10.1"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>As part of the hook function that takes care of uploading conversations to HelpScout, I check whether the upload was successful and based on that I created and assigned a Gmail label to the thread. This was an additional safeguard that I could use to flag threads that didn’t import successfully.</p>

<p>Once the upload completed, I could then inspect the content of the store:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Retrieve the threads_ids from the store</span><span class="w">
</span><span class="n">idx</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">store_hs_responses</span><span class="p">()</span><span class="o">$</span><span class="nf">list</span><span class="p">(</span><span class="n">namespace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"v2020-04-10.1"</span><span class="p">)</span><span class="w">

</span><span class="c1">## Retrieve the status code for the HelpScout API responses</span><span class="w">
</span><span class="n">is_error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_lgl</span><span class="p">(</span><span class="w">
  </span><span class="n">idx</span><span class="p">,</span><span class="w">
  </span><span class="o">~</span><span class="w"> </span><span class="n">httr</span><span class="o">::</span><span class="n">status_code</span><span class="p">(</span><span class="w">
    </span><span class="n">store_hs_responses</span><span class="p">()</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">namespace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"v2020-04-10.1"</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">&gt;=</span><span class="w">  </span><span class="m">400</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1">## How many calls failed?</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">is_error</span><span class="p">)</span><span class="w">

</span><span class="c1">## Which thread_ids failed?</span><span class="w">
</span><span class="n">idx</span><span class="p">[</span><span class="n">is_error</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<p>and double check that it was the same threads that were labeled with <code class="language-plaintext highlighter-rouge">failure-&lt;namespace&gt;</code> in Gmail.</p>

<h2 id="lessons-learned">Lessons learned</h2>

<p>As often with using programming to solve problems, what might seem like a simple task: “Transfering emails from one system to an other” is a collection of small problems. Being able to break down the big problems into small ones, and knowing how to address them comes with experience. Experience will help you recognize problems similar to some you have already solved, and reflecting on these past experiences will help you identify the algorithms, packages, and general code organization that are most likely to help you solve your problem.</p>

<p>In The Carpentries Instructor Training, when <a href="https://carpentries.github.io/instructor-training/03-expertise/index.html">we teach about expertise</a>, we talk about how the mental model of experts is denser and more connected. These features make it more difficult for experts to teach beginners because they have forgotten what it is like to not know how to break down a large problem into multiple small ones. The problem here is not just “migrate a bunch of emails between two systems”, there is a lot more to it. I wrote this blog post with the intent to demonstrate the approach I took to break down a problem into small ones and, in the process, describe the tools and techniques I chose to address them.</p>

<p>Expertise is subjective and relative, and I certainly do not claim that the approach I chose here is the best, the most efficient or the most elegant. There is certainly room for improvement. For instance, parts of the code could be re-factored to make it more organized, parts could be rewritten to be more <a href="https://en.wikipedia.org/wiki/Defensive_programming">defensive</a>, and there is no documentation (besides this blog post) and barely any comments.</p>

<p>I am interested in hearing your perspective and thoughts on how the problem could have been approached differently and the tools you would have chosen to address it. If this post was useful to you to help you solve a different problem, I would also love to hear about it! Leave a comment below or contact me using the info provided on the left of this page.</p>

<h3 id="footnotes">Footnotes</h3>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>You may notice that the Git history for the repo includes the key and secret for the HelpScout OAuth authentication. By themselves, these are not enough to access any data, as you also need to authenticate with a valid HelpScout account within our organization. These credentials have also been revoked. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I’ll be submitting a pull request to <code class="language-plaintext highlighter-rouge">gmailr</code> soon. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>If you are interested in learning more about the object-oriented programming R6 system, the <a href="https://adv-r.hadley.nz/r6.html">chapter about it</a> in the “Advanced R” book by Hadley Wickham is a great place to start. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>If you are interested in learning more about <code class="language-plaintext highlighter-rouge">storr</code>, read the <a href="https://richfitz.github.io/storr/articles/storr.html">documentation for the package</a> and the <a href="https://richfitz.github.io/storr/articles/external.html">vignette on external data</a> that initially helped me get started with this amazingly useful package. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="R in Production" /><category term="helpscout" /><category term="gmailr" /><category term="R" /><summary type="html"><![CDATA[How R allowed The Carpentries to migrate emails from Gmail to HelpScout using their web APIs]]></summary></entry><entry><title type="html">Advent of Code 2018</title><link href="https://francoismichonneau.net/2018/12/advent/" rel="alternate" type="text/html" title="Advent of Code 2018" /><published>2018-12-01T00:00:00+00:00</published><updated>2018-12-01T00:00:00+00:00</updated><id>https://francoismichonneau.net/2018/12/advent</id><content type="html" xml:base="https://francoismichonneau.net/2018/12/advent/"><![CDATA[<p>I’m going to try to complete the Advent of Code again this year. I’ll put all the exercises I complete in this post.</p>

<p>Links to the puzzles are at https://adventofcode.com/2018</p>

<h1 id="day-1">Day 1</h1>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">


</span><span class="c1">## part 1</span><span class="w">
</span><span class="n">readr</span><span class="o">::</span><span class="n">read_lines</span><span class="p">(</span><span class="s2">"advent-data/2018-12-01-day1.txt"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="nf">sum</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 408
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## part 2</span><span class="w">
</span><span class="n">input</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readr</span><span class="o">::</span><span class="n">read_lines</span><span class="p">(</span><span class="s2">"advent-data/2018-12-01-day1.txt"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">

</span><span class="n">already_seen</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">

  </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">v_sum</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">cumsum</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">))</span><span class="w">
    </span><span class="n">has_dup</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">any</span><span class="p">(</span><span class="n">duplicated</span><span class="p">(</span><span class="n">v_sum</span><span class="p">))</span><span class="w">
    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">has_dup</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nf">return</span><span class="p">(</span><span class="n">v_sum</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">duplicated</span><span class="p">(</span><span class="n">v_sum</span><span class="p">))[</span><span class="m">1</span><span class="p">]])</span><span class="w">
    </span><span class="p">}</span><span class="w">
    </span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
  </span><span class="p">}</span><span class="w">
  
</span><span class="p">}</span><span class="w">

</span><span class="n">already_seen</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 55250
</code></pre></div></div>

<h1 id="day-2">Day 2</h1>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readr</span><span class="o">::</span><span class="n">read_lines</span><span class="p">(</span><span class="s2">"advent-data/2018-12-02-day2.txt"</span><span class="p">)</span><span class="w"> 

</span><span class="c1">## part 1</span><span class="w">
</span><span class="n">count_letters</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="n">n_letters</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">strsplit</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">purrr</span><span class="o">::</span><span class="n">map</span><span class="p">(</span><span class="n">table</span><span class="p">)</span><span class="w">
  
  </span><span class="n">has_2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nf">as.integer</span><span class="p">(</span><span class="nf">any</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="n">has_3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nf">as.integer</span><span class="p">(</span><span class="nf">any</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="n">has_2_vec</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_int</span><span class="p">(</span><span class="n">n_letters</span><span class="p">,</span><span class="w"> </span><span class="n">has_2</span><span class="p">)</span><span class="w">
  </span><span class="n">has_3_vec</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">purrr</span><span class="o">::</span><span class="n">map_int</span><span class="p">(</span><span class="n">n_letters</span><span class="p">,</span><span class="w"> </span><span class="n">has_3</span><span class="p">)</span><span class="w">

  </span><span class="nf">sum</span><span class="p">(</span><span class="n">has_2_vec</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">has_3_vec</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">count_letters</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 6000
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## part 2</span><span class="w">
</span><span class="n">all_in</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">crossing</span><span class="p">(</span><span class="n">in1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">in2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">split1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">strsplit</span><span class="p">(</span><span class="n">in1</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
    </span><span class="n">split2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">strsplit</span><span class="p">(</span><span class="n">in2</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">     
    </span><span class="n">n_diff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map2_int</span><span class="p">(</span><span class="n">split1</span><span class="p">,</span><span class="w"> </span><span class="n">split2</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">.x</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">.y</span><span class="p">))</span><span class="w">    
  </span><span class="p">)</span><span class="w">

</span><span class="n">all_in</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">n_diff</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map2_chr</span><span class="p">(</span><span class="w">
    </span><span class="n">split1</span><span class="p">,</span><span class="w">
    </span><span class="n">split2</span><span class="p">,</span><span class="w">
    </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
      </span><span class="n">y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
      </span><span class="n">paste</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">x</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">y</span><span class="p">],</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
    </span><span class="p">}))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">pull</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "pbykrmjmizwhxlqnasfgtycdv"
</code></pre></div></div>

<h2 id="day-3">Day 3</h2>

<p>That’s far from the prettiest code I’ve written! But it gets the job done.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">extract_coords</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="n">readr</span><span class="o">::</span><span class="n">read_delim</span><span class="p">(</span><span class="w">
    </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">delim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> </span><span class="n">col_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">tidyr</span><span class="o">::</span><span class="n">extract</span><span class="p">(</span><span class="n">X1</span><span class="p">,</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"id"</span><span class="p">,</span><span class="w"> </span><span class="n">regexp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"([[:digit]]+)"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">tidyr</span><span class="o">::</span><span class="n">extract</span><span class="p">(</span><span class="n">X3</span><span class="p">,</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"x_begin"</span><span class="p">,</span><span class="w"> </span><span class="s2">"y_begin"</span><span class="p">),</span><span class="w"> </span><span class="n">regex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"([[:digit:]]+),([[:digit:]]+):"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">tidyr</span><span class="o">::</span><span class="n">extract</span><span class="p">(</span><span class="n">X4</span><span class="p">,</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"width"</span><span class="p">,</span><span class="w"> </span><span class="s2">"height"</span><span class="p">),</span><span class="w"> </span><span class="n">regex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"([[:digit:]]+)x([[:digit:]]+)"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">X2</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate_all</span><span class="p">(</span><span class="n">as.numeric</span><span class="p">)</span><span class="w">
  
</span><span class="p">}</span><span class="w">


</span><span class="n">find_total_dim</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">coords</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">coords</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">total_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">width</span><span class="p">,</span><span class="w">
                  </span><span class="n">total_height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">height</span><span class="p">)</span><span class="w">

  </span><span class="nf">c</span><span class="p">(</span><span class="n">total_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">res</span><span class="o">$</span><span class="n">total_width</span><span class="p">),</span><span class="w">
    </span><span class="n">total_height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">res</span><span class="o">$</span><span class="n">total_height</span><span class="p">))</span><span class="w">
  
</span><span class="p">}</span><span class="w">


</span><span class="n">fill_matrix</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="n">c</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">extract_coords</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
  </span><span class="n">m_dim</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">find_total_dim</span><span class="p">(</span><span class="n">c</span><span class="p">)</span><span class="w">

  </span><span class="n">M</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w">
              </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m_dim</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w">
              </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m_dim</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w">


  </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">c</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">i_s</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">x_begin</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">x_begin</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="o">$</span><span class="n">width</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
    </span><span class="n">j_s</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">y_begin</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">y_begin</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="o">$</span><span class="n">height</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
    </span><span class="n">M</span><span class="p">[</span><span class="n">i_s</span><span class="p">,</span><span class="w"> </span><span class="n">j_s</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">M</span><span class="p">[</span><span class="n">i_s</span><span class="p">,</span><span class="w"> </span><span class="n">j_s</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="n">M</span><span class="w">
  
</span><span class="p">}</span><span class="w">


</span><span class="n">more_two_claims</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">M</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fill_matrix</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
  </span><span class="nf">sum</span><span class="p">(</span><span class="n">M</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> 
</span><span class="p">}</span><span class="w">

</span><span class="c1">## part 1 answer</span><span class="w">
</span><span class="n">more_two_claims</span><span class="p">(</span><span class="s2">"advent-data/2018-12-03-day3.txt"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 109716
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">overlaps</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x_begin</span><span class="p">,</span><span class="w"> </span><span class="n">y_begin</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="p">,</span><span class="w"> </span><span class="n">M</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">x_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="n">x_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">width</span><span class="p">)</span><span class="w">
  </span><span class="n">j</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">y_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="p">(</span><span class="n">y_begin</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">height</span><span class="p">)</span><span class="w">
  </span><span class="nf">all</span><span class="p">(</span><span class="n">M</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> 
</span><span class="p">}</span><span class="w">

</span><span class="n">no_overlap</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">M</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fill_matrix</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">

  </span><span class="n">c</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">extract_coords</span><span class="p">(</span><span class="n">input</span><span class="p">)</span><span class="w">
  </span><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">logical</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">c</span><span class="p">))</span><span class="w">
  
  </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">c</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">overlaps</span><span class="p">(</span><span class="n">c</span><span class="o">$</span><span class="n">x_begin</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">c</span><span class="o">$</span><span class="n">y_begin</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w">
                       </span><span class="n">c</span><span class="o">$</span><span class="n">width</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">c</span><span class="o">$</span><span class="n">height</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">M</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="n">c</span><span class="o">$</span><span class="n">id</span><span class="p">[</span><span class="n">res</span><span class="p">]</span><span class="w">
  
</span><span class="p">}</span><span class="w">

</span><span class="c1">## part 2 answer</span><span class="w">
</span><span class="n">no_overlap</span><span class="p">(</span><span class="s2">"advent-data/2018-12-03-day3.txt"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 124
</code></pre></div></div>]]></content><author><name>François Michonneau, PhD</name><email>francois.michonneau@gmail.com</email></author><category term="Hacking" /><category term="advent of code" /><summary type="html"><![CDATA[Solutions for the 2018 Advent of Code]]></summary></entry></feed>