David Waldron

R Packages: dplyr vs data.table

Two R packages--dplyr and data.table--offer similar tools for data manipulation, but appeal to different users.

David Waldron Nov 25, 2018

R currently has two main packages that are widely used for manipulating datasets. Over the years, the data.table package built up a base of users with its efficient syntax and ability to handle larger datasets with lightning speeds. On the other side, dplyr (which superseded the plyr package in 2014) has seen rapid growth in users. As part of the larger tidyverse group of R packages, it appeals to users who prefer a friendlier syntax and do not need the speed of data.table.

In order to get a picture of the activity related to these packages, I used the StackExchange Data Explorer to query the database of StackOverflow questions and answers.

Questions
Number of Stack Overflow questions by tag (quarterly)
Data source: Stack Exchange Data Explorer

As I expected, the number of questions related to dplyr has surpassed the number related to data.table soon after its release. In 2018, dplyr accounts for about three times as many questions as data.table. But I suspect that dplyr, with its increased ubiquity via the tidyverse, is being adopted by many newer R users, who might be more prone to have questions.

So my second query counted the percent of R questions with at least one answer mentioning either dplyr or data.table. This shows that, in recent months, dplyr was twice as likely to be mentioned in an R question as data.table. This could indicate that use of dplyr doesn't dominate among answerers--who might be more seasoned users--as much as it does among questioners.

Answers
Percent of R Stack Overflow questions with an answer that mentions package (quarterly)
Data source: Stack Exchange Data Explorer

Although the trend is clearly towards dplyr and the tidyverse, data.table's usefulness with bigger datasets is still enough reason to become familiar with it. On the other hand, data.table users should be aware of how popular dplyr has become. Dplyr can't do much that data.table doesn't already do, but refusing to learn dplyr will only isolate you from other R users.

Comments


Read next