Title: | Out-of-memory data collation into Arrow datasets |
---|---|
Description: | Iterate over a function and collate its output into an Arrow dataset, without loading the whole result set into memory. |
Authors: | Petr Bouchal [aut, cre] |
Maintainer: | Petr Bouchal <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0.9000 |
Built: | 2024-11-09 02:51:30 UTC |
Source: | https://github.com/petrbouchal/purrrow |
map + arrow: iterate over a function and collate the results into an Arrow dataset. This happens without the whole dataset being in memory, so is suitable for large data objects. The function must return a data.frame or tibble. The returned value is a path to the directory containing the Arrow dataset.
marrow_dir(.x, .f, ..., .path, .partitioning = c(), .format = "parquet") marrow_ds(.x, .f, ..., .path, .partitioning = c(), .format = "parquet") marrow_files(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")
marrow_dir(.x, .f, ..., .path, .partitioning = c(), .format = "parquet") marrow_ds(.x, .f, ..., .path, .partitioning = c(), .format = "parquet") marrow_files(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")
.x |
vector or list of values for .f to iterate over |
.f |
function; must return a data.frame/tibble |
... |
other arguments to .f |
.path |
path to directory where collated Arrow dataset will be stored. will be created if it does not exist |
.partitioning |
character vector of columns to use for partitioning. Columns must exist in output of .f. |
.format |
"parquet" (the default) or "arrow". |
path to new dataset directory; character string of length one.
an Arrow Dataset
character vector containing paths to all files in dataset dir
marrow_dir
: Return path to directory containing dataset
marrow_ds
: Return Arrow Dataset
marrow_files
: Return paths to all files in dataset dir
months <- unique(airquality$Month) td <- tempdir() part_of_aq <- function(month) { airquality[airquality$Month==month,] } aq_arrow <- purrrow:::marrow_dir(months, part_of_aq, .path = td)
months <- unique(airquality$Month) td <- tempdir() part_of_aq <- function(month) { airquality[airquality$Month==month,] } aq_arrow <- purrrow:::marrow_dir(months, part_of_aq, .path = td)