Package 'purrrow'

Title: Out-of-memory data collation into Arrow datasets
Description: Iterate over a function and collate its output into an Arrow dataset, without loading the whole result set into memory.
Authors: Petr Bouchal [aut, cre]
Maintainer: Petr Bouchal <[email protected]>
License: MIT + file LICENSE
Version: 0.0.0.9000
Built: 2024-11-09 02:51:30 UTC
Source: https://github.com/petrbouchal/purrrow

Help Index


Iteratively collate output of function into an Arrow dataset out of memory

Description

Experimental lifecycle

map + arrow: iterate over a function and collate the results into an Arrow dataset. This happens without the whole dataset being in memory, so is suitable for large data objects. The function must return a data.frame or tibble. The returned value is a path to the directory containing the Arrow dataset.

Usage

marrow_dir(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

marrow_ds(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

marrow_files(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

Arguments

.x

vector or list of values for .f to iterate over

.f

function; must return a data.frame/tibble

...

other arguments to .f

.path

path to directory where collated Arrow dataset will be stored. will be created if it does not exist

.partitioning

character vector of columns to use for partitioning. Columns must exist in output of .f.

.format

"parquet" (the default) or "arrow".

Value

path to new dataset directory; character string of length one.

an Arrow Dataset

character vector containing paths to all files in dataset dir

Functions

  • marrow_dir: Return path to directory containing dataset

  • marrow_ds: Return Arrow Dataset

  • marrow_files: Return paths to all files in dataset dir

Examples

months <- unique(airquality$Month)
td <- tempdir()
part_of_aq <- function(month) {
  airquality[airquality$Month==month,]
}

aq_arrow <- purrrow:::marrow_dir(months, part_of_aq,
                                  .path = td)