--- title: "Getting Started with Pointblank Operatives" format: html: df-print: tibble vignette: > %\VignetteIndexEntry{Getting Started with Operatives} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) ``` ```{r setup, include = FALSE} library(pointblankops) library(dplyr) library(pointblank) library(arrow) ``` ## Introduction The `pointblankops` package provides specialized data validation operations using lightweight operatives for focused intelligence gathering. Operatives are streamlined alternatives to pointblank agents, designed for efficient row-level failure detection without the overhead of full reporting capabilities. The use case that this solves is the following: - data is large so can be out of memory - we run tests on data to understand which rows fail which test, because downstream we exclude different rows in different situations depending on the purpose of the analysis - so we need per-row validation results to use in post-processing Extracting this from an interrogated agent it tedious and memory-intensive. To preserve memory and allow working on large datasets, operatives focus on extracting validation failures directly, without the full reporting overhead of pointblank agents. - Per-row validation results are returned in a tidy format, making it easy to integrate with other data processing workflows. - They can be stored directly in a database or saved to a file format like Parquet for further analysis, all done efficiently with minimal memory footprint. - Validation failure information can also be returned as a tibble for immediate use in R. ## Creating Operatives Operatives are created using the `create_operative()` function, which is a lightweight wrapper around pointblank's `create_agent()`: ```{r} #| label: create-operative # Create test data test_data <- data.frame( batch = c("A", "A", "B", "B", "C"), id = c(1, 2, 3, 4, 5), value = c(10, NA, 15, 8, 12), category = c("X", "Y", "X", "Z", "Y") ) # Create an operative operative <- create_operative(test_data, tbl_name = "test_data", label = "Test Operative") ``` ```{r, include = FALSE} export_report(operative, "operative.html") ``` ```{r, eval = FALSE} operative ``` ```{=html} ``` ## Adding Validation Steps Just like pointblank agents, operatives can have validation steps added to them: ```{r} #| label: validation-steps operative <- operative |> col_vals_not_null(columns = vars(value)) |> col_vals_between(columns = vars(value), left = 5, right = 20) |> col_vals_in_set(columns = vars(category), set = c("X", "Y", "Z")) ``` ## Debriefing Operatives The core functionality is the `debrief()` function, which extracts only the validation failures: ```{r} #| label: basic-debrief # Get failures as a tibble failures <- debrief(operative, row_id_col = c("batch", "id")) ``` ```{r} failures ``` ## Output Options The `debrief()` function supports multiple output formats: ### 1. Return as Tibble (default) ```{r} #| label: tibble-output failures <- debrief(operative, row_id_col = c("batch", "id")) ``` ```{r} failures ``` ### 2. Save to Parquet File ```{r} #| label: parquet-output debrief(operative, row_id_col = c("batch", "id"), parquet_path = "validation_failures.parquet") ``` ```{r} read_parquet("validation_failures.parquet") ``` ### 3. Save to Database ```{r} #| label: database-output con <- DBI::dbConnect(duckdb::duckdb(), ":memory:") # Copy test data to database DBI::dbWriteTable(con, "test_data", test_data) # Create operative from database table db_operative <- create_operative(test_data) |> col_vals_not_null(columns = vars(value)) |> col_vals_between(columns = vars(value), left = 5, right = 20) # Save failures to database table debrief(db_operative, row_id_col = c("batch", "id"), con = con, output_tbl = "validation_failures") ``` ```{r} tbl(con, "validation_failures") |> collect() ``` ```{r include = FALSE} DBI::dbDisconnect(con) ``` ## Memory Efficiency For large datasets, `debrief()` processes data in chunks to maintain memory efficiency: ```{r} #| label: chunked-processing # Process in smaller chunks for memory efficiency failures <- debrief(operative, row_id_col = c("batch", "id"), chunk_size = 500) # Process 500 rows at a time ``` ```{r} failures ``` ## Database Compatibility Operatives work seamlessly with database tables via dbplyr: ```{r} #| label: database-compatibility con <- DBI::dbConnect(duckdb::duckdb(), ":memory:") DBI::dbWriteTable(con, "large_table", test_data) # Create operative from database table db_operative <- create_operative(dplyr::tbl(con, "large_table")) |> col_vals_not_null(columns = vars(value)) |> col_vals_gt(value, 8) # Debrief processes the query efficiently in the database failures <- debrief(db_operative, row_id_col = c("batch", "id")) ``` ```{r} failures ``` ```{r, include = FALSE} DBI::dbDisconnect(con) ``` ## Supported Validation Types The following pointblank validation functions are supported: - `col_vals_not_null()` / `col_vals_null()` - `col_vals_between()` / `col_vals_not_between()` - `col_vals_in_set()` / `col_vals_not_in_set()` - `col_vals_gt()` / `col_vals_gte()` / `col_vals_lt()` / `col_vals_lte()` - `col_vals_equal()` / `col_vals_not_equal()` Unsupported validation types are automatically skipped with a message.