Read a data file into a tibble and log data access on GitHub

This is a wrapper around any specified function for reading in data files. Upon accessing the data file, it checks the file against the history of previously accessed data files (through its MD5 hash) to assess whether it constitutes first-time access to the data. If so, it automatically logs this event on GitHub (after prompting the user). This is useful if you want to show in your log that you accessed parts of your data in a particular order (e.g., you first accessed your independent variables to establish an analysis plan and only then accessed your dependent variables).

Usage

read_data(
  file,
  read_fun,
  col_select = NULL,
  row_filter = NULL,
  row_shuffle = NULL,
  long_format = FALSE,
  seed = 3985843,
  ...
)

Arguments

file: Either a path to a file, a connection, or literal data (either a single string or a raw vector).
read_fun: The name of a function to read data. for 'readr' functions, you only have to specify the function name (e.g., `read_csv()`). If you use a function from another package, name the package explicitly (e.g., `haven::read_spss()`).
col_select: Columns to include in the results. You can use the same mini-language as `dplyr::select()` to refer to the columns by name. Use `c()` to use more than one selection expression. Although this usage is less common, col_select also accepts a numeric column index. See ?tidyselect::language for full details on the selection language.
row_filter: Optional rows to include in the results. Uses `dplyr::filter()`.
row_shuffle: Optional variables to randomly shuffle.
long_format: Logical indicating whether the data are in long format (only relevant when shuffling variables using row_shuffle).
seed: integer used for replicability purposes when randomly shuffling data.
...: Additional arguments for the read function.

Value

A `tibble()`. Side effects are committing and pushing the updated MD5 hash overview to GitHub in case of first-time data access.