Skip to contents

This is a wrapper around any specified function for reading in data files. Upon accessing the data file, it checks the file against the history of previously accessed data files (through its MD5 hash) to assess whether it constitutes first-time access to the data. If so, it automatically logs this event on GitHub (after prompting the user). This is useful if you want to show in your log that you accessed parts of your data in a particular order (e.g., you first accessed your independent variables to establish an analysis plan and only then accessed your dependent variables).

Usage

read_data(
  file,
  read_fun,
  col_select = NULL,
  row_filter = NULL,
  row_shuffle = NULL,
  long_format = FALSE,
  seed = 3985843,
  ...
)

Arguments

file

Either a path to a file, a connection, or literal data (either a single string or a raw vector).

read_fun

The name of a function to read data. for 'readr' functions, you only have to specify the function name (e.g., `read_csv()`). If you use a function from another package, name the package explicitly (e.g., `haven::read_spss()`).

col_select

Columns to include in the results. You can use the same mini-language as `dplyr::select()` to refer to the columns by name. Use `c()` to use more than one selection expression. Although this usage is less common, col_select also accepts a numeric column index. See ?tidyselect::language for full details on the selection language.

row_filter

Optional rows to include in the results. Uses `dplyr::filter()`.

row_shuffle

Optional variables to randomly shuffle.

long_format

Logical indicating whether the data are in long format (only relevant when shuffling variables using row_shuffle).

seed

integer used for replicability purposes when randomly shuffling data.

...

Additional arguments for the read function.

Value

A `tibble()`. Side effects are committing and pushing the updated MD5 hash overview to GitHub in case of first-time data access.