The dataset R Package
The dataset package extension to the R statistical environment aims to ensure that the most important R object that contains a dataset, i.e. a data.frame or an inherited tibble, tsibble or data.table contains important metadata for the reuse and validation of the dataset contents. We aim to offer a novel solution to support individuals or small groups of data scientists working in various business, academic or policy research functions who cannot count on the support of librarians, knowledge engineers, and extensive documentation processes.
The dataset package extends the concept of tidy data and adds further, standardized semantic information to the user’s dataset to increase the (re-)use value of the data object.
- More descriptive information about the dataset as a creation, its authors, contributors, reuse rights and other metadata to make it easier to find and use.
- More standardized and linked metadata, such as standard variable definitions and code lists, enable the data owner to gather far more information from third parties or for third parties to understand and use the data correctly.
- More information about the data provenance makes the quality assessment easier and reduces the need for time-consuming and unnecessary re-processing steps.
- More structural information about the data makes it more accessible to reuse and join with new information, making it less error-prone for logical errors.