data_VOC: VOC dataset

Description

This dataset contains the data on volatile organic components (VOCs) in urine of children between 3 and 10 years old. It is composed of pubicly available data from the National Health and Nutrition Examination Survey (NHANES) and was analyzed in Raymaekers and Rousseeuw (2020). See below for details and references.

Usage

data("data_VOC")

Arguments

Format

A matrix of dimensions \(512 \times 19\). The first 16 variables are the VOC, the last 3 are:

SMD460: number of smokers that live in the same home as the subject
SMD470: number of people that smoke inside the home of the subject
RIDAGEYR: age of the subject

Note that the original variable names are kept.

Details

All of the data was collected from the NHANES website, and was part of the NHANES 2015-2016 survey. This was the most recent epoch with complete data at the time of extraction. Three datasets were matched in order to assemble this data:

UVOC_I: contains the information on the Volative organic components in urine
DEMO_I: contains the demographical information such as age
SMQFAM_I: contains the data on the smoking habits of family members

The dataset was constructed as follows:

Select the relevant VOCs from the UVOC_I data (see column names) and transform by taking the logarithm
Match the subjects in the UVOC_I data with their age in the DEMO_I data
Select all subjects with age at most 10
Match the data on smoking habits with the selected subjects.

References

J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Journal of Data Science, Statistics, and Visualisation. tools:::Rd_expr_doi("10.52933/jdssv.v1i3.18")(link to open access pdf)

Examples

Run this code

data("data_VOC")
# For an analysis of this data, we refer to the vignette:
if (FALSE) {
vignette("DI_examples")
}

Run the code above in your browser using DataLab