Learn R Programming

rebmix (version 2.6.1)

adult: Adult Dataset

Description

The adult dataset containing 48842 instances with 16 continuous, binary and discrete variables was extracted from the census bureau database http://www.census.gov/. Extraction was done by Barry Becker from the 1994 census bureau database.

Usage

adult

Arguments

format

adult is a data frame with 48842 cases (rows) and 16 variables (columns) named:
  1. Typebinarytrainortest.
Age continuous. Workclass one of the 8 discrete values private, self-emp-not-inc, self-emp-inc, federal-gov, local-gov, state-gov, without-pay or never-worked. Fnlwgt stands for continuous final weight. Education one of the 16 discrete values bachelors, some-college, 11th, hs-grad, prof-school, assoc-acdm, assoc-voc, 9th, 7th-8th, 12th, masters, 1st-4th, 10th, doctorate, 5th-6th or preschool. Education.Num continuous. Marital.Status one of the 7 discrete values married-civ-spouse, divorced, never-married, separated, widowed, married-spouse-absent or married-af-spouse. Occupation one of the 14 discrete values tech-support, craft-repair, other-service, sales, exec-managerial, prof-specialty, handlers-cleaners, machine-op-inspct, adm-clerical, farming-fishing, transport-moving, priv-house-serv, protective-serv or armed-forces. Relationship one of the 6 discrete values wife, own-child, husband, not-in-family, other-relative or unmarried. Race one of the 5 discrete values white, asian-pac-islander, amer-indian-eskimo, other or black. Sex binary female or male. Capital.Gain continuous. Capital.Loss continuous. Hours.Per.Week continuous. Native.Country one of the 41 discrete values united-states, cambodia, england, puerto-rico, canada, germany, outlying-us(guam-usvi-etc), india, japan, greece, south, china, cuba, iran, honduras, philippines, italy, poland, jamaica, vietnam, mexico, portugal, ireland, france, dominican-republic, laos, ecuador, taiwan, haiti, columbia, hungary, guatemala, nicaragua, scotland, thailand, yugoslavia, el-salvador, trinadad&tobago, peru, hong or holand-netherlands. Income binary <=50k< code=""> or >50k.

source

A. Asuncion and D. J. Newman. Uci machine learning repository, 2007. http://archive.ics.uci.edu/ml.

References

A. Asuncion and D. J. Newman. Uci machine learning repository, 2007. http://archive.ics.uci.edu/ml.

Examples

Run this code
data("adult")

## Find complete cases.

adult <- adult[complete.cases(adult), ]

## Map metric attributes.

adult[["Capital.Loss"]] <- ordered(cut(adult[["Capital.Loss"]], 2000))
adult[["Capital.Gain"]] <- ordered(cut(adult[["Capital.Gain"]], 2000))

## Show level attributes for binary and discrete variables.

levels(adult[["Type"]])
levels(adult[["Workclass"]])
levels(adult[["Education"]])
levels(adult[["Marital.Status"]])
levels(adult[["Occupation"]])
levels(adult[["Relationship"]])
levels(adult[["Race"]])
levels(adult[["Sex"]])
levels(adult[["Native.Country"]])
levels(adult[["Income"]])

## Replace levels with numbers.

adult <- as.data.frame(data.matrix(adult))

## Levels should start with 0 for discrete distributions except for the 
## Dirac distribution.

f <- c("Type", "Workclass", "Education", "Marital.Status", "Occupation", 
  "Relationship", "Race", "Sex", "Native.Country", "Income")

adult[, f] <- adult[, f] - 1

## Split adult dataset into two train subsets for the two Incomes
## and remove Type and Income columns.

trainle50k <- subset(adult, subset = (Type == 1) & (Income == 0), 
  select = c(-Type, -Income))
traingt50k <- subset(adult, subset = (Type == 1) & (Income == 1), 
  select = c(-Type, -Income))

trainall <- subset(adult, subset = Type == 1, select = c(-Type, -Income))

train <- as.factor(subset(adult, subset = Type == 1, select = c(Income))[, 1])

## Extract test dataset form adult dataset and remove Type 
## and Income columns.

testle50k <- subset(adult, subset = (Type == 0) & (Income == 0), 
  select = c(-Type, -Income))
testgt50k <- subset(adult, subset = (Type == 0) & (Income == 1), 
  select = c(-Type, -Income))

testall <- subset(adult, subset = Type == 0, select = c(-Type, -Income))

test <- as.factor(subset(adult, subset = Type == 0, select = c(Income))[, 1])

save(trainall, file = "trainall.rda")
save(testall, file = "testall.rda")

Run the code above in your browser using DataLab