dataset_trec: TREC dataset

Description

The TREC dataset is dataset for question classification consisting of open-domain, fact-based questions divided into broad semantic categories. It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.

Usage

dataset_trec(
  dir = NULL,
  split = c("train", "test"),
  version = c("6", "50"),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Value

A tibble with 5,452 or 500 rows for "train" and "test" respectively and 2 variables:

class: Character, denoting the class
text: Character, question text

Arguments

dir: Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.
split: Character. Return training ("train") data or testing ("test") data. Defaults to "train".
version: Character. Version 6("6") or version 50("50"). Defaults to "6".
delete: Logical, set TRUE to delete dataset.
return_path: Logical, set TRUE to return the path of the dataset.
clean: Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.
manual_download: Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

The classes in TREC-6 are

ABBR - Abbreviation
DESC - Description and abstract concepts
ENTY - Entities
HUM - Human beings
LOC - Locations
NYM - Numeric values

the classes in TREC-50 can be found here https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html.

Examples

Run this code

if (FALSE) {
dataset_trec()

# Custom directory
dataset_trec(dir = "data/")

# Deleting dataset
dataset_trec(delete = TRUE)

# Returning filepath of data
dataset_trec(return_path = TRUE)

# Access both training and testing dataset
train_6 <- dataset_trec(split = "train")
test_6 <- dataset_trec(split = "test")

train_50 <- dataset_trec(split = "train", version = "50")
test_50 <- dataset_trec(split = "test", version = "50")
}