nbc: Train a NBC model

Description

Performs supervised Naive Bayes Classification on verbal autopsy data.

Usage

nbc(train, test, known = TRUE)

Arguments

train

Dataframe of verbal autopsy train data (See Data documentation).

Columns (in order): ID, Cause, Symptom-1 to Symptom-n..
ID (vectorof char): unique case identifiers
Cause (vectorof char): observed causes for each case
Symptom-n.. (vectorsof (1 OR 0)): 1 for presence, 0 for absence, other values are treated as unknown
Unknown symptoms are imputed randomly from distributions of 1s and 0s per symptom column; if no 1s or 0s exist then the column is removed

Example:

ID	Cause	S1	S2	S3
"a1"	"HIV"	1	0	0
"b2"	"Stroke"	0	0	1

test

Dataframe of verbal autopsy test data in the same format as train except if causes are not known:

The 2nd column (Cause) can be omitted if known is FALSE

known

TRUE to indicate that the test causes are available in the 2nd column and FALSE to indicate that they are not known

Value

out The result nbc list object containing:

$prob.causes (vectorof double): the probabilities for each test case prediction by case id
$pred.causes (vectorof char): the predictions for each test case by case id
Additional values:
- * indicates that the value is only available if test causes are known
- $train (dataframe): the input train data
- $train.ids (vectorof char): the ids of the train data
- $train.causes (vectorof char): the causes of the train data by case id
- $train.samples (double): the number of input train samples
- $test (dataframe): the input test data
- $test.ids (vectorof char): the ids of the test data
- $test.causes* (vectorof char): the causes of the test data by case id
- $test.samples (double): the number of input test samples
- $test.known (logical): whether the test causes are known
- $symptoms (vectorof char): all unique symptoms in order
- $causes (vectorof char): all possible unique causes of death
- $causes.train (vectorof char): all unique causes of death in the train data
- $causes.test* (vectorof char): all unique causes of death in the test data
- $causes.pred (vectorof char): all unique causes of death in the predicted cases
- $causes.obs* (vectorof char): all unique causes of death in the observed cases
- $pred (dataframe): a table of predictions for each test case, sorted by probability
  - Columns (in order): CaseID, TrueCause, Prediction-1 to Prediction-n..
  - CaseID (vectorof char): case identifiers
  - TrueCause* (vectorof char): the observed causes of death
  - Prediction-n.. (vectorsof char): the predicted causes of death, where Prediction1 is the most probable cause, and Prediction-n is the least probable cause
  Example:
  CaseID Prediction1 Prediction2 "a1"
  "HIV" "Stroke" "b2" "Stroke"
  "HIV" CaseID Prediction1 Prediction2
- $obs* (dataframe): a table of observed causes matching $pred for each test case
  - Columns (in order): CaseID, TrueCause
  - CaseID (vectorof char): case identifiers
  - TrueCause (vectorof char): the actual cause of death if applicable
  Example:
  CaseID TrueCause "a1" "HIV"
  "b2" "Stroke" CaseID TrueCause
- $obs.causes* (vectorof char): all observed causes of death by case id
- $prob (dataframe): a table of probabilities of each cause for each test case
  - Columns (in order): CaseID, Cause-1 to Cause-n..
  - CaseID (vectorof char): case identifiers
  - Cause-n.. (vectorsof double): probabilies for each cause of death
  Example:
  CaseID HIV Stroke
  "a1" 0.5 0.5
  "b2" 0.3 0.7

References

Miasnikof P, Giannakeas V, Gomes M, Aleksandrowicz L, Shestopaloff AY, Alam D, Tollman S, Samarikhalaj, Jha P. Naive Bayes classifiers for verbal autopsies: comparison to physician-based classification for 21,000 child and adult deaths. BMC Medicine. 2015;13:286. doi:10.1186/s12916-015-0521-2.

Examples

Run this code

# NOT RUN {
library(nbc4va)
data(nbc4vaData)

# Run naive bayes classifier on random train and test data
# Set "known" to indicate whether or not "test" causes are known
train <- nbc4vaData[1:50, ]
test <- nbc4vaData[51:100, ]
results <- nbc(train, test, known=TRUE)

# Obtain the probabilities and predictions
prob <- results$prob.causes
pred <- results$pred.causes

# }

Run the code above in your browser using DataLab

CaseID	Prediction1	Prediction2	"a1"
"HIV"	"Stroke"	"b2"	"Stroke"
"HIV"	CaseID	Prediction1	Prediction2

CaseID	HIV	Stroke
"a1"	0.5	0.5
"b2"	0.3	0.7