Performs Naive Bayes Classification given train and test (validation) datasets, as well as additional information for the train and test data.
internalNBC(train, test, known = TRUE)
Dataframe of verbal autopsy train data (See Data documentation).
Columns (in order): ID, Cause, Symptom-1 to Symptom-n..
ID (vectorof char): unique case identifiers
Cause (vectorof char): observed causes for each case
Symptom-n.. (vectorsof (1 OR 0)): 1 for presence, 0 for absence, other values are treated as unknown
Unknown symptoms are imputed randomly from distributions of 1s and 0s per symptom column; if no 1s or 0s exist then the column is removed
Example:
ID | Cause | S1 | S2 | S3 |
"a1" | "HIV" | 1 | 0 | 0 |
"b2" | "Stroke" | 0 | 0 | 1 |
Dataframe of verbal autopsy test data in the same format as train except if causes are not known:
The 2nd column (Cause) can be omitted if known is FALSE
TRUE to indicate that the test causes are available in the 2nd column and FALSE to indicate that they are not known
out The result list object containing:
$prob.causes (vectorof double): the probabilities for each test case prediction by case id
$pred.causes (vectorof char): the predictions for each test case by case id
Additional values:
* indicates that the value is only available if test causes are known
$train (dataframe): the input train data
$train.ids (vectorof char): the ids of the train data
$train.causes (vectorof char): the causes of the train data by case id
$train.samples (double): the number of input train samples
$test (dataframe): the input test data
$test.ids (vectorof char): the ids of the test data
$test.causes* (vectorof char): the causes of the test data by case id
$test.samples (double): the number of input test samples
$test.known (logical): whether the test causes are known
$symptoms (vectorof char): all unique symptoms in order
$causes (vectorof char): all possible unique causes of death
$causes.train (vectorof char): all unique causes of death in the train data
$causes.test* (vectorof char): all unique causes of death in the test data
$causes.pred (vectorof char): all unique causes of death in the predicted cases
$causes.obs* (vectorof char): all unique causes of death in the observed cases
$pred (dataframe): a table of predictions for each test case, sorted by probability
Columns (in order): CaseID, TrueCause, Prediction-1 to Prediction-n..
CaseID (vectorof char): case identifiers
TrueCause* (vectorof char): the observed causes of death
Prediction-n.. (vectorsof char): the predicted causes of death, where Prediction1 is the most probable cause, and Prediction-n is the least probable cause
Example:
CaseID | Prediction1 | Prediction2 | "a1" |
"HIV" | "Stroke" | "b2" | "Stroke" |
"HIV" | CaseID | Prediction1 | Prediction2 |
$obs* (dataframe): a table of observed causes matching $pred for each test case
Columns (in order): CaseID, TrueCause
CaseID (vectorof char): case identifiers
TrueCause (vectorof char): the actual cause of death if applicable
Example:
CaseID | TrueCause | "a1" | "HIV" |
"b2" | "Stroke" | CaseID | TrueCause |
$obs.causes* (vectorof char): all observed causes of death by case id
$prob (dataframe): a table of probabilities of each cause for each test case
Columns (in order): CaseID, Cause-1 to Cause-n..
CaseID (vectorof char): case identifiers
Cause-n.. (vectorsof double): probabilies for each cause of death
Example:
CaseID | HIV | Stroke |
"a1" | 0.5 | 0.5 |
"b2" | 0.3 | 0.7 |
This function was built on code provided by Miasnikof et al (2015). Edits to the code included the following improvements:
Causes can be character type
Matrix operations for speed
Removal of order dependence for causes
Refactoring of variable names for clarity
Included list structure of model data and details
Argument validation
Miasnikof P, Giannakeas V, Gomes M, Aleksandrowicz L, Shestopaloff AY, Alam D, Tollman S, Samarikhalaj, Jha P. Naive Bayes classifiers for verbal autopsies: comparison to physician-based classification for 21,000 child and adult deaths. BMC Medicine. 2015;13:286. doi:10.1186/s12916-015-0521-2.
Other internal functions:
internalGetCSMFAcc()
,
internalGetCSMFMaxError()
,
internalGetCauseMetrics()
,
internalGetMetrics()
# NOT RUN {
library(nbc4va)
data(nbc4vaData)
# Create naive bayes classifier on random train and test data
# Set "known" to indicate whether or not "test" causes are known
train <- nbc4vaData[1:50, ]
test <- nbc4vaData[51:100, ]
results <- nbc4va::internalNBC(train, test, known=TRUE)
# Obtain the probabilities and predictions
prob <- results$prob.causes
pred <- results$pred.causes
# }
Run the code above in your browser using DataLab