spark.lda: Latent Dirichlet Allocation

Description

spark.lda fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call summary to get a summary of the fitted LDA model, spark.posterior to compute posterior probabilities on new data, spark.perplexity to compute log perplexity on new data and write.ml/read.ml to save/load fitted models.

Usage

spark.lda(data, ...)
spark.posterior(object, newData)
spark.perplexity(object, data)
# S4 method for SparkDataFrame
spark.lda(
  data,
  features = "features",
  k = 10,
  maxIter = 20,
  optimizer = c("online", "em"),
  subsamplingRate = 0.05,
  topicConcentration = -1,
  docConcentration = -1,
  customizedStopWords = "",
  maxVocabSize = bitwShiftL(1, 18)
)
# S4 method for LDAModel
summary(object, maxTermsPerTopic)
# S4 method for LDAModel,SparkDataFrame
spark.perplexity(object, data)
# S4 method for LDAModel,SparkDataFrame
spark.posterior(object, newData)
# S4 method for LDAModel,character
write.ml(object, path, overwrite = FALSE)

Arguments

data

A SparkDataFrame for training.

...

additional argument(s) passed to the method.

object

A Latent Dirichlet Allocation model fitted by spark.lda.

newData

A SparkDataFrame for testing.

features

Features column name. Either libSVM-format column or character-format column is valid.

Number of topics.

maxIter

Maximum iterations.

optimizer

Optimizer to train an LDA model, "online" or "em", default is "online".

subsamplingRate

(For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].

topicConcentration

concentration parameter (commonly named beta or eta) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use summary to retrieve the effective topicConcentration. Only 1-size numeric is accepted.

docConcentration

concentration parameter (commonly named alpha) for the prior placed on documents distributions over topics (theta), default -1 to set automatically on the Spark side. Use summary to retrieve the effective docConcentration. Only 1-size or k-size numeric is accepted.

customizedStopWords

stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column.

maxVocabSize

maximum vocabulary size, default 1 << 18

maxTermsPerTopic

Maximum number of terms to collect for each topic. Default value of 10.

path

The directory where the model is saved.

overwrite

Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.

Value

spark.lda returns a fitted Latent Dirichlet Allocation model.

summary returns summary information of the fitted model, which is a list. The list includes

docConcentration

concentration parameter commonly named alpha for the prior placed on documents distributions over topics theta

topicConcentration

concentration parameter commonly named beta or eta for the prior placed on topic distributions over terms

logLikelihood

log likelihood of the entire corpus

logPerplexity

log perplexity

isDistributed

TRUE for distributed model while FALSE for local model

vocabSize

number of terms in the corpus

topics

top 10 terms and their weights of all topics

vocabulary

whole terms of the training corpus, NULL if libsvm format file used as training set

trainingLogLikelihood

Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")

logPrior

Log probability of the current parameter estimate: log P(topics, topic distributions for docs | Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")

spark.perplexity returns the log perplexity of given SparkDataFrame, or the log perplexity of the training data if missing argument "data".

spark.posterior returns a SparkDataFrame containing posterior probabilities vectors named "topicDistribution".

Examples

Run this code

# NOT RUN {
text <- read.df("data/mllib/sample_lda_libsvm_data.txt", source = "libsvm")
model <- spark.lda(data = text, optimizer = "em")

# get a summary of the model
summary(model)

# compute posterior probabilities
posterior <- spark.posterior(model, text)
showDF(posterior)

# compute perplexity
perplexity <- spark.perplexity(model, text)

# save and load the model
path <- "path/to/model"
write.ml(model, path)
savedModel <- read.ml(path)
summary(savedModel)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples