Cut a hierarchical clustering tree into clusters of documents.
This dialog allows grouping the documents present in a tm corpus
according to a previously computed hierarchical clustering tree (see
corpusClustDlg
). It adds a new meta-data variable to the corpus,
each number corresponding to a cluster; this variable is also added to the corpusMetaData
data set. If clusters were already created before, they are simply replaced.
Clusters will be created by starting from the top of the dendrogram, and going through the merge points with the highest position until the requested number of branches is reached.
A window opens to summarize created clusters, providing information about specific documents and terms for each cluster. Specific terms are those whose observed frequency in the document or level has the lowest probability under an hypergeometric distribution, based on their global frequencies in the corpus and on the number of occurrences of all terms in the considered cluster. All terms with a probability below the value chosen using the third slider are reported, ignoring terms with fewer occurrences in the whole corpus than the value of the fourth slider (these terms can often have a low probability but are too rare to be of interest). The last slider allows limiting the number of terms that will be shown for each cluster.
The positive or negative character of the association is visible from the sign of the t value, or by comparing the value of the “% Term/Level” column with that of the “Global %” column. The definition of columns is:
the percent of the term's occurrences in all terms occurrences in the level.
the percent of the term's occurrences that appear in the level (rather than in other levels).
the percent of the term's occurrences in all terms occurrences in the corpus.
the number of occurrences of the term in the level (“internal”).
the number of occurrences of the term in the corpus.
the quantile of a normal distribution corresponding the probability “Prob.”.
the probability of observing such an extreme (high or low) number of occurrences of the term in the level, under an hypergeometric distribution.
Specific documents are selected using a different criterion than terms: documents with the smaller Chi-squared distance to the average vocabulary of the cluster are shown. This is a euclidean distance, but weighted by the inverse of the prevalence of each term in the whole corpus, and controlling for the documents' different lengths.
This dialog can only be used after having created a tree, which is done via the Text Mining->Hierarchical clustering->Create dendrogram... dialog.