This function implements the residual-based diagnostic method of Taddy
(2012). The basic idea is that when the model is correctly specified the
multinomial likelihood implies a dispersion of the residuals:
\(\sigma^2=1\). If we calculate the sample dispersion and the value is
greater than one, this implies that the number of topics is set too low,
because the latent topics are not able to account for the overdispersion. In
practice this can be a very demanding criterion, especially if the documents
are long. However, when coupled with other tools it can provide a valuable
perspective on model fit. The function is based on the Taddy 2012 paper as well as code
found in maptpx package.
Further details are available in the referenced paper, but broadly speaking
the dispersion is derived from the mean of the squared adjusted residuals.
We get the sample dispersion by dividing by the degrees of freedom
parameter. In estimating the degrees of freedom, we follow Taddy (2012) in
approximating the parameter \(\hat{N}\) by the number of expected counts
exceeding a tolerance parameter. The default value of 1/100 given in the
Taddy paper can be changed by setting the tol
argument.
The function returns the estimated sample dispersion (which equals 1 under
the data generating process) and the p-value of a chi-squared test where the
null hypothesis is that \(\sigma^2=1\) vs the alternative \(\sigma^2
>1\). As Taddy notes and we echo, rejection of the null 'provides a very
rough measure for evidence in favor of a larger number of topics.'