Perplexity in topic modeling - text-mining

I have run the LDA using topic models package on my training data. How can I determine the perplexity of the fitted model? I read the instruction, but I am not sure which code I should use.
Here's what I have so far:
burnin <- 500
iter <- 1000
#keep <- 30
k <- 4
results_training <- LDA(dtm_training, k,
method = "Gibbs",
control = list(burnin = burnin,
iter = iter))
Terms <- terms(results_training, 10)
Topic <- topics(results_training, 4)
# Get the posterior probability for each document over each topic
posterior <- posterior(results_training)[[2]]
It works perfectly, but now my question is how I can use perplexity on the testing data (results_testing)? And how can I interpret the result of the perplexity?
Thanks

Related

After quantisation in neural network, will the output need to be scaled with the inverse of the weight scaling

I'm currently writing a script to quantise a Keras model down to 8 bits. I'm doing a fairly basic linear scaling on the weights, by assuming a normal distribution of weights and biases, and then interpolating all the values within 2 standard deviations of the mean, to the range [-128, 127].
This all works, and I run the model through inference, but my image out is crazy bad. I know there will be a small performance hit, but I'm seeing roughly 10x performance degradation.
My question is, after this scaling of the weights, do I need to do the inverse scaling operation to my output? None of the papers I've been reading seem to mention this, but I'm unsure why else my results would be so bad.
The network is for image demosaicing. It takes in a RAW image, and is meant to output an image with very low noise, and no demosaicing artefacts. My full precision model is very good, with image PSNRs of around 40-43dB, but after quantisation, I'm getting 4-8dB, and incredibly bad looking images.
Code for anyone who's bothered to read it
for i in layer_index:
count = count+1
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
std = np.std(weights_act)
if (std > max_std):
max_std = std
mean = np.mean(weights_act)
mean_of_mean = mean_of_mean + mean
mean_of_mean = mean_of_mean / count
max_bound = mean_of_mean + 2*max_std
min_bound = mean_of_mean - 2*max_std
print(max_bound, min_bound)
for i in layer_index:
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
weights_shape = weights_act.shape;
bias_shape = bias_act.shape;
new_weights = np.empty(weights_shape, dtype = np.int8)
print(new_weights.dtype)
new_biass = np.empty(bias_shape, dtype = np.int8)
for a in range(weights_shape[0]):
for b in range(weights_shape[1]):
for c in range(weights_shape[2]):
for d in range(weights_shape[3]):
new_weight = (((weights_act[a,b,c,d] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_weights[a,b,c,d] = np.int8(new_weight)
#print(new_weights[a,b,c,d], weights_act[a,b,c,d])
for e in range(bias_shape[0]):
new_bias = (((bias_act[e] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_biass[e] = np.int8(new_bias)
new_weight_layer = (new_weights, new_biass)
layer.set_weights(new_weight_layer)
You dont do what you think you are doing, I'll explain.
If you wish to take pre-trained model and quantize it you have to add scales after each operation that involves weights, lets take for example the convolution operation.
As we know convolution operation is linear in my explantion i will ignore the bias for the sake of simplicity (adding him is relatively easy), Let's assume X is our input Y is our output and W is the weights, convolution can be written as:
Y=W*X
where '*' represent the convolution operation, what you are basically doing is taking the weights and multiple them by some scalar (lets call it 'a') and shift them by some other scalar (let's call it 'b') so in your model you use W' where: W'= Wa+b
So if we return to the convolution operation we get that in your quantized network you basically do the next operation: Y' = W'*X = (Wa+b)*X
Because convolution is linear we get: Y' = a(W*X) + b*X'
Don't forget that in your network you want to receive Y not Y' at the output of the convolution therefore you must do shift + re scale to get the correct answer.
So after that explanation (which i hope was clear enough) i hope you can understand what is the problem in your network, you do this scale and shift to all of weights and you never compensate for it, I think your confusion is because your read papers that trained models in quantized mode from the beginning and didn't take pretrained model quantized it.
For you problem i think tensorflow graph transform tool might help, take a look at:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md
If you wish to read more about quantizing pre trained model you can find more information in (for more academic info just go to scholar.google.com:
https://www.tensorflow.org/lite/performance/post_training_quantization

Writing custom optimizers questions

apologies beforehand if these are basic questions - I've done some digging around and can't find straightforward answers. If there are links to any resources that can help that would be great too!
I'm currently looking at the below piece of code for a class that implements optimizer.optimizer. I'm having trouble understanding the following things:
def _create_slots(self, var_list):
# Create slots for the first and second moments.
for v in var_list:
self._zeros_slot(v, "m", self._name)
def _apply_dense(self, grad, var):
lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
beta_t = math_ops.cast(self._beta_t, var.dtype.base_dtype)
alpha_t = math_ops.cast(self._alpha_t, var.dtype.base_dtype)
eps = 1e-7 #cap for moving average
m = self.get_slot(var, "m")
m_t = m.assign(tf.maximum(beta_t * m + eps, tf.abs(grad)))
var_update = state_ops.assign_sub(var, lr_t*grad*(1.0+alpha_t*tf.sign(grad)*tf.sign(m_t) ) )
#Create an op that groups multiple operations
#When this op finishes, all ops in input have finished
return control_flow_ops.group(*[var_update, m_t])
In create slots, is it allocating a variable "m" for each weight? And if I needed more variables then would I have another line in the for loop like self._zero_slots(v, "another variable", self._name) and so forth?
Is the input grad and var in _apply_dense the gradient per weight and variables per weight? What if I needed other gradients, e.g. if I wanted to do a global update based on the whole gradient matrix?
How is m being updated with new weights? It seems like on every iteration it would just be multiplied by beta.
In the line with var_update, it seems like var is being updated with the weight update, but I can also get my variables from var?

Predicting from the full posterior distribution using stan_glmer

Could I ask for some help please?
I have fit a binomial model using stan_glmer and have picked the model which I think best fits the data. I have used the posterior predict command to compare my observed data to data simulated by the model and it seems very similar.
I now want to predict the probability of an event for different levels of the predictors. I would usually use the predict command in glmer but I know I should use the posterior_predict command for stan_glmer to take into account the full uncertainty in the model. If x1 and x2 are continuous predictors for a binary event and I want a random intercept on group, the model formula would be:
model <- stan_glmer(binary event ~ x1 + x2 +(1 | group), family="binomial"
My question is: I want to vary the predictors (x1 and x2) to see how the model predicts the observed data (and the variability in those predictions), maybe as a plot but I’m not sure how. Any help or guidance would be greatly appreciated.
In short, posterior_predict has a newdata argument that expects a data.frame with values of x1, x2, and group. This argument is similar to that in many other prediction functions and there is an example of using that can be executed via example(posterior_predict, package = "rstanarm").
In your case, it might be something like
nd <- with(original_data,
expand.grid(x1 = seq(from = min(x1), to = max(x1), length.out = 20),
x2 = seq(from = min(x2), to = max(x2), length.out = 20),
group = levels(group)))
PPD <- posterior_predict(model, newdata = nd)
but you could choose the values of x1 and x2 in various other ways.

Fit a bayesian linear regression and predict unobservable values

I'd like to use Jags plus R to adjust a linear model with observable quantities, and make inference about unobservable ones. I found lots of example on the internet about how to adjust the model, but nothing on how to extrapolate its coefficients after having fitted the model in the Jags environment. So, I'll appreciate any help on this.
My data looks like the following:
ngroups <- 2
group <- 1:ngroups
nobs <- 100
dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
head(dta)
JAGS has powerful ways to make inference about missing data, and once you get the hang of it, it's easy! I strongly recommend that you check out Marc Kéry's excellent book which provides a wonderful introduction to BUGS language programming (JAGS is close enough to BUGS that almost everything transfers).
The easiest way to do this involves, as you say, adjusting the model. Below I provide a complete worked example of how this works. But you seem to be asking for a way to get the prediction interval without re-running the model (is your model very large and computationally expensive?). This can also be done.
How to predict--the hard way (without re-running the model)
For each iteration of the MCMC, simulate the response for the desired x-value based on that iteration's posterior draws for the covariate values. So imagine you want to predict a value for X=10. Then if iteration 1 (post burn-in) has slope=2, intercept=1, and standard deviation=0.5, draw a Y-value from
Y=rnorm(1, 1+2*10, 0.5)
And repeat for iteration 2, 3, 4, 5...
These will be your posterior draws for the response at X=10. Note: if you did not monitor the standard deviation in your JAGS model, you are out of luck and need to fit the model again.
How to predict--the easy way--with worked example
The basic idea is to insert (into your data) the x-values whose responses you want to predict, with the associated y-values NA. For example, if you want a prediction interval for X=10, you just have to include the point (10, NA) in your data, and set a trace monitor for the y-value.
I use JAGS from R with the rjags package. Below is a complete worked example that begins by simulating the data, then adds some extra x-values to the data, specifies and runs the linear model in JAGS via rjags, and summarizes the results. Y[101:105] contains draws from the posterior prediction intervals for X[101:105]. Notice that Y[1:100] just contains the y-values for X[1:100]. These are the observed data that we fed to the model, and they never change as the model updates.
library(rjags)
# Simulate data (100 observations)
my.data <- as.data.frame(matrix(data=NA, nrow=100, ncol=2))
names(my.data) <- c("X", "Y")
# the linear model will predict Y based on the covariate X
my.data$X <- runif(100) # values for the covariate
int <- 2 # specify the true intercept
slope <- 1 # specify the true slope
sigma <- .5 # specify the true residual standard deviation
my.data$Y <- rnorm(100, slope*my.data$X+int, sigma) # Simulate the data
#### Extra data for prediction of unknown Y-values from known X-values
y.predict <- as.data.frame(matrix(data=NA, nrow=5, ncol=2))
names(y.predict) <- c("X", "Y")
y.predict$X <- c(-1, 0, 1.3, 2, 7)
mydata <- rbind(my.data, y.predict)
set.seed(333)
setwd(INSERT YOUR WORKING DIRECTORY HERE)
sink("mymodel.txt")
cat("model{
# Priors
int ~ dnorm(0, .001)
slope ~ dnorm(0, .001)
tau <- 1/(sigma * sigma)
sigma ~ dunif(0,10)
# Model structure
for(i in 1:R){
Y[i] ~ dnorm(m[i],tau)
m[i] <- int + slope * X[i]
}
}", fill=TRUE)
sink()
jags.data <- list(R=dim(mydata)[1], X=mydata$X, Y=mydata$Y)
inits <- function(){list(int=rnorm(1, 0, 5), slope=rnorm(1,0,5),
sigma=runif(1,0,10))}
params <- c("Y", "int", "slope", "sigma")
nc <- 3
n.adapt <-1000
n.burn <- 1000
n.iter <- 10000
thin <- 10
my.model <- jags.model('mymodel.txt', data = jags.data, inits=inits, n.chains=nc, n.adapt=n.adapt)
update(my.model, n.burn)
my.model_samples <- coda.samples(my.model,params,n.iter=n.iter, thin=thin)
summary(my.model_samples)

R: FAST multivariate optimization packages?

I am looking to find a local minimum of a scalar function of 4 variables, and I have range-constraints on the variables ("box constraints"). There's no closed-form for the function derivative, so methods needing an analytical derivative function are out of the question. I've tried several options and control parameters with the optim function, but all of them seem very slow. Specifically, they seem to spend a lot of time between calls to my (R-defined) objective function, so I know the bottleneck is not my objective function but the "thinking" between calls to my objective function. I looked at CRAN Task View for optimization and tried several of those options (DEOptim from RcppDE, etc) but none of them seem any good. I would have liked to try the nloptr package (an R wrapper for NLOPT library) but it seems to be unavailable for windows.
I'm wondering, are there any good, fast optimization packages that people use that I may be missing? Ideally these would be in the form of thin wrappers around good C++/Fortran libraries, so there's minimal pure-R code. (Though this shouldn't be relevant, my optimization problem arose while trying to fit a 4-parameter distribution to a set of values, by minimizing a certain goodness-of-fit measure).
In the past I've found R's optimization libraries to be quite slow, and ended up writing a thin R wrapper calling a C++ API of a commercial optimization library. So are the best libraries necessarily commercial ones?
UPDATE. Here is a simplified example of the code I'm looking at:
###########
## given a set of values x and a cdf, calculate a measure of "misfit":
## smaller value is better fit
## x is assumed sorted in non-decr order;
Misfit <- function(x, cdf) {
nevals <<- nevals + 1
thinkSecs <<- thinkSecs + ( Sys.time() - snapTime)
cat('S')
if(nevals %% 20 == 0) cat('\n')
L <- length(x)
cdf_x <- pmax(0.0001, pmin(0.9999, cdf(x)))
measure <- -L - (1/L) * sum( (2 * (1:L)-1 )* ( log( cdf_x ) + log( 1 - rev(cdf_x))))
snapTime <<- Sys.time()
cat('E')
return(measure)
}
## Given 3 parameters nu (degrees of freedom, or shape),
## sigma (dispersion), gamma (skewness),
## returns the corresponding 4-parameter student-T cdf parametrized by these params
## (we restrict the location parameter mu to be 0).
skewtGen <- function( p ) {
require(ghyp)
pars = student.t( nu = p[1], mu = 0, sigma = p[2], gamma = p[3] )
function(z) pghyp(z, pars)
}
## Fit using optim() and BFGS method
fit_BFGS <- function(x, init = c()) {
x <- sort(x)
nevals <<- 0
objFun <- function(par) Misfit(x, skewtGen(par))
snapTime <<- Sys.time() ## global time snap shot
thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
tUser <- system.time(
res <- optim(init, objFun,
lower = c(2.1, 0.1, -1), upper = c(15, 2, 1),
method = 'L-BFGS-B',
control = list(trace=2, factr = 1e12, pgtol = .01 )) )[1]
cat('Total time = ', tUser,
' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
cat('results:\n')
print(res$par)
}
fit_DE <- function(x) {
x <- sort(x)
nevals <<- 0
objFun <- function(par) Misfit(x, skewtGen(par))
snapTime <<- Sys.time() ## global time snap shot
thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
require(RcppDE)
tUser <- system.time(
res <- DEoptim(objFun,
lower = c(2.1, 0.1, -1),
upper = c(15, 2, 1) )) [1]
cat('Total time = ', tUser,
' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
cat('results:\n')
print(res$par)
}
Let's generate a random sample:
set.seed(1)
# generate 1000 standard-student-T points with nu = 4 (degrees of freedom)
x <- rt(1000,4)
First fit using the fit.tuv (for "T UniVariate") function in the ghyp package -- this uses the Max-likelihood Expectation-Maximization (E-M) method. This is wicked fast!
require(ghyp)
> system.time( print(unlist( pars <- coef( fit.tuv(x, silent = TRUE) ))[c(2,4,5,6)]))
nu mu sigma gamma
3.16658356 0.11008948 1.56794166 -0.04734128
user system elapsed
0.27 0.00 0.27
Now I am trying to fit the distribution a different way: by minimizing the "misfit" measure defined above, using the standard optim() function in base R. Note that the results will not in general be the same. My reason for doing this is to compare these two results for a whole class of situations. I pass in the above Max-Likelihood estimate as the starting point for this optimization.
> fit_BFGS( x, init = c(pars$nu, pars$sigma, pars$gamma) )
N = 3, M = 5 machine precision = 2.22045e-16
....................
....................
.........
iterations 5
function evaluations 7
segments explored during Cauchy searches 7
BFGS updates skipped 0
active bounds at final generalized Cauchy point 0
norm of the final projected gradient 0.0492174
final function value 0.368136
final value 0.368136
converged
Total time = 41.02 secs, ObjFun Time Pct = 99.77084
results:
[1] 3.2389296 1.5483393 0.1161706
I also tried to fit with the DEoptim() but it ran for too long and I had to kill it. As you can see from the output above, 99.8% of the time is attributable to the objective function! So Dirk and Mike were right in their comments below. I should have more carefully estimated the time spent in my objective function, and printing dots was not a good idea! Also I suspect the MLE(E-M) method is very fast because it uses an analytical (closed-form) for the log-likelihood function.
A maximum likelihood estimator, when it exists for your problem, will always be faster than a global optimizer, in any language.
A global optimizer, no matter the algorithm, typically combines some random jumps with local minimization routines. Different algorithms may discuss this in terms of populations (genetic algorithms), annealing, migration, etc. but they are all conceptually similar.
In practice, this means that if you have a smooth function, some other optimization algorithm will likely be fastest. The characteristics of your problem function will dictate whether that will be a quadratic, linear, conical, or some other type of optimization problem for which an exact (or near-exact) analytical solution exists, or whether you will need to apply a global optimizer that is necessarily slower.
By using ghyp, you're saying that your 4 variable function produces an output that may be fit to the generalized hyperbolic distribution, and you are using a maximum likelihood estimator to find the closest generalized hyperbolic distribution to the data you've provided. But if you are doing that, I'm afraid I don't understand how you could have a non-smooth surface requiring optimization.
In general, the optimizer you choose needs to be chosen based on your problem. There is no perfect 'optimal optimizer', in any programming language, and choice of optimization algorithm to fit your problem will likely make more of a difference than any minor inefficiencies of the implementation.