How to estimate the required sweeps to reach equilibrium in Ising model? - physics

How to estimate the required sweeps to reach equilibrium in 2D Ising model, considering the Metropolis algorithm.

Related

Can I train a CNN architecture based deep-learning model with raw audio-data at a high sampling frequency

Shape of my raw data array is (1299981,2000)
I am getting error that Allocation of 415952000 exceeds 10% of free system memory.
My doubt is that how should I train my model with raw data while keeping input shape feasible (should I decrease the sampling frequency?) for the model.

Loss function for binary classification with problem of data imbalance

I try to segment of multiple sclerosis lesions in MR images using deep convolutional neural networks with keras. In this task, each voxel must be classified, either as a lesion voxel or healthy voxel.
The challenge of this task is data imbalance that number of lesion voxels is less than number of healthy voxels and data is extremely imbalanced.
I have a small number of training data and I can not use the sampling techniques. I try to select appropriate loss function to classify voxels in these images.
I tested focal loss, but I could not tuning gamma parameter in this loss function.
Maybe someone help me that how to select appropriate loss function for this task?
Focal loss is indeed a good choice, and it is difficult to tune it to work.
I would recommend using online hard negative mining: At each iteration, after your forward pass, you have loss computed per voxel. Before you compute gradients, sort the "healthy" voxels by their loss (high to low), and set to zero the loss for all healthy voxels apart from the worse k (where k is about 3 times the number of "lesion" voxels in the batch).
This way, gradients will only be estimated for a roughly balanced set.
This video provides a detailed explanation how class imbalance negatively affect training, and how to use online hard negative mining to overcome it.

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

neural networks - variance of gradients in minibatch

I'm using tensorflow to try to investigate the local reparameterization trick [Kingma et al, 2015] and the effect it has on the variance of the gradients. I'm getting strange results however and I'm concerned I may be misunderstanding what's going on.
My understanding is as follows: given a loss function or lower bound etc. it's possible to compute a matrix of derivatives for each data point which is the derivative of this loss function with respect to each weight in the weight matrix for the output layer (if these are the gradients we want to examine). For any one of these weights, the variance is calculated across the derivative of the loss function for each data point with respect to this weight. So if we start with a (1000, 1000) weight matrix, we finish with a (1000, 1000) matrix whose entries (i,j) are given by the variance across the derivatives of the loss function for each datapoint in the mini-batch with respect to weight (i,j).
At this point we can take the mean of all variances in the matrix to give us the final average variance. Is this what's being referred to when people talk about the variance of the gradients?

In Stochastic Gradient Descent as the cost function is updated based on single training data , wont it lead to overfitting?

When we are dealing with Stochastic Gradient Descent, the cost function is updated based on single, random training data.
But this single entry may alter the weights to its favour and as the cost function is only dependent on that entry, the cost function might mislead us, as it isn't actually reducing the cost, but instead it is overfitting the particular entry. With the next entry, once again, the weights will be updated to favour this entry.
Won't it lead to over fitting? How do I go about resolving this issue?
The training data isn't random - SGD iterates over all the training points (either singly or in batches). Because the loss function is calculated for data batch (or individual training point), it can be thought of as a random draw from a distribution of gradient vectors in weight space that will not match exactly the global gradient of the loss function calculated over the entirety of the training data. A single step is absolutely "over-fit" to the batch / training point, but we only take a single step in that direction (moderated by the learning rate which is typically << 1). Then we move on to the next data point (or batch) and calculate a new gradient. There is a "recency" effect (data trained more recently effectively counts more), but this is moderated by small learning rates. In aggregate over many iterations, all of the training data are equally weighted.
By doing this over all of the data in turn, each individual backprop step is taking a small random (but not uncorrelated) step in weight space. Across many training iterations, the network may be able to find its way to very good solutions (not a lot of guarantees about global optimality, but neural networks are highly expressive by their nature and can often find very good solutions). However, it may take many stepwise iterations over the same data set to converge to a local basin of attraction.
Over-fitting on training data is absolutely a concern for Neural Networks, but that's a function of their expressivity rather than the Stochastic Gradient Descent algorithm. Techniques like dropout and kernel regularizers on the training weights can provide regularization robustness, but the only way to