Initialization of Lagrangian Multipliers in SVM using SMO (sequential minimal optimization) - optimization

I came across a paper that implemented SVM using SMO. I planned to implement SVR (support vector regression) on the basis of it, using SMO. But I'm stuck. I want to ask how the initial values of lagrangian parameters are generated? Are they generated using a random function. Because I came across several implementation and there was no such notion of how initial values are generated.

Initial parameters can be taken random and SVR will eventually evolve with optimal ones. The second order derivative is guaranteed to be positive in SVR but in SVM it may not always support optimization.

Related

Building a deep neural network that produces output that is distributed as multivariate Standard normal distribution

I'm looking for a way to Build a deep neural network that produces output that is distributed as multivariate Standard normal distribution ~N(0,1).
I can use Pytorch or TensorFlow, whichever is easier for this task.
I actually have some input X, which in terms of this question can be assumed to be just a matrix of values ​​from the uniform distribution.
I put the input into the network, whose architecture can currently change.
And I want to get output, so in addition to other requirements I will have from it. I want that if we represent the values ​​obtained by all the possible x's, we get that it looks like a multivariate standard normal distribution ~N(0,1).
What I think needs to be done for this to happen is to choose the right loss function.
To do this, I thought of two ways:
Use of statistical tests.
A loss that tests a large number of properties (mean, standard deviation, ..).
Realizing 2 sounds complicated, so I started with 1.
I was looking for statistical tests already implemented in one of the packages ​​as a loss function, and I did not find anything like that.
I implemented statistical tests by myself to obtain output that is univariate standard normal distribution - and it seemed to work relatively well.
With the realization of multidimensional tests I became more entangled.
Do you know of any understandable tensorflow\pythorch functions that do something similar to what I'm trying to do?
Do you have another idea for the operation?
Do you have any comments regarding the methods I try to work with?
Thanks
Using pytorch functions can help you a lot. Considering that I don't know exactly what you will want with these results, I can refer you to pytorch with this link here.
In this link you will have all the pytorch loss functions and the calculations used in each one of them! just click on one and check how it works and see if it’s what you’re looking for.
For the second topic you can look at this same link I sent the BCEWithLogitcLoss function because it may be what you are looking for.

why normalization do not need parameters, but batch normalization need

Normalization is just normalizing the input layer.
while batch normalization is on each layer.
We do not learn parameters in Normalization
But why we need to learn the batch normalization?
This is has been answered in detail in https://stats.stackexchange.com/a/310761
Deep Learning Book, Section 8.7.1:
Normalizing the mean and standard deviation of a unit can reduce the expressive power of the neural network containing that unit. To
maintain the expressive power of the network, it is common to replace
the batch of hidden unit activations H with γH+β rather than simply
the normalized H. The variables γ and β are learned parameters that
allow the new variable to have any mean and standard deviation. At
first glance, this may seem useless — why did we set the mean to 0,
and then introduce a parameter that allows it to be set back to any
arbitrary value β?
The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the
new parametrization has different learning dynamics. In the old
parametrization, the mean of H was determined by a complicated
interaction between the parameters in the layers below H. In the new
parametrization, the mean of γH+β is determined solely by β. The new
parametrization is much easier to learn with gradient descent.

Deep learning basic thoughts

I try to understand the basics of deep learning, lastly reading a bit through deeplearning4j. However, I don't really find an answer for: How does the training performance scale with the amount of training data?
Apparently, the cost function always depends on all the training data, since it just sums the squared error per input. Thus, I guess at each optimization step, all datapoints have to be taken into account. I mean deeplearning4j has the dataset iterator and the INDArray, where the data can live anywhere and thus (I think) doesn't limit the amount of training data. Still, doesn't that mean, that the amount of training data is directly related to the calculation time per step within the gradient descend?
DL4J uses iterator. Keras uses generator. Still the same idea - your data comes in batches, and used for SGD. So, minibatches matter, not the the whole amount of data you have.
Fundamentally speaking it doesn't (though your mileage may vary). You must research right architecture for your problem. Adding new data records may introduce some new features, which may be hard to capture with your current architecture. I'd safely always question my net's capacity. Retrain your model and check if metrics drop.

Difference between `tf.nn.batch_normalization` and `tf.nn.fused_batch_norm`?

Their documentation is short and they both refer to the same paper. Is there a difference in what those two functions implement? If no, is one of them soon-to-be obsolete by the other, which one of the two is recommended for use?
According to the performance guide
The non-fused batch norm does computations using several individual Ops. Fused batch norm combines the individual operations into a single kernel, which runs faster.
EDIT: 1/6/2020
The original link no longer works. This is a web archive link provided by Rika. The updated text says:
Fused batch norm combines the multiple operations needed to do batch normalization into a single kernel. Batch norm is an expensive process that for some models makes up a large percentage of the operation time. Using fused batch norm can result in a 12%-30% speedup.

Testing Unsupervised Machine Learning Algorithms

All over the Internet, I can see applications of supervised and unsupervised Machine Learning Algorithms but no one is talking about maintaining the quality of machine learning apps.
The recent analysis on how to test unsupervised machine learning algorithms brought up these points:
1) Cross-validation Testing: Dataset is divided into equal folds(parts) and all folds except the one are used as training dataset and later is used as test dataset Few more options around using test and training dataset.
Are there more effective ways of testing unsupervised ML algorithms where the output is uncertain?
Depending on the type of algorithm (and choosen distance) you used, you can still try to look if the variance between group and the variance within group is changing a lot.
If your algorithm is still as good as when you built it, the variance between and the variance within should not change that much. If the variance between shrink (or the reverse), it mean groups are not as well separated as before by your algorithm.
The second thing you can try is to keep some observations (that you know were well classified) to see if they are still in the same group once you retrained your algorithm. If not, it doesn't mean your algorithm is wrong, but you can send an alert in this case to look deeper.