How random_state works in lightGBM algorithm? - xgboost

Why is not it a unique process?
Specifically, which part has randomness?
I only know random forest.
random forest uses bootstrap data, so I can understand its randomness.

Same in GBM - random-number generator is used in feature and example sampling. That's used to reduce overfitting

Related

Can predictions be trusted if learning curve shows validation error lower than training error?

I'm working with neural networks (NN) as a part of my thesis in geophysics, and is using TensorFlow with Keras for training my network.
My current task is to use a NN to approximate a thermodynamical model i.e a nonlinear regression problem. It takes 13 input parameters and outputs a velocity profile (velocity vs. depth) of 450 parameters. My data consists of 100,000 synthetic examples (i.e. no noise is present), split in training (80k), validation (10k) and testing (10k).
I've tested my network for a number of different architectures: wider (5-800 neurons) and deeper (up to 10 layers), different learning rates and batch sizes, and even for many epochs (5000). Basically all the standard tricks of the trade...
But, I am puzzled by the fact that the learning curve shows validation error lower than training error (for all my tests), and I've never been able to overfit to the training data. See figure below:
The error on the test set is correspondingly low, thus the network seems to be able to make decent predictions. It seems like a single hidden layer of 50 neurons is sufficient. However, I'm not sure if I can trust these results due to the behavior of the learning curve. I've considered that this might be due to the validation set consisting of examples that are "easy" to predict, but I cannot see how I should change this. A bigger validation set perhaps?
To wrap it up: Is is necessarily a bad sign if the validation error is lower than or very close to the training error? What if the predictions made with said network are decent?
Is it possible that overfitting is simply not possible for my problem and data?
In addition to trying a higher k fold and the additional testing holdout sample,perhaps mix it up when sampling from the original data set: Select a stratified sample when partitioning out the training and validation/test sets. Then partition the validation and test set without stratifying the sampling.
My opinion is that if you introduce more variation in your modeling methodology (without breaking any "statistical rules"), you can be more confident in the model that you have created.
You can achieve more trustworthy results by repeating your experiments on different data. Use cross validation with high fold (like k=10) to get better confidence of your solution performance. Usually neural networks easily overfit, if your solution has similar results on validation and test set its a good sign.
It is not that easy to tell when not knowing the exact way you have setup the experiment:
what cross-validation method did you use?
how did you split the data?
etc
As you mentioned, the fact that you observe validation error lower than training can be a result of the fact that either the training dataset contains many "hard" cases to learn or the validation set contains many "easy" cases to predict.
However, since generally speaking training loss is expected to underestimate the validation, to me the specific model appear to have unpredictable/unknown fit (perform better in predicting the unknown that the known feels indeed weird).
In order to overcome this, I would start experimenting by reconsidering the data splitting strategy, adding more data if possible, or even change your performance metric.

FTRL optimizer in tensorflow seems not work well

Tried to training LR model on a large scale dataset via tensorflow with FTRL optimizer for a ctr task. tensorflow/sklearn auc and training/evaluation auc are OK. But performance in product is not good. I've tried to lower down the distributed level, but question can't be totally resolved. Any suggestions?
Found at least two reasons:
First is the underlying implementation is not exactly the same as the original paper. I don't know why they do this, explanation needed.
Second, the gradients used in updating weights are batch gradient, which means update the ps weights once per batch(very trivial in a modern distributed system but not suitable for the scenario in original paper), in a summary it does not utilize the training data record-wise. Personally the second is the key point.

How many epochs should Word2Vec be trained? What is a recommended training dataset?

I am learning about Word2Vec using the TensorFlow tutorial. The code I am running for Word2Vec is also from the TensorFlow tutorial: https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_optimized.py . When I ran the code for 15 epochs, the test accuracy was around 30%. When I ran for 100 epochs, test accuracy got up to around 39%. I am using the Text8 dataset for training and questions-words.txt for evaluation.
Do I need to run for more epochs? Should I be using a different dataset? How can I improve test accuracy?
Larger datasets are better; text8 is very, very small – sufficient for showing some of the analogy-solving power of word-vectors, but not good enough for other purposes.
More iterations may help squeeze slightly stronger vectors out of smaller datasets, but with diminishing returns. (No number of extra iterations over a weak dataset can extract the same rich interrelationships that a larger, more varied corpus can provide.)
There's a related text9 from the same source that if I recall correctly, is 10x larger. You'll likely get better evaluation results from using it, than from doing 10x more iterations on text8.
I believe the 3 million pretrained vectors Google once released – the GoogleNews set – were trained on a corpus of 100 billion words' worth of news articles, but with just 3 passes.
Note that there's no single standard for word-vector quality: the questions-words.txt analogy solving is just one convenient evaluation, but it's possible the word-vectors best at that won't be best at your own domain-specific analyses. Similarly, word-vectors trained on one domain of text, like the GoogleNews set from news articles, might underperform compared to text that better matches your domain (like perhaps forum posts, scientific articles, etc. – which all use different words in different ways).
So it's often best to use your own corpus, and your own goal-specific quantitative evaluation, to help adjust corpus/parameter choices.

Testing Unsupervised Machine Learning Algorithms

All over the Internet, I can see applications of supervised and unsupervised Machine Learning Algorithms but no one is talking about maintaining the quality of machine learning apps.
The recent analysis on how to test unsupervised machine learning algorithms brought up these points:
1) Cross-validation Testing: Dataset is divided into equal folds(parts) and all folds except the one are used as training dataset and later is used as test dataset Few more options around using test and training dataset.
Are there more effective ways of testing unsupervised ML algorithms where the output is uncertain?
Depending on the type of algorithm (and choosen distance) you used, you can still try to look if the variance between group and the variance within group is changing a lot.
If your algorithm is still as good as when you built it, the variance between and the variance within should not change that much. If the variance between shrink (or the reverse), it mean groups are not as well separated as before by your algorithm.
The second thing you can try is to keep some observations (that you know were well classified) to see if they are still in the same group once you retrained your algorithm. If not, it doesn't mean your algorithm is wrong, but you can send an alert in this case to look deeper.

Initialization of Lagrangian Multipliers in SVM using SMO (sequential minimal optimization)

I came across a paper that implemented SVM using SMO. I planned to implement SVR (support vector regression) on the basis of it, using SMO. But I'm stuck. I want to ask how the initial values of lagrangian parameters are generated? Are they generated using a random function. Because I came across several implementation and there was no such notion of how initial values are generated.
Initial parameters can be taken random and SVR will eventually evolve with optimal ones. The second order derivative is guaranteed to be positive in SVR but in SVM it may not always support optimization.