parameters tuning in tensorflow - tensorflow

I use tf to training a model. Since the initial value of the network is generally set to random, this means that even if the parameters are unchanged, the results of each training are different. How to judge that the improvement is due to my tuning of parameters or some good initial parameters? Thank you.

Generally you can train the same model multiple times to see how big the difference is. However, there are some good practices to follow. Among others you can
Use cross-validation https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Ensure your validation dataset is big enough to have statistical significance. If you only have 100 test images, getting one more correct would give you a 1% boost. Have 10.000 images in this set means you have to get 100 more correct for the same 1% boost, which already feels better :)
Good luck with your project!

Related

TF2 tensorflow hub retrained model - expanded my training datababase twofold, accuracy dropped 30%, after cleaning up data dropped 40%

I need assistance as a ML beginner. I have retrained TF2's model using make_image_classifier through tensorflow hub (command line approach).
My very first training:
I have immediately retrained the model once again as this one's predictions did not satisfy me -> using 4 classes instead and achieved 85% val.accuracy in Epoch 4/5 and 81% val.accuracy in Epoch 5/5. Because the problem is complex, I have decided to expand upon my db, increasing the number of data fed to the model.
I have expanded the database more than twofold! The results are shocking and I have no clue what to do. I can't believe this.
If anything - the added data is of much better quality, relevance and diversity, and it is actually obtained by myself - a human "discriminator" to the 2nd model's preformance (the 81% acc. one). If I knew how to do it, which is a second question - I would simply add this "feedback data" to my already working model. But I have no idea how to make it happen - ideally, I'd want to give feedback to the bot, allowing it to train once again on the additional data, it understanding that this is a feedback rather than a random additional set of data. However training a new model with updated data from scratch I'd like to try out anyways.
How do I interpret those numbers? What can be wrong with the data? Is it the data what's wrong at all, which is my assumption? Why is the training accuracy below 0.5?? What could have caused the drop between Epoch 2/5' and Epoch 3/5 starting the downfall? What are common issues and similar situations when something like this happens?
I'd love to understand what happened behind the curtains on a more lower level here but I have difficulty, I need guidance. This is as discouraging as the first & second training were encouraging.
Possible problems with the data I can see - but I can't believe they could cause such a drop especially because they were there during 1st&2nd training that went okay:
1-5% images can be in multiple classes (2 classes or maximum 3) - in my opinion this is even encouraged for the problem, as the model imitates me, and I want it to struggle with finding out what it is about certain elements that caused me to classify them as two+ simultaneous classes. Finding out features in them that made me think they'd satisfy both.
1-3% can be duplicates.
There's white border around the images (since it didn't seem to affect the first trainings I didn't bother to remove it)
The drop is 40%.
I will now try to use Tensorflow's tutorial to do it through a script rather than using command line, but seeing such a huge drop is very discouraging, I was hoping to see an increase, after all that's a common suggestion to feed a model more data and I have made sure to feed it quality additional data.. I appreciate every single suggestion for fine-tuning the model, as I am a beginner and I have no experience with what might work and what most probably won't. Thank you.
EDIT: I have cleaned my data by:
removing borders(now they are tiny, white, sometimes not present at all)
removing dust
removing artifacts which there were quite many in the added data!
Convinced the last point was the issue, I retrained. Results are even worse!
2 classes binary
4 classes

How do I handle variability of output in Anylogic?

I have been working on a simulation model for battery swapping in Anylogic. So far I have developed the simulation model, optimization experiment and parameters variation experiment.
There are no errors in the model but the output values are unsatisfactory. Small changes such as changing the step size of the decision variables results in a drastic change in the best value obtained after every experiment. Though the objective does not change much but I am concerned about the other variables that are changing with each run. Even with multiple optimization runs it is difficult to come to a conclusion.
For reference I am posting an output of parameters variation experiment here. I ran the experiment with an optimized value but I was getting feasible results (percentile > 95%) far off the expected input values. Although, the overall result is correct (decreasing percentile with increasing charging time) but it is difficult to understand the variability.
Can anyone help?enter image description here
When building a model, this is a common problem you will have when looking at high level overall outputs. You could have a model bug, but it is just as likely (if not more likely) that there is some dynamic to your system that was not clear in simple Excel spreadsheets or mental models. The DES may be telling us something truly interesting about the system behavior, but without additional outputs, there is no way to understand what that is.
A few suggestions:
Run this as a simple single scenario, where you manually update inputs. When you run this with the low range of input values and then the high range of input values, what do you see on the animation or additional outputs that is different than you expected or could explain the overall output trend? Try running several intermediate points.
Add additional output metrics. If you look at queue sizes, resource utilizations, turn-around-times, etc; do you see anything at that level that is different than expected?
Add a "replication" log. When you run a set of inputs for multiple scenarios, does any single replication stand out as an outlier? If so, re-run the scenario with that set of inputs and that random seed.
There is no substitute for understanding underlying system behavior, and without understanding those dynamics, looking at overall correlation with optimization or parameter variation experiments will often lead companies to make the wrong policies decisions.

Neural network hyperparameter tuning - is setting random seed a good idea? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to tune a basic neural network as practice. (Based on an example from a coursera course: Neural Networks and Deep Learning - DeepLearning.AI)
I face the issue of the random weight initialization. Lets say I try to tune the number of layers in the network.
I have two options:
1.: set the random seed to a fixed value
2.: run my experiments more times without setting the seed
Both version has pros and cons.
My biggest concern is that if I use a random seed (e.g.: tf.random.set_seed(1)) then the determined values can be "over-fitted" to the seed and may not work well without the seed or if the value is changed (e.g.: tf.random.set_seed(1) -> tf.random.set_seed(2). On the other hand, if I run my experiments more times without random seed then I can inspect less option (due to limited computing capacity) and still only inspect a subset of possible random weight initialization.
In both cases I feel that luck is a strong factor in the process.
Is there a best practice how to handle this topic?
Has TensorFlow built in tools for this purpose? I appreciate any source of descriptions or tutorials. Thanks in advance!
Tuning hyperparameters in deep learning (generally in machine learning) is a common issue. Setting the random seed to a fixed number ensures reproducibility and fair comparison. Repeating the same experiment will lead to the same outcomes. As you probably know, best practice to avoid over-fitting is to do a train-test split of your data and then use k-fold cross-validation to select optimal hyperparameters. If you test multiple values for a hyperparameter, you want to make sure other circumstances that might influence the performance of your model (e.g. train-test-split or weight initialization) are the same for each hyperparameter in order to have a fair comparison of the performance. Therefore I would always recommend to fix the seed.
Now, the problem with this is, as you already pointed out, the performance for each model will still depend on the random seed, like the particular data split or weight initialization in your case. To avoid this, one can do repeated k-fold-cross validation. That means you repeat the k-fold cross-validation multiple times, each time with a different seed, select best parameters of that run, test on test data and average the final results to get a good estimate of performance + variance and therefore eliminate the influence the seed has in the validation process.
Alternatively you can perform k-fold cross validation a single time and train each split n-times with a different random seed (eliminating the effect of weight initialization, but still having the effect of the train-test-split).
Finally TensorFlow has no build-in tool for this purpose. You as practitioner have to take care of this.
There is no an absolute right or wrong answer to your question. You are almost answered your own question already. In what follows, however, I will try to expand more, via the following points:
The purpose of random initialization is to break the symmetry that makes neural networks fail to learn:
... the only property known with complete certainty is that the
initial parameters need to “break symmetry” between different units.
If two hidden units with the same activation function are connected to
the same inputs, then these units must have different initial
parameters. If they have the same initial parameters, then a
deterministic learning algorithm applied to a deterministic cost and
model will constantly update both of these units in the same way...
Deep Learning (Adaptive Computation and Machine Learning series)
Hence, we need the neural network components (especially weights) to be initialized by different values. There are some rules of thumb of how to choose those values, such as the Xavier initialization, which samples from normal distribution with mean of 0 and special variance based on the number of the network layer. This is a very interesting article to read.
Having said so, the initial values are important but not extremely critical "if" proper rules are followed, as per mentioned in point 2. They are important because large or improper ones may lead to vanishing or exploding gradient problems. On the other hand, different "proper" weights shall not hugely change the final results, unless they are making the aforementioned problems, or getting the neural network stuck at some local maxima. Please note, however, the the latter depends also on many other aspects, such as the learning rate, the activation functions used (some explode/vanish more than others: this is a great comparison), the architecture of the neural network (e.g. fully connected, convolutional ..etc: this is a cool paper) and the optimizer.
In addition to point 2, bringing a good learning optimizer into the bargain, other than the standard stochastic one, shall in theory not let a huge influence of the initial values to affect the final results quality, noticeably. A good example is Adam, which provides a very adaptive learning technique.
If you still get a noticeably-different results, with different "proper" initialized weights, there are some ways that "might help" to make neural network more stable, for example: use a Train-Test split, use a GridSearchCV for best parameters, and use k-fold cross validation...etc.
At the end, obviously the best scenario is to train the same network with different random initial weights many times then get the average results and variance, for more specific judgement on the overall performance. How many times? Well, if can do it hundreds of times, it will be better, yet that clearly is almost impractical (unless you have some Googlish hardware capability and capacity). As a result, we come to the same conclusion that you had in your question: There should be a tradeoff between time & space complexity and reliability on using a seed, taking into considerations some of the rules of thumb mentioned in previous points. Personally, I am okay to use the seed because I believe that, "It’s not who has the best algorithm that wins. It’s who has the most data". (Banko and Brill, 2001). Hence, using a seed with enough (define enough: it is subjective, but the more the better) data samples, shall not cause any concerns.

Deep neural network diverges after convergence

I implemented the A3C network in https://arxiv.org/abs/1602.01783 in TensorFlow.
At this point I'm 90% sure the algorithm is implemented correctly. However, the network diverges after convergence. See the attached image that I got from a toy example where the maximum episode reward is 7.
When it diverges, policy network starts giving a single action very high probability (>0.9) for most states.
What should I check for this kind of problem? Is there any reference for it?
Note that in Figure 1 of the original paper the authors say:
For asynchronous methods we average over the best 5
models from 50 experiments.
That can mean that in lot of cases the algorithm does not work that well. From my experience, A3C often diverges, even after convergence. Carefull learning-rate scheduling can help. Or do what the authors did - learn several agents with different seed and pick the one performing the best on your validation data. You could also employ early stopping when validation error becomes to increase.

Encoding invariance for deep neural network

I have a set of data, 2D matrix (like Grey pictures).
And use CNN for classifier.
Would like to know if there is any study/experience on the accuracy impact
if we change the encoding from traditionnal encoding.
I suppose yes, question is rather which transformation of the encoding make the accuracy invariant, which one deteriorates....
To clarify, this concerns mainly the quantization process of the raw data into input data.
EDIT:
Quantize the raw data into input data is already a pre-processing of the data, adding or removing some features (even minor). It seems not very clear the impact in term of accuracy on this quantization process on real dnn computation.
Maybe, some research available.
I'm not aware of any research specifically dealing with quantization of input data, but you may want to check out some related work on quantization of CNN parameters: http://arxiv.org/pdf/1512.06473v2.pdf. Depending on what your end goal is, the "Q-CNN" approach may be useful for you.
My own experience with using various quantizations of the input data for CNNs has been that there's a heavy dependency between the degree of quantization and the model itself. For example, I've played around with using various interpolation methods to reduce image sizes and reducing the color palette size, and in the end, I discovered that each variant required a different tuning of hyper-parameters to achieve optimal results. Generally, I found that minor quantization of data had a negligible impact, but there was a knee in the curve where throwing away additional information dramatically impacted the achievable accuracy. Unfortunately, I'm not aware of any way to determine what degree of quantization will be optimal without experimentation, and even deciding what's optimal involves a trade-off between efficiency and accuracy which doesn't necessarily have a one-size-fits-all answer.
On a theoretical note, keep in mind that CNNs need to be able to find useful, spatially-local features, so it's probably reasonable to assume that any encoding that disrupts the basic "structure" of the input would have a significantly detrimental effect on the accuracy achievable.
In usual practice -- a discrete classification task in classic implementation -- it will have no effect. However, the critical point is in the initial computations for back-propagation. The classic definition depends only on strict equality of the predicted and "base truth" classes: a simple right/wrong evaluation. Changing the class coding has no effect on whether or not a prediction is equal to the training class.
However, this function can be altered. If you change the code to have something other than a right/wrong scoring, something that depends on the encoding choice, then encoding changes can most definitely have an effect. For instance, if you're rating movies on a 1-5 scale, you likely want 1 vs 5 to contribute a higher loss than 4 vs 5.
Does this reasonably deal with your concerns?
I see now. My answer above is useful ... but not for what you're asking. I had my eye on the classification encoding; you're wondering about the input.
Please note that asking for off-site resources is a classic off-topic question category. I am unaware of any such research -- for what little that is worth.
Obviously, there should be some effect, as you're altering the input data. The effect would be dependent on the particular quantization transformation, as well as the individual application.
I do have some limited-scope observations from general big-data analytics.
In our typical environment, where the data were scattered with some inherent organization within their natural space (F dimensions, where F is the number of features), we often use two simple quantization steps: (1) Scale all feature values to a convenient integer range, such as 0-100; (2) Identify natural micro-clusters, and represent all clustered values (typically no more than 1% of the input) by the cluster's centroid.
This speeds up analytic processing somewhat. Given the fine-grained clustering, it has little effect on the classification output. In fact, it sometimes improves the accuracy minutely, as the clustering provides wider gaps among the data points.
Take with a grain of salt, as this is not the main thrust of our efforts.