How do I handle variability of output in Anylogic? - optimization

I have been working on a simulation model for battery swapping in Anylogic. So far I have developed the simulation model, optimization experiment and parameters variation experiment.
There are no errors in the model but the output values are unsatisfactory. Small changes such as changing the step size of the decision variables results in a drastic change in the best value obtained after every experiment. Though the objective does not change much but I am concerned about the other variables that are changing with each run. Even with multiple optimization runs it is difficult to come to a conclusion.
For reference I am posting an output of parameters variation experiment here. I ran the experiment with an optimized value but I was getting feasible results (percentile > 95%) far off the expected input values. Although, the overall result is correct (decreasing percentile with increasing charging time) but it is difficult to understand the variability.
Can anyone help?enter image description here

When building a model, this is a common problem you will have when looking at high level overall outputs. You could have a model bug, but it is just as likely (if not more likely) that there is some dynamic to your system that was not clear in simple Excel spreadsheets or mental models. The DES may be telling us something truly interesting about the system behavior, but without additional outputs, there is no way to understand what that is.
A few suggestions:
Run this as a simple single scenario, where you manually update inputs. When you run this with the low range of input values and then the high range of input values, what do you see on the animation or additional outputs that is different than you expected or could explain the overall output trend? Try running several intermediate points.
Add additional output metrics. If you look at queue sizes, resource utilizations, turn-around-times, etc; do you see anything at that level that is different than expected?
Add a "replication" log. When you run a set of inputs for multiple scenarios, does any single replication stand out as an outlier? If so, re-run the scenario with that set of inputs and that random seed.
There is no substitute for understanding underlying system behavior, and without understanding those dynamics, looking at overall correlation with optimization or parameter variation experiments will often lead companies to make the wrong policies decisions.

Related

What exactly does A/B test mean in this scenario?

When multiple developers are continually applying changes to databases
within an organization, there may be no structured change log. Then
it's difficult to find out what change caused the database performance
to slow down. In a call center, where results must be instantaneous
for customers waiting on the phone, an adverse change can be
detrimental to the business. The DEA tests a change that might have an
impact, and compares it to the unchanged database. You apply the A/B
testing method to analyze one change at a time, helping you to clearly
know its effect. The change might mean an improvement to performance,
degradation, or maybe no performance difference, which is equally
acceptable.
I am looking for an example to understand the following statement from the above quote: The DEA tests a change that might have an impact, and compares it to the unchanged database. You apply the A/B testing method to analyze one change at a time, helping you to clearly know its effect.
For example - suppose a setting is currently having the value as 0 and I want to set it to 1. How will the above A/B testing help me exactly?
You got two environments and the only difference between them is the setting you specified. And then you split the traffic to the DB into two (50% vs. 50% for example), and part of them goes to the first enviroment and part of them goes to another. You monitor the metrics (performance or something else) and if the difference is statistically significant, you would have high confidence that setting is the change that moves the metric.

Neural network hyperparameter tuning - is setting random seed a good idea? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to tune a basic neural network as practice. (Based on an example from a coursera course: Neural Networks and Deep Learning - DeepLearning.AI)
I face the issue of the random weight initialization. Lets say I try to tune the number of layers in the network.
I have two options:
1.: set the random seed to a fixed value
2.: run my experiments more times without setting the seed
Both version has pros and cons.
My biggest concern is that if I use a random seed (e.g.: tf.random.set_seed(1)) then the determined values can be "over-fitted" to the seed and may not work well without the seed or if the value is changed (e.g.: tf.random.set_seed(1) -> tf.random.set_seed(2). On the other hand, if I run my experiments more times without random seed then I can inspect less option (due to limited computing capacity) and still only inspect a subset of possible random weight initialization.
In both cases I feel that luck is a strong factor in the process.
Is there a best practice how to handle this topic?
Has TensorFlow built in tools for this purpose? I appreciate any source of descriptions or tutorials. Thanks in advance!
Tuning hyperparameters in deep learning (generally in machine learning) is a common issue. Setting the random seed to a fixed number ensures reproducibility and fair comparison. Repeating the same experiment will lead to the same outcomes. As you probably know, best practice to avoid over-fitting is to do a train-test split of your data and then use k-fold cross-validation to select optimal hyperparameters. If you test multiple values for a hyperparameter, you want to make sure other circumstances that might influence the performance of your model (e.g. train-test-split or weight initialization) are the same for each hyperparameter in order to have a fair comparison of the performance. Therefore I would always recommend to fix the seed.
Now, the problem with this is, as you already pointed out, the performance for each model will still depend on the random seed, like the particular data split or weight initialization in your case. To avoid this, one can do repeated k-fold-cross validation. That means you repeat the k-fold cross-validation multiple times, each time with a different seed, select best parameters of that run, test on test data and average the final results to get a good estimate of performance + variance and therefore eliminate the influence the seed has in the validation process.
Alternatively you can perform k-fold cross validation a single time and train each split n-times with a different random seed (eliminating the effect of weight initialization, but still having the effect of the train-test-split).
Finally TensorFlow has no build-in tool for this purpose. You as practitioner have to take care of this.
There is no an absolute right or wrong answer to your question. You are almost answered your own question already. In what follows, however, I will try to expand more, via the following points:
The purpose of random initialization is to break the symmetry that makes neural networks fail to learn:
... the only property known with complete certainty is that the
initial parameters need to “break symmetry” between different units.
If two hidden units with the same activation function are connected to
the same inputs, then these units must have different initial
parameters. If they have the same initial parameters, then a
deterministic learning algorithm applied to a deterministic cost and
model will constantly update both of these units in the same way...
Deep Learning (Adaptive Computation and Machine Learning series)
Hence, we need the neural network components (especially weights) to be initialized by different values. There are some rules of thumb of how to choose those values, such as the Xavier initialization, which samples from normal distribution with mean of 0 and special variance based on the number of the network layer. This is a very interesting article to read.
Having said so, the initial values are important but not extremely critical "if" proper rules are followed, as per mentioned in point 2. They are important because large or improper ones may lead to vanishing or exploding gradient problems. On the other hand, different "proper" weights shall not hugely change the final results, unless they are making the aforementioned problems, or getting the neural network stuck at some local maxima. Please note, however, the the latter depends also on many other aspects, such as the learning rate, the activation functions used (some explode/vanish more than others: this is a great comparison), the architecture of the neural network (e.g. fully connected, convolutional ..etc: this is a cool paper) and the optimizer.
In addition to point 2, bringing a good learning optimizer into the bargain, other than the standard stochastic one, shall in theory not let a huge influence of the initial values to affect the final results quality, noticeably. A good example is Adam, which provides a very adaptive learning technique.
If you still get a noticeably-different results, with different "proper" initialized weights, there are some ways that "might help" to make neural network more stable, for example: use a Train-Test split, use a GridSearchCV for best parameters, and use k-fold cross validation...etc.
At the end, obviously the best scenario is to train the same network with different random initial weights many times then get the average results and variance, for more specific judgement on the overall performance. How many times? Well, if can do it hundreds of times, it will be better, yet that clearly is almost impractical (unless you have some Googlish hardware capability and capacity). As a result, we come to the same conclusion that you had in your question: There should be a tradeoff between time & space complexity and reliability on using a seed, taking into considerations some of the rules of thumb mentioned in previous points. Personally, I am okay to use the seed because I believe that, "It’s not who has the best algorithm that wins. It’s who has the most data". (Banko and Brill, 2001). Hence, using a seed with enough (define enough: it is subjective, but the more the better) data samples, shall not cause any concerns.

Using multidimensional test data to optimize an output

I am looking for a machine learning strategy that will allow me to load in test data, along with a test outcome, and optimize for a particular scenario to adjust future testing parameters.
(See example in the edit)
Original example: For example, consider that I have a 3 dimensional space (environmental chamber) that I place a physical test device into. I will then select a number of locations and physical attributes with which to test the device. First I will select to test the device at every location configuration, across multiple temperatures, humidities, and pressures. At each test increment, or combination of variables, I log the value of each feature, e.g. x,y,z positional data, as well as the temperature, humidity, pressure, etc.. after setting these parameters I will initiate a software test on the physical device that is affected by the environmental factors in ways too complex for me to predict. This software test can output three outputs that vary with an unknown (until tested) probability based on the logged physical parameters. Of the three outputs, one is a failure, one is a success, and one is that the test finishes without any meaningful output (we can ignore this case in processing).
Once the test finishes testing every physical parameter I provide it, I would like to then run an algorithm on this test data to select the controllable parameters e.g. x,y,z positions, or temperature to maximize my chance of a successful test while also minimizing my chance at a failure (in some cases we find parameters that have a high chance at failure and a high chance at success, failures are more time expensive, thus we need to avoid them). The algorithm should use these to suggest an alternative set of test parameter ranges to initiate the next test iteration that it believes will get us closer to our goal.
Currently, I can maximize the success case by using an iterative decision tree and ignoring the results that end in failure.
Any ideas are appreciated
Edit:
Another example (this is contrived, lets not get into the details of PRNGS)-- Lets say that I have an embedded device that has a hardware pseudo random number generator (PRNG) that is affected by environmental factors, such as heat and magnetometer data. I have a program on the device that uses this PRNG to give me a random value. Suppose that this PRNG just barely achieves sufficient randomization in the average case, in the best case gives me a good random value, and in the worst case fails to provide a random value. By changing the physical parameters of the environment around the device I can find values which with some probability cause this PRNG to fail to give me a random number, give me an 'ok' random number, and which cause it to succeed in generating a cryptographically secure random number. Lets suppose in cases in which it fails to generate a good enough random number the program enters a long running loop trying to find one before ultimately failing, which we would like to avoid as it is computationally expensive. To test this, we first start off by testing every combination of variables in which we can control (temperature, position, etc..) perhaps by jumping several degrees at a time, to give a rough picture on what the device's response looks like over a large range. I would like then to run an algorithm on this test data, narrow my testing windows and iterate over the newly selected feature parameters to arrive at an optimized solution. In this case, since failures are expensive, the optimized solution would be something that minimizes failures, while simultaneously maximizing successes.

Optimizing branch predictions: how to generalize code that could run wth different compiler, interperter, and hardware prediction?

I ran into some slow downs on a tight loop today caused by an If statement, which surprised me some because I expected branch prediction to successfully pipeline the particular statement to minimize the cost of the conditional.
When I sat down to think more about why it wasn't better handled I realized I didn't know much about how branch prediction was being handled at all. I know the concept of branch prediction quite well and it's benefits, but the problem is that I didn't know who was implementing it and what approach they were utilizing for predicting the outcome of a conditional.
Looking deeper I know branch prediction can be done at a few levels:
Hardware itself with instruction pipelining
C++ style compiler
Interpreter of interpreted language.
half-compiled language like java may do two and three above.
However, because optimization can be done in many areas I'm left uncertain as to how to anticipate branch prediction. If I'm writing in Java, for example, is my conditional optimized when compiled, when interpreted, or by the hardware after interpretation!? More interesting, does this mean if someone uses a different runtime enviroment? Could a different branch prediction algorithm used in a different interpreter result in a tight loop based around a conditional showing significant different performance depending on which interpreter it's run with?
Thus my question, how does one generalize an optimization around branch prediction if the software could be run on very different computers which may mean different branch prediction? If the hardware and interpreter could change their approach then profiling and using whichever approach proved fastest isn't a guarantee. Lets ignore C++ where you have compile level ability to force this, looking at the interpreted languages if someone still needed to optimize a tight loop within them.
Are there certain presumptions that are generally safe to make regardless of interpreter used? Does one have to dive into the intricate specification of a language to make any meaningful presumption about branch prediction?
Short answer:
To help improve the performance of the branch predictor try to structure your program so that conditional statements don't depend on apparently random data.
Details
One of the other answers to this question claims:
There is no way to do anything at the high level language to optimize for branch prediction, caching sure, sometimes you can, but branch prediction, no not at all.
However, this is simply not true. A good illustration of this fact comes from one of the most famous questions on Stack Overflow.
All branch predictors work by identifying patterns of repeated code execution and using this information to predict the outcome and/or target of branches as necessary.
When writing code in a high-level language it's typically not necessary for an application programmer to worry about trying to optimizing conditional branches. For instance gcc has the __builtin_expect function which allows the programmer to specify the expected outcome of a conditional branch. But even if an application programmer is certain they know the typical outcome of a specific branch it's usually not necessary to use the annotation. In a hot loop using this directive is unlikely to help improve performance. If the branch really is strongly biased the the predictor will be able to correctly predict the outcome most of the time even without the programmer annotation.
On most modern processors branch predictors perform incredibly well (better than 95% accurate even on complex workloads). So as a micro-optimization, trying to improve branch prediction accuracy is probably not something that an application programmer would want to focus on. Typically the compiler is going to do a better job of generating optimal code that works for the specific hardware platform it is targeting.
But branch predictors rely on identifying patterns, and if an application is written in such a way that patterns don't exist, then the branch predictor will perform poorly. If the application can be modified so that there is a pattern then the branch predictor has a chance to do better. And that is something you might be able to consider at the level of a high-level language, if you find a situation where a branch really is being poorly predicted.
branch prediction like caching and pipelining are things done to make code run faster in general overcoming bottlenecks in the system (super slow cheap dram which all dram is, all the layers of busses between X and Y, etc).
There is no way to do anything at the high level language to optimize for branch prediction, caching sure, sometimes you can, but branch prediction, no not at all. in order to predict, the core has to have the branch in the pipe along with the instructions that preceed it and across architectures and implementations not possible to find one rule that works. Often not even within one architecture and implementation from the high level language.
you could also easily end up in a situation where tuning for branch predictions you de-tune for cache or pipe or other optimizations you might want to use instead. and the overall performance first and foremost is application specific then after that something tuned to that application, not something generic.
For as much as I like to preach and do optimizations at the high level language level, branch prediction is one that falls into the premature optimization category. Just enable it it in the core if not already enabled and sometimes it saves you a couple of cycles, most of the time it doesnt, and depending on the implementation, it can cost more cycles than it saves. Like a cache it has to do with the hits vs misses, if it guesses right you have code in a faster ram sooner on its way to the pipe, if it guesses wrong you have burned bus cycles that could have been used by code that was going to be run.
Caching is usually a benefit (although not hard to write high level code that shows it costing performance instead of saving) as code usually runs linearly for some number of instructions before branching. Likewise data is accessed in order often enough to overcome the penalties. Branching is not something we do every instruction and where we branch to does not have a common answer.
Your backend could try to tune for branch prediction by having the pre-branch decisions happen a few cycles before the branch but all within a pipe size and tuned for fetch line or cache line alignments. again this messes with tuning for other features in the core.

Control of parallelization

I am running a custom processor on a rowset that does not seem to run in parallel. The underlying ~1GB text file is first read into a table that is partitioned via round robin. The 'Extract' runs on 200 vertices but then (under 'Aggregate' node) the processing [that does various complex computations] happens on only 2 vertices even though the parallelism parameter is much higher than that. Is there a special hint that needs to be used to dictate the compiler to use more vertex? Is there a function or property that needs to be overridden to set the parallelism at this phase as well?
Sorry for the late reply. But it is vacation time :).
It is good to see that the extract phase is fully scaled out.
Without seeing the script or the generated plan it is a bit difficult to say why you only see 2 vertices in some places. There are a couple of reasons why that may be the case:
you don't have enough data to scale out to more.
your aggregation needs more data and thus the plan has less parallelism.
your operation is intrinsically less parallel.
The optimizer's data cardinality estimation is off and chooses not enough parallelism. We have some ability to hint, but I rather first see the job.
Note that custom processors often block the optimizer from pushing optimizations through in the script (using the READ ONLY option for example helps) and can throw off the cardinality estimations.
If you send me the script, the job graph and the link to the job to mrys at Microsoft, I and the team will look into it next week after the holidays are over.