Google Cloud ML: How can I enforce a pure grid-search for a hyperparameter tuning job - tensorflow

Google Cloud ML uses Bayesian optimisation to mitigate the curse of dimensionality. In specific situations I have hyperparameter tuning jobs in which I want to enforce an exhaustive search over a grid of hyperparameters in a hyperparameter-tuning job. How can I do this?
My motivation for enforcing a pure grid-search is: I have observed that a hyperparameter-tuning job for hyperparameters which are exclusively of DISCRETE type, evaluates the same combination of hyperparameters more than once, which I do not want. I am suspecting it has to do with the use of Bayesian optimisation. This is why I would like to enforce a pure grid-search for those cases.

There is not currently an argument available to enforce a grid search.
The best workaround currently is probably to submit multiple jobs, with the specific hyperparameters set for each one. This can be done without changing the code, as you can specify the values as user command line arguments. You should be able to submit all the jobs in a loop, and Google Cloud ML will queue them if there are too many to run at once. The downside is that you'll have to figure out which is the best.

Related

OpenMDAO v/s modeFrontier comparisons for optimization capabilities and application scaling

I realize that this might not be the best platform to ask this, but I think this would be best unbiased one to put my question in.
How would you compare OpenMDAO v/s modeFrontier with regards to there optimization capabilities and application scaling and overall software development? Which one would you pick and why?
If you know of any resources or link do provide.
The most fundamental technical difference is OpenMDAO can pass data + derivative information between components. This means that if you want to use gradient based optimization and have access to at least some tools that provide derivative information, OpenMDAO will have far more effective overall capabilities. This is especially important when doing optimization with high-cost analysis tools (e.g. partial differential equation solvers --- CFD, FEA). In those situations making use of derivatives offers between a 100x and 10000x speedup.
One other difference is that OpenMDAO is designed to run natively on a distributed memory compute cluster. Industrial frameworks can submit jobs to remote clusters and query for the results, but OpenMDAO itself can run on the cluster and has a direct and internal MPI based distributed memory capability. This is critical to it being able to efficiently handle derivatives of those expensive PDE solvers. To the best of my knowledge, OpenMDAO is unique in this regard. This is a low level technical detail that most users never need to directly understand, but the consequence is that if you want to do any kind of high fidelity coupled optimziations (aero-structural, aero-propulsive, aero-thermal) with more than one PDE solver in the loop then OpenMDAO's architecture is going to be by far the most effective.
However, OpenMDAO does not offer a GUI. It does not have the same level of data tracking and visualization tools. Also, I know that mode-frontier offers the ability to split a single model up across multiple computers distributed across an organization. Mode Frontier, along with other tools like ModelCenter and Isight, all offer this kind of smooth user experience and code-free interaction that many find valuable.
Honestly, I'm not sure a direct comparison is really warranted. I think if you have an organization that invests in a commercial integration tool like Mode Fronteir, then you can still use OpenMDAO to create tightly coupled integrated optimizations which you can then include as boxes inside your overall integration framework.
You certainly can use OpenMDAO as a complete integration framework, and it has some advantages in that area related to derivatives and execution in distributed memory environments. But you don't have to, and it certainly does not have to be an exclusive decision.

Is there a way to partition a tf.Dataset with TensorFlow’s Dataset API?

I checked the doc but I could not find a method for it. I want to de cross validation, so I kind of need it.
Note that I'm not asking how to split a tensor, as I know that TensorFlow provides an API for that an has been answered in another question. I'm asking on how to partition a tf.Dataset (which is an abstraction).
You could either:
1) Use the shard transformation partition the dataset into multiple "shards". Note that for best performance, sharding should be to data sources (e.g. filenames).
2) As of TensorFlow 1.12, you can also use the window transformation to build a dataset of datasets.
I am afraid you cannot. The dataset API is a way to efficiently stream inputs to your net at run time. It is not a set of tools to manipulate datasets as a whole -- in that regards it might be a bit of a misnomer.
Also, if you could, this would probably be a bad idea. You would rather have this train/test split done once and for all.
it let you review those sets offline
if the split is done each time you run an experiment there is a risk that samples start swapping sets if you are not extremely careful (e.g. when you add more data to your existing dataset)
See also a related question about how to split a set into training & testing in tensorflow.

How to effectively make use of a GPU for reinforcement learning?

Recently i looked into reinforcement learning and there was one question bugging me, that i could not find an answer for: How is training effectively done using GPUs? To my understanding constant interaction with an environment is required, which for me seems like a huge bottleneck, since this task is often non-mathematical / non-parallelizable. Yet for example Alpha Go uses multiple TPUs/GPUs. So how are they doing it?
Indeed, you will often have interactions with the environment in between learning steps, which will often be better off running on CPU than GPU. So, if your code for taking actions and your code for running an update / learning step are very fast (as in, for example, tabular RL algorithms), it won't be worth the effort of trying to get those on the GPU.
However, when you have a big neural network, that you need to go through whenever you select an action or run a learning step (as is the case in most of the Deep Reinforcement Learning approaches that are popular these days), the speedup of running these on GPU instead of CPU is often enough for it to be worth the effort of running them on GPU (even if it means you're quite regularly ''switching'' between CPU and GPU, and may need to copy some things from RAM to VRAM or the other way around).
When doing off-policy reinforcement learning (which means you can use transitions samples generated by a "behavioral" policy, different from the one you are currently learning), an experience replay is generally used. Therefore, you can grab a bunch of transitions from this large buffer and use a GPU to optimize the learning objective with SGD (c.f. DQN, DDPG).
One instance of CPU-GPU hybrid approach for RL is this - https://github.com/NVlabs/GA3C.
Here, multiple CPUs are used to interact with different instances of the environment. "Trainer" and "Predictor" processes then collect the interactions using multi-process queues, and pass them to a GPU for back-propagation.

Optimal data streaming and processing solution for enormous datasets into tf.data.Dataset

Context:
My text input pipeline currently consists of two main parts:
I. A complex text preprocessing and exporting of tf.SequenceExamples to tfrecords (custom tokenization, vocabulary creation, statistics calculation, normalization and many more over the full dataset as well as per each individual example). That is done once for each data configuration.
II. A tf.Dataset (TFRecords) pipeline that does quite a bit of processing during training, too (string_split into characters, table lookups, bucketing, conditional filtering, etc.).
Original Dataset is present across multiple locations (BigQuery, GCS, RDS, ...).
Problem:
The problem is that as the production dataset increases rapidly (several terabytes), it is not feasible to recreate a tfrecords files for each possible data configuration (part 1 has a lot of hyperparameters) as each will have an enormous size of hundreds of terabytes. Not to mention, that tf.Dataset reading speed surprisingly slows down when tf.SequenceExamples or tfrecords grow in size.
There are quite a few possible solutions:
Apache Beam + Cloud DataFlow + feed_dict;
tf.Transform;
Apache Beam + Cloud DataFlow + tf.Dataset.from_generator;
tensorflow/ecosystem + Hadoop or Spark
tf.contrib.cloud.BigQueryReader
, but neither of the following seem to fully fulfill my requierments:
Streaming and processing on the fly data from BigQuery, GCS, RDS, ... as in part I.
Sending data (protos?) directly to tf.Dataset in one way or another to be used in part II.
Fast and reliable for both training and inference.
(optional) Being able to pre-calculate some full pass statistics over the selected part of the data.
EDIT: Python 3 support would be just wonderful.
What is the most suitable choice for the tf.data.Dataset pipeline? What are the best practices in this case?
Thanks in advance!
I recommend to orchestrate the whole use case using Cloud Composer(GCP integration of Airflow).
Airflow provided operators which let you orchestrate a pipeline with a script.
In your case you can use the dataflow_operator to have the dataflow job spin up when you have enough data to process.
To get the data from BigQuery you can use the bigquery_operator.
Furthermore you can use the python operator or the bash operator to monitor and pre-calculate statistics.

Optimizing branch predictions: how to generalize code that could run wth different compiler, interperter, and hardware prediction?

I ran into some slow downs on a tight loop today caused by an If statement, which surprised me some because I expected branch prediction to successfully pipeline the particular statement to minimize the cost of the conditional.
When I sat down to think more about why it wasn't better handled I realized I didn't know much about how branch prediction was being handled at all. I know the concept of branch prediction quite well and it's benefits, but the problem is that I didn't know who was implementing it and what approach they were utilizing for predicting the outcome of a conditional.
Looking deeper I know branch prediction can be done at a few levels:
Hardware itself with instruction pipelining
C++ style compiler
Interpreter of interpreted language.
half-compiled language like java may do two and three above.
However, because optimization can be done in many areas I'm left uncertain as to how to anticipate branch prediction. If I'm writing in Java, for example, is my conditional optimized when compiled, when interpreted, or by the hardware after interpretation!? More interesting, does this mean if someone uses a different runtime enviroment? Could a different branch prediction algorithm used in a different interpreter result in a tight loop based around a conditional showing significant different performance depending on which interpreter it's run with?
Thus my question, how does one generalize an optimization around branch prediction if the software could be run on very different computers which may mean different branch prediction? If the hardware and interpreter could change their approach then profiling and using whichever approach proved fastest isn't a guarantee. Lets ignore C++ where you have compile level ability to force this, looking at the interpreted languages if someone still needed to optimize a tight loop within them.
Are there certain presumptions that are generally safe to make regardless of interpreter used? Does one have to dive into the intricate specification of a language to make any meaningful presumption about branch prediction?
Short answer:
To help improve the performance of the branch predictor try to structure your program so that conditional statements don't depend on apparently random data.
Details
One of the other answers to this question claims:
There is no way to do anything at the high level language to optimize for branch prediction, caching sure, sometimes you can, but branch prediction, no not at all.
However, this is simply not true. A good illustration of this fact comes from one of the most famous questions on Stack Overflow.
All branch predictors work by identifying patterns of repeated code execution and using this information to predict the outcome and/or target of branches as necessary.
When writing code in a high-level language it's typically not necessary for an application programmer to worry about trying to optimizing conditional branches. For instance gcc has the __builtin_expect function which allows the programmer to specify the expected outcome of a conditional branch. But even if an application programmer is certain they know the typical outcome of a specific branch it's usually not necessary to use the annotation. In a hot loop using this directive is unlikely to help improve performance. If the branch really is strongly biased the the predictor will be able to correctly predict the outcome most of the time even without the programmer annotation.
On most modern processors branch predictors perform incredibly well (better than 95% accurate even on complex workloads). So as a micro-optimization, trying to improve branch prediction accuracy is probably not something that an application programmer would want to focus on. Typically the compiler is going to do a better job of generating optimal code that works for the specific hardware platform it is targeting.
But branch predictors rely on identifying patterns, and if an application is written in such a way that patterns don't exist, then the branch predictor will perform poorly. If the application can be modified so that there is a pattern then the branch predictor has a chance to do better. And that is something you might be able to consider at the level of a high-level language, if you find a situation where a branch really is being poorly predicted.
branch prediction like caching and pipelining are things done to make code run faster in general overcoming bottlenecks in the system (super slow cheap dram which all dram is, all the layers of busses between X and Y, etc).
There is no way to do anything at the high level language to optimize for branch prediction, caching sure, sometimes you can, but branch prediction, no not at all. in order to predict, the core has to have the branch in the pipe along with the instructions that preceed it and across architectures and implementations not possible to find one rule that works. Often not even within one architecture and implementation from the high level language.
you could also easily end up in a situation where tuning for branch predictions you de-tune for cache or pipe or other optimizations you might want to use instead. and the overall performance first and foremost is application specific then after that something tuned to that application, not something generic.
For as much as I like to preach and do optimizations at the high level language level, branch prediction is one that falls into the premature optimization category. Just enable it it in the core if not already enabled and sometimes it saves you a couple of cycles, most of the time it doesnt, and depending on the implementation, it can cost more cycles than it saves. Like a cache it has to do with the hits vs misses, if it guesses right you have code in a faster ram sooner on its way to the pipe, if it guesses wrong you have burned bus cycles that could have been used by code that was going to be run.
Caching is usually a benefit (although not hard to write high level code that shows it costing performance instead of saving) as code usually runs linearly for some number of instructions before branching. Likewise data is accessed in order often enough to overcome the penalties. Branching is not something we do every instruction and where we branch to does not have a common answer.
Your backend could try to tune for branch prediction by having the pre-branch decisions happen a few cycles before the branch but all within a pipe size and tuned for fetch line or cache line alignments. again this messes with tuning for other features in the core.