Is there a reason why the number of channels/filters and batch sizes in many deep learning models are in powers of 2? - tensorflow

In many models the number of channels is kept in powers of 2. Also the batch-sizes are described in powers of 2. Is there any reason behind this design choice?

There is no significant in keeping channels and batch size as powers of 2. You can keep any number you want.

In many models the number of channels is kept in powers of 2. Also the batch-sizes are described in powers of 2. Is there any reason behind this design choice?
While both could probably be optimized for speed (cache-alignment? optimal usage of CUDA cores?) to be powers of two, I am 95% certain that 99.9% do it because others used the same numbers / it worked.
For both hyperparameters you could choose any positive integer. So what would you try? Keep in mind, each complete evaluation takes at least several hours. Hence I guess if people play with this parameter, they make something like a binary search: Starting from one number, doubling keep doubling if it improves until an upper bound is found. At some point the differences are minor and then it is irrelevant what you choose. And people will wonder less if you write that you used a batch size of 64 than if you write that you used 50. Or 42.

Related

Neural network hyperparameter tuning - is setting random seed a good idea? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to tune a basic neural network as practice. (Based on an example from a coursera course: Neural Networks and Deep Learning - DeepLearning.AI)
I face the issue of the random weight initialization. Lets say I try to tune the number of layers in the network.
I have two options:
1.: set the random seed to a fixed value
2.: run my experiments more times without setting the seed
Both version has pros and cons.
My biggest concern is that if I use a random seed (e.g.: tf.random.set_seed(1)) then the determined values can be "over-fitted" to the seed and may not work well without the seed or if the value is changed (e.g.: tf.random.set_seed(1) -> tf.random.set_seed(2). On the other hand, if I run my experiments more times without random seed then I can inspect less option (due to limited computing capacity) and still only inspect a subset of possible random weight initialization.
In both cases I feel that luck is a strong factor in the process.
Is there a best practice how to handle this topic?
Has TensorFlow built in tools for this purpose? I appreciate any source of descriptions or tutorials. Thanks in advance!
Tuning hyperparameters in deep learning (generally in machine learning) is a common issue. Setting the random seed to a fixed number ensures reproducibility and fair comparison. Repeating the same experiment will lead to the same outcomes. As you probably know, best practice to avoid over-fitting is to do a train-test split of your data and then use k-fold cross-validation to select optimal hyperparameters. If you test multiple values for a hyperparameter, you want to make sure other circumstances that might influence the performance of your model (e.g. train-test-split or weight initialization) are the same for each hyperparameter in order to have a fair comparison of the performance. Therefore I would always recommend to fix the seed.
Now, the problem with this is, as you already pointed out, the performance for each model will still depend on the random seed, like the particular data split or weight initialization in your case. To avoid this, one can do repeated k-fold-cross validation. That means you repeat the k-fold cross-validation multiple times, each time with a different seed, select best parameters of that run, test on test data and average the final results to get a good estimate of performance + variance and therefore eliminate the influence the seed has in the validation process.
Alternatively you can perform k-fold cross validation a single time and train each split n-times with a different random seed (eliminating the effect of weight initialization, but still having the effect of the train-test-split).
Finally TensorFlow has no build-in tool for this purpose. You as practitioner have to take care of this.
There is no an absolute right or wrong answer to your question. You are almost answered your own question already. In what follows, however, I will try to expand more, via the following points:
The purpose of random initialization is to break the symmetry that makes neural networks fail to learn:
... the only property known with complete certainty is that the
initial parameters need to “break symmetry” between different units.
If two hidden units with the same activation function are connected to
the same inputs, then these units must have different initial
parameters. If they have the same initial parameters, then a
deterministic learning algorithm applied to a deterministic cost and
model will constantly update both of these units in the same way...
Deep Learning (Adaptive Computation and Machine Learning series)
Hence, we need the neural network components (especially weights) to be initialized by different values. There are some rules of thumb of how to choose those values, such as the Xavier initialization, which samples from normal distribution with mean of 0 and special variance based on the number of the network layer. This is a very interesting article to read.
Having said so, the initial values are important but not extremely critical "if" proper rules are followed, as per mentioned in point 2. They are important because large or improper ones may lead to vanishing or exploding gradient problems. On the other hand, different "proper" weights shall not hugely change the final results, unless they are making the aforementioned problems, or getting the neural network stuck at some local maxima. Please note, however, the the latter depends also on many other aspects, such as the learning rate, the activation functions used (some explode/vanish more than others: this is a great comparison), the architecture of the neural network (e.g. fully connected, convolutional ..etc: this is a cool paper) and the optimizer.
In addition to point 2, bringing a good learning optimizer into the bargain, other than the standard stochastic one, shall in theory not let a huge influence of the initial values to affect the final results quality, noticeably. A good example is Adam, which provides a very adaptive learning technique.
If you still get a noticeably-different results, with different "proper" initialized weights, there are some ways that "might help" to make neural network more stable, for example: use a Train-Test split, use a GridSearchCV for best parameters, and use k-fold cross validation...etc.
At the end, obviously the best scenario is to train the same network with different random initial weights many times then get the average results and variance, for more specific judgement on the overall performance. How many times? Well, if can do it hundreds of times, it will be better, yet that clearly is almost impractical (unless you have some Googlish hardware capability and capacity). As a result, we come to the same conclusion that you had in your question: There should be a tradeoff between time & space complexity and reliability on using a seed, taking into considerations some of the rules of thumb mentioned in previous points. Personally, I am okay to use the seed because I believe that, "It’s not who has the best algorithm that wins. It’s who has the most data". (Banko and Brill, 2001). Hence, using a seed with enough (define enough: it is subjective, but the more the better) data samples, shall not cause any concerns.

What makes non linear functions computationally expensive in hardware (e.g. FPGA)?

I've read some articles that state non-linear functions (like exponentials) are computationally expensive.
I was wondering what makes them computationally expensive.
When referring to 'computationally expensive' does it mean in terms of time taken or hardware resources used?
I've tried searching on Google, but I couldn't find any simple explanations for this.
Not pretending to offer the answer, but start with what you have in fpga.
Normally you're limited to adders, multipliers and some memory. What can you do with those?
Linear function - easy, taking just one multiplier and one adder.
Nonlinear functions - what a those? Either polynomial, requiring you to spend a ton of multipliers (the more the higher the polynomial's degree), or even transcendental, requiring you to find some satisfactory approximation, doing that in many steps.
Even simple integer division can't be done in one clock, in simple implementations requiring as many steps as there's bits in the numbers being divided.
The other possible solution is to use a lookup tables. And it's great for a small range of arguments. But if you want to have the function values found in wide range of arguments, or with greater precision, you'll end up with lookup table that is so large that can't fit in the device you have to work with.
So that's the main costs - you'll spend lots of dedicated hardware resources (multipliers, memory for lookup tables), or spend lots of time in many-steps approximation algorithms, or algorithms that refine the results one "digit" per iteration (integer division, CORDIC, etc).

how 'negative sampling' improve word representation quality in word2vec?

negative sampling in 'word2vec' improves the training speed, that's obviously!
but why 'makes the word representations significantly more accurate.'?
I didn't find the relevant discussion or details. can u help me?
It's hard to describe what the author of that claim may have meant, without the full context of where it appeared. For example, word-vectors can be optimized for different tasks, and the same options that make word-vectors better for one task might make them worse for another.
One popular way to evaluate word-vectors since Google's original paper & code release is a set of word-analogy problems. These give a nice repeatable summary 'accuracy' percentage, so the author might have meant that for a particular training corpus, on that particular problem, holding other things constant, the negative-sampling mode had a higher 'accuracy' score.
But that wouldn't mean it's always better, with any corpus, or for any other downstream evaluation of quality or accuracy-on-some-task.
Projects with larger corpuses, and especially larger vocabularies (more unique words), tend to prefer the negative-sampling mode. The hierarchical-softmax alternative mode becomes slower as the vocabulary becomes larger, while the negative-sampling mode does not.
And, having a large, diverse corpus, with many subtly-different usage examples of all interesting words, is the most important contributor to really good word-vectors.
So, simply by making larger corpuses manageable, within a limited amount of training time, negative-sampling could be seen as indirectly enabling improved word-vectors - because corpus size is such an important factor.

How to concatenate the low halves of two SSE registers?

I have two SSE registers and I want to replace the high half of one by the low half of the other. As usual, the fastest way.
I guess it is doable by shifting one of the registers by 8 bytes, then alignr to concatenate.
Is there any single-instruction solution?
You can use punpcklqdq to combine the low halves of two registers into hi:lo in a single register. This is identical to what the movlhps FP instruction does, and also unpcklpd, but operates in the integer domain on CPUs that care about FP vs. integer shuffles for bypass delays.
Bonus reading: combining different parts of two registers
palignr would only be good for combining hi:xxx with xxx:lo, to produce lo:hi (i.e. reversed). You can use an FP shuffle (the register-register form of movsd) to get hi:lo (by moving the low half of xxx:lo to replace the low garbage in hi:xxx). Without that, you'd want to use punpckhqdq to bring the high half of one register to the low half, then use punpcklqdq to combine the low halves of two registers.
On most CPUs other than Intel Nehalem, floating-point shuffles on integer data are generally fine (little or no extra latency when used between vector-integer ALU instructions). On Nehalem, you might get two cycles of extra latency into and out of a floating point shuffle (for a total of 4 cycles latency), but that's only a big problem for throughput if it's part of a loop-carried dependency chain. See Agner Fog's guides for more info.
Agner's Optimizing Assembly guide also has a whole section of tables of SSE/AVX instructions that are useful for various kinds of data movement within or between registers. See the sse tag wiki for a link, download the PDF, read section 13.7 "Permuting data" on page 130.
To use FP shuffles with intrinsics, you have to clutter up your code with _mm_castsi128_ps and _mm_castps_si128, which are reinterpret-casts that emit no instructions.

Is flop per second a measure of the speed of a processor, or a measure of the speed of an algorithm?

1) I can see very clearly that: the number of floating point operations a computer can do in one second is a good way of quantifying its performance. That's correct, right?
2) My teacher keeps asking me to calculate the flop rate for algorithms I program. I do this by calculating how many flops the algorithm does and timing how long it takes to run. In this situation the flop rate always falls way short of the flop rate I expect from the computer I'm using. So for algorithms, is a flop rate more an assessment of how long the 'other stuff' takes (i.e. overheads, stuff that doesn't involve flopping). That is, when the flop count is low, most of the programs time is spent calling functions etc. and not performing flop, correct?
I know this is a very broad question but I was hoping for some ideas from those in industry or academia about what they intuitively feel the flop rate of an algorithm actually is.
Properly, “flops” is a measure of processor or system performance. Many people misuse it as a measure of implementation or algorithm speed.
Suppose you had a computation to perform that is fixed in the number of operations it takes. For example, you want to multiply a matrix with dimensions a•b with a matrix with dimensions b•c. If you perform this multiplication in the usual way, then, in each combination of one of a rows and one of c columns, you perform b multiplications and b-1 additions. So the entire matrix multiplication takes a•c•(2b-1) floating-point operations. If it finishes in one second, some people say it is providing a•c•(2b-1) flops.
If you have two programs that both do the multiplication the same way, you can compare them using this figure. The one of them that has more “flops” is better. Even though they use the same algorithm, one of them might have a better implementation, perhaps because it organizes the work more efficiently for memory cache.
This breaks when somebody figures out a new algorithm that gets the same job done with fewer operations. Then some people compare programs (or routines) using the nominal number of operations of the original method, even though the program actually performs fewer operations.
To some extent, this makes sense. If you have two programs that do the same job, and one of them has a higher number of “flops” calculated this way, then it is the program that gives you the answer more quickly.
However, it does not make sense to the extent that it introduces inaccuracy. We are often not interested in a single problem size but in various sizes, and the “flops” of a program will not scale linearly with the nominal number of operations once a new algorithm is used.
By analogy, suppose it is 80 kilometers from town A to town B over the mountain road that everybody uses. If it takes your car an hour to make the trip, your car is traveling 80 kilometers an hour. While out exploring one day, you discover a pass through the mountains that reduces the trip to 70 kilometers. Now you can make the trip in 52.5 minutes. The same calculation that some people do with “flops” would say your car is going 91.4 kilometers per hour, since it makes the 80-kilometer trip in 52.5 minutes.
That is obviously wrong. However, it is useful for deciding which route to take.
FLOPS means the amount of Floating Point Operations Per Second, executed by a processor. That can be a purely theoretical figure derived from some hardware/architecture specification or an empirical result from running some algorithm that is tuned to give high numbers.
The main issue in FLOPS calculation comes from a system, where there are multiple and parallel execution blocks. AFAIK, only in that context it starts to get really tough to split a practical algorithm (e.g. FFT, or RGB->YUV conversion) to the most useful set of instructions, that use all the calculation units in a CPU. (e.g. without automatic vectorization a x64 system often calculates Floating point operations only in the Xmm0[0] register, wasting 50-75% of the full potential.)
This partly answers the question 2. Besides of the obvious stall introduced by cache/memory to register bandwidth, the next crucial obstacle in the way to maximum FLOPS figures is that the data is in the wrong register. That's something that is often completely ignored in complexity analysis that just like FLOPS calculations only count basic arithmetic operations. In case of parallel programming, it often happens, that there are not only one, but 4, 8 or 16 values in wrong registers without any means of easily permuting them all at once. Add that to the overhead, "warm up" and "cool down" stages in an algorithm that tries to occupy all the calculating units with meaningful data and there you have major reasons for getting 100 MFlops out of a 1GFLOPS system.