Is my training data really being randomized? Error rates are wildly oscillating - cntk

So I set the randomization window to 100,000. In my log I can see that it's oscillating between 0 errors and a lot of errors, which makes me wonder if the data is truly random. The training data is made up of sequences where the input is typically about 50 tokens and the output is 6 tokens for about 99% of the sequences, and maybe about 400 tokens in the other 1% (and these sequences are the most important to learn how to output, of course). It seems like more than one of the longer sequences may be getting clumped together, and that's why the error rate might go up all of a sudden. Is that possible?

Please try to specify larger randomization window if your samples are small, i.e. randomizationWindow=100000000. It can be that your window is only a single chunk - then the data will be only randomized inside, not between chunks.
(You can see how the data is splitted if you specify verbosity=4 in the reader section, the randomized windows [) information).
The more data you can put in memory - the better. Also from the perf perspective, because (after initial load) while the data being processed the readers can start prefetching new chunks and your GPU won't be IO bound.

Related

Kotlin's Array vs ArrayList vs List for storing large amounts of data

I'm building a Deep Neural Network in Kotlin (I know Python would be better, but I have to do that in Kotlin).
For training the net I need a huge amount of data from the MNIST database, this means I need to read about 60,000 images from a single file in IDX format and store them for simultaneous use.
Every image consists of 784 Bytes. So the total size is:
784*60,000 = 47,040,000 = ~47 MB of training data.
Which ain't that much, since I'm running the JVM in an 8GB RAM env.
After reading an image i need to convert it to a KMatrix, a custom data structure for matrix math operations. Under the hood of a KMatrix there's an Array<Array<Double>>.
I need a structure to store all the images at once, so I'm currently using a List<KMatrix>, which basically tranlates to a List<Array<Array<Double>>>
The problem is that while building the List<KMatrix> the Garbage Collector runs out of memory, launching a OutOfMemoryException: GC overhead limit exceeded.
I wonder if the problem is which data structures I'm using (i.e. should I use an ArrayList instead of an Array?) or maybe how I'm building the entire thing up (i.e. I need some optimization work to do).
I'll put the code, if needed, as soon as I can.
Thanks for your help.
Self-answer with the summarized solution (Thanks to answers by #Tenfour04 and #gidds)
As #Tenfour04 stated, you have basically three alternatives to the Array<Array<Double>> for the KMatrix:
an Array<DoubleArray> which mantains the same logic as the original, but saving lots of memory and increasing performance;
a 1-Dimensional DoubleArray which saves a bit of extra memory and performance, but with increased complexity given by the index-mapping of the array (the [i;j] element of the matrix is given by the [i * w + j] element of the array), and this probably isn't worth it as #gidds pointed out;
a 1-D DoubleBuffer created with ByteBuffer.allocateDirect(8 * size).asDoubleBuffer(), which improves performances even further but has only get and put methods, so it is useless if you need simple and direct set operations.
Conclusion
I choose the option 2, since in my case I'm performing very intensive operations, but in common cases, probably option 1 is the best as it is balanced in complexity and performance.
If you need a highest-performance structure and read/put methods are enough, I'd say that option 3 is what you're looking for.
Hope this helps someone

Big Oh! algorithms running in O(4^N)

Locked. There are disputes about this question’s content being resolved at this time. It is not currently accepting new answers or interactions.
For algorithms running in
O(16^N)
If we triple the size, the time is multiplied by what number??
This is an interesting question because while equivalent questions for runtimes like Θ(n) or Θ(n3) have clean answers, the answer here is a bit more nuanced.
Let's start with a simpler question. We have an algorithm whose runtime is Θ(n2), and on a "sufficiently large" input the runtime is T seconds. What should we expect the runtime to be once we triple the size of the input? To answer this, let's imagine, just for simplicity's sake, that the actual runtime of this function is closely approximated by cn2, and let's have k be the "sufficiently large" input we plugged into it. Then, plugging in 3k, we see that the runtime is
c(3k)2 = 9ck2 = 9(ck2) = 9T.
That last step follows because the cost of running the algorithm on an input of size k is T, meaning that ck2 = T.
Something important to notice here - tripling the size of the input does not change the fact that the runtime here is Θ(n2). The runtime is still quadratic; we're just changing how big the input is.
More generally, for any algorithm whose runtime is Θ(nm) for some fixed constant m, the runtime will grow by roughly a factor of 3m if you triple the size of the input. That's because
c(3k)m = 3mckm = 3mT.
But something interesting happens if we try performing this same analysis on a function whose runtime is Θ(4n). Let's imagine that we ran this algorithm on some input k and it took T time units to finish. Then running this algorithm on an input of size 3k will take time roughly
c43k = c4k42k = T42k = 16kT.
Notice how we aren't left with a constant multiple of the original cost, but rather something that's 16k times bigger. In particular, that means that the amount by which the algorithm slows down will depend on how big the input is. For example, the slowdown going from input size 10 to input size 30 is a factor of 1620, while the slowdown going from input size 30 to input size 90 is a staggering 1660. For what it's worth, 1660 = 2240, which is pretty close to the number of atoms in the observable universe.
And, intuitively, that makes sense. Exponential functions grow at a rate proportional to how big they already are. That means that the scale in runtime for doubling or tripling the size of the input will lead to a runtime change that depends on the size of that input.
And, as above, notice that the runtime is not now Θ(43n). The runtime is still Θ(4n); we're just changing which inputs we're plugging in.
So, to summarize:
The runtime of the function slows down by a factor of 42n if you triple the size of the input n. This means that the slowdown depends on how big the input is.
The runtime of the function stays at Θ(4n) when we do this. All that's changing is where we're evaluating the 4n.
Hope this helps!
The time complexity of the algorithm represents the growth in run-time of the algorithm with respect to the growth in input size. So, if our input size increases by 3 times, that means we have now new value for our input size.
Hence, time complexity of the algorithm still remains same. i.e O(4^N)

Is there an optimal number of elements for a tfrecords file?

This is follow up to these SO questions
What is the need to do sharding of TFRecords files?
optimal size of a tfrecord file
and this passage from this tutorial
For this small dataset we will just create one TFRecords file for the
training-set and another for the test-set. But if your dataset is very
large then you can split it into several TFRecords files called
shards. This will also improve the random shuffling, because the
Dataset API only shuffles from a smaller buffer of e.g. 1024 elements
loaded into RAM. So if you have e.g. 100 TFRecords files, then the
randomization will be much better than for a single TFRecords file.
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/18_TFRecords_Dataset_API.ipynb
So there is an optimal file size, but I am wondering, if there's an optimal number of elements? Since it's the elements itself that's being distributed to the GPUs cores?
Are you trying to optimize:
1 initial data randomization?
2 data randomization across training batches and/or epochs?
3 training/validation throughput (ie, gpu utilization)?
Initial data randomization should be handled when data are initially saved into sharded files. This can be challenging, assuming you can't read the data into memory. One approach is to read all the unique data ids into memory, shuffle those, do your train/validate/test split, and then write your actual data to file shards in that randomized order. Now your data are initially shuffled/split/sharded.
Initial data randomization will make it easier to maintain randomization during training. However, I'd still say it is 'best practice' to re-shuffle file names and re-shuffle a data memory buffer as part of the train/validate data streams. Typically, you'll set up an input stream using multiple threads/processes. The first step is to randomize the file input streams by re-shuffling the filenames. This can be done like:
train_files = tf.data.Dataset.list_files('{}/d*.tfr'.format(train_dir),
shuffle=True)
Now, if your initial data write was already randomized, you 'could' read the entire data from one file, before going to the next, but that would still impact re-randomization throughout the training process, so typically you interleave file reads, reading a certain number of records from each file. This also improves throughput, assuming you are using multiple file read processes (which you should do, to maximize gpu throughput).
blocksize = 1000 # samples read from one file before switching files
train_data = train_files.interleave(interleaveFiles,
block_length=blocksize,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
Here, we're reading 1000 samples from each file, before going on to the next. Again, to re-shuffle the training data each epoch (which may or may not be critical), we re-shuffle the data in memory, setting a memory buffer based on what's available on the machine and how large our data items are (note - before formatting the data for gpu).
buffersize = 1000000 # samples read before shuffling in memory
train_data = train_data.shuffle(buffersize,
reshuffle_each_iteration=True)
train_data = train_data.repeat()
The repeat() call is just to allow the data set to 'wrap around' during training. This may or may not be important, depending on how you set up your training process.
To optimize throughput, you can do 2 things:
1 alter the order of operations in the data input stream. Typically, if you put your randomization operations early, they can operate on 'low weight' entities, like file names, rather than on tensors.
2 use pre-fetching to let your cpu processes stream data during gpu calculations
train_data = train_data.map(mapData,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
train_data = train_data.padded_batch(batchsize)
train_data = train_data.prefetch(10)
So, mapping and batching happens last (this is usually preferred for maximizing gpu throughput, but it can depend on other factors, like data size (pre and post-tensorizing), and how computationally expensive your map function is).
Finally, you can tune the prefetch size to maximize gpu throughput, constrained by system memory and memory speed.
So, how does this all impact the 'optimal' number of data items in each sharded file?
Obviously, if your data/file size is > your blocksize, blocksize becomes irrelevant, and you might as well read each file completely. Typically, if you are going to use this paradigm, you wand blocksize << data/file. I use 10x; so if my blocksize is 1000, I have ~10,000 data items in the file. This may not be optimal, but so far I can maintain >90% gpu usage using this approach on my specific hardware. If you want to tune for your hardware, you could start somewhere at ~10x and adjust, based on whatever you are specifically trying to optimize.
If you have very large numbers of files, you may run into problems maintaining good file read streams, but on a modern system you should be able to get to 100,000 files or more and still be fine. Moving large numbers of files around can be difficult, but usually easier than having very small numbers of very big files, so there are some (broad) constraints on file sizes that can impact how many data items/file you end up with. Generally speaking, I'd say having on the order of 100s of files would be ideal for a large dataset. That way you can easily stream files across a network efficiently (again, that will depend on your network). If the data set is small, you'll have 10s to 50s of files, which is fine for streaming, depending on file size (I typically try to hit 100-300MB/file, which works well for moving things around a LAN or WAN).
So, I think file-size and number-of-files places much stronger constraints on your process than number of data items/file, so long as you have an appropriate number of data items/file, given your file read blocksize. Again, you could hyper-shard your files (1 data item/file?), and read entire files into memory, without using file blocking. That might work, and it would certainly be lightweight to shuffle file names, rather than data items. But you might also end up with millions of files!
To really optimize, you'll need to set up an end-to-end training system on a particular machine, and then tweak it to see what works best for your particular data, network, and hardware. So long as your data are effectively randomized and your data files are easy to store/use/share, you just want to optimize gpu throughput. I would be surprised if reordering the data input stream and pre-fetching doesn't get you there.

What is the CNTK randomizationWindow behavior?

I have a quick question about the randomizationWindow parameter of the reader. It says in the documentation it controls how much of the data is in memory – but I’m a little unclear what effect it will have on the randomness of the data. If the training data file starts with one distribution of data, and ends in another completely different distribution, will setting a randomization window smaller than the data size cause the data fed to the trainer not to be from a homogenous distribution? I just wanted to double check.
To give a bit more detail on randomization/IO:
All corpus/data is always splitted in chunks. Chunks help to make IO efficient, because all sequences of a chunk are read in one go (usually a chunk is 32/64MB).
When it comes to randomization, there are two steps there:
all chunks are randomized
given the randomization window of N samples the randomizer creates a rolling window of M chunks that in total have approximately N samples in them. All sequences inside this rolling window are randomized. When all sequences of a chunk have been processed, the randomizer can release it and start loading the next one asynchronously.
When the randomizationWindow is set to a window smaller than the entire data size, the entire data size is chunked into randomizationWindow sized chunks and the order of chunks is randomized. Then within each chunk, the samples are randomized.

Getting each example exactly once

For monitoring my model's performance on my evaluation dataset, I'm using tf.train.string_input_producer for the filenames queue on .tfr files, then I feed the parsed examples to the tf.train.batch function, that produces batches of a fixed size.
Assume my evaluation dataset contains exactly 761 examples (a prime number). To read all the examples exactly once, I have to have a batch size that divides 761, but there is no such, except 1 that will be too slow and 761 that will not fit in my GPU. Any standard way for reading each example exactly once?
Actually, my dataset size is not 761, but there is no number in the reasonable range of 50-300 that divides it exactly. Also I'm working with many different datasets, and finding a number that approximately divides the number of examples in each dataset can be a hassle.
Note that using the num_epochs parameter to tf.train.string_input_producer does not solve the issue.
Thanks!
You can use reader.read_up_to as in this example. Your last batch will be smaller, so you need to make sure your network doesn't hard-wire batch-size anywhere