What will be the best choice for batch size for one device (using Mirrored Strategy in TF)? - tensorflow

Question:
Suppose you have 4 GPUs (having 2GB memory each) to train your deep learning model. You have 1000 data points in your dataset that takes around 10 GB of storage. What will be the best choice for batch size for one device (using Mirrored Strategy in TF)?
Can someone help me to solve this assignment problem? Thanks in advance.

Each GPU has a memory of 2GB and there are 4 GPUs which means you have a total of 8 GB memory to work with.
Now you can't divide 10 GB of data into 8 GB in one go, so you split 10GB into halves, and have an overall batch size of 500 data points(or rather 512 to be closer to a power of 2)
Now you distribute these 500 data points across the 4 GPUs, getting a batch size of ~128 data points per device.
So overall batch size would be 512 data points, and per GPU batch size would 128.

Related

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.
There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

Loading large set of images kill the process

Loading 1500 images of size (1000,1000,3) breaks the code and throughs kill 9 without any further error. Memory used before this line of code is 16% of system total memory. Total size of images direcotry is 7.1G.
X = np.asarray(images).astype('float64')
y = np.asarray(labels).astype('float64')
system spec is:
OS: macOS Catalina
processor: 2.2 GHz 6-Core Intel Core i7 16 GB 2
memory: 16 GB 2400 MHz DDR4
Update:
getting the bellow error while running the code on 32 vCPUs, 120 GB memory.
MemoryError: Unable to allocate 14.1 GiB for an array with shape (1200, 1024, 1024, 3) and data type float32
You would have to provide some more info/details for an exact answer but, assuming that this is a memory error(incredibly likely, size of the images on disk does not represent the size they would occupy in memory, so that is irrelevant. In 100% of all cases, the images in memory will occupy a lot more space due to pointers, objects that are needed and so on. Intuitively I would say that 16GB of ram is nowhere nearly enough to load 7GB of images. It's impossible to tell you how much you would need but from experience I would say that you'd need to bump it up to 64GB. If you are using Keras, I would suggest looking into the DirectoryIterator.
Edit:
As Cris Luengo pointed out, I missed the fact that you stated the size of the images.

dask dataframe optimal partition size for 70GB data join operations

I have a dask dataframe of around 70GB and 3 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.
I have to take each of the 3 columns and join them to another even larger dataframe.
The documentation recommends to have partition sizes of 100MB. However, given this size of data, joining 700 partitions seems to be a lot more work than for example joining 70 partitions a 1000MB.
Is there a reason to keep it at 700 x 100MB partitions?
If not which partition size should be used here?
Does this also depend on the number of workers I use?
1 x 50GB worker
2 x 25GB worker
3 x 17GB worker
Optimal partition size depends on many different things, including available RAM, the number of threads you're using, how large your dataset is, and in many cases the computation that you're doing.
For example, in your case if your join/merge code it could be that your data is highly repetitive, and so your 100MB partitions may quickly expand out 100x to 10GB partitions, and quickly fill up memory. Or they might not; it depends on your data. On the other hand join/merge code does produce n*log(n) tasks, so reducing the number of tasks (and so increasing partition size) can be highly advantageous.
Determining optimal partition size is challenging. Generally the best we can do is to provide insight about what is going on. That is available here:
https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

GTX 970 bandwidth calculation

I am trying to calculate the theoretical bandwidth of gtx970. As per the specs given in:-
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
Memory clock is 7Gb/s
Memory bus width = 256
Bandwidth = 7*256*2/8 (*2 because it is a DDR)
= 448 GB/s
However, in the specs it is given as 224GB/s
Why is there a factor 2 difference? Am i making a mistake, if so please correct me.
Thanks
The 7 Gbps seems to be the effective clock, i.e. including the data rate. Also note that the field explanation for this Wikipedia list says that "All DDR/GDDR memories operate at half this frequency, except for GDDR5, which operates at one quarter of this frequency", which suggests that all GDDR5 chips are in fact quad data rate, despite the DDR abbreviation.
Finally, let me point out this note from Wikipedia, which disqualifies the trivial effective clock * bus width formula:
For accessing its memory, the GTX 970 stripes data across 7 of its 8 32-bit physical memory lanes, at 196 GB/s. The last 1/8 of its memory (0.5 GiB on a 4 GiB card) is accessed on a non-interleaved solitary 32-bit connection at 28 GB/s, one seventh the speed of the rest of the memory space. Because this smaller memory pool uses the same connection as the 7th lane to the larger main pool, it contends with accesses to the larger block reducing the effective memory bandwidth not adding to it as an independent connection could.
The clock rate reported is an "effective" clock rate and already takes into account the transfer on both rising and falling edges. The trouble is the factor of 2 for DDR.
Some discussion on devtalk here: https://devtalk.nvidia.com/default/topic/995384/theoretical-bandwidth-vs-effective-bandwidth/
In fact, your format is correct, but the memory clock is wrong. GeForce GTX 970's memory clock is 1753MHz(refers to https://www.techpowerup.com/gpu-specs/geforce-gtx-970.c2620).

Using CoreNLP ColumnDataClassifier for document classification with a large corpus

I'm trying to use the CoreNLP ColumnDataClassifier to classify a large number of documents. I have a little more than 1 million documents with about 20000 labels.
Is this even possible in terms of memory requirements? (I currently only have 16GB)
Is it somehow possible to train the classifier in an iterative way, splitting the input into many smaller files?
As an experiment I ran:
1.) 500,000 documents, each with 100 random words
2.) a label set of 10,000
This crashed with a memory error even when I gave it 40 GB of RAM.
I also ran:
1.) same 500,000 documents
2.) a label set of 6
This ran successfully to completion with 16 GB of RAM.
I'm not sure at what point growing the label set will cause a crash, but my advice would be to shrink the possible label set and experiment.