I wrote a Theano machine learning program. But I got two absolutely different results between on CPU and on GPU.
Below is the log.(only a tiny part of log)
result on GPU
result on CPU
The loss function will quickly decrease and then converg to 0.2 on CPU.
However, the loss function will increase and finally become NaN on GPU.
What mistakes may be in my program? Or what should I take into attention? Thank you!
Could it be that CPU is using float64 (double precision) and the GPU is using float32 (single precision)? Here you can look up the configuration flags: http://deeplearning.net/software/theano/library/config.html
Check out this thread...
https://github.com/fchollet/keras/issues/511
Basically adding;
optimizer_excluding=cudnn
in the .theanorc solved it for me, but solution runs a slower.
Related
I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats)
I run it on Polaris 560, 16CU GPU.
Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl
I noticed very strange performance drops for this size, matrix multiplication with this size has ~8-10 GFlops performance while if I change N to 4095 or 4097 I'm getting around 130-150Gflops. I notices similar behaviour with other GEMM libraries like clblas or miopengemm - I'm getting significant performance drop for this particular size of 4096x16 and changing N by 1 boosts the performance several times.
The workload is split into work-groups of 256 threads. Each work-group handles 128x16 and 128x16 matrix tiles (8x8 block per threads).
I tried changing matrix tiling to 96x96 with 6x6 blocks instead of 128x128 with 8x8 - same result.
I tested same code with ROCm 3.7 OpenCL, Clover OpenCL and even with Windows OpenCL driver - same behavior.
There is no such issue with nvidia gtx 960 having same number of gpu cores (threads) and same memory type/size.
I suspect that this is somehow cache/collision related but I don't understand how it happens. Thus I don't know how to work-around it.
Finally I found that clBlas library (developed for AMD originally) handles special case of lda % 1024==0, ldb % 1024==0 probably due to cache
https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/specialCases/GemmSpecialCases.cpp#L228
I found that the better way was to rearrange blocks in z-curve order instead of queuing several kernels.
https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl#L109
To handle cases M!=N or M != 1<<n I just increased number of work groups on M/N to neares 1<<n and groups that don't have jobs exit in the begging not adding too much overhead.
z-order improved performance x4 times.
I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.
There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.
I want to implement an algorithm which allows a parallel implementation in TensorFlow. My question is what the arguments parallel_iterations, swap_memory and maximum_iterations actually do and which are their appropriate values according the situation. Specifically, in the documentation on TensorFlow's site https://www.tensorflow.org/api_docs/python/tf/while_loop says that parallel_iterations are the number of iterations allowed to run in parallel. Is this number the number of threads? When someone should allow CPU-GPU swap memory and for what reason? What are the advantages and disadvantages from this choice? What is the purpose of maximum_iterations? Can it be combined with parallel_iterations?
swap_memory is used when you want to have extra memory on the GPU device. Usually when you are training a model some activations are saved in the GPU mem. for later use. With swap_memory, you can store those activations in the CPU memory and use the GPU mem. to fit e.g. larger batch sizes. And this is an advantage. You would choose this if you need big batch_sizes or have long sequences and want to avoid OOM exceptions. Disadvantage is computation time since you need to transfer the data from CPU mem. to GPU mem.
The maximum iterations is smth. like this:
while num_iter < 100 and <some condition>:
do something
num_iter += 1
So it is useful when you check a condition, but also want to have an upper bound (one example is to check if your model converges. If it doesn't you still want to end after k iterations.)
As for parallel_iterations I am not sure, but it sounds like multiple threads, yes. You can try and see the effect in a sample script.
Tensorflow tends to preallocate the entire available memory on it's GPUs. For debugging, is there a way of telling how much of that memory is actually in use?
(1) There is some limited support with Timeline for logging memory allocations. Here is an example for its usage:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
train_writer.add_summary(summary, i)
print('Adding run metadata for', i)
tl = timeline.Timeline(run_metadata.step_stats)
print(tl.generate_chrome_trace_format(show_memory=True))
trace_file = tf.gfile.Open(name='timeline', mode='w')
trace_file.write(tl.generate_chrome_trace_format(show_memory=True))
You can give this code a try with the MNIST example (mnist with summaries)
This will generate a tracing file named timeline, which you can open with chrome://tracing. Note that this only gives an approximated GPU memory usage statistics. It basically simulated a GPU execution, but doesn't have access to the full graph metadata. It also can't know how many variables have been assigned to the GPU.
(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command.
nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage.
Here is an example command: nvprof --print-gpu-trace matrixMul
And more details here:
http://docs.nvidia.com/cuda/profiler-users-guide/#abstract
Here's a practical solution that worked well for me:
Disable GPU memory pre-allocation using TF session configuration:
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
run nvidia-smi -l (or some other utility) to monitor GPU memory consumption.
Step through your code with the debugger until you see the unexpected GPU memory consumption.
There's some code in tensorflow.contrib.memory_stats that will help with this:
from tensorflow.contrib.memory_stats.python.ops.memory_stats_ops import BytesInUse
with tf.device('/device:GPU:0'): # Replace with device you are interested in
bytes_in_use = BytesInUse()
with tf.Session() as sess:
print(sess.run(bytes_in_use))
The TensorFlow profiler has improved memory timeline that is based on real gpu memory allocator information
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler#visualize-time-and-memory
tf.config.experimental.get_memory_info('GPU:0')
Currently returns the following keys:
'current': The current memory used by the device, in bytes.
'peak': The peak memory used by the device across the run of the program, in bytes.
as #V.M previously mentioned, a solution that works well is using: tf.config.experimental.get_memory_info('DEVICE_NAME')
This function returns a dictionary with two keys:
'current': The current memory used by the device, in bytes
'peak': The peak memory used by the device across the run of the program, in bytes.
The value of these keys is the ACTUAL memory used not the allocated one that is returned by nvidia-smi.
In reality, for GPUs, TensorFlow will allocate all the memory by default rendering using nvidia-smi to check for the used memory in your code useless. Even if, tf.config.experimental.set_memory_growth is set to true, Tensorflow will no more allocate the whole available memory but is going to remain in allocating more memory than the one is used and in a discrete manner, i.e. allocates 4589MiB then 8717MiB then 16943MiB then 30651 MiB, etc.
A small note concerning the get_memory_info() is that it doesn't return correct values if used in a tf.function() decorated function. Thus, the peak key shall be used after executing tf.function() decorated function to determine the peak memory used.
For older versions of Tensorflow, tf.config.experimental.get_memory_usage('DEVICE_NAME') was the only available function and only returned the used memory (no option for determining the peak memory).
Final note, you can also consider the Tensorflow Profiler available with Tensorboard as #Peter Mentioned.
Hope this helps :)
Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano.
I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.
I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.
Any suggestions?
The previous answer is partial. The suggestion should not work, as the work around is the one used in the final compiled code. There is optimization that will do this transformation automatically.
The title of the question isn't the same as the content. They differ by the axis argument. I'll answer both questions.
If the axis is 0 or None we support this on the GPU for that operation for matrix. If the axis is None, we have a basic implementation that isn't well optimized as it is harder to parallelize. If the axis is 0, we have a basic implementation, but it is faster as it is easier to parallelize.
Also, how did you do your timing? If you just make one function with only that operation and test it via the device=gpu flags to do your comparison, this will include the transfer time between CPU and GPU. This is a memory bound operation, so if you include the transfer in your timming, personnaly I don't expect any speed op for that case. To see only the GPU operation, use Theano profiler: run with the Theano flag profile=True.
The max and exp operations are fundamentally different; exp (and other operations like addition, sin, etc.) is an elementwise operation that is embarrassingly parallelizable, while max requires a parallel-processing scan algorithm that basically builds up a tree of pairwise comparisons over an array. It's not impossible to speed up max, but it's not as easy as exp.
Anyway, the theano implementation of max basically consists of the following lines (in theano/tensor/basic.py):
try:
out = max_and_argmax(x, axis)[0]
except Exception:
out = CAReduce(scal.maximum, axis)(x)
where max_and_argmax is a bunch of custom code that, to my eye, implements a max+argmax operation using numpy, and CAReduce is a generic GPU-accelerated scan operation used as a fallback (which, according to the comments, doesn't support grad etc.). You could try using the fallback directly and see whether that is faster, maybe something like this:
from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum
def mymax(X, axis=None):
CAReduce(maximum, axis)(X)