I'm trying to accelerate some numpy code with cupy, but I'm getting some unexpected results.
I'm running this on a Mac Pro Late 2013, OSX 10.13.6 using a NVIDIA GeForce GTX 1080 Ti.
I have been able to reproduce an error in ipython shown below. When determining a norm the multiplication of the conjugate with itself should give a real number. In numpy this is as expected, but using cupy I end up with an imaginary part.
In [54]: import numpy as np
In [55]: import cupy as cp
In [56]: q = np.arange(4)
In [57]: q.shape=[2,2]
In [58]: q=(0.23+0.33j)*(q+0.43)
In [59]: np.dot(np.conj(q).flatten(),q.flatten())
Out[59]: (3.21975528+0j)
In [60]: q_gpu = cp.asarray(q)
In [61]: cp.dot(cp.conj(q_gpu).flatten(),q_gpu.flatten())
Out[61]: array(3.21975528-1.93612215e-17j)
In [62]: cp.sum(cp.abs(q_gpu)**2)
Out[62]: array(3.21975528)
In [63]: sys.version
Out[63]: '3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56) \n[Clang 4.0.1 (tags/RELEASE_401/final)]'
In [64]: sys.version_info
Out[64]: sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
I have realized other inconsistencies in precision between running code in cupy vs. numpy.
What am I doing wrong?
Related
I am trying to find a TensorFlow equivalent of np.quantile(). I have found tfp.stats.quantiles() (tfp stands for TensorFlow Probability). However, its constructs are a bit different from that of np.quantile().
Consider the following example:
import tensorflow_probability as tfp
import tensorflow as tf
import numpy as np
inputs = tf.random.normal((1, 4096, 4))
print("NumPy")
print(np.quantile(inputs.numpy(), q=0.9, axis=1, keepdims=False))
I am not sure from the TFP docs how I could write the above using tfp.stats.quantile(). I tried checking out the source code of both methods, but it didn't help.
Let me try to be more helpful here than I was on GitHub.
There is a difference in behavior between np.quantile and tfp.stats.quantiles. The key difference here is that numpy.quantile will
Compute the q-th quantile of the data along the specified axis.
where q is the
Quantile or sequence of quantiles to compute, which must be between 0 and 1 inclusive.
and tfp.stats.quantiles
Given a vector x of samples, this function estimates the cut points by returning num_quantiles + 1 cut points
So you need to tell tfp.stats.quantiles how many quantiles you want and then select out the qth quantile. If it isn't clear how to do this just from the API, if you look at the source for tfp.stats.quantiles (for v0.19.0) we can see that it shows us how we can get a similar return structure as NumPy.
For completeness, setting up a virtual environment with
$ cat requirements.txt
numpy==1.24.2
tensorflow==2.11.0
tensorflow-probability==0.19.0
allows us to run
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
inputs = tf.random.normal((1, 4096, 4), dtype=tf.float64)
q = 0.9
numpy_quantiles = np.quantile(inputs.numpy(), q=q, axis=1, keepdims=False)
tfp_quantiles = tfp.stats.quantiles(
inputs, num_quantiles=100, axis=1, interpolation="linear"
)[int(q * 100)]
assert np.allclose(numpy_quantiles, tfp_quantiles.numpy())
print(f"{numpy_quantiles=}")
# numpy_quantiles=array([[1.31727661, 1.2699167 , 1.28735237, 1.27137588]])
print(f"{tfp_quantiles=}")
# tfp_quantiles=<tf.Tensor: shape=(1, 4), dtype=float64, numpy=array([[1.31727661, 1.2699167 , 1.28735237, 1.27137588]])>
You could also use tfp.stats.percentile(inputs, 90., axis=1, keepdims=False) -- the only difference from quantile is the 90. replacing .90.
I am working in Ubuntu 18.04. - Linux distro.
When I use Python I have no problem producing my graphs, tables and plots output.
When I switch to IPython instead of the expected table I get
Figure size 432x288 with 1 Axes
This is the script I am using from Dr. Hilpisch Python for finance O'Reilly books
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import matplotlib as mpl
In [2]: mpl.version
Out[2]: '3.3.2'
In [3]: import matplotlib.pyplot as plt
In [4]: plt.style.use('seaborn')
In [5]: mpl.rcParams['font.family'] = 'serif'
In [6]: %matplotlib inline
In [7]: import numpy as np
In [8]: np.random.seed(1000)
In [9]: y = np.random.standard_normal(20)
In [10]: x = np.arange(len(y))
In [11]: plt.plot(x, y);
Figure size 432x288 with 1 Axes
Thank You for your help
My code involves slicing into 432x432x400 arrays a total of ~10 million times to generate batches of data for neural network training. As these are fairly large arrays (92 million data points/300MB), I was hoping to speed this up using CuPy (and maybe even speed training up by generating data on the same GPU as training), but found it actually made the code about 5x slower.
Is this expected behaviour due to CuPy overheads or am I missing something?
Code to reproduce:
import cupy as cp
import numpy as np
import timeit
cp_arr = cp.zeros((432, 432, 400), dtype=cp.float32)
np_arr = np.zeros((432, 432, 400), dtype=np.float32)
# numbers below are representative of my code
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120]'
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120]'
timeit.timeit(cp_code, number=8192*4, globals=globals()) # prints 0.122
timeit.timeit(np_code, number=8192*4, globals=globals()) # prints 0.027
Setup:
GPU: NVIDIA Quadro P4000
CuPy Version: 7.3.0
OS: CentOS Linux 7
CUDA Version: 10.1
cuDNN Version: 7.6.5
I also confirmed that the slicing is about 5x times slower in cupy, while there's a more precise way to measure the time (see e.g. https://github.com/cupy/cupy/pull/2740).
The size of the array does not matter because slice operations do not copy the data but create views. The result with the following is similar.
cp_arr = cp.zeros((4, 4, 4), dtype=cp.float32)
cp_code = 'arr2 = cp_arr[1:3, 1:3, 1:3]'
It is natural that "take slice then send it to GPU" is faster because it reduces the bytes to be transferred. Consider doing so if the first preprocess is the slicing.
Slicing in NumPy and CuPy is not actually copying the data anywhere, but simply returning a new array where the data is the same but with the its pointer being offset to the first element of the new slice and an adjusted shape. Note below how both the original array and the slice have the same strides:
In [1]: import cupy as cp
In [2]: a = cp.zeros((432, 432, 400), dtype=cp.float32)
In [3]: b = a[100:120, 100:120, 100:120]
In [4]: a.strides
Out[4]: (691200, 1600, 4)
In [5]: b.strides
Out[5]: (691200, 1600, 4)
The same above could be verified by replacing CuPy with NumPy.
If you want to time the actual slicing operation, the most reliable way of doing this would be to add a .copy() to each operation, thus enforcing the memory accessing/copying:
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120].copy()' # 0.771 seconds
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120].copy()' # 0.154 seconds
Unfortunately, for the case above the memory pattern is bad for GPUs as the small chunks won't be able to saturate memory channels, thus it's still slower than NumPy. However, CuPy can be much faster if the chunks are able to get close to memory channel saturation, for example:
cp_code = 'arr2 = cp_arr[:, 100:120, 100:120].copy()' # 0.786 seconds
np_code = 'arr2 = np_arr[:, 100:120, 100:120].copy()' # 2.911 seconds
I want to use theano with GPU ,and I use the following script to test if GPU is working:
import os
os.environ['THEANO_FLAGS'] = "device=gpu0"
import theano
from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
but I get the following result:
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
/usr/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:556: UserWarning: Theano flag device=gpu* (old gpu back-end) only support floatX=float32. You have floatX=float64. Use the new gpu back-end with device=cuda* for that value of floatX.
warnings.warn(msg)
Using gpu device 0: GeForce GT 720 (CNMeM is enabled with initial size: 50.0% of memory, cuDNN 6021)
/usr/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:631: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.1.
warnings.warn(warn)
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 3.424644 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
My Question
What does the result mean? and How do I make it work to use the GPU?
I had similar issue with new version of theano. You can try with
THEANO_FLAGS="floatX=float32,device=gpu,nvcc.flags=-D_FORCE_INLINES" python test_gpu.py
import numpy as np
import math
print -1/2*np.log2(1/2)-1/2*np.log2(1/2)
prints nan
Can you explain?
change python version as well
The first is python 2.7, the second is python 3.5
>>> import numpy as np
>>> print(-1/2*np.log2(1/2)-1/2*np.log2(1/2))
nan
>>> print(-1/2*np.log2(1/2)-1/2*np.log2(1/2))
1.0
More information requested...
>>> import numpy as np
>>> print(-1/2*np.log2(1/2)-1/2*np.log2(1/2))
__main__:1: RuntimeWarning: divide by zero encountered in log2
__main__:1: RuntimeWarning: invalid value encountered in double_scalars
nan
Now this can be avoided by floating your terms... the easiest ways is to do it directly...
>>> import numpy as np
>>> print(-1/2.*np.log2(1/2.)-1/2.*np.log2(1/2.))
1.0
Same numpy version, just python has changed between 2.7 and 3.5
In python 2.x division between ints is euclidean division, so 1/2 is 0, and np.log(0) returns nan.
Using python 3:
Python 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
>>> 1/2
0.5
whereas in python 2:
Python 2.7.12 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
>>> 1/2
0
>>> 1./2
0.5
>>> from __future__ import division
>>> 1/2
0.5
I have included two ways to get ordinary division in python 2: using a float (1. instead of 1) or importing division from __future __