Cupy slower than numpy when doing a "for loop" for columns of an array as vectors - cupy

I'm trying to parallelize the following operation with cupy:
I have an array. For each column of that array, I'm generating 2 random vectors. I take that array column, add one of the vectors, subtract the other, and make that new vector the next column of the array. I continue on until I finish with the array.
I already asked the following question - Cupy slower than numpy when iterating through array. But this is different, in that I believe I followed the advice of parallelizing the operation and having one "for loop" instead of two, and iterating only through the array columns instead of both rows and columns.
import cupy as cp
import time
#import numpy as cp
def row_size(array):
return(array.shape[1])
def number_of_rows(array):
return(array.shape[0])
x = (cp.zeros((200,200), 'f'))
#x = cp.zeros((200,200))
x[:,1] = 500000
vector_one = x * 0
vector_two = x * 0
start = time.time()
for i in range(number_of_rows(x) - 1):
if sum(x[ :, i])!=0:
vector_one[ :, i + 1], vector_two[ :, i+ 1] = cp.random.poisson(.01*x[:,i],len(x[:,i])), cp.random.poisson(.01 * x[:,i],len(x[:,i]))
x[ :, i+ 1] = x[ :, i] + vector_one[ :, i+ 1] - vector_two[ :, i+ 1]
time = time.time() - start
print(x)
print(time)
When I run this in cupy, the time comes out to about .62 seconds.
When I switch to numpy, so I 1) uncomment #import numpy as cp and #x = cp.zeros((200,200)) and 2) instead comment import cupy as cp
and x = (cp.zeros((200,200), 'f')):
The time comes out to about .11 seconds.
I thought maybe if I increase the array size, for example from (200,200) to (2000,2000), then I'd see a difference in cupy being faster, but it's still slower.
I know this is working properly, in a sense, because if I change the coefficient in cp.random.poisson from .01 to .5, I can only do that in cupy because that lambda is too large for numpy.
But still, how do I make it actually faster with cupy?

In general, looping on the host (CPU) and iteratively processing small device (GPU) arrays isn't ideal due to the larger number of separate kernels you will have to launch than in a columnar-oriented approach. However, sometimes a columnar-oriented approach just isn't feasible.
You can speed up your CuPy code by using CuPy's sum instead of using Python's built-in sum operation, which is forcing a device to host transfer each time you call it. With that said, you can also speed up your NumPy code by switching to NumPy's sum.
import cupy as cp
import time
#import numpy as cp
def row_size(array):
return(array.shape[1])
def number_of_rows(array):
return(array.shape[0])
x = (cp.zeros((200,200), 'f'))
#x = cp.zeros((200,200))
x[:,1] = 500000
vector_one = x * 0
vector_two = x * 0
start = time.time()
for i in range(number_of_rows(x) - 1):
# if sum(x[ :, i]) !=0:
if x[ :, i].sum() !=0: # or you could do: if x[ :, i].sum().get() !=0:
vector_one[ :, i + 1], vector_two[ :, i+ 1] = cp.random.poisson(.01*x[:,i],len(x[:,i])), cp.random.poisson(.01 * x[:,i],len(x[:,i]))
x[ :, i+ 1] = x[ :, i] + vector_one[ :, i+ 1] - vector_two[ :, i+ 1]
cp.cuda.Device().synchronize() # CuPy is asynchronous, but this doesn't really affect the timing here.
t = time.time() - start
print(x)
print(t)
[[ 0. 500000. 500101. ... 498121. 497922. 497740.]
[ 0. 500000. 499894. ... 502050. 502174. 502112.]
[ 0. 500000. 499989. ... 501703. 501836. 502081.]
...
[ 0. 500000. 499804. ... 499600. 499526. 499371.]
[ 0. 500000. 499923. ... 500371. 500184. 500247.]
[ 0. 500000. 500007. ... 501172. 501113. 501254.]]
0.06389498710632324
This small change should make your workflow much faster (0.06 vs 0.6 seconds originally on my T4 GPU). Note that the .get() method in the comment is used to explicitly transfer the result of the sum operation from the GPU to the CPU before the not equal comparison. This isn't necessary, as CuPy knows how to handle logical operations, but would give you a very tiny additional speedup.

Related

Is nx.eigenvector_centrality_numpy() using the Arnoldi iteration instead of the basic power method?

Since nx.eigenvector_centrality_numpy() using ARPACK, is it mean that nx.eigenvector_centrality_numpy() using Arnoldi iteration instead of the basic power method?
because when I try to compute manually using the basic power method, the result of my computation is different from the result of nx.eigenvector_centrality_numpy(). Can someone explain it to me?
To make it more clear, here is my code and the result that I got from the function and the result when I compute manually.
import networkx as nx
G = nx.DiGraph()
G.add_edge('a', 'b', weight=4)
G.add_edge('b', 'a', weight=2)
G.add_edge('b', 'c', weight=2)
G.add_edge('b','d', weight=2)
G.add_edge('c','b', weight=2)
G.add_edge('d','b', weight=2)
centrality = nx.eigenvector_centrality_numpy(G, weight='weight')
centrality
The result:
{'a': 0.37796447300922725,
'b': 0.7559289460184545,
'c': 0.3779644730092272,
'd': 0.3779644730092272}
Below is code from Power Method Python Program and I did a little bit of modification:
# Power Method to Find Largest Eigen Value and Eigen Vector
# Importing NumPy Library
import numpy as np
import sys
# Reading order of matrix
n = int(input('Enter order of matrix: '))
# Making numpy array of n x n size and initializing
# to zero for storing matrix
a = np.zeros((n,n))
# Reading matrix
print('Enter Matrix Coefficients:')
for i in range(n):
for j in range(n):
a[i][j] = float(input( 'a['+str(i)+']['+ str(j)+']='))
# Making numpy array n x 1 size and initializing to zero
# for storing initial guess vector
x = np.zeros((n))
# Reading initial guess vector
print('Enter initial guess vector: ')
for i in range(n):
x[i] = float(input( 'x['+str(i)+']='))
# Reading tolerable error
tolerable_error = float(input('Enter tolerable error: '))
# Reading maximum number of steps
max_iteration = int(input('Enter maximum number of steps: '))
# Power Method Implementation
lambda_old = 1.0
condition = True
step = 1
while condition:
# Multiplying a and x
ax = np.matmul(a,x)
# Finding new Eigen value and Eigen vector
x = ax/np.linalg.norm(ax)
lambda_new = np.vdot(ax,x)
# Displaying Eigen value and Eigen Vector
print('\nSTEP %d' %(step))
print('----------')
print('Eigen Value = %0.5f' %(lambda_new))
print('Eigen Vector: ')
for i in range(n):
print('%0.5f\t' % (x[i]))
# Checking maximum iteration
step = step + 1
if step > max_iteration:
print('Not convergent in given maximum iteration!')
break
# Calculating error
error = abs(lambda_new - lambda_old)
print('errror='+ str(error))
lambda_old = lambda_new
condition = error > tolerable_error
I used the same matrix and the result:
STEP 99
----------
Eigen Value = 3.70328
Eigen Vector:
0.51640
0.77460
0.25820
0.25820
errror=0.6172133998483682
STEP 100
----------
Eigen Value = 4.32049
Eigen Vector:
0.71714
0.47809
0.35857
0.35857
Not convergent in given maximum iteration!
I've to try to compute it with my calculator too and I know it's not convergent because |lambda1|=|lambda2|=4. I've to know the theory behind nx.eigenvector_centrality_numpy() properly so I can write it right for my thesis. Help me, please

calculating the covariance matrix fast in python with some minor customizing

I have a pandas data frame and I'm trying to find the covariance of the percentage change of each column. For each pair, I want rows with missing values to be dropped, and the percentage be calculated afterwards. That is, I want something like this:
import pandas as pd
import numpy as np
# create dataframe example
N_ROWS, N_COLS = 249, 3535
df = pd.DataFrame(np.random.random((N_ROWS, N_COLS)))
df.iloc[np.random.choice(N_ROWS, N_COLS), np.random.choice(10, 50)] = np.nan
cov_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_i in df:
for col_j in df:
cov = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().cov()
cov_df.loc[col_i, col_j] = cov.iloc[0, 1]
The thing is this is super slow. The code below gives me results that is similar (but not exactly) what I want, but it runs quite fast
df.dropna(how='any', axis=0).pct_change().cov()
I am not sure why the second one runs so much faster. I want to speed up my code in the first, but I can't figure out how.
I have tried using combinations from itertools to avoid repeating the calculation for (col_i, col_j) and (col_j, col_i), and using map from multiprocessing to do the computations in parallel, but it still hasn't finished running after 90+ mintues.
somehow this works fast enough, although I am not sure why
from scipy.stats import pearsonr
corr = np.zeros((x.shape[1], x.shape[1]))
for i in range(x.shape[1]):
for j in range (i + 1, x.shape[1]):
y = x[:, [i, j]]
y = y[~np.isnan(y).any(axis=1)]
y = np.diff(y, axis=0) / y[:-1, :]
if len(y) < 2:
corr[i, j] = np.nan
continue
y = pearsonr(y[:, 0], y[:, 1])[0]
corr[i, j] = y
corr = corr + corr.T
np.fill_diagonal(corr, 1)
This takes within 8 minutes, which is fast enough for my use case.
On the other hand, this has been running for 30 minutes but still isn't done.
corr = pd.DataFrame(index=nav.columns, columns=nav.columns)
for col_i in df:
for col_j in df:
corr_ij = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().corr().iloc[0, 1]
corr.loc[col_i, col_j] = corr_ij
t1 = time.time()
Don't know why this is but anyways the first one is a good enough solution for me now.

How to create index combinations (k out of n) as sparse bitmasks for numpy

For numpy how can I efficiently create
an array/matrix representing a list of all combinations (k out of n) as lists of k indices. The shape would be (binomial(n, k), k).
a sparse array/matrix representing this combinations as bitmasks of length n. (So expanding aboves indices to bitmask.) The shape would be (binomial(n, k), n).
I need to do this with large n (and maybe small k). So the algorithm should be
time efficient (e.g. maybe allocate complete result space at once before filling it?)
space efficient (e.g. sparse bitmasks)
Many Thanks for your help.
Assuming the blowup is not that bad (as mentioned in the comment above), you might try this. It's pretty vectorized and should be fast (for cases which could be handled).
Edit: i somewhat assumed you are interested in an output based on scipy.sparse. Maybe you are not.
Code
import itertools
import numpy as np
import scipy.sparse as sp
def combs(a, r):
"""
Return successive r-length combinations of elements in the array a.
Should produce the same output as array(list(combinations(a, r))), but
faster.
"""
a = np.asarray(a)
dt = np.dtype([('', a.dtype)]*r)
b = np.fromiter(itertools.combinations(a, r), dt)
b_ = b.view(a.dtype).reshape(-1, r)
return b_
def sparse_combs(k, n):
combs_ = combs(np.arange(n), k)
n_bin = combs_.shape[0]
spmat = sp.coo_matrix(( np.ones(n_bin*k),
(np.repeat(np.arange(n_bin), k),
combs_.ravel()) ),
shape=(n_bin, n))
return spmat
print('dense')
print(combs(range(4), 3))
print('sparse (dense for print)')
print(sparse_combs(3, 4).todense())
Output
dense
[[0 1 2]
[0 1 3]
[0 2 3]
[1 2 3]]
sparse (dense for print)
[[ 1. 1. 1. 0.]
[ 1. 1. 0. 1.]
[ 1. 0. 1. 1.]
[ 0. 1. 1. 1.]]
The helper-function combs i took (probably) from this question (sometime in the past).
Small (unscientific) timing:
from time import perf_counter as pc
start = pc()
spmat = sparse_combs(5, 50)
time_used = pc() - start
print('secs: ', time_used)
print('nnzs: ', spmat.nnz)
#secs: 0.5770790778094155
#nnzs: 10593800
(3, 500)
#secs: 3.4843752405405497
#nnzs: 62125500

Bernoulli random number generator

I cannot understand how Bernoulli Random Number generator used in numpy is calculated and would like some explanation on it. For example:
np.random.binomial(size=3, n=1, p= 0.5)
Results:
[1 0 0]
n = number of trails
p = probability of occurrence
size = number of experiments
With how do I determine the generated numbers/results of "0" or "1"?
=================================Update==================================
I created a Restricted Boltzmann Machine which always presents the same results despite being "random" on multiple code executions. The randomize is seeded using
np.random.seed(10)
import numpy as np
np.random.seed(10)
def sigmoid(u):
return 1/(1+np.exp(-u))
def gibbs_vhv(W, hbias, vbias, x):
f_s = sigmoid(np.dot(x, W) + hbias)
h_sample = np.random.binomial(size=f_s.shape, n=1, p=f_s)
f_u = sigmoid(np.dot(h_sample, W.transpose())+vbias)
v_sample = np.random.binomial(size=f_u.shape, n=1, p=f_u)
return [f_s, h_sample, f_u, v_sample]
def reconstruction_error(f_u, x):
cross_entropy = -np.mean(
np.sum(
x * np.log(sigmoid(f_u)) + (1 - x) * np.log(1 - sigmoid(f_u)),
axis=1))
return cross_entropy
X = np.array([[1, 0, 0, 0]])
#Weight to hidden
W = np.array([[-3.85, 10.14, 1.16],
[6.69, 2.84, -7.73],
[1.37, 10.76, -3.98],
[-6.18, -5.89, 8.29]])
hbias = np.array([1.04, -4.48, 2.50]) #<= 3 bias for 3 neuron in hidden
vbias = np.array([-6.33, -1.68, -1.25, 3.45]) #<= 4 bias for 4 neuron in input
k = 2
v_sample = X
for i in range(k):
[f_s, h_sample, f_u, v_sample] = gibbs_vhv(W, hbias, vbias, v_sample)
start = v_sample
if i < 2:
print('f_s:', f_s)
print('h_sample:', h_sample)
print('f_u:', f_u)
print('v_sample:', v_sample)
print(v_sample)
print('iter:', i, ' h:', h_sample, ' x:', v_sample, ' entropy:%.3f'%reconstruction_error(f_u, v_sample))
Results:
[[1 0 0 0]]
f_s: [[ 0.05678618 0.99652957 0.97491304]]
h_sample: [[0 1 1]]
f_u: [[ 0.99310473 0.00139984 0.99604968 0.99712837]]
v_sample: [[1 0 1 1]]
[[1 0 1 1]]
iter: 0 h: [[0 1 1]] x: [[1 0 1 1]] entropy:1.637
f_s: [[ 4.90301318e-04 9.99973278e-01 9.99654440e-01]]
h_sample: [[0 1 1]]
f_u: [[ 0.99310473 0.00139984 0.99604968 0.99712837]]
v_sample: [[1 0 1 1]]
[[1 0 1 1]]
iter: 1 h: [[0 1 1]] x: [[1 0 1 1]] entropy:1.637
I am asking on how the algorithm works to produce the numbers. – WhiteSolstice 35 mins ago
Non-technical explanation
If you pass n=1 to the Binomial distribution it is equivalent to the Bernoulli distribution. In this case the function could be thought of simulating coin flips. size=3 tells it to flip the coin three times and p=0.5 makes it a fair coin with equal probabilitiy of head (1) or tail (0).
The result of [1 0 0] means the coin came down once with head and twice with tail facing up. This is random, so running it again would result in a different sequence like [1 1 0], [0 1 0], or maybe even [1 1 1]. Although you cannot get the same number of 1s and 0s in three runs, on average you would get the same number.
Technical explanation
Numpy implements random number generation in C. The source code for the Binomial distribution can be found here. Actually two different algorithms are implemented.
If n * p <= 30 it uses inverse transform sampling.
If n * p > 30 the BTPE algorithm of (Kachitvichyanukul and Schmeiser 1988) is used. (The publication is not freely available.)
I think both methods, but certainly the inverse transform sampling, depend on a random number generator to produce uniformly distributed random numbers. Numpy internally uses a Mersenne Twister pseudo random number generator. The uniform random numbers are then transformed into the desired distribution.
A Binomially distributed random variable has two parameters n and p, and can be thought of as the distribution of the number of heads obtained when flipping a biased coin n times, where the probability of getting a head at each flip is p. (More formally it is a sum of independent Bernoulli random variables with parameter p).
For instance, if n=10 and p=0.5, one could simulate a draw from Bin(10, 0.5) by flipping a fair coin 10 times and summing the number of times that the coin lands heads.
In addition to the n and p parameters described above, np.random.binomial has an additional size parameter. If size=1, np.random.binomial computes a single draw from the Binomial distribution. If size=k for some integer k, k independent draws from the same Binomial distribution will be computed. size can also be an array of indices, in which case a whole np.array with the given size will be filled with independent draws from the Binomial distribution.
Note that the Binomial distribution is a generalisation of the Bernoulli distribution - in the case that n=1, Bin(n,p) has the same distribution as Ber(p).
For more information about the binomial distribution see: https://en.wikipedia.org/wiki/Binomial_distribution

numpy n-dimensional smart indexing over large tensors - memory efficiency

I'm working with large tensors, so numpy memory allocations for temporary tensors begin significantly influencing execution time + code sometimes raises memory allocation errors during those intermediate steps. Here're two approaches for indexing one tensor with int values of another tensor (like, result_ijk = a[i, b[i, j], k]) that I came up with, and even though second one seems more memory-efficient, I feel like creating this enormous index-matrix and iterating over all it's values (even in parallel) is kind of wired (and hits memory limits quite often):
def test():
i, j, k, l = 10, 20, 30, 40 # in reality, they're like 1e3..1e6
a = np.random.rand(i, j, k)
b = np.random.randint(0, j, size=i*l).reshape((i, l))
# c_ilk = c[i, b[i, l], k]; shape(c) = (10, 40, 30)
tmp = a[:, b, :] # <= i*ijk additional memory allocated (!) crazy
c1 = np.diagonal(tmp, axis1=0, axis2=1).transpose([2, 0, 1])
print(c1.shape)
# another approach:
ii, ll = np.indices((i, l)) # <= 2*i*l of temporary ints allocated
tmp2 = b[ii, ll] # i*l of ints allocated, slow ops
c2 = a[ii, tmp2] # slow ops over tensor
print(c2.shape)
print(np.allclose(c1, c2))
test()
- any suggestions on how one could optimize this type of n-dim smart indexing code?
If I'm going to use this piece of ~vectorized code in Theano, does it also going to allocate all those temporary buffers or it could somehow manage to build them "on-fly"? Is there any package that would perform such indexing in lazy\more efficient manner without allocation of these ii-like tensors?
(note: I need to take gradients over it in the end, so I can't use fancy jit-compilers like numba :( )
You only need to allocate an array of integers of length i to get your desired result:
i_idx = np.arange(i)
c = a[i_idx[:, None], b[i_idx, :], :]
# or you can use the terser c = a[i_idx[:, None], b[i_idx]]
Broadcasting takes care of duplicating values as needed on the fly, without having to allocate memory for them.
If you time this for large-ish arrays, you'll notice it is only marginally faster than your second approach: as noted by others, the intermediate indexing array is going to be several orders of magnitude smaller than your overall computation, so optimizing it has a small effect on the total runtime or memory footprint.
Some methods :
i,j,k,l=[100]*4
a = np.random.randint(0,5,(i, j, k))
b = np.random.randint(0, j,(i, l))
def test1():
# c_ilk = c[i, b[i, l], k]; shape(c) = (2,3,5)
tmp = a[:, b, :] # <= i*ijk additional memory allocated (!) crazy
c1 = np.diagonal(tmp, axis1=0, axis2=1).transpose([2, 0, 1])
return c1
def test2():
ii, ll = np.indices((i, l)) # <= 2*i*l of temporary ints allocated
tmp2 = b[ii, ll] # i*l of ints allocated, slow ops
c2 = a[ii, tmp2] # slow ops over tensor
#print(c2.shape)
return c2
def test3():
c3=np.empty((i,l,k),dtype=a.dtype)
for ii in range(i):
for ll in range(l):
c3[ii,ll]=a[ii,b[ii,ll]]
return c3
from numba import jit
test4=jit(test3)
And the corresponding benchmarks :
In [54]: %timeit test1()
1 loop, best of 3: 720 ms per loop
In [55]: %timeit test2()
100 loops, best of 3: 7.79 ms per loop
In [56]: %timeit test3()
10 loops, best of 3: 43.7 ms per loop
In [57]: %timeit test4()
100 loop, best of 3: 4.99 ms per loop
That seems to show (see #Eelco Hoogendoorn comment) that your second method is nearly optimal for big sizes, while the first is a bad choice.
For numba you can just use this part of the code, and apply gradient in a non "jited" function.