Cupy slower than numpy when iterating through array - cupy

I have code, that I want to parallelize with cupy. I thought it would be straight forward - just write "import cupy as cp", and replace everywhere I wrote np., with cp., and it would work.
And, it does work, the code does run, but takes much slower. I thought it would eventually be faster, compared to numpy, when iterating through larger arrays, but it seems that never happens.
The code is:
q = np.zeros((5,5))
q[:,0] = 20
def foo(array):
result = array
shedding_row = array*0
for i in range((array.shape[0])):
for j in range((array.shape[1])-1):
shedding_param = 2 * (result[i,j])**.5
shedding = (np.random.poisson( (shedding_param), 1))[0]
if shedding >= result[i,j]:
shedding = result[i,j] - 1
result[i,j+1] = result[i,j] - shedding
if result[i,j+1]<0:
result[i,j+1] = 0
shedding_row[i,j+1] = shedding
return(result,shedding_row)
x,y = foo(q)
Is this supposed to get faster with cupy? Am I using it wrong?

To get fast performance of numpy or cupy, you should use parallel operation instead of using for loop.
Just for example,
for i in range((array.shape[0])):
for j in range((array.shape[1])-1):
shedding_param = 2 * (result[i,j])**.5
This can be calculated as
xp = numpy # change to cupy for GPU
shedding_param = 2 * xp.sqrt(result[:, :-1])

Related

numpy - find all pixels near a set of pixels

I have a PIL.Image object input of mode '1' (a black & white bitmap) and I would like to determine, for every pixel in the image, whether it's within n pixels (Euclidean distance - n may be around 100 or so) of any of the white pixels.
The motivation is: input represents every pixel that is different between two other images, and I would like to create a highlight region around all those differences to show clearly where the differences occur.
So far I haven't been able to find a fast algorithm for this - the following code works, but the convolution is very slow because the kernel argument is larger than the convolution can apparently handle efficiently:
from scipy import ndimage
import numpy as np
from PIL import Image
n = 100
y, x = np.ogrid[:2*n, :2*n]
kernel = (x-n)**2 + (y-n)**2 <= n**2
img = Image.open('input.png')
result = ndimage.convolve(np.array(img), kernel) != 0
Image.fromarray(result).save('result.png')
Example input input.png:
Desired output result.png (there are also some undesired artifacts here that I assume come from over/underflow):
Even with these small images, the computation takes 30 seconds or so.
Can someone recommend a better procedure to compute this? Thanks.
ndimage.convolve use a very inefficient algorithm to perform the convolution certainly running in O(n m kn km) where (n,m) is the shape of the image and (kn, km) is the shape of the kernel. You can use an FFT to do that much more efficiently in O(n m log(n m)) time. Hopefully, scipy provide such a function. Here is an example of usage:
import scipy.signal
import numpy as np
from PIL import Image
n = 100
y, x = np.ogrid[:2*n, :2*n]
kernel = (x-n)**2 + (y-n)**2 <= n**2
img = Image.open('input.png')
result = scipy.signal.fftconvolve(img, kernel, mode='same') >= 1.0
Image.fromarray(result).save('result.png')
This is >500 times faster on my machine and this also fix the artefacts. Here is the result:

Numpy.dot hang my program, i assume that is memory problem

I have two numpy array: A shape(512,) and B (3000,512). One function call
C = np.dot(B,A) and my program hang without any error out.
My python 3.7.3 and numpy 1.16.2.
But that code run good if i call c = np.dot(B,A) manual with suitable input or the length of B around 50
I don't know what difference between 2 ways of call.
I found the answer. This because of proceed memory limit. My program when running taking 20GB of RAM and when numpy need more memory for it's working the system hanged without a any error or warning, but when i call my that function manually, it called another process and got more RAM for it's work.
When A comes first, you need to use the transpose of B. Interestingly, I did not have to change the shape of A. This does not seem to be consistent to me, but it works.
import numpy as np
A = np.array([i for i in range(512)])
B = np.random.rand(3000, 512)
C1 = B.dot(A) # 3000 rows, 512
B = B.transpose() # 512 rows, 3,000 columns
C2 = A.dot(B)
C2 = C2.transpose() # 3,000 rows, 512 columns
print(np.all(np.equal(C1, C2))) # Verify that the result is the same

random.multivariate_normal on a dask array?

I've been struggling to find a way to get this calc that works for a dask workflow.
I have code that uses np.random.mulivariate_normal function and while many of these types are available to us on dask array it seems this one it not. Sooo.... I attempted to create my own based on an example provided in the dask documentation.
Here is my attempt which is giving errors that I am having difficulty understanding. I also provided random input variables to make it easy to replicate:
import numpy as np
from dask.distributed import Client
import dask.array as da
def mvn(mu, sigma, n, blocksize):
chunks = ((blocksize,) * (n // blocksize),
(blocksize,) * (n // blocksize))
name = 'mvn' # unique identifier
dsk = {(name, i, j): (np.random.multivariate_normal(mu,sigma, blocksize))
if i == j else
(np.zeros, (blocksize, blocksize))
for i in range(n // blocksize)
for j in range(n // blocksize)}
dtype = np.random.multivariate_normal(0).dtype # take dtype default from numpy
return da.Array(dsk, name, chunks, dtype)
n = 10000
A = da.random.normal(0, 1, size=(n,n), chunks=(1000, 1000))
sigma = da.dot(A,A.transpose())
mu = 4.0*da.ones(n, chunks = 1000)
R = da.numpy.random.mvn(mu, sigma, n, chunks=(100))
Any suggestions or am I so far off the mark here that I should abandon all hope? Thanks!
If you have a cluster to run this on, you can use my answer from this post, copied here for refrence:
An work arround for now, is to use a cholesky decomposition. Note that any covariance matrix C can be expressed as C=G*G'. It then follows that x = G'*y is correlated as specified in C if y is standard normal (see this excellent post on StackExchange Mathematic). In code:
Numpy
n_dim =4
size = 100000
A = np.random.randn(n_dim, n_dim)
covm = A.dot(A.T)
x= np.random.multivariate_normal(size=size, mean=np.zeros(len(covm)),cov=covm)
## verify numpys covariance is correct
np.cov(x, rowvar=False)
covm
Dask
## create covariance matrix
A = da.random.standard_normal(size=(n_dim, n_dim),chunks=(2,2))
covm = A.dot(A.T)
## get cholesky decomp
L = da.linalg.cholesky(covm, lower=True)
## drawn standard normal
sn= da.random.standard_normal(size=(size, n_dim),chunks=(100,100))
## correct for correlation
x =L.dot(sn.T)
x.shape
## verify
covm.compute()
da.cov(x, rowvar=True).compute()
This answer can be fleshed out, but I imagine you would have an easier time using dask's delayed, da.from_delayed and da.*stack.
One immediate problem I see with what you have: with np.random.multivariate_normal(mu,sigma, blocksize) you are directly calling the function, instead of making the spec. You probably wanted (np.random.multivariate_normal, mu,sigma, blocksize). This shows that working with raw dask dictionaries can be tricky!

nesting openmdao "assemblies"/drivers - working from a 0.13 analogy, is this possible to implement in 1.X?

I am using NREL's DAKOTA_driver openmdao plugin for parallelized Monte Carlo sampling of a model. In 0.X, I was able to nest assemblies, allowing an outer optimization driver to direct the DAKOTA_driver sampling evaluations. Is it possible for me to nest this setup within an outer optimizer? I would like the outer optimizer's workflow to call the DAKOTA_driver "assembly" then the get_dakota_output component.
import pandas as pd
import subprocess
from subprocess import call
import os
import numpy as np
from dakota_driver.driver import pydakdriver
from openmdao.api import IndepVarComp, Component, Problem, Group
from mpi4py import MPI
import sys
from itertools import takewhile
sigm = .005
n_samps = 20
X_bar=[0.065 , sigm] #2.505463e+03*.05]
dacout = 'dak.sout'
class get_dak_output(Component):
mean_coe = 0
def execute(self):
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nam ='ape.net_aep'
csize = 10000
with open(dacout) as f:
for i,l in enumerate(f):
pass
numlines = i
dakchunks = pd.read_csv(dacout, skiprows=0, chunksize = csize, sep='there_are_no_seperators')
linespassed = 0
vals = []
for dchunk in dakchunks:
for line in dchunk.values:
linespassed += 1
if linespassed < 49 or linespassed > numlines - 50: continue
else:
split_line = ''.join(str(s) for s in line).split()
if len(split_line)==2:
if (len(split_line) != 2 or
split_line[0] in ('nan', '-nan') or
split_line[1] != nam):
continue
else:vals.append(float(split_line[0]))
self.coe_vals = sorted(vals)
self.mean_coe = np.mean(self.coe_vals)
class ape(Component):
def __init__(self):
super(ape, self).__init__()
self.add_param('x', val=0.0)
self.add_output('net_aep', val=0.0)
def solve_nonlinear(self, params, unknowns, resids):
print 'hello'
x = params['x']
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
outp = subprocess.check_output("python test/exampleCall.py %f"%(float(x)),
shell=True)
unknowns['net_aep'] = float(outp.split()[-1])
top = Problem()
root = top.root = Group()
root.add('ape', ape())
root.add('p1', IndepVarComp('x', 13.0))
root.connect('p1.x', 'ape.x')
drives = pydakdriver(name = 'top.driver')
drives.UQ('sampling', use_seed=False)
#drives.UQ()
top.driver = drives
#top.driver = ScipyOptimizer()
#top.driver.options['optimizer'] = 'SLSQP'
top.driver.add_special_distribution('p1.x','normal', mean=0.065, std_dev=0.01, lower_bounds=-50, upper_bounds=50)
top.driver.samples = n_samps
top.driver.stdout = dacout
#top.driver.add_desvar('p2.y', lower=-50, upper=50)
#top.driver.add_objective('ape.f_xy')
top.driver.add_objective('ape.net_aep')
top.setup()
top.run()
bak = get_dak_output()
bak.execute()
print('\n')
print('E(aep) is %f'%bak.mean_coe)
There are two different options for this situation. Both will work in parallel, and both can be currently supported. But only one of them will work when you want to use analytic derivatives:
1) Nested Problems: You create one problem class that has a DOE driver in it. You pass the list of cases you want run into that driver, and it runs them in parallel. Then you put that problem into a parent problem as a component.
The parent problem doesn't know that it has a sub-problem. It just thinks it has a single component that uses multiple processors.
This is the most similar way to how you would have done it in 0.x. However I don't recommend going this route because it won't work if you want to use ever want to use analytic derivatives.
If you use this way, the dakota driver can stay pretty much as is. But you'll have to use a special sub-problem class. This isn't an officially supported feature yet, but its very doable.
2) Using a multi-point approach, you would create a Group class that represent your model. You would then create one instance of that group for each monte-carlo run you want to do. You put all of these instances into a parallel group inside your overall problem.
This approach avoids the sub-problem messiness. Its also much more efficient for actual execution. It will have a somewhat greater setup-cost than the first method. But in my opinion its well worth the one time setup cost to get the advantage of analytic derivatives. The only issue is that it would probably require some changes to the way the dakota_driver works. You would want to get a list of evaluations from the driver, then hand them them out to the individual children groups.

Faster numpy.random.shuffle with a length limit?

I am using numpy.random.shuffle to shuffle a list of data. The length of the list is large so I want to randomly sample some of data to do my work.
I implement this using the following code:
# data_list is a numpy array of shape (num_data,)
index = np.arange(data_list.size)
np.random.shuffle(index)
index = index[:len_limit]
data = data_list[index]
But since index is big, the shuffle is slow.
Any advice to improve the performance?
This is a common problem. I use the following:
Drawing with replacement
idxs = np.random.randint(0, high=len(data), size=(N,))
result = data[idxs]
Drawing without replacement
import random
idxs = random.sample(xrange(len(data)), N)
result = data[idxs]
where data is your original dataset and N is the number of desired samples. Either should be faster than shuffling, as long as N << len(data).
Try np.random.choice, with replace=False.
Example (using the same variables as in the question):
data = np.random.choice(data_list, len_limit, replace=False)
You'll need numpy version 1.7.0 or later.