Using CuPy/cuDF, remove elements that are not distant enough to their previous elements from a sorted list - gpu

The purpose of the code is similar to this post
I have a code that runs on CPUs:
import pandas as pd
def remove(s: pd.Series, thres:int):
pivot = -float("inf")
new_s = []
for e in s:
if (e-pivot)>thres:
new_s.append(e)
pivot=e
return pd.Series(new_s)
# s is an ascending sequence
s = pd.Series([0,1,2,4,6,9])
remove(s, thres=3)
# Out:
# 0 0
# 1 4
# 2 9
# dtype: int64
The input is an ascending sequence with integer values.
This function simply removes those points s[i] where d(s[i], s[i-1]) < thres
My problem is that CuPy/cuDF do not support loops, so I can't use GPUs to accelerate the code. I only have options like cumsum, diff, and mod that don't fit my needs.
Is there a function like scan in tensorflow?
The remove function can be reformulated in a form that is similar to prefix sum (scan):
For a sequence [a1, a2, a3], the output should be [a1, a1⨁a2, (a1⨁a2)⨁a3], and ⨁ is equal to
⨁=lambda x,y: x if (y-x)>thres else y
Then set(output) is what I want.
Note that (a1⨁a2)⨁a3 != a1⨁(a2⨁a3), in the absence of associative property, parallel computation might not be feasible.
Update
I found that there is already a function called Inclusive Scan, all I need is a python wrapper.
Or is there any other way?

Related

Adding a third dimension to my 2D array in a for loop

I have a for loop that gives me an output of 16 x 8 2D arrays per entry in the loop. I want to stack all of these 2D arrays along the z-axis in a 3D array. This way, I can determine the variance over the z-axis. I have tried multiple commands, such as np.dstack, matrix3D[p,:,:] = ... and np.newaxis both in- and outside the loop. However, the closest I've come to my desired output is just a repetition of the last array stacked on top of each other. Also the dimensions were way off. I need to keep the original 16 x 8 format. By now I'm in a bit too deep and could use some nudge in the right direction!
My code:
excludedElectrodes = [1,a.numberOfColumnsInArray,a.numberOfElectrodes-a.numberOfColumnsInArray+1,a.numberOfElectrodes]
matrixEA = np.full([a.numberOfRowsInArray, a.numberOfColumnsInArray], np.nan)
for iElectrode in range(a.numberOfElectrodes):
if a.numberOfDeflectionsPerElectrode[iElectrode] != 0:
matrixEA[iElectrode // a.numberOfColumnsInArray][iElectrode % a.numberOfColumnsInArray] = 0
for iElectrode in range (a.numberOfElectrodes):
if iElectrode+1 not in excludedElectrodes:
"""Preprocessing"""
# Loop over heartbeats
for p in range (1,len(iLAT)):
# Calculate parameters, store them in right row-col combo (electrode number)
matrixEA[iElectrode // a.numberOfColumnsInArray][iElectrode % a.numberOfColumnsInArray] = (np.trapz(abs(correctedElectrogram[limitA[0]:limitB[0]]-totalBaseline[limitA[0]:limitB[0]]))/(1000))
# Stack all matrixEA arrays along z axis
matrix3D = np.dstack(matrixEA)
This example snippet does what you want, although I suspect your errors have to do more with things not relative to the concatenate part. Here, we use the None keyword in the array to create a new empty dimension (along which we concatenate the 2D arrays).
import numpy as np
# Function does create a dummy (16,8) array
def foo(a):
return np.random.random((16,8)) + a
arrays2D = []
# Your loop
for i in range(10):
# Calculate your (16,8) array
f = foo(i)
# And append it to the list
arrays2D.append(f)
# Stack arrays along new dimension
array3D = np.concatenate([i[...,None] for i in arrays2D], axis = -1)

bisect_right for numpy array

I wish to locate the index of the closest higher value to a query over a sorted numpy array (where the query value is not in the array).
similar to bisect_right in the python standard library, without converting the numpy array to a python list, and leveraging the fact that the array is sorted (i.e. runtime should be O(log N), like numpy's searchsorted).
Pandas have this option using get_loc with the 'bfill' option, but it seems a bit of an overkill to include it as a dependency just for this... I might have to resort to holding this array as both a python list and a numpy array, but wanted to hear if there's a more reasonable solution.
Edit: It seems searchsorted does exactly what I need.
We can see the code for bisect_right on github:
def bisect_right(a, x, lo=0, hi=None):
"""Return the index where to insert item x in list a, assuming a is sorted.
The return value i is such that all e in a[:i] have e <= x, and all e in
a[i:] have e > x. So if x already appears in the list, a.insert(x) will
insert just after the rightmost x already there.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
# Use __lt__ to match the logic in list.sort() and in heapq
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
This is all numpy compliant:
import numpy as np
array = np.array([1,2,3,4,5,6])
print(bisect_right(array, 7))
>>> 6
print(bisect_right(array, 0))
>>> 0
To find the index of the closest higher value to a number given:
def closest_higher_value(array, value):
if bisect_right(array, value) < len(array):
return bisect_right(array, value)
print("value too large:", value, "is bigger than all elements of:")
print(array)
print(closest_higher_value(array, 3))
>>> 3
print(closest_higher_value(array, 7))
>>> value too large: 7 is bigger than all elements of:
>>> [1 2 3 4 5 6]
>>> None

generate large array in dask

I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array.
So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?
Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?
The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.)
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16)) # in actual sample we expect 65000x65000 array
# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"
# accumulate results
all_pairs=[]
def generate_pairs(sequence):
pairs=[]
for i in range(len(sequence)-8+1):
window=sequence[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
pairs.append(pair)
return pairs
# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))
# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
counts[ two_index[j[0]], two_index[j[1]] ] += 1
print(counts)
EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?
Here's simpler code to test:
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
seq = "AAAAACCATCGACTACGACTAC"
counts = np.zeros(shape=(16,16))
for i in range(len(seq)-8+1):
window=seq[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
counts[two_index[pair[0]], two_index[pair[1]]] += 1 # problematic part!
print(counts)

Build a numpy array from a random distribution until the last column exceeds a threshold

I want to build a 2d numpy array from a random distribution so that each of the values in the last column of each row exceeds a threshold.
Here's the working code I have now. Is there a cleaner way to build numpy arrays with an arbitrary condition?
def new_array(
num_rows: int,
dist: Callable[[int], np.ndarray],
min_hours: int) -> np.ndarray:
# Get the 40th percentile as a reasonable guess for how many samples we need.
# Use a lower percentile to increase num_cols and avoid looping in most cases.
p40_val = np.quantile(dist(20), 0.4)
# Generate at least 10 columns each time.
num_cols = max(int(min_hours / p40_val), 10)
def create_starts() -> np.ndarray:
return dist(num_rows * num_cols).reshape((num_rows, num_cols)).cumsum(axis=1)
max_iters = 20
starts = create_starts()
for _ in range(max_iters):
if np.min(starts[:, -1]) >= min_hours:
# All the last columns exceed min_hours.
break
last_col_vals = starts[:, -1].repeat(num_cols).reshape(starts.shape)
next_starts = create_starts() + last_col_vals
starts = np.append(starts, next_starts, axis=1)
else:
# We didn't break out of the for loop, so we hit the max iterations.
raise AssertionError('Failed to create enough samples to exceed '
'sim duration for all columns')
# Only keep columns up to the column where each value > min_hours.
mins_per_col = np.min(starts, axis=0)
cols_exceeding_sim_duration = np.nonzero(mins_per_col > min_hours)[0]
cols_to_keep = cols_exceeding_sim_duration[0]
return np.delete(starts, np.s_[cols_to_keep:], axis=1)
new_array(5, lambda size: np.random.normal(3, size=size), 7)
# Example output
array([[1.47584632, 4.04034105, 7.19592256],
[3.10804306, 6.46487043, 9.74177227],
[1.03633165, 2.62430309, 6.92413189],
[3.46100139, 6.53068143, 7.37990547],
[2.70152742, 6.09488369, 9.58376664]])
I simplified several things and replaced them with Numpy's logical indexing. The for-loop is now while and there is no need to handle the error as it just runs until there are enough rows.
Is this still working as you expect it?
def new_array(num_rows, dist, min_hours):
# Get the 40th percentile as a reasonable guess for how many samples we need.
# Use a lower percentile to increase num_cols and avoid looping in most cases.
p40_val = np.quantile(dist(20), 0.4)
# Generate at least 10 columns each time.
num_cols = max(int(min_hours / p40_val), 10)
# no need to reshape here, size can be a shape tuple
def create_starts() -> np.ndarray:
return dist((num_rows, num_cols)).cumsum(axis=1)
# append to list, in the end stack it into a Numpy array once.
# faster than numpy.append
# due to Numpy's pre-allocation which will slow down things here.
storage = []
while True:
starts = create_starts()
# boolean / logical array
is_larger = starts[:, -1] >= min_hours
# Use Numpy boolean indexing instead to find the rows
# fitting your condition
good_rows = starts[is_larger, :]
# can also be empty array if none found, but will
# be skipped later
storage.append(good_rows)
# count what is in storage so far, empty arrays will be skipped
# due to shape (0, x)
number_of_good_rows = sum([_a.shape[0] for _a in storage])
print('number_of_good_rows', number_of_good_rows)
if number_of_good_rows >= num_rows:
starts = np.vstack(storage)
print(starts)
break
# Only keep columns up to the column where each value > min_hours.
# also use logical indexing here
is_something = np.logical_not(np.all(starts > min_hours, axis=0))
return starts[:, is_something]

ValueError: setting an array element with a sequence Ask

This python code:
import numpy,math
import scipy.optimize as optimization
import matplotlib.pyplot as plt
# Create toy data for curve_fit.
zo = numpy.array([0.0,1.0,2.0,3.0,4.0,5.0])
mu = numpy.array([0.1,0.9,2.2,2.8,3.9,5.1])
sig = numpy.array([1.0,1.0,1.0,1.0,1.0,1.0])
# Define hubble function.
def Hubble(x,a,b):
return H0 * m.sqrt( a*(1+x)**2 + 1/2 * a * (1+b)**3 )
# Define
def Distancez(x,a,b):
return c * (1+x)* np.asarray(quad(lambda tmp:
1/Hubble(a,b,tmp),0,x))
def mag(x,a,b):
return 5*np.log10(Distancez(x,a,b)) + 25
#return a+b*x
# Compute chi-square manifold.
Steps = 101 # grid size
Chi2Manifold = numpy.zeros([Steps,Steps]) # allocate grid
amin = 0.2 # minimal value of a covered by grid
amax = 0.3 # maximal value of a covered by grid
bmin = 0.3 # minimal value of b covered by grid
bmax = 0.6 # maximal value of b covered by grid
for s1 in range(Steps):
for s2 in range(Steps):
# Current values of (a,b) at grid position (s1,s2).
a = amin + (amax - amin)*float(s1)/(Steps-1)
b = bmin + (bmax - bmin)*float(s2)/(Steps-1)
# Evaluate chi-squared.
chi2 = 0.0
for n in range(len(xdata)):
residual = (mu[n] - mag(zo[n], a, b))/sig[n]
chi2 = chi2 + residual*residual
Chi2Manifold[Steps-1-s2,s1] = chi2 # write result to grid.
Throws this error message:
ValueError Traceback (most recent call last)
<ipython-input-136-d0ef47a881a7> in <module>()
36 residual = (mu[n] - mag(zo[n], a, b))/sig[n]
37 chi2 = chi2 + residual*residual
---> 38 Chi2Manifold[Steps-1-s2,s1] = chi2 # write result to
grid.
ValueError: setting an array element with a sequence.
Note: If I define a simple mag function such as (a+b*x), I do not get any error message.
In fact all three functions Hubble, Distancez and Meg have to be functions of redshift z, which is an array.
Now do you think I need to redefine all these functions to have an output array? I mean first, create an array of redshift and then the output of the functions automatically become array?
I need the output of the Distancez() and mag() functions to be arrays. I managed to do it, simply by changing the upper limit of the integral in the Distancez function from x to x.any(). Now I have an array and this is what I want. However, now I see that the output value of the for example Distance(0.25, 0.5, 0.3) is different from when I just put x in the upper limit of the integral? Any help would be appreciated.
Thanks for your reply.
I need the output of the Distancez() and mag() functions to be arrays. I managed to do it, simply by changing the upper limit of the integral in the Distancez function from x to x.any(). Now I have an array and this is what I want. However, now I see that the output value of the for example Distance(0.25, 0.5, 0.3) is different from when I just put x in the upper limit of the integral? Any help would be appreciated.
The ValueError is saying that it cannot assign an element of the array Chi2Manifold with a value that is a sequence. chi2 is probably a numpy array because residual is a numpy array because, your mag() function returns a numpy array, all because your Distancez function returns an numpy array -- you are telling it to do this with that np.asarray().
If Distancez() returned a scalar floating point value you'd probably be set. Do you need to use np.asarray() in Distancez()? Is that actually a 1-element array, or perhaps you intend to reduce that somehow to a scalar. I don't know what your Hubble() function is supposed to do and I'm not an astronomer but in my experience distances are often scalars ;).
If chi2 is meant to be a sequence or numpy array, you probably want to set an appropriately-sized range of values in Chi2Manifold to chi2.