Find pairs of array such as array_1 = -array_2 - numpy

I search a way to find all the vector from a np.meshgrid(xrange, xrange, xrange) that are related by k = -k.
For the moment I do that :
#numba.njit
def find_pairs(array):
boolean = np.ones(len(array), dtype=np.bool_)
pairs = []
idx = [i for i in range(len(array))]
while len(idx) > 1:
e1 = idx[0]
for e2 in idx:
if (array[e1] == -array[e2]).all():
boolean[e2] = False
pairs.append([e1, e2])
idx.remove(e1)
if e2 != e1:
idx.remove(e2)
break
return boolean, pairs
# Give array of 3D vectors
krange = np.fft.fftfreq(N)
comb_array = np.array(np.meshgrid(krange, krange, krange)).T.reshape(-1, 3)
# Take idx of the pairs k, -k vector and boolean selection that give position of -k vectors
boolean, pairs = find_pairs(array)
It works but the execution time grow rapidly with N...
Maybe someone has already deal with that?

The main problem is that comb_array has a shape of (R, 3) where R = N**3 and the nested loop in find_pairs runs at least in quadratic time since idx.remove runs in linear time and is called in the for loop. Moreover, there are cases where the for loop does not change the size of idx and the loop appear to run forever (eg. with N=4).
One solution to solve this problem in O(R log R) is to sort the array and then check for opposite values in linear time:
import numpy as np
import numba as nb
# Give array of 3D vectors
krange = np.fft.fftfreq(N)
comb_array = np.array(np.meshgrid(krange, krange, krange)).T.reshape(-1, 3)
# Sorting
packed = comb_array.view([('x', 'f8'), ('y', 'f8'), ('z', 'f8')])
idx = np.argsort(packed, axis=0).ravel()
sorted_comb = comb_array[idx]
# Find pairs
#nb.njit
def findPairs(sorted_comb, idx):
n = idx.size
boolean = np.zeros(n, dtype=np.bool_)
pairs = []
cur = n-1
for i in range(n):
while cur >= i:
if np.all(sorted_comb[i] == -sorted_comb[cur]):
boolean[idx[i]] = True
pairs.append([idx[i], idx[cur]])
cur -= 1
break
cur -= 1
return boolean, pairs
findPairs(sorted_comb, idx)
Note that the algorithm assume that for each row, there are only up to one valid matching pair. If there are several equal rows, they are paired 2 by two. If your goal is to extract all the combination of equal rows in this case, then please note that the output will grow exponentially (which is not reasonable IMHO).
This solution is pretty fast even for N = 100. Most of the time is spent in the sort that is not very efficient (unfortunately Numpy does not provide a way to do a lexicographic argsort of the row efficiently yet though this operation is fundamentally expensive).

Related

About the numpy.where statement

I would like to use the numpy.where to check the value of a previous row but don't know how to code
for n1 in range(len(image1)):
print('input image ',input_folder+'\\' + image1[n1])
print('\n')
print('image1[n1] ',image1[n1])
print('\n')
im = Image.open(input_folder+'\\'+image1[n1])
a = np.array(im, dtype='uint8')
width, height = im.size
print('width ',width)
print('height ',height)
a = np.where(a==[0,0,0],[255,255,255],a)
!-- Change the looping statement to np.where --!
for h in range(height):
for w in range(width):
if h <= (height - 2) and w <= (width - 2):
if a[h,w,0] != 255 and a[h,w,1] != 255 and a[h,w,2] != 255:
if (a[h-1,w,0] == 255 and a[h-1,w,1] == 255 and a[h-1,w,2] == 255 and a[h+1,w,0] == 255 and a[h+1,w,1] == 255 and a[h+1,w,2] == 255) or (a[h,w-1,0] == 255 and a[h,w-1,1] == 255 and a[h,w-1,2] == 255 and a[h,w+1,0] == 255 and a[h,w+1,1] == 255 and a[h,w+1,2] == 255):***
Change the above looping statement to np.where(a[-??] = [255,255,255] or a[+??] = [255,255,255]) so it can run more faster than the for loop statement. -->
a[h,w,0] = 255
a[h,w,1] = 255
a[h,w,2] = 255
I'm afraid, you can not use np.where here.
The reason is that:
the condition passed to np.where should indicate each element of the
source array,
whereas the criterion in your code actually relates only to first 2
dimensions of the source array.
So I came up with another, quite elegant and concise solution.
Part 1: How to get first two indices of elements, where all elements
in the third dimension are != 255:
To to it, on the whole array, you could run:
np.not_equal(a, 255).all(axis=2)
Part 2: How to limit the "range of operation" to elements having both
previous and next row and column.
You can do it passing to the above code a "subrange" of the original array:
np.not_equal(a[1:-1, 1:-1], 255).all(axis=2))
You should eliminate both the first and the last column and row (in
your code you failed to eliminate the first row / column).
But note that this time the resulting indices are by one less than before,
so at the later step you will have to add 1 to them.
Part 3: A function to check whether all elements along the third dimension
== 255, for some row (r) and column (c):
def all_eq(arr, r, c):
return np.equal(arr[r, c], 255).all()
(will be used soon).
Part 4: How to get the result:
res = a.copy()
for r, c in zip(*np.where(np.not_equal(a[1:-1, 1:-1], 255).all(axis=2))):
h = r + 1
w = c + 1
if all_eq(a, h-1, w) and all_eq(a, h+1, w) or\
all_eq(a, h, w-1) and all_eq(a, h, w+1):
res[h, w] = 255
Note that this code starts from making a copy of the original array
(it will hold the result).
Then, for r, c in zip(…) iterates over the indices found.
First 2 lines in the loop add 1 to the indices found, in the subrange
of the original array, so now h and w indicate row / column in the whole
original array.
Then if checks whether the respective adjacent pixels have 255 in all elements.
If they do, then put 255 in all elements of the "current" pixel, in the result.
You can't operate on the original array, since changed values in some pixels
would "falsify" the evaluation of conditions for subseqent pixels.
Edit
After some research I found, that it is possible to use np.where,
although the solution is a bit complicated and involving quite a big
number of Numpy methods:
# Mask 1: Pixels with all elements != 255
m1 = np.zeros((height, width), dtype='int8')
idx = np.where(np.not_equal(a, 255).all(axis=2))
m1[idx] = 1
# Pixels with all elements == 255
m2 = np.apply_along_axis(lambda px: np.equal(px, 255).all(), 2, a).astype('int8')
# Both adjacent pixels (left / right) == 255
m2a = np.logical_and(np.insert(m2, 0, 0, axis=1)[:,:-1], np.insert(m2,
width, 0, axis=1)[:,1:])
# Both adjacent pixels (up / down) == 255
m2b = np.logical_and(np.insert(m2, 0, 0, axis=0)[:-1,:], np.insert(m2,
height, 0, axis=0)[1:,:])
# Mask 2: Both adjacent pixels (either vertically or horizontally) == 255
m2 = np.logical_or(m2a, m2b)
# The "final" mask
msk = np.logical_and(m1, m2)
# Generate the result
result = np.where(np.expand_dims(msk, 2), 255, a)
This solution should be substantially faster than my first concept.

Can you solve maximum gap for a chain of elements in SQL?

I have a difficult query I have to make in SQL(postgressql). I have tried to explain the problem below.
I have a chain of elements each having a max gap to next. So I want to calculate the "distance" matrix. So take the following 4 element:
example_id,id,max_gap
0,0,2
0,1,5
0,2,
0,3,4
then the max_gap between each element should be the following for this example
example_id,id,max_gap
0,0,0,0
0,0,1,2
0,0,2,7
0,0,3,
0,1,0,-2
0,1,1,0
0,1,2,5
0,1,3,
0,2,0,-7
0,2,1,-5
0,2,2,0
0,2,3,
0,3,0,
0,3,1,
0,3,2,
0,3,3,0
So if any of the elements between two elements have max_gap infinity then the max_gap between the two elements is infinity.
The challenge is to the solve this problem in SQL (since in need to have this in a sql trigger).
The following Python code can be used to create test_cases:
from random import randint, random
from itertools import groupby
n_examples = 100
def generate_examples(n):
out = []
for i in range(n):
for j in range(randint(1,10)):
max_dist = randint(0,10)
if random()>0.75:
max_dist = None
out.append([i,j,max_dist])
return out
def max_dist_between_all(example):
example_id = example[0][0]
n=len(example)
return [(example_id,i,j,calc_dist(i,j,example)) for i in range(n) for j in range(n)]
def calculate_max_dist_between_all_examples(examples):
return [result
for _, example in groupby(examples, lambda x:x[0])
for result in max_dist_between_all(list(example))
]
def calc_dist(i,j,example):
if j<i:
i,j = j,i
sign =-1
else:
sign=1
max_dist = 0
for k in range(i,j):
max_dist_between_step = example[k][2]
if max_dist_between_step is None:
return None
max_dist+=max_dist_between_step
return sign*max_dist
examples =generate_examples(n_examples)
def print_in_csv(input_, headers):
print(",".join(headers))
print("\n".join([",".join(str(e) if e is not None else "" for e in l) for l in input_]))
print_in_csv(examples, ["example_id","id","max_gap"])
print()
print_in_csv(calculate_max_dist_between_all_examples(examples), ["example_id","id","max_gap"])
Do you just want a self join?
select e1.example_id, e1.id, e2.id, e1.max_gap - e2.max_gap
from elements e1 join
elements e2
on e1.example_id = e2.example_id

How to optimize the linear coefficients for numpy arrays in a maximization function?

I have to optimize the coefficients for three numpy arrays which maximizes my evaluation function.
I have a target array called train['target'] and three predictions arrays named array1, array2 and array3.
I want to put the best linear coefficients i.e., x,y,z for these three arrays which will maximize the function
roc_aoc_curve(train['target'], xarray1 + yarray2 +z*array3)
the above function would be maximum when prediction is closer to the target.
i.e, xarray1 + yarray2 + z*array3 should be closer to train['target'].
The range of x,y,z >=0 and x,y,z <= 1
Basically I am trying to put the weights x,y,z for each of the three arrays which would make the function
xarray1 + yarray2 +z*array3 closer to the train['target']
Any help in getting this would be appreciated.
I used pulp.LpProblem('Giapetto', pulp.LpMaximize) to do the maximization. It works for normal numbers, integers etc, however failing while trying to do with arrays.
import numpy as np
import pulp
# create the LP object, set up as a maximization problem
prob = pulp.LpProblem('Giapetto', pulp.LpMaximize)
# set up decision variables
x = pulp.LpVariable('x', lowBound=0)
y = pulp.LpVariable('y', lowBound=0)
z = pulp.LpVariable('z', lowBound=0)
score = roc_auc_score(train['target'],x*array1+ y*array2 + z*array3)
prob += score
coef = x+y+z
prob += (coef==1)
# solve the LP using the default solver
optimization_result = prob.solve()
# make sure we got an optimal solution
assert optimization_result == pulp.LpStatusOptimal
# display the results
for var in (x, y,z):
print('Optimal weekly number of {} to produce: {:1.0f}'.format(var.name, var.value()))
Getting error at the line
score = roc_auc_score(train['target'],x*array1+ y*array2 + z*array3)
TypeError: unsupported operand type(s) for /: 'int' and 'LpVariable'
Can't progress beyond this line when using arrays. Not sure if my approach is correct. Any help in optimizing the function would be appreciated.
When you add sums of array elements to a PuLP model, you have to use built-in PuLP constructs like lpSum to do it -- you can't just add arrays together (as you discovered).
So your score definition should look something like this:
score = pulp.lpSum([train['target'][i] - (x * array1[i] + y * array2[i] + z * array3[i]) for i in arr_ind])
A few notes about this:
[+] You didn't provide the definition of roc_auc_score so I just pretended that it equals the sum of the element-wise difference between the target array and the weighted sum of the other 3 arrays.
[+] I suspect your actual calculation for roc_auc_score is nonlinear; more on this below.
[+] arr_ind is a list of the indices of the arrays, which I created like this:
# build array index
arr_ind = range(len(array1))
[+] You also didn't include the arrays, so I created them like this:
array1 = np.random.rand(10, 1)
array2 = np.random.rand(10, 1)
array3 = np.random.rand(10, 1)
train = {}
train['target'] = np.ones((10, 1))
Here is my complete code, which compiles and executes, though I'm sure it doesn't give you the result you are hoping for, since I just guessed about target and roc_auc_score:
import numpy as np
import pulp
# create the LP object, set up as a maximization problem
prob = pulp.LpProblem('Giapetto', pulp.LpMaximize)
# dummy arrays since arrays weren't in OP code
array1 = np.random.rand(10, 1)
array2 = np.random.rand(10, 1)
array3 = np.random.rand(10, 1)
# build array index
arr_ind = range(len(array1))
# set up decision variables
x = pulp.LpVariable('x', lowBound=0)
y = pulp.LpVariable('y', lowBound=0)
z = pulp.LpVariable('z', lowBound=0)
# dummy roc_auc_score since roc_auc_score wasn't in OP code
train = {}
train['target'] = np.ones((10, 1))
score = pulp.lpSum([train['target'][i] - (x * array1[i] + y * array2[i] + z * array3[i]) for i in arr_ind])
prob += score
coef = x + y + z
prob += coef == 1
# solve the LP using the default solver
optimization_result = prob.solve()
# make sure we got an optimal solution
assert optimization_result == pulp.LpStatusOptimal
# display the results
for var in (x, y,z):
print('Optimal weekly number of {} to produce: {:1.0f}'.format(var.name, var.value()))
Output:
Optimal weekly number of x to produce: 0
Optimal weekly number of y to produce: 0
Optimal weekly number of z to produce: 1
Process finished with exit code 0
Now, if your roc_auc_score function is nonlinear, you will have additional troubles. I would encourage you to try to formulate the score in a way that is linear, possibly using additional variables (for example, if you want the score to be an absolute value).

Create line network from closest points with boundaries

I have a set of points and I want to create line / road network from those points. Firstly, I need to determine the closest point from each of the points. For that, I used the KD Tree and developed a code like this:
def closestPoint(source, X = None, Y = None):
df = pd.DataFrame(source).copy(deep = True) #Ensure source is a dataframe, working on a copy to keep the datasource
if(X is None and Y is None):
raise ValueError ("Please specify coordinate")
elif(not X in df.keys() and not Y in df.keys()):
raise ValueError ("X and/or Y is/are not in column names")
else:
df["coord"] = tuple(zip(df[X],df[Y])) #create a coordinate
if (df["coord"].duplicated):
uniq = df.drop_duplicates("coord")["coord"]
uniqval = list(uniq.get_values())
dupl = df[df["coord"].duplicated()]["coord"]
duplval = list(dupl.get_values())
for kq,vq in uniq.items():
clstu = spatial.KDTree(uniqval).query(vq, k = 3)[1]
df.at[kq,"coord"] = [vq,uniqval[clstu[1]]]
if([uniqval[clstu[1]],vq] in list(df["coord"]) ):
df.at[kq,"coord"] = [vq,uniqval[clstu[2]]]
for kd,vd in dupl.items():
clstd = spatial.KDTree(duplval).query(vd,k = 1)[1]
df.at[kd,"coord"] = [vd,duplval[clstd]]
else:
val = df["coord"].get_values()
for k,v in df["coord"].items():
clst = spatial.KDTree(val).query(vd, k = 3)[1]
df.at[k,"coord"] = [v,val[clst[1]]]
if([val[clst[1]],v] in list (df["coord"])):
df.at[k,"coord"] = [v,val[clst[2]]]
return df["coord"]
The code can return the the closest points around. However, I need to ensure that no double lines are created (e.g (x,y) to (x1,y1) and (x1,y1) to (x,y)) and also I need to ensure that each point can only be used as a starting point of a line and an end point of a line despite the point being the closest one to the other points.
Below is the visualization of the result:
Result of the code
What I want:
What I want
I've also tried to separate the origin and target coordinate and do it like this:
df["coord"] = tuple(zip(df[X],df[Y])) #create a coordinate
df["target"] = "" #create a column for target points
count = 2 # create a count iteration
if (df["coord"].duplicated):
uniq = df.drop_duplicates("coord")["coord"]
uniqval = list(uniq.get_values())
for kq,vq in uniq.items():
clstu = spatial.KDTree(uniqval).query(vq, k = count)[1]
while not vq in (list(df["target"]) and list(df["coord"])):
clstu = spatial.KDTree(uniqval).query(vq, k = count)[1]
df.set_value(kq, "target", uniqval[clstu[count-1]])
else:
count += 1
clstu = spatial.KDTree(uniqval).query(vq, k = count)[1]
df.set_value(kq, "target", uniqval[clstu[count-1]])
but this return an error
IndexError: list index out of range
Can anyone help me with this? Many thanks!
Answering now about the global strategy, here is what I would do (rough pseudo-algorithm):
current_point = one starting point in uniqval
while (uniqval not empty)
construct KDTree from uniqval and use it for next line
next_point = point in uniqval closest to current_point
record next_point as target for current_point
remove current_point from uniqval
current_point = next_point
What you will obtain is a linear graph joining all your points, using closest neighbors "in some way". I don't know if it will fit your needs. You would also obtain a linear graph by taking next_point at random...
It is hard to comment on your global strategy without further detail about the kind of road network your want to obtain. So let me just comment your specific code and explain why the "out of range" error happens. I hope this can help.
First, are you aware that (list_a and list_b) will return list_a if it is empty, else list_b? Second, isn't the condition (vq in list(df["coord"]) always True? If yes, then your while loop is just always executing the else statement, and at the last iteration of the for loop, (count-1) will be greater than the total number of (unique) points. Hence your KDTree query does not return enough points and clstu[count-1] is out of range.

Retrieve indices for rows of a PyTables table matching a condition using `Table.where()`

I need the indices (as numpy array) of the rows matching a given condition in a table (with billions of rows) and this is the line I currently use in my code, which works, but is quite ugly:
indices = np.array([row.nrow for row in the_table.where("foo == 42")])
It also takes half a minute, and I'm sure that the list creation is one of the reasons why.
I could not find an elegant solution yet and I'm still struggling with the pytables docs, so does anybody know any magical way to do this more beautifully and maybe also a bit faster? Maybe there is special query keyword I am missing, since I have the feeling that pytables should be able to return the matched rows indices as numpy array.
tables.Table.get_where_list() gives indices of the rows matching a given condition
I read the source of pytables, where() is implemented in Cython, but it seems not fast enough. Here is a complex method that can speedup:
Create some data first:
from tables import *
import numpy as np
class Particle(IsDescription):
name = StringCol(16) # 16-character String
idnumber = Int64Col() # Signed 64-bit integer
ADCcount = UInt16Col() # Unsigned short integer
TDCcount = UInt8Col() # unsigned byte
grid_i = Int32Col() # 32-bit integer
grid_j = Int32Col() # 32-bit integer
pressure = Float32Col() # float (single-precision)
energy = Float64Col() # double (double-precision)
h5file = open_file("tutorial1.h5", mode = "w", title = "Test file")
group = h5file.create_group("/", 'detector', 'Detector information')
table = h5file.create_table(group, 'readout', Particle, "Readout example")
particle = table.row
for i in range(1001000):
particle['name'] = 'Particle: %6d' % (i)
particle['TDCcount'] = i % 256
particle['ADCcount'] = (i * 256) % (1 << 16)
particle['grid_i'] = i
particle['grid_j'] = 10 - i
particle['pressure'] = float(i*i)
particle['energy'] = float(particle['pressure'] ** 4)
particle['idnumber'] = i * (2 ** 34)
# Insert a new particle record
particle.append()
table.flush()
h5file.close()
Read the column in chunks and append the indices into a list and concatenate the list to array finally. You can change the chunk size according to your memory size:
h5file = open_file("tutorial1.h5")
table = h5file.get_node("/detector/readout")
size = 10000
col = "energy"
buf = np.zeros(batch, dtype=table.coldtypes[col])
res = []
for start in range(0, table.nrows, size):
length = min(size, table.nrows - start)
data = table.read(start, start + batch, field=col, out=buf[:length])
tmp = np.where(data > 10000)[0]
tmp += start
res.append(tmp)
res = np.concatenate(res)