Can you solve maximum gap for a chain of elements in SQL? - sql

I have a difficult query I have to make in SQL(postgressql). I have tried to explain the problem below.
I have a chain of elements each having a max gap to next. So I want to calculate the "distance" matrix. So take the following 4 element:
example_id,id,max_gap
0,0,2
0,1,5
0,2,
0,3,4
then the max_gap between each element should be the following for this example
example_id,id,max_gap
0,0,0,0
0,0,1,2
0,0,2,7
0,0,3,
0,1,0,-2
0,1,1,0
0,1,2,5
0,1,3,
0,2,0,-7
0,2,1,-5
0,2,2,0
0,2,3,
0,3,0,
0,3,1,
0,3,2,
0,3,3,0
So if any of the elements between two elements have max_gap infinity then the max_gap between the two elements is infinity.
The challenge is to the solve this problem in SQL (since in need to have this in a sql trigger).
The following Python code can be used to create test_cases:
from random import randint, random
from itertools import groupby
n_examples = 100
def generate_examples(n):
out = []
for i in range(n):
for j in range(randint(1,10)):
max_dist = randint(0,10)
if random()>0.75:
max_dist = None
out.append([i,j,max_dist])
return out
def max_dist_between_all(example):
example_id = example[0][0]
n=len(example)
return [(example_id,i,j,calc_dist(i,j,example)) for i in range(n) for j in range(n)]
def calculate_max_dist_between_all_examples(examples):
return [result
for _, example in groupby(examples, lambda x:x[0])
for result in max_dist_between_all(list(example))
]
def calc_dist(i,j,example):
if j<i:
i,j = j,i
sign =-1
else:
sign=1
max_dist = 0
for k in range(i,j):
max_dist_between_step = example[k][2]
if max_dist_between_step is None:
return None
max_dist+=max_dist_between_step
return sign*max_dist
examples =generate_examples(n_examples)
def print_in_csv(input_, headers):
print(",".join(headers))
print("\n".join([",".join(str(e) if e is not None else "" for e in l) for l in input_]))
print_in_csv(examples, ["example_id","id","max_gap"])
print()
print_in_csv(calculate_max_dist_between_all_examples(examples), ["example_id","id","max_gap"])

Do you just want a self join?
select e1.example_id, e1.id, e2.id, e1.max_gap - e2.max_gap
from elements e1 join
elements e2
on e1.example_id = e2.example_id

Related

time complexity analysis - nested loops of

so I have analyzed the following code:
def f2(L):
n= len(L)
res = []
for i in range(500,n):
for j in range(i,3*i):
k=1
while k<n:
k*=2
res.append(k)
return res
and Its quite easy to see (and also prove rigorously) that the time complexity is O(n^2logn)
I have tried to analyze two similar functions:
first one:
def f2(L):
n= len(L)
res = []
for i in range(500,n):
for j in range(i,3*i):
k=j ## changed from 1
while k<n:
k*=2
res.append(k)
return res
second one:
def f2(L):
n= len(L)
res = []
for i in range(500,n):
for j in range(i,3*i):
k=i ## changed from 1
while k<n:
k*=2
res.append(k)
return res
Im having trouble thinking how to handle the inner most loop (the while loop) in these two cases
for the original I wrote for the inner sum: k=1 to k=logn ( and then $sum_{k=1}^{logn}1$ is just log n)
Im having trouble to decide if the upper part is still logn for the inner sum, and how to calculate it this time

Find pairs of array such as array_1 = -array_2

I search a way to find all the vector from a np.meshgrid(xrange, xrange, xrange) that are related by k = -k.
For the moment I do that :
#numba.njit
def find_pairs(array):
boolean = np.ones(len(array), dtype=np.bool_)
pairs = []
idx = [i for i in range(len(array))]
while len(idx) > 1:
e1 = idx[0]
for e2 in idx:
if (array[e1] == -array[e2]).all():
boolean[e2] = False
pairs.append([e1, e2])
idx.remove(e1)
if e2 != e1:
idx.remove(e2)
break
return boolean, pairs
# Give array of 3D vectors
krange = np.fft.fftfreq(N)
comb_array = np.array(np.meshgrid(krange, krange, krange)).T.reshape(-1, 3)
# Take idx of the pairs k, -k vector and boolean selection that give position of -k vectors
boolean, pairs = find_pairs(array)
It works but the execution time grow rapidly with N...
Maybe someone has already deal with that?
The main problem is that comb_array has a shape of (R, 3) where R = N**3 and the nested loop in find_pairs runs at least in quadratic time since idx.remove runs in linear time and is called in the for loop. Moreover, there are cases where the for loop does not change the size of idx and the loop appear to run forever (eg. with N=4).
One solution to solve this problem in O(R log R) is to sort the array and then check for opposite values in linear time:
import numpy as np
import numba as nb
# Give array of 3D vectors
krange = np.fft.fftfreq(N)
comb_array = np.array(np.meshgrid(krange, krange, krange)).T.reshape(-1, 3)
# Sorting
packed = comb_array.view([('x', 'f8'), ('y', 'f8'), ('z', 'f8')])
idx = np.argsort(packed, axis=0).ravel()
sorted_comb = comb_array[idx]
# Find pairs
#nb.njit
def findPairs(sorted_comb, idx):
n = idx.size
boolean = np.zeros(n, dtype=np.bool_)
pairs = []
cur = n-1
for i in range(n):
while cur >= i:
if np.all(sorted_comb[i] == -sorted_comb[cur]):
boolean[idx[i]] = True
pairs.append([idx[i], idx[cur]])
cur -= 1
break
cur -= 1
return boolean, pairs
findPairs(sorted_comb, idx)
Note that the algorithm assume that for each row, there are only up to one valid matching pair. If there are several equal rows, they are paired 2 by two. If your goal is to extract all the combination of equal rows in this case, then please note that the output will grow exponentially (which is not reasonable IMHO).
This solution is pretty fast even for N = 100. Most of the time is spent in the sort that is not very efficient (unfortunately Numpy does not provide a way to do a lexicographic argsort of the row efficiently yet though this operation is fundamentally expensive).

TypeError in Python generator

Define a generator function, perms, which takes in a list of numbers and a non-negative integer n. Implement the function such that it generates all permutations of length exactly n using the elements of lst. Assume elements of lst are unique, and n <= len(lst).
def perms(lst, n):
"""
>>> g1 = perms([1,2,3],2)
>>> print(list(g1))
[[1,2], [1,3], [2,1], [2,3], [3,1]. [3,2]]
"""
The issue is in line
yield from lst
Here each element of list will be yielded, which is int, and can't be used in the loop
for p in perms(lst, n-1)
The correct and working solution:
def perms(lst, n):
"""
>>> g1 = perms([1,2,3],2)
>>> print(list(g1))
[[1,2], [1,3], [2,1], [2,3], [3,1], [3,2]]
"""
if n == 0:
yield []
elif n == 1:
for elem in lst:
yield [elem]
else:
for p in perms(lst, n-1):
for e in lst:
if e not in p:
yield p + [e]
When for p in perms(lst, n-1): is called for the first time which is same as for p in perms([1,2,3], 2-1), perms([1,2,3], 2-1) is not returning a list but an integer because of your yield from lst. yield from lst returns just an element from the iterable.
So for p in perms(lst, n-1): when called for the first time is same as for p in 1: which of course will throw error.

Binary-search without an explicit array

I want to perform a binary-search using e.g. np.searchsorted, however, I do not want to create an explicit array containing values. Instead, I want to define a function giving the value to be expected at the desired position of the array, e.g. p(i) = i, where i denotes the position within the array.
Generating an array of values regarding the function would, in my case, be neither efficient nor elegant. Is there any way to achieve this?
What about something like:
import collections
class GeneratorSequence(collections.Sequence):
def __init__(self, func, size):
self._func = func
self._len = size
def __len__(self):
return self._len
def __getitem__(self, i):
if 0 <= i < self._len:
return self._func(i)
else:
raise IndexError
def __iter__(self):
for i in range(self._len):
yield self[i]
This would work with np.searchsorted(), e.g.:
import numpy as np
gen_seq = GeneratorSequence(lambda x: x ** 2, 100)
np.searchsorted(gen_seq, 9)
# 3
You could also write your own binary search function, you do not really need NumPy in this case, and it can actually be beneficial:
def bin_search(seq, item):
first = 0
last = len(seq) - 1
found = False
while first <= last and not found:
midpoint = (first + last) // 2
if seq[midpoint] == item:
first = midpoint
found = True
else:
if item < seq[midpoint]:
last = midpoint - 1
else:
first = midpoint + 1
return first
Which gives identical results:
all(bin_search(gen_seq, i) == np.searchsorted(gen_seq, i) for i in range(100))
# True
Incidentally, this is also WAY faster:
gen_seq = GeneratorSequence(lambda x: x ** 2, 1000000)
%timeit np.searchsorted(gen_seq, 10000)
# 1 loop, best of 3: 1.23 s per loop
%timeit bin_search(gen_seq, 10000)
# 100000 loops, best of 3: 16.1 µs per loop
Inspired by #norok2 comment, I think you can use something like this:
def f(i):
return i*2 # Just an example
class MySeq(Sequence):
def __init__(self, f, maxi):
self.maxi = maxi
self.f = f
def __getitem__(self, x):
if x < 0 or x > self.maxi:
raise IndexError()
return self.f(x)
def __len__(self):
return self.maxi + 1
In this case f is your function while maxi is the maximum index. This of course only works if the function f return values in sorted order.
At this point you can use an object of type MySeq inside np.searchsorted.

Odd-size numpy arrays send/receive

I would like to gather numpy array contents from all processors to one. In case all arrays are of the same size, it works. However I don't see a natural way of doing the same task for arrays of proc-dependent size. Please consider the following code:
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.rank
size = comm.size
if rank >= size/2:
nb_elts = 5
else:
nb_elts = 2
# create data
lst = []
for i in xrange(nb_elts):
lst.append(rank*3+i)
array_lst = numpy.array(lst, dtype=int)
# communicate array
result = []
if rank == 0:
result = array_lst
for p in xrange(1, size):
received = numpy.empty(nb_elts, dtype=numpy.int)
comm.Recv(received, p, tag=13)
result = numpy.concatenate([result, received])
else:
comm.Send(array_lst, 0, tag=13)
My problem is at the "received" allocation. How can I know what is the size to be allocated? Do I have to first send/receive each array size?
Based on a suggestion below, I'll go with
data_array = numpy.ones(rank + 3, dtype=int)
data_array *= rank + 5
print '[{}] data: {} ({})'.format(rank, data_array, type(data_array))
# make all processors aware of data array sizes
all_sizes = {rank: data_array.size}
gathered_all_sizes = comm_py.allgather(all_sizes)
for d in gathered_all_sizes:
all_sizes.update(d)
# prepare Gatherv as described by #francis
nbsum = 0
sendcounts = []
displacements = []
for p in xrange(size):
n = all_sizes[p]
displacements.append(nbsum)
sendcounts.append(n)
nbsum += n
if rank==0:
result = numpy.empty(nbsum, dtype=numpy.int)
else:
result = None
comm_py.Gatherv(data_array,[result, tuple(sendcounts), tuple(displacements), MPI.INT64_T], root=0)
print '[{}] gathered data: {}'.format(rank, result)
In the code you pasted, both Send() and Recv() sends nb_elts elements. The problem is that nb_elts is not the same for every processes... Hence, the number of item received does not match the number of elements that were sent and the program complains:
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
To prevent that, the root process must compute the number of items that the other processes have sent. Hence, in the loop for p in xrange(1, size), nb_elts must be computed according to p, not rank.
The following code based on yours has been corrected. I would add that the natural way to perform this gathering operation is to use Gatherv(). See http://materials.jeremybejarano.com/MPIwithPython/collectiveCom.html and the documentation of mpi4py for instance. I added the corresponding sample code. The only tricky point is that numpy.int is 64bit long. Hence, the Gatherv() uses the MPI type MPI_DOUBLE.
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.rank
size = comm.size
if rank >= size/2:
nb_elts = 5
else:
nb_elts = 2
# create data
lst = []
for i in xrange(nb_elts):
lst.append(rank*3+i)
array_lst = numpy.array(lst, dtype=int)
# communicate array
result = []
if rank == 0:
result = array_lst
for p in xrange(1, size):
if p >= size/2:
nb_elts = 5
else:
nb_elts = 2
received = numpy.empty(nb_elts, dtype=numpy.int)
comm.Recv(received, p, tag=13)
result = numpy.concatenate([result, received])
else:
comm.Send(array_lst, 0, tag=13)
if rank==0:
print "Send Recv, result= "+str(result)
#How to use Gatherv:
nbsum=0
sendcounts=[]
displacements=[]
for p in xrange(0,size):
displacements.append(nbsum)
if p >= size/2:
nbsum+= 5
sendcounts.append(5)
else:
nbsum+= 2
sendcounts.append(2)
if rank==0:
print "nbsum "+str(nbsum)
print "sendcounts "+str(tuple(sendcounts))
print "displacements "+str(tuple(displacements))
print "rank "+str(rank)+" array_lst "+str(array_lst)
print "numpy.int "+str(numpy.dtype(numpy.int))+" "+str(numpy.dtype(numpy.int).itemsize)+" "+str(numpy.dtype(numpy.int).name)
if rank==0:
result2=numpy.empty(nbsum, dtype=numpy.int)
else:
result2=None
comm.Gatherv(array_lst,[result2,tuple(sendcounts),tuple(displacements),MPI.DOUBLE],root=0)
if rank==0:
print "Gatherv, result2= "+str(result2)