Possible to break out of a reduce operator in presto?

Possible to break out of a reduce operator in presto? - sql

Wondering if it's possible to break out of a reduce operator in presto. Example use case:
I have a table where one column is an array of bigints, and I want to return all columns where the magnitude of the array is less than say 1000. So I could write
select
*
from table
where reduce(array_col, 0, (s,x) -> s + power(x,2), s -> if(s < power(1000,2), TRUE, FALSE))
but if there are a lot of rows and the arrays are big, this can take a while. I would like the operator to break and return FALSE as soon as the sum exceeds 1000. Currently I have:
select
*
from table
where reduce(array_col, 0, if(s >= power(1000,2), power(1000,2), s + power(x,2), s -> if(s < power(1000,2), TRUE, FALSE))
which at least saves some computation once the sum exceeds the target value, but still has to iterate through each array element.

There is no support for "break" from array reduction.
Note: technically, you may try to hack this by generating a failure (eg. 1/0) when you would want a break and catching it with try. I doubt it's worth it though.

Related

How to get guaranteed unique list shuffles in Kotlin

I have a list of nine numbers (1-9), that I need to shuffle based on a seed, and guarantee that each permutation of that shuffle is unique. I'd like to do that like this:
list.shuffle(Random(seed))
There are 9! (362,880) possible permutations of this list, and I know that if I pass it the same Random seed twice, those two permutations will be identical, but I need a way to guarantee that for any given seed between 0 and 362,880, the list order will be unique from any other seed in that range.
Is this possible in Kotlin?

This isn't really a question about Kotlin, but algorithms in general.
There could be much better solution, but you can represent your seed as a number with variable base. First digit has base of 9, second has base of 8 and so on. When dealing with numbers of base 10, we need to repeatedly divide it by 10 and note the remainder to split it into digits. In our case we need to divide it by 9, 8, 7 and so on. This way we will convert the seed to a list of 9 digits like this: 0-8, 0-7, 0-6, ... . What is important: each seed has a unique list of such digits.
Now, if we create another list of numbers 1-9, then we can use the list of digits from the previous paragraph to pick numbers from it, removing them at the same time. Initially, we have 9 items in our list, so valid indexes are 0-8 and this is exactly the range of our first digit. Then we have only 8 remaining items, so they have indexes 0-7 and this is exactly what the second digit is. And so on.
This is not that easy to explain in words, code could be better:
fun shuffled1to9(seed: Int): List<Int> {
require(seed in 0 until 362880)
val remaining = (1..9).toMutableList()
val result = mutableListOf<Int>()
var curr = seed
(9 downTo 2).forEach {
val (next, pick) = curr divmod it
result += remaining.removeAt(pick)
curr = next
}
result += remaining.single()
return result
}
infix fun Int.divmod(divisor: Int): Pair<Int, Int> {
val quotient = this / divisor
return quotient to (this - quotient * divisor)
}
shuffled1to9(0) returns original order of 1..9. shuffled1to9(362879) returns the order inverted: 9..1. Any number in between should generate a unique ordering.
Of course, it can be very easily generalized to different lists of numbers and to different sizes.

Optimizing specific numbers to reach value

I'm trying to make a program, that when given specific values (let's say 1, 4 and 10), will try to get how much of each value is needed to reach a certain amount, say 19.
It will always try to use as many high values as possible, so in this case, the result should be 10*1, 4*2, 1*1.
I tried thinking about it, but couldn't end up with an algorithm that could work...
Any help or hints would be welcome!

Here is a python solution that tries all the choices until one is found. If you pass the values it can use in descending order, the first found will be the one that uses the most high values as possible:
def solve(left, idx, nums, used):
if (left == 0):
return True
for i in range(idx, len(nums)):
j = int(left / nums[idx])
while (j > 0):
used.append((nums[idx], j))
if solve(left - j * nums[idx], idx + 1, nums, used):
return True
used.pop()
j -= 1
return False
solution = []
solve(19, 0, [10, 4, 1], solution)
print(solution) # will print [(10, 1), (4, 2), (1, 1)]

If anyone needs a simple algorithm, one way I found was:
sort the values, in descending order
keep track on how many values are kept
for each value, do:
if the sum is equal to the target, stop
if it isn't the first value, remove one of the previous values
while the total sum of values is smaller than the objective:
add the current value once
Have a nice day!
(As juviant mentionned, this won't work if the skips larger numbers, and only uses smaller ones! I'll try to improve it and post a new version when I get it to work)

Trade off between Linear and Binary Search

I have a list of elements to be searched in a dataset of variable lengths. I have tried binary search and I found it is not always efficient when the objective is to search a list of elements.
I did the following study and conclude that if the number of elements to be searched is less than 5% of the data, binary search is efficient, other wise the Linear search is better.
Below are the details
Number of elements : 100000
Number of elements to be searched: 5000
Number of Iterations (Binary Search) =
log2 (N) x SearchCount=log2 (100000) x 5000=83048
Further increase in the number of search elements lead to more iterations than the linear search.
Any thoughts on this?
I am calling the below function only if the number elements to be searched is less than 5%.
private int SearchIndex(ref List<long> entitylist, ref long[] DataList, int i, int len, ref int listcount)
{
int Start = i;
int End = len-1;
int mid;
while (Start <= End)
{
mid = (Start + End) / 2;
long target = DataList[mid];
if (target == entitylist[listcount])
{
i = mid;
listcount++;
return i;
}
else
{
if (target < entitylist[listcount])
{
Start = mid + 1;
}
if (target > entitylist[listcount])
{
End = mid - 1;
}
}
}
listcount++;
return -1; //if the element in the list is not in the dataset
}
In the code I retun the index rather than the value because, I need to work with Index in the calling function. If i=-1, the calling function resets the value to the previous i and calls the function again with a new element to search.

In your problem you are looking for M values in an N long array, N > M, but M can be quite large.
Usually this can be approached as M independent binary searches (or even with the slight optimization of using the previous result as a starting point): you are going to O(M*log(N)).
However, using the fact that also the M values are sorted, you can find all of them in one pass, with linear search. In this case you are going to have your problem O(N). In fact this is better than O(M*log(N)) for M large.
But you have a third option: since M values are sorted, binary split M too, and every time you find it, you can limit the subsequent searches in the ranges on the left and on the right of the found index.
The first look-up is on all the N values, the second two on (average) N/2, than 4 on N/4 data,.... I think that this scale as O(log(M)*log(N)). Not sure of it, comments welcome!
However here is a test code - I have slightly modified your code, but without altering its functionality.
In case you have M=100000 and N=1000000, the "M binary search approach" takes about 1.8M iterations, that's more that the 1M needed to scan linearly the N values. But with what I suggest it takes just 272K iterations.
Even in case the M values are very "collapsed" (eg, they are consecutive), and the linear search is in the best condition (100K iterations would be enough to get all of them, see the comments in the code), the algorithm performs very well.

Algorithm - find the minimal subtraction between sum of two arrays

I am hunting job now and doing many algorithm exercises. Here is my problem:
Given two arrays: a and b with same length, the subject is to make |sum(a)-sum(b)| minimal, by swapping elements between a and b.
Here is my though:
assume we swap a[i] and b[j], set Delt = sum(a) - sum(b), x = a[i]-b[j]
then Delt2 = sum(a)-a[i]+b[j] - (sum(b)-b[j]+a[i]) = Delt - 2*x,
then the change = |Delt| - |Delt2|, which is proportional to |Delt|^2 - |Delt2|^2 = 4*x*(Delt-x),
Based on the thought above I got the following code:
Delt = sum(a) - sum(b);
done = false;
while(!done)
{
done = true;
for i = [0, n)
{
for j = [0,n)
{
x = a[i]-b[j];
change = x*(Delt-x);
if(change >0)
{
swap(a[i], b[j]);
Delt = Delt - 2*x;
done = false;
}
}
}
}
However, does anybody have a much better solution ? If you got, please tell me and I would be very grateful of you!

This problem is basically the optimization problem for Partition Problem with an extra constraint of equal parts. I'll prove that adding this constraint doesn't make the problem easier.
NP-Hardness proof:
Assume there was an algorithm A that solves this problem in polynomial time, we can solve the Partition-Problem in polynomial time.
Partition(S):
for i in range(|S|):
S += {0}
result <- A(S\2,S\2) //arbitrary split S into 2 parts
if result is a partition: //simple to check, since partition is NP.
return true.
return false //no partition
Correctness:
If there is a partition denote as (S1,S2) [assume S2 has more elements], on iteration |S2|-|S1| [i.e. when adding |S2|-|S1| zeros]. The input to A will contatin enough zeros so we can return two equal length arrays: S2,S1+{0,0,...,0}, which will be a partition to S, and the algorithm will yield true.
If the algorithm yields true, and iteration k, we had two arrays: S2,S1, with same number of elements, and equal values. by removing k zeros from the arrays, we get a partition to the original S, so S had a partition.
Polynomial:
assume A takes P(n) time, the algorithm we produced will take n*P(n) time, which is also polynomial.
Conclusion:
If this problem is solveable in polynomial time, so does the Partion-Problem, and thus P=NP. based on this: this problem is NP-Hard.
Because this problem is NP-Hard, for an exact solution you will probably need an exponential algorith. One of those is simple backtracking [I leave it as an exercise to the reader to implement a backtracking solution]
EDIT: as mentioned by #jpalecek: by simply creating a reduction: S->S+(0,0,...,0) [k times 0], one can directly prove NP-Hardness by reduction. polynomial is trivial and correctness is very similar to the above partion's correctness proof: [if there is a partition, adding 'balancing' zeros is possible; the other direction is simply trimming those zeros]

Just a comment. Through all this swapping you can basically arrange the contents of both arrays as you like. So it is unimportant in which array the values are at start.
Can't do it in my head but I'm pretty sure there is a constructive solution. I think if you sort them first and then deal them according to some rule. Something along the lines If value > 0 and if sum(a)>sum(b) then insert to a else into b

What is a data structure for quickly finding non-empty intersections of a list of sets?

I have a set of N items, which are sets of integers, let's assume it's ordered and call it I[1..N]. Given a candidate set, I need to find the subset of I which have non-empty intersections with the candidate.
So, for example, if:
I = [{1,2}, {2,3}, {4,5}]
I'm looking to define valid_items(items, candidate), such that:
valid_items(I, {1}) == {1}
valid_items(I, {2}) == {1, 2}
valid_items(I, {3,4}) == {2, 3}
I'm trying to optimize for one given set I and a variable candidate sets. Currently I am doing this by caching items_containing[n] = {the sets which contain n}. In the above example, that would be:
items_containing = [{}, {1}, {1,2}, {2}, {3}, {3}]
That is, 0 is contained in no items, 1 is contained in item 1, 2 is contained in itmes 1 and 2, 2 is contained in item 2, 3 is contained in item 2, and 4 and 5 are contained in item 3.
That way, I can define valid_items(I, candidate) = union(items_containing[n] for n in candidate).
Is there any more efficient data structure (of a reasonable size) for caching the result of this union? The obvious example of space 2^N is not acceptable, but N or N*log(N) would be.

I think your current solution is optimal big-O wise, though there are micro-optimization techniques that could improve its actual performance. Such as using bitwise operations when merging the chosen set in item_containing set with the valid items set.
i.e. you store items_containing as this:
items_containing = [0x0000, 0x0001, 0x0011, 0x0010, 0x0100, 0x0100]
and your valid_items can use bit-wise OR to merge like this:
int valid_items(Set I, Set candidate) {
// if you need more than 32-items, use int[] for valid
// and int[][] for items_containing
int valid = 0x0000;
for (int item : candidate) {
// bit-wise OR
valid |= items_containing[item];
}
return valid;
}
but they don't really change the Big-O performance.

One representation that might help is storing the sets I as vectors V of size n whose entries V(i) are 0 when i is not in V and positive otherwise. Then to take the intersection of two vectors you multiply the terms, and to take the union you add the terms.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Possible to break out of a reduce operator in presto? - sql

There is no support for "break" from array reduction. Note: technically, you may try to hack this by generating a failure (eg. 1/0) when you would want a break and catching it with try. I doubt it's worth it though.

Related

How to get guaranteed unique list shuffles in Kotlin

Optimizing specific numbers to reach value

Trade off between Linear and Binary Search

Algorithm - find the minimal subtraction between sum of two arrays

What is a data structure for quickly finding non-empty intersections of a list of sets?

Categories

Resources