Implementing Convolution in SQL - sql

I have a table d with fields x, y, f, (PK is x,y) and would like to implement convolution, where a new column, c, is defined, as the Convolution (2D) given an arbitrary kernel. In a procedural language, this is easy to define (see below). I'm confident it can be defined in SQL using a JOIN, but I'm having trouble doing so.
In a procedural language, I would do:
def conv(x, y):
c = 0
# x_ and y_ are pronounced "x prime" and "y prime",
# and take on *all* x and y values in the table;
# that is, we iterate through *all* rows
for all x_, y_
c += f(x_, y_) * kernel(x_ - x, y_ - y)
return c
kernel can be any arbitrary function. In my case, it's 1/k^(sqrt(x_dist^2, y_dist^2)), with kernel(0,0) = 1.
For performance reasons, we don't need to look at every x_, y_. We can filter it where the distance < threshold.
I think this can be done using a Cartesian product join, followed by aggregate SQL SUM, along with a WHERE clause.
One additional challenge of doing this in SQL is NULLs. A naive implementation would treat them as zeroes. What I'd like to do is instead treat the kernel as a weighted average, and just leave out NULLs. That is, I'd use a function wkernel as my kernel, and modify the code above to be:
def conv(x, y):
c = 0
w = 0
for all x_, y_
c += f(x_, y_) * wkernel(x_ - x, y_ - y)
w += wkernel(x_ - x, y_ - y)
return c/w
That would make NULLs work great.
To clarify: You can't have a partial observation, where x=NULL and y=3. However, you can have a missing observation, e.g. there is no record where x=2 and y=3. I am referring to this as NULL, in the sense that the entire record is missing. My procedural code above will handle this fine.
I believe the above can be done in SQL (assuming wkernel is already implemented as a function), but I can't figure out how. I'm using Postgres 9.4.
Sample table:
Table d
x | y | f
0 | 0 | 1.4
1 | 0 | 2.3
0 | 1 | 1.7
1 | 1 | 1.2
Output (just showing one row):
x | y | c
0 | 0 | 1.4*1 + 2.3*1/k + 1.7*1/k + 1.2*1/k^1.414
Convolution https://en.wikipedia.org/wiki/Convolution is a standard algorithm used throughout image processing and signal processing, and I believe it can be done in SQL, which is very useful given the large data sets we're now using.

I assumed a function wkernel, for example:
create or replace function wkernel(k numeric, xdist numeric, ydist numeric)
returns numeric language sql as $$
select 1. / pow(k, sqrt(xdist*xdist + ydist*ydist))
$$;
The following query gives what you want but without restricting to close values:
select d1.x, d1.y, SUM(d2.f*wkernel(2, d2.x-d1.x, d2.y-d1.y)) AS c
from d d1 cross join d d2
group by d1.x, d1.y;
x | y | c
---+---+-------------------------
0 | 0 | 3.850257072695778143380
1 | 0 | 4.237864186319019036455
0 | 1 | 3.862992722666908108145
1 | 1 | 3.725299918145074500610
(4 rows)
With some arbitrary restriction:
select d1.x, d1.y, SUM(d2.f*wkernel(2, d2.x-d1.x, d2.y-d1.y)) AS c
from d d1 cross join d d2
where abs(d2.x-d1.x)+abs(d2.y-d1.y) < 1.1
group by d1.x, d1.y;
x | y | c
---+---+-------------------------
0 | 0 | 3.400000000000000000000
1 | 0 | 3.600000000000000000000
0 | 1 | 3.000000000000000000000
1 | 1 | 3.200000000000000000000
(4 rows)
For the weighted average point:
select d1.x, d1.y, SUM(d2.f*wkernel(2, d2.x-d1.x, d2.y-d1.y)) / SUM(wkernel(2, d2.x-d1.x, d2.y-d1.y)) AS c
from d d1 cross join d d2
where abs(d2.x-d1.x)+abs(d2.y-d1.y) < 1.1
group by d1.x, d1.y;
Now onto the missing information thing. In the following code, please replace 2 by the maximum distance to be considered.
The idea is the following: We find the bounds of the considered image and we generate all the information that could be needed. With your example and with a maximum scope of 1, we need all the couples (x, y) such that (-1 <= x <= 2) and (-1 <= y <= 2).
Finding bounds and fixing scope=1 and k=2 (call this relation cfg):
SELECT MIN(x), MAX(x), MIN(y), MAX(y), 1, 2
FROM d;
min | max | min | max | ?column? | ?column?
-----+-----+-----+-----+----------+----------
0 | 1 | 0 | 1 | 1 | 2
Generating completed set of values (call this relation completed):
SELECT x.*, y.*, COALESCE(f, 0)
FROM cfg
CROSS JOIN generate_series(minx - scope, maxx + scope) x
CROSS JOIN generate_series(miny - scope, maxy + scope) y
LEFT JOIN d ON d.x = x.* AND d.y = y.*;
x | y | coalesce
----+----+----------
-1 | -1 | 0
-1 | 0 | 0
-1 | 1 | 0
-1 | 2 | 0
0 | -1 | 0
0 | 0 | 1.4
0 | 1 | 1.7
0 | 2 | 0
1 | -1 | 0
1 | 0 | 2.3
1 | 1 | 1.2
1 | 2 | 0
2 | -1 | 0
2 | 0 | 0
2 | 1 | 0
2 | 2 | 0
(16 rows)
Now we just have to compute the values with the query given before and the cfg and completed relations. Note that we do not compute convolution for the values on the borders:
SELECT d1.x, d1.y, SUM(d2.f*wkernel(k, d2.x-d1.x, d2.y-d1.y)) / SUM(wkernel(k, d2.x-d1.x, d2.y-d1.y)) AS c
FROM cfg cross join completed d1 cross join completed d2
WHERE d1.x BETWEEN minx AND maxx
AND d1.y BETWEEN miny AND maxy
AND abs(d2.x-d1.x)+abs(d2.y-d1.y) <= scope
GROUP BY d1.x, d1.y;
x | y | c
---+---+-------------------------
0 | 0 | 1.400000000000000000000
0 | 1 | 1.700000000000000000000
1 | 0 | 2.300000000000000000000
1 | 1 | 1.200000000000000000000
(4 rows)
All in one, this gives:
WITH cfg(minx, maxx, miny, maxy, scope, k) AS (
SELECT MIN(x), MAX(x), MIN(y), MAX(y), 1, 2
FROM d
), completed(x, y, f) AS (
SELECT x.*, y.*, COALESCE(f, 0)
FROM cfg
CROSS JOIN generate_series(minx - scope, maxx + scope) x
CROSS JOIN generate_series(miny - scope, maxy + scope) y
LEFT JOIN d ON d.x = x.* AND d.y = y.*
)
SELECT d1.x, d1.y, SUM(d2.f*wkernel(k, d2.x-d1.x, d2.y-d1.y)) / SUM(wkernel(k, d2.x-d1.x, d2.y-d1.y)) AS c
FROM cfg cross join completed d1 cross join completed d2
WHERE d1.x BETWEEN minx AND maxx
AND d1.y BETWEEN miny AND maxy
AND abs(d2.x-d1.x)+abs(d2.y-d1.y) <= scope
GROUP BY d1.x, d1.y;
I hope this helps :-)

Related

Is there a way to reference the tuple below in a calculation?

i have this view here:
x | y | z
-----+------+-----
a | 645 |
b | 46 |
c | 356 |
d | 509 |
Is there a way to write a query for a z item to reference a different row?
For example, if i want z to be the value of the tuple below's y value - 1
So:
z.a = y.b - 1 = 46 - 1 = 45
z.b = y.c - 1 = 356 - 1 = 355
z.c = y.d - 1 = 509 - 1 = 508
You are describing window function lead(), which lets you access any column on the the "next" row (given a partiton and an order by criteria):
select
x,
y,
lead(y) over(order by x) - 1 as z
from mytbale

Pandas - how to get the minimum value for each row from values across several rows

I have a pandas dataframe in the following structure:
|index | a | b | c | d | e |
| ---- | -- | -- | -- | -- | -- |
|0 | -1 | -2| 5 | 3 | 1 |
How can I get the minimum value for each row using only the positive values in columns a-e?
For the example row above, the minimum of (5,3,1) should be 1 and not (-2).
You can use the loop on all rows and apply your condition on the rows.
for example:
df = pd.DataFrame([{"a":-2,"b":2,"c":5},{"a":3,"b":0,"c":-1}])
# a b c
#0 -2 2 5
#1 3 0 -1
def my_condition(li):
li = [i for i in li if i>=0]
return min(li)
min_cel = []
for k,r in df.iterrows():
li = r.to_dict().values()
min_cel.append( my_condition(li) )
df["min"] = min_cel
# a b c min
#0 -2 2 5 2
#1 3 0 -1 0
You can also write the same code on one line:
df['min'] = ddd.apply(lambda row: min([i for i in row.to_dict().values() if i>=0]) , axis=1)

Custom Rolling Computation

Assume I have a model that has A(t) and B(t) governed by the following equations:
A(t) = {
WHEN B(t-1) < 10 : B(t-1)
WHEN B(t-1) >=10 : B(t-1) / 6
}
B(t) = A(t) * 2
The following table is provided as input.
SELECT * FROM model ORDER BY t;
| t | A | B |
|---|------|------|
| 0 | 0 | 9 |
| 1 | null | null |
| 2 | null | null |
| 3 | null | null |
| 4 | null | null |
I.e. we know the values of A(t=0) and B(t=0).
For each row, we want to calculate the value of A & B using the equations above.
The final table should be:
| t | A | B |
|---|---|----|
| 0 | 0 | 9 |
| 1 | 9 | 18 |
| 2 | 3 | 6 |
| 3 | 6 | 12 |
| 4 | 2 | 4 |
We've tried using lag, but because of the models' recursive-like nature, we end up only getting A & B at (t=1)
CREATE TEMPORARY FUNCTION A_fn(b_prev FLOAT64) AS (
CASE
WHEN b_prev < 10 THEN b_prev
ELSE b_prev / 6.0
END
);
SELECT
t,
CASE WHEN t = 0 THEN A ELSE A_fn(LAG(B) OVER (ORDER BY t)) END AS A,
CASE WHEN t = 0 THEN B ELSE A_fn(LAG(B) OVER (ORDER BY t)) * 2 END AS B
FROM model
ORDER BY t;
Produces:
| t | A | B |
|---|------|------|
| 0 | 0 | 9 |
| 1 | 9 | 18 |
| 2 | null | null |
| 3 | null | null |
| 4 | null | null |
Each row is dependent on the row above it. It seems it should be possible to compute a single row at a time, while iterating through the rows? Or does BigQuery not support this type of windowing?
If it is not possible, what do you recommend?
Round #1 - starting point
Below is for BigQuery Standard SQL and works (for me) with up to 3M rows
#standardSQL
CREATE TEMP FUNCTION x(v FLOAT64, t INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>>
LANGUAGE js AS """
var i, result = [];
for (i = 1; i <= t; i++) {
if (v < 10) {v = 2 * v}
else {v = v / 3};
result.push({t:i, v});
};
return result
""";
SELECT 0 AS t, 0 AS A, 9 AS B UNION ALL
SELECT line.t, line.v / 2, line.v FROM UNNEST(x(9, 3000000)) line
Going above 3M rows produces Resources exceeded during query execution: UDF out of memory.
To overcome this - i think you should just implement it on the client - so no JS UDF Limits are applied. I think it is reasonable "workaround" because looks like anyway you have no really data in BQ and just one starting value (9 in this example). But even if you do have other valuable columns in the table - you can then JOIN produced result back to table ON t value - so should be Ok!
Round #2 - It could be billions ... - so let's take care of scale, parallelization
Below is a little trick to avoid JS UDFs Resource and/or Memory error
So, I was able to run it for 2B rows in one shot!
#standardSQL
CREATE TEMP FUNCTION anchor(seed FLOAT64, len INT64, batch INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>> LANGUAGE js AS """
var i, result = [], v = seed;
for (i = 0; i <= len; i++) {
if (v < 10) {v = 2 * v} else {v = v / 3};
if (i % batch == 0) {result.push({t:i + 1, v})};
}; return result
""";
CREATE TEMP FUNCTION x(value FLOAT64, start INT64, len INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>>
LANGUAGE js AS """
var i, result = []; result.push({t:0, v:value});
for (i = 1; i < len; i++) {
if (value < 10) {value = 2 * value} else {value = value / 3};
result.push({t:i, v:value});
}; return result
""";
CREATE OR REPLACE TABLE `project.dataset.result` AS
WITH settings AS (SELECT 9 init, 2000000000 len, 1000 batch),
anchors AS (SELECT line.* FROM settings, UNNEST(anchor(init, len, batch)) line)
SELECT 0 AS t, 0 AS A, init AS B FROM settings UNION ALL
SELECT a.t + line.t, line.v / 2, line.v
FROM settings, anchors a, UNNEST(x(v, t, batch)) line
In above query - you "control" initial values in below line
WITH settings AS (SELECT 9 init, 2000000000 len, 1000 batch),
in above example, 9 is initial value, 2,000,000,000 is number of rows to be calculated and 1000 is a batch to process with (this is important one to keep BQ Engine out of throwing Resource and/or Memory error - you cannot make it too big or too small - i feel I got some sense of what it needs to be - but not enough for trying to formulate it)
Some stats (settings - execution time):
1M: SELECT 9 init, 1000000 len, 1000 batch - 0 min 9 sec
10M: SELECT 9 init, 10000000 len, 1000 batch - 0 min 50 sec
100M: SELECT 9 init, 100000000 len, 600 batch - 3 min 4 sec
100M: SELECT 9 init, 100000000 len, 40 batch - 2 min 56 sec
1B: SELECT 9 init, 1000000000 len, 10000 batch - 29 min 39 sec
1B: SELECT 9 init, 1000000000 len, 1000 batch - 27 min 50 sec
2B: SELECT 9 init, 2000000000 len, 1000 batch - 48 min 27 sec
Round #3 - some thoughts and comments
Obviously, as I mentioned in #1 above - this type of calculation is more suited for being implemented on client of your choice - so it is hard for me to judge practical value of above - but I really had fun playing with it! In reality, I had few more cool ideas in mind and also implemented and played with them - but above (in #2) was the most practical/scalable one
Note: The most interesting part of above solution is anchors table. It is very cheap to generate and allows to set anchors in batch-size interval - so having this you can for example calculate value of row = 2,000,035 or 1,123,456,789 (for example) without actually processing all previous rows - and this will take fraction of sec. Or you can parallelize calculation of all rows by starting several threads/calculations using respective anchors, etc. Quite a number of opportunities.
Finally, it really depends on your specific use-case which way to go further - so I am leaving it up to you
It seems it should be possible to compute a single row at a time, while iterating through the rows
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now.
So, conceptually your process could look like below script:
DECLARE b_prev FLOAT64 DEFAULT NULL;
DECLARE t INT64 DEFAULT 0;
DECLARE arr ARRAY<STRUCT<t INT64, a FLOAT64, b FLOAT64>> DEFAULT [STRUCT(0, 0.0, 9.0)];
SET b_prev = 9.0 / 2;
LOOP
SET (t, b_prev) = (t + 1, 2 * b_prev);
IF t >= 100 THEN LEAVE;
ELSE
SET b_prev = CASE WHEN b_prev < 10 THEN b_prev ELSE b_prev / 6.0 END;
SET arr = (SELECT ARRAY_CONCAT(arr, [(t, b_prev, 2 * b_prev)]));
END IF;
END LOOP;
SELECT * FROM UNNEST(arr);
Even though above script is simpler and more directly represents logic for non-technical personal and easier to manage - it does not fit in scenarios were you need to loop through more than 100 or more iterations. For example above script took close to 2 min while my original solution for same 100 rows took just 2 sec
But still great for simple / smaller cases

Aggregate with groupby but with distinct condition on the aggregated column [duplicate]

This question already has answers here:
one to one distinct restriction on selection
(2 answers)
Closed 8 years ago.
I have encountered a problem like that. There is a Table A, I want to aggregate it using a `group by x order by diff(which is abs(x-y)) incrementally
x and y goes always incrementally. And x with smaller value will have the priority when two different x can paired with same y
x y diff
1 2 1
1 4 3
1 6 5
3 2 1
3 4 1
3 6 3
4.5 2 3.5
4.5 4 0.5
4.5 6 1.5
The aggregate function I want is:
take the y in each group which has the smallest difference with x(smallest diff value).
BUT that y which is taken can not be reused.(for example y=2 will be taken in (x=1) group so that can not be reused in (x=3) group)
Expected result:
x y diff
1 2 1
3 4 1
4.5 4 0.5
seems to be very tricky in plain SQL. I am using PostgreSQL. The real data will be much
complicated and longer than this idea-shooting example
If properly understood your question
test=# select * from A;
x | y | diff
---+---+------
1 | 2 | 1
1 | 4 | 3
1 | 6 | 5
3 | 2 | 1
3 | 4 | 1
3 | 6 | 3
5 | 2 | 3
5 | 4 | 1
5 | 6 | 1
(9 rows)
test=# SELECT MIN(x) AS x, y FROM A WHERE diff = 1 GROUP BY y ORDER BY x;
x | y
---+---
1 | 2
3 | 4
5 | 6
(3 rows)
SELECT MIN(x) AS x, y, MIN(diff) FROM A WHERE diff = 1 GROUP BY y ORDER BY x;
x | y | min
---+---+-----
1 | 2 | 1
3 | 4 | 1
5 | 6 | 1
(3 rows)
added MIN(diff) if not needed can be removed.
Try like this
t1 as table name
d as diff
with cte as (
select x, y,d from t1 where d=(select min(d) from t1) order by x )
select t1.x, min(t1.y), min(t1.d) from t1 inner join cte on
t1.x=cte.x and not t1.y in (select y from cte where cte.x<t1.x)
group by t1.x
This is more of a comment.
This problem essentially a graph problem, of finding the shortest set pairs between two discrete sets (x and y in this case). Technically, this is a maximum matching of a weighted bipartite graph (see [here][1]). I don't think this problem is NP-complete. But that still can make it hard to solve particularly in SQL.
Regardless of whether or not it is hard in the theoretical sense (NP-complete is considered "hard theoretically"), it is hard to do in SQL. One issue is that greedy algorithms don't work. The same "y" value might be closest to all the X values. Which one to choose? Well, the algorithm has to look further.
The only way that I can think to do this accurate in SQL is an exhaustive approach. That is, generate all possible combinations and then check for the one that meets your conditions. Finding all possible combinations requires generating N-factorial combinations of the X's (or Y's). That, in turn, requires a lot of computation. My first thought would be to use recursive CTEs for this. However, that would only work on small problems.

Convert multi-dimensional array to records

Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;
This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs
Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)