BigQuery/SQL: Sum over intervals indicated by a secondary table - sql

Suppose I have two tables: intervals contains index intervals (its columns are i_min and i_max) and values contains indexed values (with columns i and x). Here's an example:
values: intervals:
+---+---+ +-------+-------+
| i | x | | i_min | i_max |
+-------+ +---------------+
| 1 | 1 | | 1 | 4 |
| 2 | 0 | | 6 | 6 |
| 3 | 4 | | 6 | 6 |
| 4 | 9 | | 6 | 6 |
| 6 | 7 | | 7 | 9 |
| 7 | 2 | | 12 | 17 |
| 8 | 2 | +-------+-------+
| 9 | 2 |
+---+---+
I want to sum the values of x for each interval:
result:
+-------+-------+-----+
| i_min | i_max | sum |
+---------------------+
| 1 | 4 | 13 | // 1+0+4+9
| 6 | 6 | 7 |
| 6 | 6 | 7 |
| 6 | 6 | 7 |
| 7 | 9 | 6 | // 2+2+2
| 12 | 17 | 0 |
+-------+-------+-----+
In some SQL engines, this could be done using:
SELECT
i_min,
i_max,
(SELECT SUM(x)
FROM values
WHERE i BETWEEN intervals.i_min AND intervals.i_max) AS sum_x
FROM
intervals
except that type of query is not allowed by BigQuery ("Subselect not allowed in SELECT clause." or "LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join." depending on the syntax used).
There must be a way to do this with window functions, but I can't figure out how — all examples I've seen have the partition as part of the table. Is there an option that doesn't use CROSS JOIN? If not, what's the most efficient way to do this CROSS JOIN?
Some notes on my data:
Both tables contain many (10⁸-10⁹) rows.
There might be repetitions in intervals, not in i.
But two intervals in intervals are either the same, either entirely disjoint (no overlaps).
The union of all intervals is typically close to the set of all values of i (so it forms a partition of this space).
Intervals might be large (say, i_max-i_min < 10⁶).

Try below - BigQuery Standard SQL
#standardSQL
SELECT
i_min, i_max, SUM(x) AS sum_x
FROM (
SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM `project.dataset.intervals`
) AS intervals
JOIN (SELECT i, x FROM `project.dataset.values` UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL
GROUP BY i_min, i_max, line
-- ORDER BY i_min
you can play/test with dummy data as below
#standardSQL
WITH intervals AS (
SELECT 1 AS i_min, 4 AS i_max UNION ALL
SELECT 6, 6 UNION ALL
SELECT 6, 6 UNION ALL
SELECT 6, 6 UNION ALL
SELECT 7, 9 UNION ALL
SELECT 12, 17
),
values AS (
SELECT 1 AS i, 1 AS x UNION ALL
SELECT 2, 0 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 4, 9 UNION ALL
SELECT 6, 7 UNION ALL
SELECT 7, 2 UNION ALL
SELECT 8, 2 UNION ALL
SELECT 9, 2
)
SELECT
i_min, i_max, SUM(x) AS sum_x
FROM (SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM intervals) AS intervals
JOIN (SELECT i, x FROM values UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL
GROUP BY i_min, i_max, line
-- ORDER BY i_min

Related

Efficient query to Group by column name in SQL or hive

Imagine I have a table with 2 columns m_1 and m_2:
m1 | m2
3 | 17
3 | 18
4 | 17
9 | 9
I would like to get a table with 3 columns:
m is the index of m (in my exemple 1 or 2)
d is the data contains in the table .
count is the number of occurence of each data, group by value and index.
In the example, the result is:
m | d | count
m_1 | 3 | 2
m_1 | 4 | 1
m_1 | 9 | 1
m_2 | 17| 2
m_2 | 18| 1
m_2 | 9 | 1
The first ligne mus be read as 'data 3 occurs 2 times in column m_1'?
A naive solution is to execute two times a parametric query like this:
for (i in 1 .. 2)
SELECT CONCAT('m_', i), m_i, count(*) FROM table GROUP BY m_i
But this algorithm scans my table two times. This is a problem since I have 255 columns m and bilion of rows.
Will the solution becomes easier if I use hive instead of a relational data base?
You can write this using union all and group by:
select colname, d, count(*)
from ((select 'm_1' as colname, m1 as d from t) union all
(select 'm_2' as colname, m2 as d from t)
) m12
group by colname, d;
posexplode(array(m1,m2))
select concat('m_',cast(pe.pos+1 as string)) as m
,pe.val as d
,count(*) as `count`
from mytable t
lateral view posexplode(array(m1,m2)) pe
group by pos
,val
;
+------+-----+--------+
| m | d | count |
+------+-----+--------+
| m_1 | 3 | 2 |
| m_1 | 4 | 1 |
| m_1 | 9 | 1 |
| m_2 | 9 | 1 |
| m_2 | 17 | 2 |
| m_2 | 18 | 1 |
+------+-----+--------+

BigQuery Standard SQL Pivot Structs and Sum over non-overlapping windows

I'm trying to query a table that uses a basic repeating field to store data like this:
+---+----------+------------+
| i | data.key | data.value |
+---+----------+------------+
| 0 | a | 1 |
| | b | 2 |
| 1 | a | 3 |
| | b | 4 |
| 2 | a | 5 |
| | b | 6 |
| 3 | a | 7 |
| | b | 8 |
+---+----------+------------+
I'm trying to figure out how to run a query that gets a result like
+---+----+----+
| i | a | b |
+---+----+----+
| 1 | 4 | 6 |
| 3 | 12 | 14 |
+---+----+----+
where each row represents a non-overlapping sum (i.e. i=1 is the sum of rows i=0 and i=1) and the data has been pivoted such that data.key is now a column.
Problem 1:
I did my best to convert this answer to use Standard SQL and ended up with:
SELECT
i,
(SELECT SUM(value) FROM UNNEST(data) WHERE key = 'a') as `a`,
(SELECT SUM(value) FROM UNNEST(data) WHERE key = 'b') as `b`
FROM
`dataset.testing.dummy`)
This works, but I'm wondering if there is a better way to do this, especially since it produces a particularly verbose query when trying to use analytic functions:
SELECT
i,
SUM(a) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS `a`,
SUM(b) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS `b`
FROM (
SELECT
i,
(SELECT SUM(value) FROM UNNEST(data) WHERE key = 'a') as `a`,
(SELECT SUM(value) FROM UNNEST(data) WHERE key = 'b') as `b`
FROM
`dataset.testing.dummy`)
ORDER BY
i;
Problem 2:
How do I write a ROW or RANGE statement such that resulting windows don't overlap. In the last query, I get a rolling sum over the data, which isn't quite what I'm looking to do.
+---+----+----+
| i | a | b |
+---+----+----+
| 0 | 1 | 2 |
| 1 | 4 | 6 |
| 2 | 8 | 10 |
| 3 | 12 | 14 |
+---+----+----+
The rolling sum produces a result for each row, whereas I'm attempting to reduce the number of rows returned.
Using a temporary SQL function plus a named window helps with the verbosity. I had to use another subselect to apply the filter on i afterward, though. Here's a self-contained example:
#standardSQL
CREATE TEMP FUNCTION SumKey(
data ARRAY<STRUCT<key STRING, value INT64>>,
target_key STRING) AS (
(SELECT SUM(value) FROM UNNEST(data) WHERE key = target_key)
);
WITH Input AS (
SELECT
0 AS i,
ARRAY<STRUCT<key STRING, value INT64>>[('a', 1), ('b', 2)] AS data UNION ALL
SELECT 1, ARRAY<STRUCT<key STRING, value INT64>>[('a', 3), ('b', 4)] UNION ALL
SELECT 2, ARRAY<STRUCT<key STRING, value INT64>>[('a', 5), ('b', 6)] UNION ALL
SELECT 3, ARRAY<STRUCT<key STRING, value INT64>>[('a', 7), ('b', 8)]
)
SELECT * FROM (
SELECT
i,
SUM(a) OVER W AS a,
SUM(b) OVER W AS b
FROM (
SELECT
i,
SumKey(data, 'a') AS a,
SumKey(data, 'b') AS b
FROM Input
)
WINDOW W AS (ORDER BY i ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
)
WHERE MOD(i, 2) = 1
ORDER BY i;
This results in:
+---+----+----+
| i | a | b |
+---+----+----+
| 1 | 4 | 6 |
| 3 | 12 | 14 |
+---+----+----+

Grouping by similar values in multiple columns

I have a table of entities with an id, and a category (few different values with NULL allowed) from 3 different years (category can be different from 1 year to another), in 'wide' table format:
| ID | CATEG_Y1 | CATEG_Y2 | CATEG_Y3 |
+-----+----------+----------+----------+
| 1 | NULL | B | C |
| 2 | A | A | C |
| 3 | B | A | NULL |
| 4 | A | C | B |
| ... | ... | ... | ... |
I would like to simply count the number of entities by category, grouped by category, independently for the year:
+-------+----+----+----+
| CATEG | Y1 | Y2 | Y3 |
+-------+----+----+----+
| A | 6 | 4 | 5 | <- 6 entities w/ categ_y1, 4 w/ categ_y2, 5 w/ categ_y3
| B | 3 | 1 | 10 |
| C | 8 | 4 | 5 |
| NULL | 3 | 3 | 3 |
+-------+----+----+----+
I guess I could do it by grouping values one column after the other and UNION ALL the results, but I was wondering if there was a more rapid & convenient way, and if it can be generalized if I have more columns/years to manage (e.g. 20-30 different values)
A bit clumsy, but probably someone has a better idea. Query first collects all diferent categories (the union-query in the from part), and then counts the occurences with dedicated subqueries in the select part. One could omit the union-part if there is a table already defining the available categories (I suppose categ_y1 is a foreign key to such a primary category table). Hope there are not to many typos:
select categories.cat,
(select count(categ_y1) from table ty1 where select categories.cat = categ_y1) as y1,
(select count(categ_y2) from table ty2 where select categories.cat = categ_y2) as y2,
(select count(categ_y3) from table ty3 where select categories.cat = categ_y3) as y3
from ( select categ_y1 as cat from table t1
union select categ_y2 as cat from table t2
union select categ_y3 as cat from table t3) categories
Use jsonb functions to transpose the data (from the question) to this format:
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
order by 1;
categ | jdata
-------+-----------------------------------------------
A | {"categ_y1": 2, "categ_y2": 2}
B | {"categ_y1": 1, "categ_y2": 1, "categ_y3": 1}
C | {"categ_y2": 1, "categ_y3": 2}
| {"categ_y1": 1, "categ_y3": 1}
(4 rows)
For a known (static) number of years you can easily unpack the jsonb column:
select categ, jdata->'categ_y1' as y1, jdata->'categ_y2' as y2, jdata->'categ_y3' as y3
from (
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
) s
order by 1;
categ | y1 | y2 | y3
-------+----+----+----
A | 2 | 2 |
B | 1 | 1 | 1
C | | 1 | 2
| 1 | | 1
(4 rows)
To get fully dynamic solution you can use the function create_jsonb_flat_view() described in Flatten aggregated key/value pairs from a JSONB field.
I would do this as using union all followed by aggregation:
select categ, sum(categ_y1) as y1, sum(categ_y2) as y2,
sum(categ_y3) as y3
from ((select categ_y1, 1 as categ_y1, 0 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y2, 0 as categ_y1, 1 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y3, 0 as categ_y1, 0 as categ_y2, 1 as categ_y3
from t
)
)
group by categ ;

Add cumulative total sum over many columns in Postgres

My table is like this:
+----+--------+--------+--------+---------+
| id | type | c1 | c2 | c3 |
+----+--------+--------+--------+---------+
| a | 0 | 10 | 10 | 10 |
| a | 0 | 0 | 10 | |
| a | 0 | 50 | 10 | |
| c | 0 | | 10 | 20 |
| c | 0 | | 10 | |
+----+--------+--------+--------+---------+
I need to the output like this:
+----+---------+--------+--------+---------+
| id | type | c1 | c2 | c3 |
+----+---------+--------+--------+---------+
| a | 0 | 10 | 10 | 10 |
| a | 0 | 0 | 10 | |
| a | 0 | 50 | 10 | |
| c | 0 | | 10 | 20 |
| c | 0 | | 10 | |
+----+---------+--------+--------+---------+
|total | 0 | 60 | 50 | 30 |
+------------------------------------------+
|cumulative| 0 | 60 | 110 | 140 |
+------------------------------------------+
My query so far:
WITH res_1 AS
(SELECT id,c1,c3,c3 FROM cloud10k.dash_reportcard),
res_2 AS
(SELECT 'TOTAL'::VARCHAR, SUM(c1),SUM(c2),SUM(c3) FROM cloud10k.dash_reportcard)
SELECT * FROM res_1
UNION ALL
SELECT * FROM res_2;
It produces a sum total per column.
How can I add the cumulative total sum?
Note: the demo has 3 data columns, my actual table has more than 250.
It would be very tedious and increasingly inefficient to list 250 columns over and over for the sum of columns - an O(n²) problem in disguise. Effectively, you want the equivalent of a window-function to calculate the running total over columns instead of rows.
You can:
Transform the row to a set ("unpivot").
Run the window aggregate function sum() OVER (...).
Transform the set back to a row ("pivot").
WITH total AS (
SELECT 'total'::text AS id, 0 AS type
, sum(c1) AS s1, sum(c2) AS s2, sum(c3) AS s3 -- more ...
FROM cloud10k.dash_reportcard
)
TABLE cloud10k.dash_reportcard
UNION ALL
TABLE total
UNION ALL
SELECT 'cumulative', 0, a[1], a[2], a[3] -- more ...
FROM (
SELECT ARRAY(
SELECT sum(v.s) OVER (ORDER BY rn)
FROM total
, LATERAL (VALUES (1, s1), (2, s2), (3, s3)) v(rn, s) -- more ...
)::int[] AS a
) sub;
See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
SELECT DISTINCT on multiple columns
The last step could also be done with crosstab() from the tablefunc module, but for this simple case it's simpler to just aggregate into an array and break out elements to a separate columns in the outer SELECT.
Alternative for Postgres 9.1
Same as above, but:
...
UNION ALL
SELECT 'cumulative'::text, 0, a[1], a[2], a[3] -- more ...
FROM (
SELECT ARRAY(
SELECT sum(v.s) OVER (ORDER BY rn)
FROM (
SELECT row_number() OVER (), s
FROM unnest((SELECT ARRAY[s1, s2, s3] FROM total)) s -- more ...
) v(rn, s)
)::int[] AS a
) sub;
Consider:
PostgreSQL unnest() with element number
db<>fiddle here - demonstrating both
Old sqlfiddle
Just add another CTE to get cumulative row:
WITH res_1 AS
(SELECT id,c1,c2,c3
FROM dash_reportcard),
res_2 AS
(SELECT 'TOTAL'::VARCHAR, SUM(c1) AS sumC1,
SUM(c2) AS sumC2, SUM(c3) AS sumC3
FROM dash_reportcard),
res_3 AS
(SELECT 'CUMULATIVE'::VARCHAR, sumC1,
sumC2+sumC1, sumC1+sumC2+sumC3
FROM res_2)
SELECT * FROM res_1
UNION ALL
SELECT * FROM res_2
UNION ALL
SELECT * FROM res_3;
Demo here
WITH total AS (
SELECT 'TOTAL'::VARCHAR, SUM(c1) AS sumc1, SUM(c2) AS sumc2, SUM(c3) AS sumc3
FROM cloud10k.dash_reportcard
), cum_total AS (
SELECT 'CUMULATIVE'::varchar, sumc1, sumc1+sumc2, sumc1+sumc2+sumc3
FROM total
)
SELECT id, c1, c2, c3 FROM cloud10k.dash_reportcard
UNION ALL
SELECT * FROM total
UNION ALL
SELECT * FROM cum_total;

Select a row X times

I have a very specific sql problem.
I have a table given with order positions (each position belongs to one order, but this isn't a problem):
| Article ID | Amount |
|--------------|----------|
| 5 | 3 |
| 12 | 4 |
For the customer, I need an export with every physical item that is ordered, e.g.
| Article ID | Position |
|--------------|------------|
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
| 12 | 1 |
| 12 | 2 |
| 12 | 3 |
| 12 | 4 |
How can I build my select statement to give me this results? I think there are two key tasks:
1) Select a row X times based on the amount
2) Set the position for each physical article
You can do it like this
SELECT ArticleID, n.n Position
FROM table1 t JOIN
(
SELECT a.N + b.N * 10 + 1 n
FROM
(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
) n
ON n.n <= t.amount
ORDER BY ArticleID, Position
Note: subquery n generates a sequence of numbers on the fly from 1 to 100. If you do a lot of such queries you may consider to create persisted tally(numbers) table and use it instead.
Here is SQLFiddle demo
or using a recursive CTE
WITH tally AS (
SELECT 1 n
UNION ALL
SELECT n + 1 FROM tally WHERE n < 100
)
SELECT ArticleID, n.n Position
FROM table1 t JOIN tally n
ON n.n <= t.amount
ORDER BY ArticleID, Position
Here is SQLFiddle demo
Output in both cases:
| ARTICLEID | POSITION |
|-----------|----------|
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
| 12 | 1 |
| 12 | 2 |
| 12 | 3 |
| 12 | 4 |
Query:
SQLFIDDLEExample
SELECT t1.[Article ID],
t2.number
FROM Table1 t1,
master..spt_values t2
WHERE t1.Amount >= t2.number
AND t2.type = 'P'
AND t2.number <= 255
AND t2.number <> 0
Result:
| ARTICLE ID | NUMBER |
|------------|--------|
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
| 12 | 1 |
| 12 | 2 |
| 12 | 3 |
| 12 | 4 |