BigQuery Standard SQL Pivot Structs and Sum over non-overlapping windows - google-bigquery

I'm trying to query a table that uses a basic repeating field to store data like this:
+---+----------+------------+
| i | data.key | data.value |
+---+----------+------------+
| 0 | a | 1 |
| | b | 2 |
| 1 | a | 3 |
| | b | 4 |
| 2 | a | 5 |
| | b | 6 |
| 3 | a | 7 |
| | b | 8 |
+---+----------+------------+
I'm trying to figure out how to run a query that gets a result like
+---+----+----+
| i | a | b |
+---+----+----+
| 1 | 4 | 6 |
| 3 | 12 | 14 |
+---+----+----+
where each row represents a non-overlapping sum (i.e. i=1 is the sum of rows i=0 and i=1) and the data has been pivoted such that data.key is now a column.
Problem 1:
I did my best to convert this answer to use Standard SQL and ended up with:
SELECT
i,
(SELECT SUM(value) FROM UNNEST(data) WHERE key = 'a') as `a`,
(SELECT SUM(value) FROM UNNEST(data) WHERE key = 'b') as `b`
FROM
`dataset.testing.dummy`)
This works, but I'm wondering if there is a better way to do this, especially since it produces a particularly verbose query when trying to use analytic functions:
SELECT
i,
SUM(a) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS `a`,
SUM(b) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS `b`
FROM (
SELECT
i,
(SELECT SUM(value) FROM UNNEST(data) WHERE key = 'a') as `a`,
(SELECT SUM(value) FROM UNNEST(data) WHERE key = 'b') as `b`
FROM
`dataset.testing.dummy`)
ORDER BY
i;
Problem 2:
How do I write a ROW or RANGE statement such that resulting windows don't overlap. In the last query, I get a rolling sum over the data, which isn't quite what I'm looking to do.
+---+----+----+
| i | a | b |
+---+----+----+
| 0 | 1 | 2 |
| 1 | 4 | 6 |
| 2 | 8 | 10 |
| 3 | 12 | 14 |
+---+----+----+
The rolling sum produces a result for each row, whereas I'm attempting to reduce the number of rows returned.

Using a temporary SQL function plus a named window helps with the verbosity. I had to use another subselect to apply the filter on i afterward, though. Here's a self-contained example:
#standardSQL
CREATE TEMP FUNCTION SumKey(
data ARRAY<STRUCT<key STRING, value INT64>>,
target_key STRING) AS (
(SELECT SUM(value) FROM UNNEST(data) WHERE key = target_key)
);
WITH Input AS (
SELECT
0 AS i,
ARRAY<STRUCT<key STRING, value INT64>>[('a', 1), ('b', 2)] AS data UNION ALL
SELECT 1, ARRAY<STRUCT<key STRING, value INT64>>[('a', 3), ('b', 4)] UNION ALL
SELECT 2, ARRAY<STRUCT<key STRING, value INT64>>[('a', 5), ('b', 6)] UNION ALL
SELECT 3, ARRAY<STRUCT<key STRING, value INT64>>[('a', 7), ('b', 8)]
)
SELECT * FROM (
SELECT
i,
SUM(a) OVER W AS a,
SUM(b) OVER W AS b
FROM (
SELECT
i,
SumKey(data, 'a') AS a,
SumKey(data, 'b') AS b
FROM Input
)
WINDOW W AS (ORDER BY i ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
)
WHERE MOD(i, 2) = 1
ORDER BY i;
This results in:
+---+----+----+
| i | a | b |
+---+----+----+
| 1 | 4 | 6 |
| 3 | 12 | 14 |
+---+----+----+

Related

Spark sql query to find many to many mappings between two columns of the same table ordered by maximum overlapedness

I wanted to write a Spark sql query or pyspark code to extract many to many mappings between two columns of the same table ordered by maximum overlapedness.
For example:
SysA SysB
A Y
A Z
B Z
B Y
C W
Which means there is therefore a M:M relationship between the above two columns.
Is there a way to extract all M:M combinations ordered by maximum overlapedness i.e values which share a lot among each other should be at the top? and discarding the one-one mappings like C W
Z maps to both A and B
Y maps to both A and B
A maps to both Y and Z
B maps to both Y and Z
Therefore both A ,B AND X,Y have M:M relationships and C W Is 1:1. The order would be sorted by the count i.e 2 , in above example only mappings of two are there between A,B:X,Y hence both are 2.
Similar question:https://social.msdn.microsoft.com/Forums/en-US/fa496933-e85a-4dfe-98df-b6c29ad812f4/sql-to-find-manytomany-combinations-of-two-columns
AS you requested and simplified version of your similar MSDN quesiton identifying just M:M relationships and ordered.
The following approaches may be used on Spark SQL.
CREATE TABLE SampleData (
`SysA` VARCHAR(1),
`SysB` VARCHAR(1)
);
INSERT INTO SampleData
(`SysA`, `SysB`)
VALUES
('A', 'Y'),
('A', 'Z'),
('B', 'Z'),
('B', 'G'),
('B', 'Y'),
('C', 'W');
Query #1
For demo purposes i have used * instead of SysA,SysB in the final projection below
SELECT
*
FROM
(
SELECT
*,
(
SELECT
count(*)
FROM
SampleData s
WHERE s.SysA=sd.SysA
) SysA_SysB,
(
SELECT
count(*)
FROM
SampleData s
WHERE s.SysB=sd.SysB
) SysB_SysA
FROM
SampleData sd
) t
WHERE t.SysA_SysB > 1 AND t.SysB_SysA>1
ORDER BY t.SysA_SysB DESC, t.SysB_SysA DESC;
| SysA | SysB | SysA_SysB | SysB_SysA |
| ---- | ---- | --------- | --------- |
| B | Z | 3 | 2 |
| B | Y | 3 | 2 |
| A | Y | 2 | 2 |
| A | Z | 2 | 2 |
Query #2
NB. Cross Joins should be enabled in spark i.e. setting spark.sql.crossJoin.enabled as true in your spark conf
SELECT
s1.SysA,
s1.SysB
FROM
SampleData s1
CROSS JOIN
SampleData s2
GROUP BY
s1.SysA, s1.SysB
HAVING
SUM(
CASE WHEN s1.SysA = s2.SysA THEN 1 ELSE 0 END
) > 1 AND
SUM(
CASE WHEN s1.SysB = s2.SysB THEN 1 ELSE 0 END
) > 1
ORDER BY
SUM(
CASE WHEN s1.SysA = s2.SysA THEN 1 ELSE 0 END
) DESC,
SUM(
CASE WHEN s1.SysB = s2.SysB THEN 1 ELSE 0 END
) DESC;
| SysA | SysB |
| ---- | ---- |
| B | Z |
| B | Y |
| A | Z |
| A | Y |
Query #3 (Recommended)
WITH SampleDataOcc AS (
SELECT
SysA,
SysB,
COUNT(SysA) OVER (PARTITION BY SysA) as SysAOcc,
COUNT(SysB) OVER (PARTITION BY SysB) as SysBOcc
FROM
SampleData
)
SELECT
SysA,
SysB,
SysAOcc,
SysBOcc
FROM
SampleDataOcc t
WHERE
t.SysAOcc > 1 AND
t.SysBOcc>1
ORDER BY
t.SysAOcc DESC,
t.SysBOcc DESC;
| SysA | SysB | SysAOcc | SysBOcc |
| ---- | ---- | --------- | --------- |
| B | Z | 3 | 2 |
| B | Y | 3 | 2 |
| A | Y | 2 | 2 |
| A | Z | 2 | 2 |
View on DB Fiddle

SQL select distinct when one column in and another column greater than

Consider the following dataset:
+---------------------+
| ID | NAME | VALUE |
+---------------------+
| 1 | a | 0.2 |
| 1 | b | 8 |
| 1 | c | 3.5 |
| 1 | d | 2.2 |
| 2 | b | 4 |
| 2 | c | 0.5 |
| 2 | d | 6 |
| 3 | a | 2 |
| 3 | b | 4 |
| 3 | c | 3.6 |
| 3 | d | 0.2 |
+---------------------+
I'm tying to develop a sql select statement that returns the top or distinct ID where NAME 'a' and 'b' both exist and both of the corresponding VALUE's are >= '1'. Thus, the desired output would be:
+---------------------+
| ID | NAME | VALUE |
+---------------------+
| 3 | a | 2 |
+----+-------+--------+
Appreciate any assistance anyone can provide.
You can try to use MIN window function and some condition to make it.
SELECT * FROM (
SELECT *,
MIN(CASE WHEN NAME = 'a' THEN [value] end) OVER(PARTITION BY ID) aVal,
MIN(CASE WHEN NAME = 'b' THEN [value] end) OVER(PARTITION BY ID) bVal
FROM T
) t1
WHERE aVal >1 and bVal >1 and aVal = [Value]
sqlfiddle
This seems like a group by and having query:
select id
from t
where name in ('a', 'b')
having count(*) = 2 and
min(value) >= 1;
No subqueries or joins are necessary.
The where clause filters the data to only look at the "a" and "b" records. The count(*) = 2 checks that both exist. If you can have duplicates, then use count(distinct name) = 2.
Then, you want the minimum value to be 1, so that is the final condition.
I am not sure why your desired results have the "a" row, but if you really want it, you can change the select to:
select id, 'a' as name,
max(case when name = 'a' then value end) as value
you can use in and sub-query
select top 1 * from t
where t.id in
(
select id from t
where name in ('a','b')
group by id
having sum(case when value>1 then 1 else 0)>=2
)
order by id

Calculating consecutive range of dates with a value in Hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!
You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id
The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.

Grouping by similar values in multiple columns

I have a table of entities with an id, and a category (few different values with NULL allowed) from 3 different years (category can be different from 1 year to another), in 'wide' table format:
| ID | CATEG_Y1 | CATEG_Y2 | CATEG_Y3 |
+-----+----------+----------+----------+
| 1 | NULL | B | C |
| 2 | A | A | C |
| 3 | B | A | NULL |
| 4 | A | C | B |
| ... | ... | ... | ... |
I would like to simply count the number of entities by category, grouped by category, independently for the year:
+-------+----+----+----+
| CATEG | Y1 | Y2 | Y3 |
+-------+----+----+----+
| A | 6 | 4 | 5 | <- 6 entities w/ categ_y1, 4 w/ categ_y2, 5 w/ categ_y3
| B | 3 | 1 | 10 |
| C | 8 | 4 | 5 |
| NULL | 3 | 3 | 3 |
+-------+----+----+----+
I guess I could do it by grouping values one column after the other and UNION ALL the results, but I was wondering if there was a more rapid & convenient way, and if it can be generalized if I have more columns/years to manage (e.g. 20-30 different values)
A bit clumsy, but probably someone has a better idea. Query first collects all diferent categories (the union-query in the from part), and then counts the occurences with dedicated subqueries in the select part. One could omit the union-part if there is a table already defining the available categories (I suppose categ_y1 is a foreign key to such a primary category table). Hope there are not to many typos:
select categories.cat,
(select count(categ_y1) from table ty1 where select categories.cat = categ_y1) as y1,
(select count(categ_y2) from table ty2 where select categories.cat = categ_y2) as y2,
(select count(categ_y3) from table ty3 where select categories.cat = categ_y3) as y3
from ( select categ_y1 as cat from table t1
union select categ_y2 as cat from table t2
union select categ_y3 as cat from table t3) categories
Use jsonb functions to transpose the data (from the question) to this format:
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
order by 1;
categ | jdata
-------+-----------------------------------------------
A | {"categ_y1": 2, "categ_y2": 2}
B | {"categ_y1": 1, "categ_y2": 1, "categ_y3": 1}
C | {"categ_y2": 1, "categ_y3": 2}
| {"categ_y1": 1, "categ_y3": 1}
(4 rows)
For a known (static) number of years you can easily unpack the jsonb column:
select categ, jdata->'categ_y1' as y1, jdata->'categ_y2' as y2, jdata->'categ_y3' as y3
from (
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
) s
order by 1;
categ | y1 | y2 | y3
-------+----+----+----
A | 2 | 2 |
B | 1 | 1 | 1
C | | 1 | 2
| 1 | | 1
(4 rows)
To get fully dynamic solution you can use the function create_jsonb_flat_view() described in Flatten aggregated key/value pairs from a JSONB field.
I would do this as using union all followed by aggregation:
select categ, sum(categ_y1) as y1, sum(categ_y2) as y2,
sum(categ_y3) as y3
from ((select categ_y1, 1 as categ_y1, 0 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y2, 0 as categ_y1, 1 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y3, 0 as categ_y1, 0 as categ_y2, 1 as categ_y3
from t
)
)
group by categ ;

BigQuery/SQL: Sum over intervals indicated by a secondary table

Suppose I have two tables: intervals contains index intervals (its columns are i_min and i_max) and values contains indexed values (with columns i and x). Here's an example:
values: intervals:
+---+---+ +-------+-------+
| i | x | | i_min | i_max |
+-------+ +---------------+
| 1 | 1 | | 1 | 4 |
| 2 | 0 | | 6 | 6 |
| 3 | 4 | | 6 | 6 |
| 4 | 9 | | 6 | 6 |
| 6 | 7 | | 7 | 9 |
| 7 | 2 | | 12 | 17 |
| 8 | 2 | +-------+-------+
| 9 | 2 |
+---+---+
I want to sum the values of x for each interval:
result:
+-------+-------+-----+
| i_min | i_max | sum |
+---------------------+
| 1 | 4 | 13 | // 1+0+4+9
| 6 | 6 | 7 |
| 6 | 6 | 7 |
| 6 | 6 | 7 |
| 7 | 9 | 6 | // 2+2+2
| 12 | 17 | 0 |
+-------+-------+-----+
In some SQL engines, this could be done using:
SELECT
i_min,
i_max,
(SELECT SUM(x)
FROM values
WHERE i BETWEEN intervals.i_min AND intervals.i_max) AS sum_x
FROM
intervals
except that type of query is not allowed by BigQuery ("Subselect not allowed in SELECT clause." or "LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join." depending on the syntax used).
There must be a way to do this with window functions, but I can't figure out how — all examples I've seen have the partition as part of the table. Is there an option that doesn't use CROSS JOIN? If not, what's the most efficient way to do this CROSS JOIN?
Some notes on my data:
Both tables contain many (10⁸-10⁹) rows.
There might be repetitions in intervals, not in i.
But two intervals in intervals are either the same, either entirely disjoint (no overlaps).
The union of all intervals is typically close to the set of all values of i (so it forms a partition of this space).
Intervals might be large (say, i_max-i_min < 10⁶).
Try below - BigQuery Standard SQL
#standardSQL
SELECT
i_min, i_max, SUM(x) AS sum_x
FROM (
SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM `project.dataset.intervals`
) AS intervals
JOIN (SELECT i, x FROM `project.dataset.values` UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL
GROUP BY i_min, i_max, line
-- ORDER BY i_min
you can play/test with dummy data as below
#standardSQL
WITH intervals AS (
SELECT 1 AS i_min, 4 AS i_max UNION ALL
SELECT 6, 6 UNION ALL
SELECT 6, 6 UNION ALL
SELECT 6, 6 UNION ALL
SELECT 7, 9 UNION ALL
SELECT 12, 17
),
values AS (
SELECT 1 AS i, 1 AS x UNION ALL
SELECT 2, 0 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 4, 9 UNION ALL
SELECT 6, 7 UNION ALL
SELECT 7, 2 UNION ALL
SELECT 8, 2 UNION ALL
SELECT 9, 2
)
SELECT
i_min, i_max, SUM(x) AS sum_x
FROM (SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM intervals) AS intervals
JOIN (SELECT i, x FROM values UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL
GROUP BY i_min, i_max, line
-- ORDER BY i_min