Summing repeated fields in BigQuery

Summing repeated fields in BigQuery - sql

I will try to explain my problem as clearly as possible, please tell me if it is not.
I have a table [MyTable] that looks like this:
----------------------------------------
|chn:integer | auds:integer (repeated) |
----------------------------------------
|1 |3916 |
|1 |4983 |
|1 |6233 |
|1 |1214 |
|2 |1200 |
|2 |900 |
|2 |2030 |
|2 |2345 |
----------------------------------------
Auds is always repeated 4 times.
If I query SELECT chn, auds FROM [MyTable] WHERE chn = 1, I get the following result:
-------------------
|Row | chn | auds |
-------------------
|1 |1 |3916 |
|2 |1 |4983 |
|3 |1 |6233 |
|4 |1 |1214 |
-------------------
If I query SELECT chn, auds FROM [MyTable] WHERE (chn = 1 OR chn = 2), I get the following result:
-------------------
|Row | chn | auds |
-------------------
|1 |1 |1200 |
|2 |1 |900 |
|3 |1 |2030 |
|4 |2 |2345 |
-------------------
Logically, I get twice as much results, but what I would like to get is the SUM() of the repeated field auds for chn = 1 and chn = 2, or visually, something like this:
-------------------
|Row | chn | auds |
-------------------
|1 |3 |5116 |
|2 |3 |5883 |
|3 |3 |8263 |
|4 |3 |3559 |
-------------------
I tried to to something:
SELECT a1+a2 FROM
(SELECT auds AS a1 FROM [MyTable] WHERE chn = 1),
(SELECT auds AS a2 FROM [MyTable] WHERE chn = 2)
But I get the following error:
Error: Cannot query the cross product of repeated fields a1 and a2.

It's much easier to express this sort of logic with standard SQL (uncheck "Use Legacy SQL" under "Show Options"). Here's an example that computes sums over the auds arrays:
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
chn,
(SELECT SUM(aud) FROM UNNEST(auds) AS aud) AS auds_sum
FROM MyTable;
+-----+----------+
| chn | auds_sum |
+-----+----------+
| 1 | 20 |
| 2 | 45 |
+-----+----------+
And another that computes pairwise sums for chn = 1 and chn = 2 (which I think is what you wanted based on your question):
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
ARRAY(SELECT first_aud + second_auds[OFFSET(off)]
FROM UNNEST(first_auds) AS first_aud WITH OFFSET off)
AS summed_auds
FROM (
SELECT
(SELECT auds FROM MyTable WHERE chn = 1) AS first_auds,
(SELECT auds FROM MyTable WHERE chn = 2) AS second_auds
);
+---------------------+
| summed_auds |
+---------------------+
| [9, 11, 13, 15, 17] |
+---------------------+
Edit: one more example that sums corresponding array elements across all rows. This probably won't be particularly efficient, but it should produce the intended result:
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
UNION ALL SELECT
3 AS chn,
[-1, -6, 2, 3, 2] AS auds
)
SELECT
ARRAY(SELECT
(SELECT SUM(auds[OFFSET(off)]) FROM UNNEST(all_auds))
FROM UNNEST(all_auds[OFFSET(0)].auds) WITH OFFSET off)
AS summed_auds
FROM (
SELECT
ARRAY_AGG(STRUCT(auds)) AS all_auds
FROM MyTable
);
+--------------------+
| summed_auds |
+--------------------+
| [8, 5, 15, 18, 19] |
+--------------------+

Elliott’s answers are always inspiration for me! Please vote and accept his answer if it works for you (it should :o))
Meantime, wanted to add alternative option with Scalar JS UDF
CREATE TEMPORARY FUNCTION mySUM(a ARRAY<INT64>, b ARRAY<INT64>)
RETURNS ARRAY<INT64>
LANGUAGE js AS """
var sum = [];
for(var i = 0; i < a.length; i++){
sum.push(parseInt(a[i]) + parseInt(b[i]));
}
return sum
""";
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
first_auds.chn AS first_auds_chn,
second_auds.chn AS second_auds_chn,
mySUM(first_auds.auds, second_auds.auds) AS summed_auds
FROM MyTable AS first_auds
JOIN MyTable AS second_auds
ON first_auds.chn = 1 AND second_auds.chn = 2
I like this option because it less filled with multiple UNNESTs, ARRAYs etc so it is much cleaner to read.

Just use GROUP BY in conjunction with SUM.
SELECT SUM(auds), chn FROM [MyTable] GROUP BY chn

Related

Flatten Nested Array and Aggregate in Snowflake

My table column has nested arrays in a Snowflake database. I want to perform some aggregations using SQL (Snowflake SQL).
My table name is: DATA
The PROJ column is of VARIANT data type. The nested arrays will not always be 3, and I demonstrated that in the DATA table.
| ID | PROJ | LOCATION |
|----|-------------------------------|----------|
| 1 |[[0, 4], [1, 30], [10, 20]] | S |
| 2 |[[0, 2], [1, 20]] | S |
| 3 |[[0, 8], [1, 10], [10, 100]] | S |
Desired Output:
| Index | LOCATION | Min | Max | Mean|
|-------|----------|------|-----|-----|
| 0 | S | 2 | 8 | 4.66|
| 1 | S | 10 | 30 | 20 |
| 10 | S | 20 | 100| 60 |

First the nested array should be flattened, then Index is the first element of subarray and Value is the second element(array is 0-based):
CREATE OR REPLACE TABLE DATA
AS
SELECT 1 AS ID, [[0, 4], [1, 30], [10, 20]] AS PROJ UNION
SELECT 2 AS ID, [[0, 2], [1, 20]] AS PROJ UNION
SELECT 3 AS ID, [[0, 8], [1, 10], [10, 100]] AS PROJ;
Query:
SELECT s.VALUE[0]::INT AS Index,
MIN(s.VALUE[1]::INT) AS MinValue,
MAX(s.VALUE[1]::INT) AS MaxValue,
AVG(s.VALUE[1]::INT) AS MeanValue
FROM DATA
,LATERAL FLATTEN(input=> PROJ) s
GROUP BY s.VALUE[0]::INT
ORDER BY Index;
Output:

Select as array of tuples postgresql

Given table of enums
|id |reaction |
|-- |-------- |
|1 |laugh |
|2 |love |
|3 |love |
|4 |like |
|5 |like |
|6 |surprised|
|7 |like |
|8 |love |
|9 |like |
|10 |surprised|
How can I select it to get following JSON array of tuples [reaction, count()]?
[
[laugh, 1],
[love, 3],
[like, 4],
[surprised, 2]
]

You can aggregate the result of a group by query:
select jsonb_agg(jsonb_build_object(reaction, count))
from (
select reaction, count(*)
from the_table
group by reaction
) t;
This would return:
[
{"surprised": 2},
{"like": 4},
{"laugh": 1},
{"love": 3}
]
Or if you really want the inner key/value pairs as a JSON array:
select jsonb_agg(array[reaction, "count"])
from (
select reaction, count(*)::text as "count"
from the_table
group by reaction
) t;
This would return
[
["surprised","2"],
["like","4"],
["laugh","1"],
["love","3"]
]
Online example

You can make use of postgres over partition by and jsonb_build_array function:
SELECT
jsonb_build_array(json_reactions.reaction, count)
FROM
(
SELECT
DISTINCT reaction, count(*) OVER (PARTITION BY reaction)
FROM
reactions r ) AS json_reactions ;

reset index in dense_rank or row_number after variable partitioning over changes

I'm using DB2 SQL. I have the following:
select * from mytable order by Var,Varseq
ID Var Varseq
-- --- ------
1 A 1
1 A 2
1 B 1
1 A 3
2 A 1
2 C 1
but would like to get:
ID Var Varseq NewSeq
-- --- ------ ------
1 A 1 1
1 A 2 2
1 B 1 1
1 A 3 1
2 A 1 1
2 C 1 1
However dense_rank produces the same as the original result. I hope you can see the difference in the desired output - in the 4th line when ID=1 returns to Var=A, I want the index reset to 1, instead of carrying on as 3. i.e. I would like the index to be reset every time Var changes for a given ID.
for ref here was my query:
SELECT *, DENSE_RANK() OVER (PARTITION BY ID, VAR ORDER BY VARSEQ) FROM MYTABLE

This is an example of a gaps-and-islands problem. However, SQL tables represent unordered sets. Without a column that specifies the overall ordering, your question does not make sense.
In this case, the difference of row numbers will do what you want. But you need an overall ordering column:
select t.*,
row_number() over (partition by id, var, seqnum - seqnum2 order by <ordering col>) as newseq
from (select t.*,
row_number() over (partition by id order by <ordering col>) as seqnum,
row_number() over (partition by id, var order by <ordering col>) as seqnum2
from t
) t

Not an answer yet, but just to have better formatting.
WITH TAB (ID, Var, Varseq) AS
(
VALUES
(1, 'A', 1)
, (1, 'A', 2)
, (1, 'A', 3)
, (1, 'B', 1)
, (2, 'A', 1)
, (2, 'C', 1)
)
SELECT *
FROM TAB
ORDER BY ID, <order keys>;
You specified Var, Varseq as <order keys> in the query above.
The result is:
|ID |VAR|VARSEQ |
|-----------|---|-----------|
|1 |A |1 |
|1 |A |2 |
|1 |A |3 |
|1 |B |1 |
|2 |A |1 |
|2 |C |1 |
But you need the following according to your question:
|ID |VAR|VARSEQ |
|-----------|---|-----------|
|1 |A |1 |
|1 |A |2 |
|1 |B |1 |
|1 |A |3 |
|2 |A |1 |
|2 |C |1 |
So, please, edit your question to specify such a <order keys> clause to get the result you need. And please, run your query getting such an order on your system first before posting here...

Postgres WITH RECURSIVE CTE: sorting/ordering children by popularity while retaining tree structure (parents always above children)

I'm building a forum, very much like Reddit/Slashdot, i.e.
Unlimited reply nesting levels
Popular comments (ordered by likes/votes) will rise to the top (within their own nesting/depth level), but the tree structure needs to be retained (parent is always shown directly above children)
Here's a sample table & data:
DROP TABLE IF EXISTS "comments";
CREATE TABLE comments (
id BIGINT PRIMARY KEY,
parent_id BIGINT,
body TEXT NOT NULL,
like_score BIGINT,
depth BIGINT
);
INSERT INTO comments VALUES ( 0, NULL, 'Main top of thread post', 5 , 0 );
INSERT INTO comments VALUES ( 1, 0, 'comment A', 5 , 1 );
INSERT INTO comments VALUES ( 2, 1, 'comment A.A', 3, 2 );
INSERT INTO comments VALUES ( 3, 1, 'comment A.B', 1, 2 );
INSERT INTO comments VALUES ( 9, 3, 'comment A.B.A', 10, 3 );
INSERT INTO comments VALUES ( 10, 3, 'comment A.B.B', 5, 3 );
INSERT INTO comments VALUES ( 11, 3, 'comment A.B.C', 8, 3 );
INSERT INTO comments VALUES ( 4, 1, 'comment A.C', 5, 2 );
INSERT INTO comments VALUES ( 5, 0, 'comment B', 10, 1 );
INSERT INTO comments VALUES ( 6, 5, 'comment B.A', 7, 2 );
INSERT INTO comments VALUES ( 7, 5, 'comment B.B', 5, 2 );
INSERT INTO comments VALUES ( 8, 5, 'comment B.C', 2, 2 );
Here's the recursive query I've come up with so far, but I can't figure out how to order children, but retain tree structure (parent should always be above children)...
WITH RECURSIVE tree AS (
SELECT
ARRAY[]::BIGINT[] AS sortable,
id,
body,
like_score,
depth
FROM "comments"
WHERE parent_id IS NULL
UNION ALL
SELECT
tree.sortable || "comments".like_score || "comments".id,
"comments".id,
"comments".body,
"comments".like_score,
"comments".depth
FROM "comments", tree
WHERE "comments".parent_id = tree.id
)
SELECT * FROM tree
ORDER BY sortable DESC
This outputs...
+----------------------------------------------------------+
|sortable |id|body |like_score|depth|
+----------------------------------------------------------+
|{10,5,7,6} |6 |comment B.A |7 |2 |
|{10,5,5,7} |7 |comment B.B |5 |2 |
|{10,5,2,8} |8 |comment B.C |2 |2 |
|{10,5} |5 |comment B |10 |1 |
|{5,1,5,4} |4 |comment A.C |5 |2 |
|{5,1,3,2} |2 |comment A.A |3 |2 |
|{5,1,1,3,10,9}|9 |comment A.B.A |10 |3 |
|{5,1,1,3,8,11}|11|comment A.B.C |8 |3 |
|{5,1,1,3,5,10}|10|comment A.B.B |5 |3 |
|{5,1,1,3} |3 |comment A.B |1 |2 |
|{5,1} |1 |comment A |5 |1 |
| |0 |Main top of thread post|5 |0 |
+----------------------------------------------------------+
...however notice that "comment B", "comment A" and "Main top of thread post" are below their children? How do I keep the contextual order? i.e. The output I want is:
+----------------------------------------------------------+
|sortable |id|body |like_score|depth|
+----------------------------------------------------------+
| |0 |Main top of thread post|5 |0 |
|{10,5} |5 |comment B |10 |1 |
|{10,5,7,6} |6 |comment B.A |7 |2 |
|{10,5,5,7} |7 |comment B.B |5 |2 |
|{10,5,2,8} |8 |comment B.C |2 |2 |
|{5,1} |1 |comment A |5 |1 |
|{5,1,5,4} |4 |comment A.C |5 |2 |
|{5,1,3,2} |2 |comment A.A |3 |2 |
|{5,1,1,3} |3 |comment A.B |1 |2 |
|{5,1,1,3,10,9}|9 |comment A.B.A |10 |3 |
|{5,1,1,3,8,11}|11|comment A.B.C |8 |3 |
|{5,1,1,3,5,10}|10|comment A.B.B |5 |3 |
+----------------------------------------------------------+
I actually want the users to be able to sort by a number of methods:
Most popular first
Least popular first
Newest first
Oldest first
etc
...but in all cases the parents need to be shown above their children. But I'm just using "like_score" here as the example, and I should be able to figure out the rest from there.
Spent a many hours researching the web and trying things myself, and feels like I'm getting close, but can't figure out this last part.

1.
tree.sortable || -"comments".like_score || "comments".id
^
/|\
|
|
2.
ORDER BY sortable
WITH RECURSIVE tree AS (
SELECT
ARRAY[]::BIGINT[] AS sortable,
id,
body,
like_score,
depth
FROM "comments"
WHERE parent_id IS NULL
UNION ALL
SELECT
tree.sortable || -"comments".like_score || "comments".id,
"comments".id,
"comments".body,
"comments".like_score,
"comments".depth
FROM "comments", tree
WHERE "comments".parent_id = tree.id
)
SELECT * FROM tree
ORDER BY sortable
+-------------------+----+-------------------------+------------+-------+
| sortable | id | body | like_score | depth |
+-------------------+----+-------------------------+------------+-------+
| (null) | 0 | Main top of thread post | 5 | 0 |
+-------------------+----+-------------------------+------------+-------+
| {-10,5} | 5 | comment B | 10 | 1 |
+-------------------+----+-------------------------+------------+-------+
| {-10,5,-7,6} | 6 | comment B.A | 7 | 2 |
+-------------------+----+-------------------------+------------+-------+
| {-10,5,-5,7} | 7 | comment B.B | 5 | 2 |
+-------------------+----+-------------------------+------------+-------+
| {-10,5,-2,8} | 8 | comment B.C | 2 | 2 |
+-------------------+----+-------------------------+------------+-------+
| {-5,1} | 1 | comment A | 5 | 1 |
+-------------------+----+-------------------------+------------+-------+
| {-5,1,-5,4} | 4 | comment A.C | 5 | 2 |
+-------------------+----+-------------------------+------------+-------+
| {-5,1,-3,2} | 2 | comment A.A | 3 | 2 |
+-------------------+----+-------------------------+------------+-------+
| {-5,1,-1,3} | 3 | comment A.B | 1 | 2 |
+-------------------+----+-------------------------+------------+-------+
| {-5,1,-1,3,-10,9} | 9 | comment A.B.A | 10 | 3 |
+-------------------+----+-------------------------+------------+-------+
| {-5,1,-1,3,-8,11} | 11 | comment A.B.C | 8 | 3 |
+-------------------+----+-------------------------+------------+-------+
| {-5,1,-1,3,-5,10} | 10 | comment A.B.B | 5 | 3 |
+-------------------+----+-------------------------+------------+-------+

Check this:
WITH RECURSIVE tree AS (
SELECT
ARRAY[]::BIGINT[] AS sortable,
id,
body,
like_score,
depth,
lpad(id::text, 2, '0') as path
FROM "comments"
WHERE parent_id IS NULL
UNION ALL
SELECT
tree.sortable || "comments".like_score || "comments".id,
"comments".id,
"comments".body,
"comments".like_score,
"comments".depth,
tree.path || '/' || lpad("comments".id::text, 2, '0') as path
FROM "comments", tree
WHERE "comments".parent_id = tree.id
)
SELECT * FROM tree
ORDER BY path
Please note that you can substitute the parameter 2 on lpad with whatever number of digits you want.

sql window functions based on start flag and end flag of a row

I have data in a table currently have data like below.
I want number rows based on child_start and child end columns using window functions.
Data Sample
LoadNumber |DispatchNumber|ChildLoadStart|ChildLoadEnd |
---------------------------------------------------------
123 | A |1 |1 |
---------------------------------------------------------
123 |B |1 |0 |
---------------------------------------------------------
123 |C |0 |0 |
---------------------------------------------------------
123 |D |0 |1 |
---------------------------------------------------------
In the above data for a load 123 I have two child loads i.e., dispatch A is one child load and dispatch B,C,D form one more child load.
So I need to number the each child loads like below;
the result should be something like below. Can some one help me on this?
LoadNumber |DispatchNumber|ChildLoadStart|ChildLoadEnd |Order |
-----------------------------------------------------------------------
123 | A |1 |1 |1 |
------------------------------------------------------------------------
123 |B |1 |0 |1 |
------------------------------------------------------------------------
123 |C |0 |0 |2 |
------------------------------------------------------------------------
123 |D |0 |1 |3 |
------------------------------------------------------------------------

If DispatchNumber can be used to order the data:
ROW_NUMBER()
OVER (PARTITION BY LoadNumber
ORDER BY DispatchNumber
RESET WHEN ChildLoadStart = 1)

Probably reset when clause proposed by #dnoeth is exactly what is needed here. But I'm not familiar with Teradata so below is the Oracle alternative, maybe this will be useful for someone.
At first divide your data into groups using cumulative sum and then use this column (grp) in partition by clause for row_number():
select loadnumber, dispatchnumber, childloadstart, childloadend,
row_number() over (partition by loadnumber, grp order by dispatchnumber) as "ORDER"
from (
select data.*,
sum(childloadstart) over (partition by loadnumber order by dispatchnumber) grp
from data )
Test data and output:
create table data (LoadNumber number(4), DispatchNumber varchar2(2),
ChildLoadStart number(1), ChildLoadEnd number(1));
insert into data values (123, 'A', 1, 1);
insert into data values (123, 'B', 1, 0);
insert into data values (123, 'C', 0, 0);
insert into data values (123, 'D', 0, 1);
LOADNUMBER DISPATCHNUMBER CHILDLOADSTART CHILDLOADEND ORDER
---------- -------------- -------------- ------------ ----------
123 A 1 1 1
123 B 1 0 1
123 C 0 0 2
123 D 0 1 3

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Summing repeated fields in BigQuery - sql

Just use GROUP BY in conjunction with SUM. SELECT SUM(auds), chn FROM [MyTable] GROUP BY chn

Related

Flatten Nested Array and Aggregate in Snowflake

Select as array of tuples postgresql

reset index in dense_rank or row_number after variable partitioning over changes

Postgres WITH RECURSIVE CTE: sorting/ordering children by popularity while retaining tree structure (parents always above children)

sql window functions based on start flag and end flag of a row

Categories

Resources