Flatten Nested Array and Aggregate in Snowflake

Flatten Nested Array and Aggregate in Snowflake - sql

My table column has nested arrays in a Snowflake database. I want to perform some aggregations using SQL (Snowflake SQL).
My table name is: DATA
The PROJ column is of VARIANT data type. The nested arrays will not always be 3, and I demonstrated that in the DATA table.
| ID | PROJ | LOCATION |
|----|-------------------------------|----------|
| 1 |[[0, 4], [1, 30], [10, 20]] | S |
| 2 |[[0, 2], [1, 20]] | S |
| 3 |[[0, 8], [1, 10], [10, 100]] | S |
Desired Output:
| Index | LOCATION | Min | Max | Mean|
|-------|----------|------|-----|-----|
| 0 | S | 2 | 8 | 4.66|
| 1 | S | 10 | 30 | 20 |
| 10 | S | 20 | 100| 60 |

First the nested array should be flattened, then Index is the first element of subarray and Value is the second element(array is 0-based):
CREATE OR REPLACE TABLE DATA
AS
SELECT 1 AS ID, [[0, 4], [1, 30], [10, 20]] AS PROJ UNION
SELECT 2 AS ID, [[0, 2], [1, 20]] AS PROJ UNION
SELECT 3 AS ID, [[0, 8], [1, 10], [10, 100]] AS PROJ;
Query:
SELECT s.VALUE[0]::INT AS Index,
MIN(s.VALUE[1]::INT) AS MinValue,
MAX(s.VALUE[1]::INT) AS MaxValue,
AVG(s.VALUE[1]::INT) AS MeanValue
FROM DATA
,LATERAL FLATTEN(input=> PROJ) s
GROUP BY s.VALUE[0]::INT
ORDER BY Index;
Output:

Related

How to get sum of transposed array in PostgresSQL

I've desinged a table below constructures.
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+---------
counts | integer[] | | |
All of value in field of counts has 9 elements.
All of elements is not null.
I would like to get the sum of transposed array like the below python code with using only SQL query.
import numpy as np
counts = np.array([
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]
])
counts = counts.transpose()
sum = list(map(lambda x: sum(x), counts))
print(sum) # [12, 15, 18, 21]
So given example data:
counts |
-----------------------|
{0,15,8,6,10,12,4,0,5} |
{0,4,6,14,7,9,8,0,9} |
{0,6,7,4,11,6,10,0,10} |
Because the record has over thousand, it takes much time 500ms to get response to calculate the sum of transpose in frontend side.
I would like a result:
counts |
---------------------------|
{0,25,21,24,28,27,22,0,24} |
or
value0 | value1 | value2 | value3 | value4 | value5 | value6 | value7 | value8 |
--------|--------|--------|--------|--------|--------|--------|--------|--------|
0 | 25 | 21 | 24 | 28 | 27 | 22 | 0 | 24 |
In my opinion, this question could be solved using SQL functions but I don't know how to use it.

Just SUM() the array elements:
SELECT SUM(i[1]) -- first
, SUM(i[2]) -- second
, SUM(i[3]) -- third
-- etc.
, SUM(i[9])
FROM (VALUES
('{0,15,8,6,10,12,4,0,5}'::int[]),
('{0,4,6,14,7,9,8,0,9}'),
('{0,6,7,4,11,6,10,0,10}')) s(i);

Finding the closest geographic points between two tables BigQuery

I have two tables with latitude and longitude points. I would like to create a new table which has information from both tables based on finding the closest points between tables. This is similar to a question previously asked; however one of the tables has arrays. The solution from the previously asked question did not seem to work with arrays.
Table A
|--------|-------------|-------------|-------------|
| id | latitude | longitude | address |
|--------|-------------|-------------|-------------|
| 1 | 39.79 | 86.03 | 123 Vine St |
|--------|-------------|-------------|-------------|
| 2 | 39.89 | 84.01 | 123 Oak St |
|--------|-------------|-------------|-------------|
Table B
|-------------|-------------|-------------|--------------|
| latitude | longitude | parameter1 | parameter2 |
|-------------|-------------|-------------|--------------|
| 39.74 | 86.33 | [1, 2, 3] | [.1, .2, .3] |
|-------------|-------------|-------------|--------------|
| 39.81 | 83.90 | [4, 5, 6] | [.4, .5, .6] |
|-------------|-------------|-------------|--------------|
I would like to create a new table, Table C, which has all the rows from TABLE A and adds the information from Table B. The information from Table B is added based on the closest point in Table B to the particular row in Table A.
Table C
|------|-------------|-------------|--------------|
| id_A | address | parameter1 | parameter2 |
|------|-------------|-------------|--------------|
| 1 | 123 Vine St | [1, 2, 3] | [.1, .2, .3] |
|------|-------------|-------------|--------------|
| 2 | 123 Oak St | [4, 5, 6] | [.4, .5, .6] |
|------|-------------|-------------|--------------|
Thank you in advance!

Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE
ARRAY_AGG(STRUCT(id, address, parameter1, parameter2) ORDER BY ST_DISTANCE(a.point, b.point) LIMIT 1)[OFFSET(0)]
FROM (SELECT *, ST_GEOGPOINT(longitude, latitude) point FROM `project.dataset.tableA`) a,
(SELECT *, ST_GEOGPOINT(longitude, latitude) point FROM `project.dataset.tableB`) b
GROUP BY id
If to apply to sample data from your question
WITH `project.dataset.tableA` AS (
SELECT 1 id, 39.79 latitude, 86.03 longitude, '123 Vine St' address UNION ALL
SELECT 2, 39.89, 84.01, '123 Oak St'
), `project.dataset.tableB` AS (
SELECT 39.74 latitude, 86.33 longitude, [1, 2, 3] parameter1, [.1, .2, .3] parameter2 UNION ALL
SELECT 39.81, 83.90, [4, 5, 6], [.4, .5, .6]
)
output is

In Hive, how to combine multiple tables to produce single row containing array of objects?

I have two tables as follows:
users table
==========================
| user_id name age |
|=========================
| 1 pete 20 |
| 2 sam 21 |
| 3 nash 22 |
==========================
hobbies table
======================================
| user_id hobby time_spent |
|=====================================
| 1 football 2 |
| 1 running 1 |
| 1 basketball 3 |
======================================
First question: I would like to make a single Hive query that can return rows in this format:
{ "user_id":1, "name":"pete", "hobbies":[ {hobby: "football", "time_spent": 2}, {"hobby": "running", "time_spent": 1}, {"hobby": "basketball", "time_spent": 3} ] }
Second question: If the hobbies table were to be as follows:
========================================
| user_id hobby scores |
|=======================================
| 1 football 2,3,1 |
| 1 running 1,1,2,5 |
| 1 basketball 3,6,7 |
========================================
Would it be possible to get the row output where scores is a list in the output as shown below:
{ "user_id":1, "name":"pete", "hobbies":[ {hobby: "football", "scores": [2, 3, 1]}, {"hobby": "running", "scores": [1, 1, 2, 5]}, {"hobby": "basketball", "scores": [3, 6, 7]} ] }

I was able to find the answer to my first question
select u.user_id, u.name,
collect_list(
str_to_map(
concat_ws(",", array(
concat("hobby:", h.hobby),
concat("time_spent:", h.time_spent)
))
)
) as hobbies
from users as u
join hobbies as h on u.user_id=h.user_id
group by u.user_id, u.name;

How can I can concatenate multiple array columns in Hive?

In a Hive table, let's assume I have data looking as follows:
+----+-----------+-----------+
| id | some_ids1 | some_ids2 |
+----+-----------+-----------+
| 1 | [2, 3] | [4, 5, 6] |
| 2 | [7] | [8, 9] |
+----+-----------+-----------+
and I would like to obtain something like:
+----+------------------+
| id | concatenated_ids |
+----+------------------+
| 1 | [2, 3, 4, 5, 6] |
| 2 | [7, 8, 9] |
+----+------------------+
In my actual example, the inner types are not primitives, so casting to string doesn't work. How are array columns concatenated in Hive?

Use a generic UDF pass these two columns and concatenate it and return as an array.
public class ArraySum extends UDF {
public List<Double> evaluate(List<Double> list1, List<Double> list2) {
return list1.addAll(list2)
}
}
Hope this helps!

Summing repeated fields in BigQuery

I will try to explain my problem as clearly as possible, please tell me if it is not.
I have a table [MyTable] that looks like this:
----------------------------------------
|chn:integer | auds:integer (repeated) |
----------------------------------------
|1 |3916 |
|1 |4983 |
|1 |6233 |
|1 |1214 |
|2 |1200 |
|2 |900 |
|2 |2030 |
|2 |2345 |
----------------------------------------
Auds is always repeated 4 times.
If I query SELECT chn, auds FROM [MyTable] WHERE chn = 1, I get the following result:
-------------------
|Row | chn | auds |
-------------------
|1 |1 |3916 |
|2 |1 |4983 |
|3 |1 |6233 |
|4 |1 |1214 |
-------------------
If I query SELECT chn, auds FROM [MyTable] WHERE (chn = 1 OR chn = 2), I get the following result:
-------------------
|Row | chn | auds |
-------------------
|1 |1 |1200 |
|2 |1 |900 |
|3 |1 |2030 |
|4 |2 |2345 |
-------------------
Logically, I get twice as much results, but what I would like to get is the SUM() of the repeated field auds for chn = 1 and chn = 2, or visually, something like this:
-------------------
|Row | chn | auds |
-------------------
|1 |3 |5116 |
|2 |3 |5883 |
|3 |3 |8263 |
|4 |3 |3559 |
-------------------
I tried to to something:
SELECT a1+a2 FROM
(SELECT auds AS a1 FROM [MyTable] WHERE chn = 1),
(SELECT auds AS a2 FROM [MyTable] WHERE chn = 2)
But I get the following error:
Error: Cannot query the cross product of repeated fields a1 and a2.

It's much easier to express this sort of logic with standard SQL (uncheck "Use Legacy SQL" under "Show Options"). Here's an example that computes sums over the auds arrays:
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
chn,
(SELECT SUM(aud) FROM UNNEST(auds) AS aud) AS auds_sum
FROM MyTable;
+-----+----------+
| chn | auds_sum |
+-----+----------+
| 1 | 20 |
| 2 | 45 |
+-----+----------+
And another that computes pairwise sums for chn = 1 and chn = 2 (which I think is what you wanted based on your question):
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
ARRAY(SELECT first_aud + second_auds[OFFSET(off)]
FROM UNNEST(first_auds) AS first_aud WITH OFFSET off)
AS summed_auds
FROM (
SELECT
(SELECT auds FROM MyTable WHERE chn = 1) AS first_auds,
(SELECT auds FROM MyTable WHERE chn = 2) AS second_auds
);
+---------------------+
| summed_auds |
+---------------------+
| [9, 11, 13, 15, 17] |
+---------------------+
Edit: one more example that sums corresponding array elements across all rows. This probably won't be particularly efficient, but it should produce the intended result:
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
UNION ALL SELECT
3 AS chn,
[-1, -6, 2, 3, 2] AS auds
)
SELECT
ARRAY(SELECT
(SELECT SUM(auds[OFFSET(off)]) FROM UNNEST(all_auds))
FROM UNNEST(all_auds[OFFSET(0)].auds) WITH OFFSET off)
AS summed_auds
FROM (
SELECT
ARRAY_AGG(STRUCT(auds)) AS all_auds
FROM MyTable
);
+--------------------+
| summed_auds |
+--------------------+
| [8, 5, 15, 18, 19] |
+--------------------+

Elliott’s answers are always inspiration for me! Please vote and accept his answer if it works for you (it should :o))
Meantime, wanted to add alternative option with Scalar JS UDF
CREATE TEMPORARY FUNCTION mySUM(a ARRAY<INT64>, b ARRAY<INT64>)
RETURNS ARRAY<INT64>
LANGUAGE js AS """
var sum = [];
for(var i = 0; i < a.length; i++){
sum.push(parseInt(a[i]) + parseInt(b[i]));
}
return sum
""";
WITH MyTable AS (
SELECT
1 AS chn,
[2, 3, 4, 5, 6] AS auds
UNION ALL SELECT
2 AS chn,
[7, 8, 9, 10, 11] AS auds
)
SELECT
first_auds.chn AS first_auds_chn,
second_auds.chn AS second_auds_chn,
mySUM(first_auds.auds, second_auds.auds) AS summed_auds
FROM MyTable AS first_auds
JOIN MyTable AS second_auds
ON first_auds.chn = 1 AND second_auds.chn = 2
I like this option because it less filled with multiple UNNESTs, ARRAYs etc so it is much cleaner to read.

Just use GROUP BY in conjunction with SUM.
SELECT SUM(auds), chn FROM [MyTable] GROUP BY chn

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Flatten Nested Array and Aggregate in Snowflake - sql

Related

How to get sum of transposed array in PostgresSQL

Finding the closest geographic points between two tables BigQuery

In Hive, how to combine multiple tables to produce single row containing array of objects?

How can I can concatenate multiple array columns in Hive?

Summing repeated fields in BigQuery

Categories

Resources