Aggregate columns containing dictionary in SQL presto - sql

Hi I want to do SQL Presto query for the data table (say user_data) looks like
user | target | result
-----------------------------
1 | b | {A: 1}
2 | a | {C: 2}
1 | c | {A: 2, B: 3}
2 | d | {A: 1}
1 | d | {C: 4}
With this data table, I would like to generate the following two outputs.
Output 1: Aggregate the values of the {key:value} dictionary based on the user and regardless of target
user | result
-------------------
1 | {A:3, B:3, C:4}
2 | {A:1, C:2}
Output 2: Aggregate the last column based on the targets of the user.
user | result
-------------------
1 | {A:[b,c], B:[c], C:[d]}
2 | {A:[d], C:[a]}
Can anyone help me with it? I would really appreciate it.

Second one can be easily achieved with multimap_agg (add transform_values with array_distinct to remove duplicates if needed):
-- sample data
WITH dataset(user, target, result) AS (
values (1, 'b', map(array['A'], array[1])),
(2, 'a', map(array['C'], array[2])),
(1, 'c', map(array['A', 'B'], array[1, 2]))
)
-- query
select user, multimap_agg(k, target)
from dataset,
unnest(result) as t (k,v)
group by user;
Output:
user
_col1
1
{A=[b, c], B=[c]}
2
{C=[a]}
As for the first one - you can look into using map_union_sum if it is available in your version of Presto. Or use some magic with unnest and transform_values:
-- query
select user,
transform_values(
multimap_agg(k, v),
(k,v) -> reduce(v, 0, (s, x) -> s + x, s -> s) -- or array_sum if available
)
from dataset,
unnest(result) as t (k, v)
group by user;
Output:
user
_col1
1
{A=2, B=2}
2
{C=2}

Related

SQL Presto Aggregate Table column values with another column values

Hi I want to do SQL Presto query for the data table (say user_data) looks like
user | target | result
-----------------------------
1 | b | {A: 1}
2 | a | {C: 2}
1 | c | {A: 2, B: 3}
2 | d | {A: 1}
1 | d | {C: 4}
With this data table, I would like to generate the following two outputs.
Output 1: Count the number of unique targets for each result for each user. For example, for user 1, this user has 2 targets (b and c) who have result A. And it has one target for each result B (target c) and C (target d).
user | result
-------------------
1 | {A: 2, B:1, C:1}
2 | {A: 1, C: 1}
Output 2: Aggregate the last column based on the targets of the user.
user | result
-------------------
1 | {A:[b,c], B:[c], C:[d]}
2 | {A:[d], C:[a]}
** Or Even better, can we make a one table that has both columns?
user | result 1 | result 2
--------------------------------------------------
1 | {A:[b,c], B:[c], C:[d]} | {A: 2, B:1, C:1}
2 | {A:[d], C:[a]} | {A: 1, C: 1}
Can anyone help me with it? I would really appreciate it.
I'm pretty new to SQL so I didn't even know how to start it.`
This can be achieved with map aggregate functions. Assuming that result originally is a map you can flatten it with unnest and then group by user and use multimap_agg and histogram functions:
-- sample data
WITH dataset(user, target, result) AS (
VALUES (1, 'b', map(array['A'], array[1])),
(2, 'a', map(array['C'], array[2])),
(1, 'c', map(array['A', 'B'], array[2, 3])),
(2, 'd', map(array['A'], array[1])),
(1, 'd', map(array['C'], array[4]))
)
-- query
select user, multimap_agg(k, target), histogram(k)
from dataset,
unnest(result) as t(k, v)
group by user;
Output:
user
_col1
_col2
2
{A=[d], C=[a]}
{A=1, C=1}
1
{A=[b, c], B=[c], C=[d]}
{A=2, B=1, C=1}

Output multiple summarized lists with KQL

I want to output multiple lists of unique column values with KQL.
For instance for the following table:
A
B
C
1
x
one
1
x
two
1
y
one
I want to output
K
V
A
[1]
B
[x,y]
C
[one, two]
I accomplished this using summarize with make_list and 2 unions, been wondering if its possible to accomplish this in the same query without union?
Table
| distinct A
| summarize k="A", v= make_list(A)
union
Table
| distinct b
| summarize k="B", v= make_list(B)
...
if your data set is reasonably-sized, you could try using the narrow() plugin: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/narrowplugin
datatable(A:int, B:string, C:string)
[
1, 'x', 'one',
1, 'x', 'two',
1, 'y', 'one',
]
| evaluate narrow()
| summarize make_set(Value) by Column
Column
set_Value
A
["1"]
B
["x","y"]
C
["one","two"]
Alternatively, you could use a combination of pack_all() and mv-apply
datatable(A:int, B:string, C:string)
[
1, 'x', 'one',
1, 'x', 'two',
1, 'y', 'one',
]
| project p = pack_all()
| mv-apply p on (
extend key = tostring(bag_keys(p)[0])
| project key, value = p[key]
)
| summarize make_set(value) by key
key
set_value
A
["1"]
B
["x","y"]
C
["one","two"]

Statistical functions on columns of arrays in BigQuery

If I have data that looks like the follows:
+------+------------+-------+
| a | val | val2 |
+------+------------+-------+
| 3.14 | [1, 2, 3] | [2, 3]|
| 1.59 | [7, 8, 9] | ... |
| -1 | [4, 5, 6] | ... |
+------+------------+-------+
and I want to get the array averages of the val column, naively I'd want to just try something like
SELECT
AVG(val)
FROM
<Table>
But that doesn't work. I get an error like No matching signature for aggregate function AVG for argument types: ARRAY<INT64>. Supported signatures: AVG(INT64); AVG(UINT64); AVG(DOUBLE); AVG(NUMERIC)
I know that if I have just one column val I can do something like
SELECT avg
FROM
(
SELECT AVG(val) as avg
FROM UNNEST(val) AS val
)
but what if I have multiple columns (val, val2, etc.) and need multiple statistics? The above method just seems really cumbersome.
To be clear the result I'd want is:
+------+------------+-------------+--------------+
| a | avg_val | std_dev_val | avg_val2 |
+------+------------+-------------+--------------+
| 3.14 | 2 | 1 | ... |
| 1.59 | 8 | .... | ... |
| -1 | 5 | .... | ... |
+------+------------+-------------+--------------+
Is there a simple way to do this? Or do I need to create some sort of temporary function to accomplish this? Or am I stuck doing something like what I see in https://stackoverflow.com/a/45560462/1902480
If you want the average as an array, you can unnest and then reaggregate:
select array_agg(avg_val order by n)
from (select n, avg(val) as avg_val
from t cross join
unnest(vals) v with offset n
group by n
) a;
EDIT:
If you want the values per row, just use a subquery with unnest():
select t.*,
(select avg(el)
from unnest(t.val) el
),
(select avg(el)
from unnest(t.val2) el
)
from t;
And so on for whatever aggregation functions you want.
Consider below example
#standardSQL
create temp function array_avg(arr any type) as ((
select avg(val) from unnest(arr) val
));
create temp function array_std_dev(arr any type) as ((
select stddev(val) from unnest(arr) val
));
select a,
val, array_avg(val) val_avg, array_std_dev(val) val_stddev,
val2, array_avg(val2) val2_avg, array_std_dev(val2) val2_stddev
from `project.dataset.table`
if applied to sample data in your question - output is
I think simple subqueries should be fine - AVG() only works with tables and UNNEST() turns arrays into tables - so you can just combine them:
SELECT
(SELECT AVG(val) FROM UNNEST(val1) as val) AS avg_val1,
(SELECT AVG(val) FROM UNNEST(val2) as val) AS avg_val2,
(SELECT AVG(val) FROM UNNEST(val3) as val) AS avg_val3
FROM
<table>
val1, val2 and val3 are looked up as columns in <table> while val within the subqueries will be looked up in the table coming from the respective UNNEST().

Is there a simple way to transform an array based on a dimension table?

I have two tables:
One with a column with is an array of identifiers.
Another which is a dimension table of that map these identifiers to another value
I'm looking to transform the column of the first table using the dimension table.
Example:
Table 1:
Column A | Column B
'Bob' | ['a', 'b', 'c']
'Terry' | ['a']
Dimension Table:
Column C | Column D
'a' | 1
'b' | 2
'c' | 3
Expected Output:
Column A | Column B
'Bob' | [1,2,3]
'Terry' | [1]
Is there a way to do this (preferably in Presto) without exploding and re-aggregating the array column ?
i guess you would be able to do this without exploding and re-aggregation by using transform_keys, not sure this is easier though.
SELECT map_keys(transform_keys(MAP(ARRAY['a','c'], ARRAY[null,null]),
(k, v) -> MAP(ARRAY['a', 'b', 'c', 'd'], ARRAY[1,2,3,4])[k]));
I guess it requires that the dimension table is not "too big".

PostgreSQL second query based on an array type results from the first query. i.e. chaining queries with arrays

I have a graph information that is stored in a database. Each node has an integer id and text label and an adjacency list which is an integer ARRAY of ids. In the first query I'll get a list of nodes, for each node in the result I would like to get the names of all the nodes which are adjacent to it.
CREATE TABLE graph (id INTEGER,
name TEXT,
adj_list INTEGER[],
PRIMARY KEY (id)
);
Here's the pseudo-code of what I would like to achieve.
let node_list = (select * from graph where name like "X%");
foreach node in node_list:
foreach adj_node in node.adj_list:
print adj_node.name
Can anyone please suggest me on how to write PostgreSQL query to achieve this?
Here is some example data
id | name | adj
---+------+------------
1 | X1 | {3, 4}
2 | X2 | {5, 6}
3 | Y1 | {..}
4 | Y2 | {..}
5 | Z1 | {..}
6 | Z2 | {..}
I would like to list all the adjacent nodes of nodes whose name start with X. In this example, the results would be {Y1, Y2, Z1, Z2}.
It is probably a lot easier if you build another table like #twelfth suggests. But if you do want to rock the integer array, I believe you can do something like this:
--create the table
create table graph (
id integer,
name text,
adj_list integer[]
);
-- a sample insert
insert into graph (id, name, adj_list)
values
(1, 'X1', '{3,4}'),
(2, 'X2', '{5,6}'),
(3, 'Y1', '{}'),
(4, 'Y2', '{}'),
(5, 'Z1', '{}'),
(6, 'Z2', '{}')
;
-- use a CTE to unnest the array and give you a simple list of integers.
-- In my opinion this CTE makes the code easier to read
with adjacent_ones as (
select unnest(adj_list) from graph where name like 'X%'
)
select * from graph where id in (select * from adjacent_ones);
This will give you the following
--------------------------
|id |name |adj_list |
--------------------------
|3 |Y1 |{} |
|4 |Y2 |{} |
|5 |Z1 |{} |
|6 |Z2 |{} |
--------------------------