SQL Presto Aggregate Table column values with another column values - sql

Hi I want to do SQL Presto query for the data table (say user_data) looks like
user | target | result
-----------------------------
1 | b | {A: 1}
2 | a | {C: 2}
1 | c | {A: 2, B: 3}
2 | d | {A: 1}
1 | d | {C: 4}
With this data table, I would like to generate the following two outputs.
Output 1: Count the number of unique targets for each result for each user. For example, for user 1, this user has 2 targets (b and c) who have result A. And it has one target for each result B (target c) and C (target d).
user | result
-------------------
1 | {A: 2, B:1, C:1}
2 | {A: 1, C: 1}
Output 2: Aggregate the last column based on the targets of the user.
user | result
-------------------
1 | {A:[b,c], B:[c], C:[d]}
2 | {A:[d], C:[a]}
** Or Even better, can we make a one table that has both columns?
user | result 1 | result 2
--------------------------------------------------
1 | {A:[b,c], B:[c], C:[d]} | {A: 2, B:1, C:1}
2 | {A:[d], C:[a]} | {A: 1, C: 1}
Can anyone help me with it? I would really appreciate it.
I'm pretty new to SQL so I didn't even know how to start it.`

This can be achieved with map aggregate functions. Assuming that result originally is a map you can flatten it with unnest and then group by user and use multimap_agg and histogram functions:
-- sample data
WITH dataset(user, target, result) AS (
VALUES (1, 'b', map(array['A'], array[1])),
(2, 'a', map(array['C'], array[2])),
(1, 'c', map(array['A', 'B'], array[2, 3])),
(2, 'd', map(array['A'], array[1])),
(1, 'd', map(array['C'], array[4]))
)
-- query
select user, multimap_agg(k, target), histogram(k)
from dataset,
unnest(result) as t(k, v)
group by user;
Output:
user
_col1
_col2
2
{A=[d], C=[a]}
{A=1, C=1}
1
{A=[b, c], B=[c], C=[d]}
{A=2, B=1, C=1}

Related

Aggregate columns containing dictionary in SQL presto

Hi I want to do SQL Presto query for the data table (say user_data) looks like
user | target | result
-----------------------------
1 | b | {A: 1}
2 | a | {C: 2}
1 | c | {A: 2, B: 3}
2 | d | {A: 1}
1 | d | {C: 4}
With this data table, I would like to generate the following two outputs.
Output 1: Aggregate the values of the {key:value} dictionary based on the user and regardless of target
user | result
-------------------
1 | {A:3, B:3, C:4}
2 | {A:1, C:2}
Output 2: Aggregate the last column based on the targets of the user.
user | result
-------------------
1 | {A:[b,c], B:[c], C:[d]}
2 | {A:[d], C:[a]}
Can anyone help me with it? I would really appreciate it.
Second one can be easily achieved with multimap_agg (add transform_values with array_distinct to remove duplicates if needed):
-- sample data
WITH dataset(user, target, result) AS (
values (1, 'b', map(array['A'], array[1])),
(2, 'a', map(array['C'], array[2])),
(1, 'c', map(array['A', 'B'], array[1, 2]))
)
-- query
select user, multimap_agg(k, target)
from dataset,
unnest(result) as t (k,v)
group by user;
Output:
user
_col1
1
{A=[b, c], B=[c]}
2
{C=[a]}
As for the first one - you can look into using map_union_sum if it is available in your version of Presto. Or use some magic with unnest and transform_values:
-- query
select user,
transform_values(
multimap_agg(k, v),
(k,v) -> reduce(v, 0, (s, x) -> s + x, s -> s) -- or array_sum if available
)
from dataset,
unnest(result) as t (k, v)
group by user;
Output:
user
_col1
1
{A=2, B=2}
2
{C=2}

Output multiple summarized lists with KQL

I want to output multiple lists of unique column values with KQL.
For instance for the following table:
A
B
C
1
x
one
1
x
two
1
y
one
I want to output
K
V
A
[1]
B
[x,y]
C
[one, two]
I accomplished this using summarize with make_list and 2 unions, been wondering if its possible to accomplish this in the same query without union?
Table
| distinct A
| summarize k="A", v= make_list(A)
union
Table
| distinct b
| summarize k="B", v= make_list(B)
...
if your data set is reasonably-sized, you could try using the narrow() plugin: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/narrowplugin
datatable(A:int, B:string, C:string)
[
1, 'x', 'one',
1, 'x', 'two',
1, 'y', 'one',
]
| evaluate narrow()
| summarize make_set(Value) by Column
Column
set_Value
A
["1"]
B
["x","y"]
C
["one","two"]
Alternatively, you could use a combination of pack_all() and mv-apply
datatable(A:int, B:string, C:string)
[
1, 'x', 'one',
1, 'x', 'two',
1, 'y', 'one',
]
| project p = pack_all()
| mv-apply p on (
extend key = tostring(bag_keys(p)[0])
| project key, value = p[key]
)
| summarize make_set(value) by key
key
set_value
A
["1"]
B
["x","y"]
C
["one","two"]

What is the best way to select a row based on whether or not the previous row has a matching value?

I know about the lag function, but I'm confused on how to make it work in my specific situation. I have four columns - id,query,url, and position, where query is a search query, url is the url found, and position is the position that the url was found in.
I'd like to query every instance of when an entry (with the same query and url) changes position from an integer to None, when ordered by id.
Example:
id | query | url | position
------------------------------
0 | 'dog' | 'dog.com' | 2
1 | 'cat' | 'cat.com' | None
2 | 'dog' | 'dog.com' | 3
4 | 'cat' | 'cat.com' | 5
5 | 'dog' | 'dog.com' | None
6 | 'cat' | 'cat.com' | 2
7 | 'bird' | 'bird.com' | 9
8 | 'bird' | 'bird.com' | None
I'd want to return:
5 | 'dog' | 'dog.com'
8 | 'bird' | 'bird.com'
(Since those are the two entries where the position changed from an integer to None)
In the following query, LAG(position, 1) means that we want to obtain the previous row's position value and OVER (PARTITION BY query, url ORDER BY id) means that the "previous row" is determined by considering rows with the same query and url values and ordering by id.
Once the position from the previous row has been determined we can select rows with a NULL position whose predecessor's position was not NULL.
-- Create an in-memory "table" using the data from the question
WITH tbl (id, query, url, position) AS (
VALUES
(0, 'dog', 'dog.com', 2),
(1, 'cat', 'cat.com', NULL),
(2, 'dog', 'dog.com', 3),
(4, 'cat', 'cat.com', 5),
(5, 'dog', 'dog.com', NULL),
(6, 'cat', 'cat.com', 2),
(7, 'bird', 'bird.com', 9),
(8, 'bird', 'bird.com', NULL)
),
-- Compute the "previous position" values
tbl2 AS (
SELECT id,
query,
url,
position,
LAG(position, 1) OVER (PARTITION BY query, url ORDER BY id) AS prev_position
FROM tbl
)
-- Select the records that our criteria
SELECT id, query, url
FROM tbl2
WHERE position IS NULL
AND prev_position IS NOT NULL
ORDER BY id;
Result
id │ query │ url
════╪═══════╪══════════
5 │ dog │ dog.com
8 │ bird │ bird.com

Is there a simple way to transform an array based on a dimension table?

I have two tables:
One with a column with is an array of identifiers.
Another which is a dimension table of that map these identifiers to another value
I'm looking to transform the column of the first table using the dimension table.
Example:
Table 1:
Column A | Column B
'Bob' | ['a', 'b', 'c']
'Terry' | ['a']
Dimension Table:
Column C | Column D
'a' | 1
'b' | 2
'c' | 3
Expected Output:
Column A | Column B
'Bob' | [1,2,3]
'Terry' | [1]
Is there a way to do this (preferably in Presto) without exploding and re-aggregating the array column ?
i guess you would be able to do this without exploding and re-aggregation by using transform_keys, not sure this is easier though.
SELECT map_keys(transform_keys(MAP(ARRAY['a','c'], ARRAY[null,null]),
(k, v) -> MAP(ARRAY['a', 'b', 'c', 'd'], ARRAY[1,2,3,4])[k]));
I guess it requires that the dimension table is not "too big".

PostgreSQL second query based on an array type results from the first query. i.e. chaining queries with arrays

I have a graph information that is stored in a database. Each node has an integer id and text label and an adjacency list which is an integer ARRAY of ids. In the first query I'll get a list of nodes, for each node in the result I would like to get the names of all the nodes which are adjacent to it.
CREATE TABLE graph (id INTEGER,
name TEXT,
adj_list INTEGER[],
PRIMARY KEY (id)
);
Here's the pseudo-code of what I would like to achieve.
let node_list = (select * from graph where name like "X%");
foreach node in node_list:
foreach adj_node in node.adj_list:
print adj_node.name
Can anyone please suggest me on how to write PostgreSQL query to achieve this?
Here is some example data
id | name | adj
---+------+------------
1 | X1 | {3, 4}
2 | X2 | {5, 6}
3 | Y1 | {..}
4 | Y2 | {..}
5 | Z1 | {..}
6 | Z2 | {..}
I would like to list all the adjacent nodes of nodes whose name start with X. In this example, the results would be {Y1, Y2, Z1, Z2}.
It is probably a lot easier if you build another table like #twelfth suggests. But if you do want to rock the integer array, I believe you can do something like this:
--create the table
create table graph (
id integer,
name text,
adj_list integer[]
);
-- a sample insert
insert into graph (id, name, adj_list)
values
(1, 'X1', '{3,4}'),
(2, 'X2', '{5,6}'),
(3, 'Y1', '{}'),
(4, 'Y2', '{}'),
(5, 'Z1', '{}'),
(6, 'Z2', '{}')
;
-- use a CTE to unnest the array and give you a simple list of integers.
-- In my opinion this CTE makes the code easier to read
with adjacent_ones as (
select unnest(adj_list) from graph where name like 'X%'
)
select * from graph where id in (select * from adjacent_ones);
This will give you the following
--------------------------
|id |name |adj_list |
--------------------------
|3 |Y1 |{} |
|4 |Y2 |{} |
|5 |Z1 |{} |
|6 |Z2 |{} |
--------------------------