Finding most common elements of column of arrays in Presto - sql

I would like to find the most common elements within a column of arrays in presto.
For example...
col1
[A,B,C]
[A,B]
[A,D]
with output of...
col1 - col2
A - 3
B - 2
C - 1
D - 1
I have tried using flatten and unnest. I am able to get it into a single array using
select flatten(array_agg(col1))
from tablename;
but I am then not sure how to group and count by the distinct elements. I also am struggling to get this to run on all of my data because of the large amount of memory required.
Thanks for any help!

You can use to unnest() to flatten Array and then group by to group the unique values.
The Query to generate the data set for your case. You can replace this part with your select command in the final query:
with dataset AS (
SELECT ARRAY[
ARRAY['A','B','C'],
ARRAY['A','B'],
ARRAY['A','D']
] AS data
)
select dt from dataset
CROSS JOIN UNNEST(data) AS t(dt)
O/P:
------
dt
------
[A,B,C]
------
[A,B]
------
[A,D]
Now in the final query we will first flatten this data to remove all the values from all the rows and then group those value to get unique values and their count.
FINAL QUERY:
with da AS(
with dataset AS (
SELECT ARRAY[
ARRAY['A','B','C'],
ARRAY['A','B'],
ARRAY['A','D']
] AS data
)
select dt from dataset
CROSS JOIN UNNEST(data) AS t(dt)
)
select daVal,count(*) from da
CROSS JOIN UNNEST(dt) AS t(daVal)
GROUP BY daVal

You can unnest() and aggregate:
select u.col, count(*)
from t cross join
unnest(col1) u(col)
group by u.col;

Related

Hive - Reformat data structure

So I have a sample of Hive data:
Customer
xx_var
yy_var
branchflow
{"customer_no":"239230293892839892","acct":["2324325","23425345"]}
23
3
[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]
And I want to transform it into something like this:
Customer_no
acct
xx_var
yy_var
branchflow
239230293892839892
2324325
23
3
[1,2,3,4,5,6,6,6,4]
239230293892839892
23425345
23
3
[1,2,3,4,5,6,6,6,99,4]
I have tried using this query, but getting the wrong output format.
SELECT
customer.customer_no,
acct,
xx_var,
yy_var,
bi_acctno,
values_bi
FROM
struct_test
LATERAL VIEW explode(customer.acct) acct AS acctno
LATERAL VIEW explode(brancflow.acctno) bia as bi_acctno
LATERAL VIEW explode(brancflow.value) biv as values_bi
WHERE bi_acctno = acctno
Does anyone know how to approach this problem?
Use json_tuple to extract JSON elements. In case of array, it returns it also as string: remove square brackets, split and explode. See comments in the demo code.
Demo:
with mytable as (--demo data, use your table instead of this CTE
select '{"customer_no":"239230293892839892","acct":["2324325","23425345"]}' as customer,
23 xx_var, 3 yy_var,
'[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]' branchflow
)
select c.customer_no,
a.acct,
t.xx_var, t.yy_var,
get_json_object(b.acct_branchflow,'$.value') value
from mytable t
--extract customer_no and acct array
lateral view json_tuple(t.customer, 'customer_no', 'acct') c as customer_no, accts
--remove [] and " and explode array of acct
lateral view explode(split(regexp_replace(c.accts,'^\\[|"|\\]$',''),',')) a as acct
--remove [] and explode array of json
lateral view explode(split(regexp_replace(t.branchflow,'^\\[|\\]$',''),'(?<=\\}),(?=\\{)')) b as acct_branchflow
--this will remove duplicates after lateral view: need only matching acct
where get_json_object(b.acct_branchflow,'$.acctno') = a.acct
Result:
customer_no acct xx_var yy_var value
239230293892839892 2324325 23 3 [1,2,3,4,5,6,6,6,4]
239230293892839892 23425345 23 3 [1,2,3,4,5,6,6,6,99,4]

How to unnest BigQuery nested records into multiple columns

I am trying to unnest the below table .
Using the below unnest query to flatten the table
SELECT
id,
name ,keyword
FROM `project_id.dataset_id.table_id`
,unnest (`groups` ) as `groups`
where id = 204358
Problem is , this duplicates the rows (except name) as is the case with flattening the table.
How can I modify the query to put the names in two different columns rather than rows.
Expected output below -
That's because the comma is a cross join - in combination with an unnested array it is a lateral cross join. You repeat the parent row for every row in the array.
One problem with pivoting arrays is that arrays can have a variable amount of rows, but a table must have a fixed amount of columns.
So you need a way to decide for a certain row that becomes a certain column.
E.g. with
SELECT
id,
name,
groups[ordinal(1)] as firstArrayEntry,
groups[ordinal(2)] as secondArrayEntry,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
If your array had a key-value pair you could decide using the key. E.g.
SELECT
id,
name,
(select value from unnest(groups) where key='key1') as key1,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
But that doesn't seem to be the case with your table ...
A third option could be PIVOT in combination with your cross-join solution but this one has restrictions too: and I'm not sure how computation-heavy this is.
Consider below simple solution
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (1, 2))
if applied to sample data in your question - output is
Note , when you apply to your real case - you just need to know how many such name_NNN columns to expect and extend respectively list - for example for offset + 1 in (1, 2, 3, 4, 5)) if you expect 5 such columns
In case if for whatever reason you want improve this - use below where everything is built dynamically for you so you don't need to know in advance how many columns it will be in the output
execute immediate (select '''
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (''' || string_agg('' || pos, ', ') || '''))
'''
from (select pos from (
select max(array_length(`groups`)) cnt
from `project_id.dataset_id.table_id`
), unnest(generate_array(1, cnt)) pos
))
Your question is a little unclear, because it does not specify what to do with other keywords or other columns. If you specifically want the first two values in the array for keyword "OVG", you can unnest the array and pull out the appropriate names:
SELECT id,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1
) as name_1,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1 OFFSET 1
) as name_2,
'OVG' as keyword
FROM `project_id.dataset_id.table_id` t
WHERE id = 204358;

Filter records using JSON function for JSON array

There is one table where data stored in JSON format. I need to find how many records are there where Quote Required.
JSON
[{"id":14,"desc":"Job is incomplete.","quote_required":"Yes"},
{"id":14,"desc":"appointment need to rebook","quote_required":"Yes","start-date":"2021-11-20"}]
I am trying to achieve about using below JSON_CONTAINS() and JSON_EXTRACT()
SELECT COUNT(*)
FROM `products`
WHERE JSON_CONTAINS( JSON_EXTRACT(submit_report, "$.quote_required"), '"Yes"' )
But I am getting 0 results here
You can search for each element of the array whether having quote_required equals to Yes through use of index values starting from 0 upto length of the array minus 1 by generating index values with recursive common table expression such as
WITH recursive cte AS
(
SELECT 0 AS n
UNION ALL
SELECT n + 1 AS value
FROM cte
WHERE cte.n < ( SELECT JSON_LENGTH(submit_report) - 1 FROM `products` )
)
SELECT SUM(JSON_CONTAINS(JSON_EXTRACT(submit_report, CONCAT("$[",n,"].quote_required")),
'"Yes"')) AS count
FROM cte
JOIN `products`
Demo

PostgreSQL: How to return a subarray dynamically using array slices in postgresql

I need to sum a subarray from an array using postgresql.
I need to create a postgresql query that will dynamically do this as the upper and lower indexes will be different for each array.
These indexes will come from two other columns within the same table.
I had the below query that will get the subarray:
SELECT
SUM(t) AS summed_index_values
FROM
(SELECT UNNEST(int_array_column[34:100]) AS t
FROM array_table
WHERE id = 1) AS t;
...but I then realised I couldn't use variables or SELECT statements when using array slices to make the query dynamic:
int_array_column[SELECT array_index_lower FROM array_table WHERE id = 1; : SELECT array_index_upper FROM array_table WHERE id = 1;]
...does anyone know how I can achieve this query dynamically?
No need for sub-selects, just use the column names:
SELECT SUM(t) AS summed_index_values
FROM (
SELECT UNNEST(int_array_column[tb.array_index_lower:tb.array_index_upper]) AS t
FROM array_table tb
WHERE id = 1
) AS t;
Note that it's not recommended to use set-returning functions (unnest) in the SELECT list. It's better to put that into the FROM clause:
SELECT sum(t.val)
FROM (
SELECT t.val
FROM array_table tb
cross join UNNEST(int_array_column[tb.array_idx_lower:array_idx_upper]) AS t(val)
WHERE id = 1
) AS t;

Split array by portions in PostgreSQL

I need split array by 2-pair portions, only nearby values.
For example I have following array:
select array[1,2,3,4,5]
And I want to get 4 rows with following values:
{1,2}
{2,3}
{3,4}
{4,5}
Can I do it by SQL query?
select a
from (
select array[e, lead(e) over()] as a
from unnest(array[1,2,3,4,5]) u(e)
) a
where not exists (
select 1
from unnest(a) u (e)
where e is null
);
a
-------
{1,2}
{2,3}
{3,4}
{4,5}
One option is to do this with a recursive cte. Starting from the first position in the array and going up to the last.
with recursive cte(a,val,strt,ed,l) as
(select a,a[1:2] as val,1 strt,2 ed,cardinality(a) as l
from t
union all
select a,a[strt+1:ed+1],strt+1,ed+1,l
from cte where ed<l
)
select val from cte
a in the cte is the array.
Another option if you know the max length of the array is to use generate_series to get all numbers from 1 to max length and cross joining the array table on cardinality. Then use lead to get slices of the array and omit the last one (as lead on last row for a given partition would be null).
with nums(n) as (select * from generate_series(1,10))
select a,res
from (select a,t.a[nums.n:lead(nums.n) over(partition by t.a order by nums.n)] as res
from nums
cross join t
where cardinality(t.a)>=nums.n
) tbl
where res is not null