Unnest an array in AWS Redshift - sql

I have a table with column with lists like this:
id
[1,2,3,10]
[1]
[2,3,4,9]
The result I would like to have is a table with unlisted values like this:
id2
1
2
3
10
1
2
3
4
9
I have tried different solutions that I found on the web, aws documentation, SO solution, blog post, but without any luck because I have a list in column and not a json object.
Any help is appreciated!

Update (2022): Redshift now supports arrays and allows to "unnest" them easily.
The syntax is simply to have a FROM the_table AS the_table_alias, the_table_alias.the_array AS the_element_alias
Here's an example with the data mentioned in the question:
WITH
-- some table with test data
input_data as (
SELECT array(1,2,3,10) as id
union all
SELECT array(1) as id
union all
SELECT array(2,3,4,9) as id
)
SELECT
id2
FROM
input_data AS ids,
ids.id AS id2
Yields the expected:
id2
---
1
2
3
4
9
1
2
3
10
See here for more details and examples with deeper nesting levels: https://docs.aws.amazon.com/redshift/latest/dg/query-super.html

What is the dataatype of that column?
Redshift does not support arrays, so let me assume this is a JSON string.
Redshift does not provide JSON set-returning functions: we need to unnest manually. Here is one way to do it, if you have a table with a sufficient numbers of rows (at least as many rows as there are elements in the array) - say sometable:
select json_extract_array_element_text(t.id, n.rn) as new_id
from mytable t
inner join (select row_number() over() - 1 as rn from sometable) n
on n.rn < json_array_length(t.id)

Related

sql how to convert multi select field to rows with totals

I have a table that has a field where the contents are a concatenated list of selections from a multi-select form. I would like to convert the data in this field into in another table where each row has the text of the selection and a count the number of times this selection was made.
eg.
Original table:
id selections
1 A;B
2 B;D
3 A;B;D
4 C
I would like to get the following out:
selection count
A 2
B 3
C 1
D 2
I could easily do this with split and maps in javascript etc, but not sure how to approach it in SQL. (I use Postgresql) The goal is to use the second table to plot a graph in Google Data Studio.
A much simpler solution:
select regexp_split_to_table(selections, ';'), count(*)
from test_table
group by 1
order by 1;
You can use a lateral join and handy set-returning function regexp_split_to_table() to unnest the strings to rows, then aggregate and count:
select x.selection, count(*) cnt
from mytable t
cross join lateral regexp_split_to_table(t.selections, ';') x(selection)
group by x.selection

Extract data dynamically in Amazon Redshift

This is the sample data in the column. I want to extract the values only associated with 5 in dynamically.
'{"2113":5,"2112":5,"2114":4,"2511":5}'
The final structure should be 3 rows of names and values?
I tried with JSON extract function but that not help. Thanks
Final result i want,
value | Key
2113 5
2112 5
2115 5
So, what you need to do is to unnest the json object (have a key-value pair per row). Unnesting in Readshift is tricky. One needs a sequence table, and then perfom a CROSS JOIN with proper filter condition. Usually unnesting is done on an array, and then it's easier, since indicies are easy to generate. To unnest a key-value map (JSON object) one needs to know all the keys (Redshift cannot do it). Your example is lucky, since the keys are integers and they're cardinality is relatively low.
This is a sketched out solution. Please note that you will have to change the way the sequence table is created:
WITH input(json) AS (
SELECT '{"2113":5,"2112":5,"2114":4,"2511":5}'::varchar
)
, sequence(idx) AS (
-- instead of the below you should use sequence table
SELECT 2113
UNION ALL
SELECT 2112
UNION ALL
SELECT 2114
UNION ALL
SELECT 2511
UNION ALL
SELECT 2512
UNION ALL
SELECT 2513
UNION ALL
SELECT 2514
)
, unnested(key, val) AS (
SELECT idx::varchar as key,
json_extract_path_text(json, key) as val
FROM input
CROSS JOIN sequence
WHERE val IS NOT NULL
)
SELECT *
FROM unnested
WHERE val = 5
key | val
2113 | 5
2112 | 5
2511 | 5
how to generate a large sequence in Redshift:
...
sequence(idx) AS (
SELECT row_number() OVER ()
FROM arbitrary_table_having_enough_rows
limit 10000
)
...
Other option is to have a specialized sequence table - here there's an idea on how to do it http://www.silota.com/docs/recipes/redshift-sequential-generate-series-numbers-time.html
Achieved the result using multiple splits.
`SELECT distinct split_part(split_part(replace(replace(replace(json_field,'{',''),'}',''),'"',''),',',i),': ',1) as value,` `split_part(split_part(replace(replace(replace(json_field,'{',''),'}',''),'"',''),',',i),':',2) as key FROM table
JOIN schema.seq_1_to_100 as numbers
ON i <=regexp_count(json_field,':') `

Subset large table for use in multiple UNIONs

Suppose I have a table with the following structure:
id measure_1_actual measure_1_predicted measure_2_actual measure_2_predicted
1 1 0 0 0
2 1 1 1 1
3 . . 0 0
I want to create the following table, for each ID (shown is an example for id = 1):
measure actual predicted
1 1 0
2 0 0
Here's one way I could solve this problem (I haven't tested this, but you get the general idea, I hope):
SELECT 1 AS measure,
measure_1_actual AS actual,
measure_1_predicted AS predicted
FROM tb
WHERE id = 1
UNION
SELECT 2 AS measure,
measure_2_actual AS actual,
measure_2_predicted AS predicted
FROM tb WHERE id = 1
In reality, I have five of these "measures" and tens of millions of people - subsetting such a large table five times for each member does not seem the most efficient way of doing this. This is a real-time API, receiving tens of requests a minute, so I think I'll need a better way of doing this. My other thought was to perhaps create a temp table/view for each member once the request is received, and then UNION based off of that subsetted table.
Does anyone have a more efficient way of doing this?
You can use a lateral join:
select t.id, v.*
from t cross join lateral
(values (1, measure_1_actual, measure_1_predicted),
(2, measure_2_actual, measure_2_predicted)
) v(measure, actual, predicted);
Lateral joins were introduced in Postgres 9.4. You can read about them in the documentation.

How to get unique list from two column in Entity Framework core?

I have a Table in the database with 2 Columns containing userIds.
Column A
1
2
3
4
5
Column B
4
2
6
1
7
Now I want to get a list/array containing the distinct Ids.
The expected result will be
[1,2,3,4,5,6,7]
Any idea how to do it?
I am looking for a Ef Core lambda/linq which will run on the database end and not have to fetch the result in the memory and then find the distinct list as that would be costly operation.
you can try this
var ids = Table1.Select( i => i.ColumnA )
.Union( Table2.Select( j => j.ColumnB ) )
.ToList()
Use union:
select col1
from t
union -- on purpose to remove duplicates
select col2
from t;
You would then read the results of the query into your application.
Posting as an answer for further reference:
IList<String> ids = ((from taba in ids select ids) .Union(from tabB in ids select (ids))).ToList();

Alternative for GROUP BY and STUFF in SQL

I am writing some SQL queries in AWS Athena. I have 3 tables search, retrieval and intent. In search table I have 2 columns id and term i.e.
id term
1 abc
1 bcd
2 def
1 ghd
What I want is to write a query to get:
id term
1 abc, bcd, ghd
2 def
I know this can be done using STUFF and FOR XML PATH but, in Athena all the features of SQL are yet not supported. Is there any other way to achieve this. My current query is:
select search.id , STUFF(
(select ',' + search.term
from search
FOR XML PATH('')),1,1,'')
FROM search
group by search.id
Also, I have one more question. I have retrieval table that consist of 3 columns i.e.:
id time term
1 0 abc
1 20 bcd
1 100 gfh
2 40 hfg
2 60 lkf
What I want is:
id time term
1 100 gfh
2 60 lkf
I want to write a query to get the id and term on the basis of max value of time. Here is my current query:
select retrieval.id, max(retrieval.time), retrieval.term
from search
group by retrieval.id, retrieval.term
order by max(retrieval.time)
I am getting duplicate id's along with the term. I think it is because, I am doing group by on id and term both. But, I am not sure how can I achieve it without using group by.
The XML method is brokenness in SQL Server. No reason to attempt it in any other database.
One method uses arrays:
select s.id, array_agg(s.term)
from search s
group by s.id;
Because the database supports arrays, you should learn to use them. You can convert the array to a string:
select s.id, array_join(array_agg(s.term), ',') as terms
from search s
group by s.id;
Group by is a group operation: think that you are clubbing the results and have to find min, max, count etc.
I am answering only one question. Use it to find the answer to question 1
For question 2:
select
from (select id, max(time) as time
from search
group by id, term
order by max(time)
) search_1, search as search_2
where search_1.id = search_2.id
and search_1.time = search_2.time