Extract data dynamically in Amazon Redshift - sql

This is the sample data in the column. I want to extract the values only associated with 5 in dynamically.
'{"2113":5,"2112":5,"2114":4,"2511":5}'
The final structure should be 3 rows of names and values?
I tried with JSON extract function but that not help. Thanks
Final result i want,
value | Key
2113 5
2112 5
2115 5

So, what you need to do is to unnest the json object (have a key-value pair per row). Unnesting in Readshift is tricky. One needs a sequence table, and then perfom a CROSS JOIN with proper filter condition. Usually unnesting is done on an array, and then it's easier, since indicies are easy to generate. To unnest a key-value map (JSON object) one needs to know all the keys (Redshift cannot do it). Your example is lucky, since the keys are integers and they're cardinality is relatively low.
This is a sketched out solution. Please note that you will have to change the way the sequence table is created:
WITH input(json) AS (
SELECT '{"2113":5,"2112":5,"2114":4,"2511":5}'::varchar
)
, sequence(idx) AS (
-- instead of the below you should use sequence table
SELECT 2113
UNION ALL
SELECT 2112
UNION ALL
SELECT 2114
UNION ALL
SELECT 2511
UNION ALL
SELECT 2512
UNION ALL
SELECT 2513
UNION ALL
SELECT 2514
)
, unnested(key, val) AS (
SELECT idx::varchar as key,
json_extract_path_text(json, key) as val
FROM input
CROSS JOIN sequence
WHERE val IS NOT NULL
)
SELECT *
FROM unnested
WHERE val = 5
key | val
2113 | 5
2112 | 5
2511 | 5
how to generate a large sequence in Redshift:
...
sequence(idx) AS (
SELECT row_number() OVER ()
FROM arbitrary_table_having_enough_rows
limit 10000
)
...
Other option is to have a specialized sequence table - here there's an idea on how to do it http://www.silota.com/docs/recipes/redshift-sequential-generate-series-numbers-time.html

Achieved the result using multiple splits.
`SELECT distinct split_part(split_part(replace(replace(replace(json_field,'{',''),'}',''),'"',''),',',i),': ',1) as value,` `split_part(split_part(replace(replace(replace(json_field,'{',''),'}',''),'"',''),',',i),':',2) as key FROM table
JOIN schema.seq_1_to_100 as numbers
ON i <=regexp_count(json_field,':') `

Related

Unnest an array in AWS Redshift

I have a table with column with lists like this:
id
[1,2,3,10]
[1]
[2,3,4,9]
The result I would like to have is a table with unlisted values like this:
id2
1
2
3
10
1
2
3
4
9
I have tried different solutions that I found on the web, aws documentation, SO solution, blog post, but without any luck because I have a list in column and not a json object.
Any help is appreciated!
Update (2022): Redshift now supports arrays and allows to "unnest" them easily.
The syntax is simply to have a FROM the_table AS the_table_alias, the_table_alias.the_array AS the_element_alias
Here's an example with the data mentioned in the question:
WITH
-- some table with test data
input_data as (
SELECT array(1,2,3,10) as id
union all
SELECT array(1) as id
union all
SELECT array(2,3,4,9) as id
)
SELECT
id2
FROM
input_data AS ids,
ids.id AS id2
Yields the expected:
id2
---
1
2
3
4
9
1
2
3
10
See here for more details and examples with deeper nesting levels: https://docs.aws.amazon.com/redshift/latest/dg/query-super.html
What is the dataatype of that column?
Redshift does not support arrays, so let me assume this is a JSON string.
Redshift does not provide JSON set-returning functions: we need to unnest manually. Here is one way to do it, if you have a table with a sufficient numbers of rows (at least as many rows as there are elements in the array) - say sometable:
select json_extract_array_element_text(t.id, n.rn) as new_id
from mytable t
inner join (select row_number() over() - 1 as rn from sometable) n
on n.rn < json_array_length(t.id)

How to get index of an array value in PostgreSQL?

I have a table called pins like this:
id (int) | pin_codes (jsonb)
--------------------------------
1 | [4000, 5000, 6000]
2 | [8500, 8400, 8600]
3 | [2700, 2300, 2980]
Now, I want the row with pin_code 8600 and with its array index. The output must be like this:
pin_codes | index
------------------------------
[8500, 8500, 8600] | 2
If I want the row with pin_code 2700, the output :
pin_codes | index
------------------------------
[2700, 2300, 2980] | 0
What I've tried so far:
SELECT pin_codes FROM pins WHERE pin_codes #> '[8600]'
It only returns the row with wanted value. I don't know how to get the index on the value in the pin_codes array!
Any help would be great appreciated.
P.S:
I'm using PostgreSQL 10
If you were storing the array as a real array not as a json, you could use array_position() to find the (first) index of a given element:
select array_position(array['one', 'two', 'three'], 'two')
returns 2
With some text mangling you can cast the JSON array into a text array:
select array_position(translate(pin_codes::text,'[]','{}')::text[], '8600')
from the_table;
The also allows you to use the "operator"
select *
from pins
where '8600' = any(translate(pin_codes::text,'[]','{}')::text[])
The contains #> operator expects arrays on both sides of the operator. You could use it to search for two pin codes at a time:
select *
from pins
where translate(pin_codes::text,'[]','{}')::text[] #> array['8600','8400']
Or use the overlaps operator && to find rows with any of multiple elements:
select *
from pins
where translate(pin_codes::text,'[]','{}')::text[] && array['8600','2700']
would return
id | pin_codes
---+-------------------
2 | [8500, 8400, 8600]
3 | [2700, 2300, 2980]
If you do that a lot, it would be more efficient to store the pin_codes as text[] rather then JSON - then you can also index that column to do searches more efficiently.
Use the function jsonb_array_elements_text() using with ordinality.
with my_table(id, pin_codes) as (
values
(1, '[4000, 5000, 6000]'::jsonb),
(2, '[8500, 8400, 8600]'),
(3, '[2700, 2300, 2980]')
)
select id, pin_codes, ordinality- 1 as index
from my_table, jsonb_array_elements_text(pin_codes) with ordinality
where value::int = 8600;
id | pin_codes | index
----+--------------------+-------
2 | [8500, 8400, 8600] | 2
(1 row)
As has been pointed out previously the array_position function is only available in Postgres 9.5 and greater.
Here is custom function that achieves the same, derived from nathansgreen at github.
-- The array_position function was added in Postgres 9.5.
-- For older versions, you can get the same behavior with this function.
create function array_position(arr ANYARRAY, elem ANYELEMENT, pos INTEGER default 1) returns INTEGER
language sql
as $BODY$
select row_number::INTEGER
from (
select unnest, row_number() over ()
from ( select unnest(arr) ) t0
) t1
where row_number >= greatest(1, pos)
and (case when elem is null then unnest is null else unnest = elem end)
limit 1;
$BODY$;
So in this specific case, after creating the function the following worked for me.
SELECT
pin_codes,
array_position(pin_codes, 8600) AS index
FROM pins
WHERE array_position(pin_codes, 8600) IS NOT NULL;
Worth bearing in mind that it will only return the index of the first occurrence of 8600, you can use the pos argument to index which ever occurrence that you like.
In short, normalize your data structure, or don't do this in SQL. If you want this index of the sub-data element given your current data structure, then do this in your application code (take result, cast to list/array, get index).
Try to unnest the string and assign numbers as follows:
with dat as
(
select 1 id, '8700, 5600, 2300' pins
union all
select 2 id, '2300, 1700, 1000' pins
)
select dat.*, t.rn as index
from
(
select id, t.pins, row_number() over (partition by id) rn
from
(
select id, trim(unnest(string_to_array(pins, ','))) pins from dat
) t
) t
join dat on dat.id = t.id and t.pins = '2300'
If you insist on storing Arrays, I'd defer to klins answer.
As the alternative answer and extension to my comment...don't store SQL data in arrays. 'Normalize' your data in advance and SQL will handle it significantly better. Klin's answer is good, but may suffer for performance as it's outside of what SQL does best.
I'd break the Array prior to storing it. If the number of pincodes is known, then simply having the table pin_id,pin1,pin2,pin3, pinetc... is functional.
If the number of pins is unknown, a first table as pin that stored the pin_id and any info columns related to that pin ID, and then a second table as pin_id, pin_seq,pin_value is also functional (though you may need to pivot this later on to make sense of the data). In this case, select pin_seq where pin_value = 260 would work.

PostgreSQL efficiently find last decendant in linear list

I currently try to retrieve the last decendet efficiently from a linked list like structure.
Essentially there's a table with a data series, with certain criteria I split it up to get a list like this
current_id | next_id
for example
1 | 2
2 | 3
3 | 4
4 | NULL
42 | 43
43 | 45
45 | NULL
etc...
would result in lists like
1 -> 2 -> 3 -> 4
and
42 -> 43 -> 45
Now I want to get the first and the last id from each of those lists.
This is what I have right now:
WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
SELECT start_ts, end_ts FROM track_caps tc
UNION
SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
FROM (
SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
GROUP BY rstart_ts
) sq
GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts
In this case I just used timestamps which work fine for the given data.
Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4
Just for illustration, the produced result set by the recursive query looks like this:
1 | 2
2 | 3
3 | 4
1 | 3
2 | 4
1 | 4
As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE.
For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.
Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?)
Or are there better single-query solutions to this problem?
The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.
To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.
To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.
So you wind up with something like this:
with recursive contract as (
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
union
select c.current_id, c.next_id, p.root_id
from track_caps c
join contract p on c.current_id = p.next_id
and c.next_id is not null
)
select *
from contract
order by current_id;
Online example: http://rextester.com/DOABC98823

BigQuery full table to partition

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.
That means I will have 270 partitions. What is the best way to move this data to partition table.
I dont want to run 270 queries which is very costly operation. So looking for optimized solution.
I have multiple tables like this. I need to migrate all these tables to partition tables.
Thanks,
I see three options
Direct Extraction out of original table:
Actions (how many queries to run) = Days [to extract] = 270
Full Scans (how much data scanned measured in full scans of original table) = Days = 270
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00
Hierarchical(recursive) Extraction: (described in Mosha’s answer)
Actions = 2^log2(Days) – 2 = 510
Full Scans = 2*log2(Days) = 18
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60
Clustered Extraction: (I will describe it in a sec)
Actions = Days + 1 = 271
Full Scans = [always]2 = 2
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40
Summary
Method Actions Total Full Scans Total Cost
Direct Extraction 270 270 $459.00
Hierarchical(recursive) Extraction 510 18 $30.60
Clustered Extraction 271 2 $3.40
Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward
Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible!
From $460 down to $31!
Almost 15 times down!
In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges
- take a look at third option
“Clustered Extraction” Explanation
Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan
Step 2
After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created)
This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table
While querying column – we will need to de-serialize row’s value and parse it back to original scheme
Very simplified example: (using BigQuery Standard SQL here)
The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)
So, assume original table (theTable) looks somehow like below
SELECT 1 AS id, "101" AS x, 1 AS ts UNION ALL
SELECT 2 AS id, "102" AS x, 1 AS ts UNION ALL
SELECT 3 AS id, "103" AS x, 1 AS ts UNION ALL
SELECT 4 AS id, "104" AS x, 1 AS ts UNION ALL
SELECT 5 AS id, "105" AS x, 1 AS ts UNION ALL
SELECT 6 AS id, "106" AS x, 2 AS ts UNION ALL
SELECT 7 AS id, "107" AS x, 2 AS ts UNION ALL
SELECT 8 AS id, "108" AS x, 2 AS ts UNION ALL
SELECT 9 AS id, "109" AS x, 2 AS ts UNION ALL
SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
SELECT 20 AS id, "120" AS x, 4 AS ts
Step 1 – transform table and write result into tempTable
SELECT
num,
MAX(IF(ts=1, ser, NULL)) AS ts_1,
MAX(IF(ts=2, ser, NULL)) AS ts_2,
MAX(IF(ts=3, ser, NULL)) AS ts_3,
MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
SELECT
ts,
CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser,
ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
FROM theTable
)
GROUP BY num
tempTable will look like below:
num ts_1 ts_2 ts_3 ts_4
1 1|101|1 6|106|2 10|110|3 19|119|4
2 2|102|1 7|107|2 11|111|3 20|120|4
3 3|103|1 8|108|2 12|112|3 null
4 4|104|1 9|109|2 13|113|3 null
5 5|105|1 null 14|114|3 null
6 null null 15|115|3 null
7 null null 16|116|3 null
8 null null 17|117|3 null
9 null null 18|118|3 null
Here, I am using simple concatenation for serialization
Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2
SELECT
r[OFFSET(0)] AS id,
r[OFFSET(1)] AS x,
r[OFFSET(2)] AS ts
FROM (
SELECT SPLIT(ts_2, "|") AS r
FROM tempTable
WHERE NOT ts_2 IS NULL
)
The result will look like below (which is expected):
id x ts
6 106 2
7 107 2
8 108 2
9 109 2
I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving
I will add another fourth option to #Mikhail's answer
DML QUERY
Action = 1 query to run
Full scans = 1
Cost = $5 x 0.34 = 1.7$ (x270 times cheaper than solution #1 \o/)
With the new DML feature of BiQuery you can convert a none partitioned table to a partitioned one while doing only one full scan of the source table
To illustrate my solution I will use one of BQ's public tables, namely bigquery-public-data:hacker_news.comments. below is the tables schema
name | type | description
_________________________________
id | INTGER | ...
_________________________________
by | STRING | ...
_________________________________
author | STRING | ...
_________________________________
... | |
_________________________________
time_ts | TIMESTAMP | human readable timestamp in UTC YYYY-MM-DD hh:mm:ss /!\ /!\ /!\
_________________________________
... | |
_________________________________
We are going to partition the comments table based on time_ts
#standardSQL
CREATE TABLE my_dataset.comments_partitioned
PARTITION BY DATE(time_ts)
AS
SELECT *
FROM `bigquery-public-data:hacker_news.comments`
I hope it helps :)
If your data was in sharded tables (i.e. with YYYYmmdd suffix), you could've used "bq partition" command. But with data in a single table - you will have to scan it multiple times applying different WHERE clauses on your partition key column.
The only optimization I can think of is to do it hierarchically, i.e. instead of 270 queries which will do 270 full table scans - first split table in half, then each half in half etc. This way you will need to pay for 2*log_2(270) = 2*9 = 18 full scans.
Once the conversion is done - all the temporary tables can be deleted to eliminate extra storage costs.

Combine elements of array into different array

I need to split text elements in an array and combine the elements (array_agg) by index into different rows
E.g., input is
'{cat$ball$x... , dog$bat$y...}'::text[]
I need to split each element by '$' and the desired output is:
{cat,dog} - row 1
{ball,bat} - row 2
{x,y} - row 3
...
Sorry for not being clear the first time. I have edited my question. I tried similar options but unable to figure out how to get it with multiple text elements separated with '$' sysmbol
Exactly two parts per array element (original question)
Use unnest(), split_part() and array_agg():
SELECT array_agg(split_part(t, '$', 1)) AS col1
, array_agg(split_part(t, '$', 2)) AS col2
FROM unnest('{cat$ball, dog$bat}'::text[]) t;
Related:
Split comma separated column data into additional columns
General solution (updated question)
For any number of arrays with any number of elements containing any number of parts.
Demo for a table tbl:
CREATE TABLE tbl (tbl_id int PRIMARY KEY, arr text[]);
INSERT INTO tbl VALUES
(1, '{cat1$ball1, dog2$bat2}') -- 2 parts per array element, 2 elements
, (2, '{cat$ball$x, dog$bat$y}') -- 3 parts ...
, (3, '{a1$b1$c1$d1, a2$b2$c2$d2, a3$b3$c3$d3}'); -- 4 parts, 3 elements
Query:
SELECT tbl_id, idx, array_agg(elem ORDER BY ord) AS pivoted_array
FROM tbl t
, unnest(t.arr) WITH ORDINALITY a1(string, ord)
, unnest(string_to_array(a1.string, '$')) WITH ORDINALITY a2(elem, idx)
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
We are looking at two (nested) LATERAL joins here. LATERAL requires Postgres 9.3. Details:
What is the difference between LATERAL and a subquery in PostgreSQL?
WITH ORDINALITY for the the first unnest() is up for debate. A simpler query normally works, too. It's just not guaranteed to work according to SQL standards:
SELECT tbl_id, idx, array_agg(elem) AS pivoted_array
FROM tbl t
, unnest(t.arr) string
, unnest(string_to_array(string, '$')) WITH ORDINALITY a2(elem, idx)
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
Details:
PostgreSQL unnest() with element number
WITH ORDINALITY requires Postgres 9.4 or later. The same back-patched to Postgres 9.3:
SELECT tbl_id, idx, array_agg(arr2[idx]) AS pivoted_array
FROM tbl t
, LATERAL (
SELECT string_to_array(string, '$') AS arr2 -- convert string to array
FROM unnest(t.arr) string -- unnest org. array
) x
, generate_subscripts(arr2, 1) AS idx -- unnest 2nd array with ord. numbers
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
Each query returns:
tbl_id | idx | pivoted_array
--------+-----+---------------
1 | 1 | {cat,dog}
1 | 2 | {bat,ball}
1 | 3 | {y,x}
2 | 1 | {cat2,dog2}
2 | 2 | {ball2,bat2}
3 | 1 | {a3,a1,a2}
3 | 2 | {b1,b2,b3}
3 | 3 | {c2,c1,c3}
3 | 4 | {d2,d3,d1}
SQL Fiddle (still stuck on pg 9.3).
The only requirement for these queries is that the number of parts in elements of the same array is constant. We could even make it work for a varying number of parts using crosstab() with two parameters to fill in NULL values for missing parts, but that's beyond the scope of this question:
PostgreSQL Crosstab Query
A bit messy but you could unnest the array, use regex to separate the text and then aggregate back up again:
with a as (select unnest('{cat$ball, dog$bat}'::_text) some_text),
b as (select regexp_matches(a.some_text, '(^[a-z]*)\$([a-z]*$)') animal_object from a)
select array_agg(animal_object[1]) animal, array_agg(animal_object[2]) a_object
from b
If you're processing multiple records at once you may want to use something like a row number before the unnest so that you have a group by to aggregate back to an array in your final select statement.