PostgreSQL efficiently find last decendant in linear list - sql

I currently try to retrieve the last decendet efficiently from a linked list like structure.
Essentially there's a table with a data series, with certain criteria I split it up to get a list like this
current_id | next_id
for example
1 | 2
2 | 3
3 | 4
4 | NULL
42 | 43
43 | 45
45 | NULL
etc...
would result in lists like
1 -> 2 -> 3 -> 4
and
42 -> 43 -> 45
Now I want to get the first and the last id from each of those lists.
This is what I have right now:
WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
SELECT start_ts, end_ts FROM track_caps tc
UNION
SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
FROM (
SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
GROUP BY rstart_ts
) sq
GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts
In this case I just used timestamps which work fine for the given data.
Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4
Just for illustration, the produced result set by the recursive query looks like this:
1 | 2
2 | 3
3 | 4
1 | 3
2 | 4
1 | 4
As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE.
For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.
Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?)
Or are there better single-query solutions to this problem?

The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.
To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.
To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.
So you wind up with something like this:
with recursive contract as (
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
union
select c.current_id, c.next_id, p.root_id
from track_caps c
join contract p on c.current_id = p.next_id
and c.next_id is not null
)
select *
from contract
order by current_id;
Online example: http://rextester.com/DOABC98823

Related

Big Query String Manipulation using SubQuery

I would appreciate a push in the right direction with how this might be achieved using GCP Big Query, please.
I have a column in my table of type string, inside this string there are a repeating sequence of characters and I need to extract and process each of them. To illustrate, lets say the column name is 'instruments'. A possible value for instruments could be:
'band=false;inst=basoon,inst=cello;inst=guitar;cases=false,permits=false'
In which case I need to extract 'basoon', 'cello' and 'guitar'.
I'm more or less a SQL newbie, sorry. So far I have:
SELECT
bandId,
REGEXP_EXTRACT(instruments, r'inst=.*?\;') AS INSTS
FROM `inventory.band.mytable`;
This extracts the instruments substring ('inst=basoon,inst=cello;inst=guitar;') and gives me an output column 'INSTS' but now I think I need to split the values in that column on the comma and do some further processing. This is where I'm stuck as I cannot see how to structure additional queries or processing blocks.
How can I reference the INSTS in order to do subsequent processing? Documentation suggests I should be buildin subqueries using WITH but I can't seem to get anything going. Could some kind soul give me a push in the right direction, please?
BigQuery has a function SPLIT() that does the same as SPLIT_PART() in other databases.
Assuming that you don't alternate between the comma and the semicolon for separating your «key»=«value» pairs, and only use the semicolon,
first you split your instruments string into as many parts that contain inst=. To do that, you use an in-line table of consecutive integers to CROSS JOIN with, so that you can SPLIT(instruments,';',i) with an increasing integer value for i. You will get strings in the format inst=%, of which you want the part after the equal sign. You get that part by applying another SPLIT(), this time with the equal sign as the delimiter, and for the second split part:
WITH indata(bandid,instruments) AS (
-- some input, don't use in real query ...
-- I assume that you don't alternate between comma and semicolon for the delimiter, and stick to semicolon
SELECT
1,'band=false;inst=basoon;inst=cello;inst=guitar;cases=false;permits=false'
UNION ALL
SELECT
2,'band=true;inst=drum;inst=cello;inst=bass;inst=flute;cases=false;permits=true'
UNION ALL
SELECT
3,'band=false;inst=12string;inst=banjo;inst=triangle;inst=tuba;cases=false;permits=true'
)
-- real query starts here, replace following comma with "WITH" ...
,
-- need a series of consecutive integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
)
SELECT
bandid
, i
, SPLIT(SPLIT(instruments,';',i),'=',2) AS instrument
FROM indata CROSS JOIN i
WHERE SPLIT(instruments,';',i) like 'inst=%'
ORDER BY 1
-- out bandid | i | instrument
-- out --------+---+------------
-- out 1 | 2 | basoon
-- out 1 | 3 | cello
-- out 1 | 4 | guitar
-- out 2 | 2 | drum
-- out 2 | 3 | cello
-- out 2 | 4 | bass
-- out 2 | 5 | flute
-- out 3 | 2 | 12string
-- out 3 | 3 | banjo
-- out 3 | 4 | triangle
-- out 3 | 5 | tuba
Consider below few options (just to demonstrate different technics here)
Option 1
select bandId,
( select string_agg(split(kv, '=')[offset(1)])
from unnest(split(instruments, ';')) kv
where split(kv, '=')[offset(0)] = 'inst'
) as insts
from `inventory.band.mytable`
Option 2 (for obvious reason this one would be my choice)
select bandId,
array_to_string(regexp_extract_all(instruments, r'inst=([^;$]+)'), ',') instrs
from `inventory.band.mytable`
If applied to sample data in your question - output in both cases is

Extract data dynamically in Amazon Redshift

This is the sample data in the column. I want to extract the values only associated with 5 in dynamically.
'{"2113":5,"2112":5,"2114":4,"2511":5}'
The final structure should be 3 rows of names and values?
I tried with JSON extract function but that not help. Thanks
Final result i want,
value | Key
2113 5
2112 5
2115 5
So, what you need to do is to unnest the json object (have a key-value pair per row). Unnesting in Readshift is tricky. One needs a sequence table, and then perfom a CROSS JOIN with proper filter condition. Usually unnesting is done on an array, and then it's easier, since indicies are easy to generate. To unnest a key-value map (JSON object) one needs to know all the keys (Redshift cannot do it). Your example is lucky, since the keys are integers and they're cardinality is relatively low.
This is a sketched out solution. Please note that you will have to change the way the sequence table is created:
WITH input(json) AS (
SELECT '{"2113":5,"2112":5,"2114":4,"2511":5}'::varchar
)
, sequence(idx) AS (
-- instead of the below you should use sequence table
SELECT 2113
UNION ALL
SELECT 2112
UNION ALL
SELECT 2114
UNION ALL
SELECT 2511
UNION ALL
SELECT 2512
UNION ALL
SELECT 2513
UNION ALL
SELECT 2514
)
, unnested(key, val) AS (
SELECT idx::varchar as key,
json_extract_path_text(json, key) as val
FROM input
CROSS JOIN sequence
WHERE val IS NOT NULL
)
SELECT *
FROM unnested
WHERE val = 5
key | val
2113 | 5
2112 | 5
2511 | 5
how to generate a large sequence in Redshift:
...
sequence(idx) AS (
SELECT row_number() OVER ()
FROM arbitrary_table_having_enough_rows
limit 10000
)
...
Other option is to have a specialized sequence table - here there's an idea on how to do it http://www.silota.com/docs/recipes/redshift-sequential-generate-series-numbers-time.html
Achieved the result using multiple splits.
`SELECT distinct split_part(split_part(replace(replace(replace(json_field,'{',''),'}',''),'"',''),',',i),': ',1) as value,` `split_part(split_part(replace(replace(replace(json_field,'{',''),'}',''),'"',''),',',i),':',2) as key FROM table
JOIN schema.seq_1_to_100 as numbers
ON i <=regexp_count(json_field,':') `

where clause with = sign matches multiple records while expected just one record

I have a simple inline view that contains 2 columns.
-----------------
rn | val
-----------------
0 | A
... | ...
25 | Z
I am trying to select a val by matching the rn randomly by using the dbms_random.value() method as in
with d (rn, val) as
(
select level-1, chr(64+level) from dual connect by level <= 26
)
select * from d
where rn = floor(dbms_random.value()*25)
;
My expectation is it should return one row only without failing.
But now and then I get multiple rows returned or no rows at all.
on the other hand,
>>select floor(dbms_random.value()*25) from dual connect by level <1000
returns a whole number for each row and I failed to see any abnormality.
What am I missing here?
The problem is that the random value is recalculated for each row. So, you might get two random values that match the value -- or go through all the values and never get a hit.
One way to get around this is:
select d.*
from (select d.*
from d
order by dbms_random.value()
) d
where rownum = 1;
There are more efficient ways to calculate a random number, but this is intended to be a simple modification to your existing query.
You also might want to ask another question. This question starts with a description of a table that is not used, and then the question is about a query that doesn't use the table. Ask another question, describing the table and the real problem you are having -- along with sample data and desired results.

BigQuery full table to partition

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.
That means I will have 270 partitions. What is the best way to move this data to partition table.
I dont want to run 270 queries which is very costly operation. So looking for optimized solution.
I have multiple tables like this. I need to migrate all these tables to partition tables.
Thanks,
I see three options
Direct Extraction out of original table:
Actions (how many queries to run) = Days [to extract] = 270
Full Scans (how much data scanned measured in full scans of original table) = Days = 270
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00
Hierarchical(recursive) Extraction: (described in Mosha’s answer)
Actions = 2^log2(Days) – 2 = 510
Full Scans = 2*log2(Days) = 18
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60
Clustered Extraction: (I will describe it in a sec)
Actions = Days + 1 = 271
Full Scans = [always]2 = 2
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40
Summary
Method Actions Total Full Scans Total Cost
Direct Extraction 270 270 $459.00
Hierarchical(recursive) Extraction 510 18 $30.60
Clustered Extraction 271 2 $3.40
Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward
Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible!
From $460 down to $31!
Almost 15 times down!
In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges
- take a look at third option
“Clustered Extraction” Explanation
Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan
Step 2
After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created)
This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table
While querying column – we will need to de-serialize row’s value and parse it back to original scheme
Very simplified example: (using BigQuery Standard SQL here)
The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)
So, assume original table (theTable) looks somehow like below
SELECT 1 AS id, "101" AS x, 1 AS ts UNION ALL
SELECT 2 AS id, "102" AS x, 1 AS ts UNION ALL
SELECT 3 AS id, "103" AS x, 1 AS ts UNION ALL
SELECT 4 AS id, "104" AS x, 1 AS ts UNION ALL
SELECT 5 AS id, "105" AS x, 1 AS ts UNION ALL
SELECT 6 AS id, "106" AS x, 2 AS ts UNION ALL
SELECT 7 AS id, "107" AS x, 2 AS ts UNION ALL
SELECT 8 AS id, "108" AS x, 2 AS ts UNION ALL
SELECT 9 AS id, "109" AS x, 2 AS ts UNION ALL
SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
SELECT 20 AS id, "120" AS x, 4 AS ts
Step 1 – transform table and write result into tempTable
SELECT
num,
MAX(IF(ts=1, ser, NULL)) AS ts_1,
MAX(IF(ts=2, ser, NULL)) AS ts_2,
MAX(IF(ts=3, ser, NULL)) AS ts_3,
MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
SELECT
ts,
CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser,
ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
FROM theTable
)
GROUP BY num
tempTable will look like below:
num ts_1 ts_2 ts_3 ts_4
1 1|101|1 6|106|2 10|110|3 19|119|4
2 2|102|1 7|107|2 11|111|3 20|120|4
3 3|103|1 8|108|2 12|112|3 null
4 4|104|1 9|109|2 13|113|3 null
5 5|105|1 null 14|114|3 null
6 null null 15|115|3 null
7 null null 16|116|3 null
8 null null 17|117|3 null
9 null null 18|118|3 null
Here, I am using simple concatenation for serialization
Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2
SELECT
r[OFFSET(0)] AS id,
r[OFFSET(1)] AS x,
r[OFFSET(2)] AS ts
FROM (
SELECT SPLIT(ts_2, "|") AS r
FROM tempTable
WHERE NOT ts_2 IS NULL
)
The result will look like below (which is expected):
id x ts
6 106 2
7 107 2
8 108 2
9 109 2
I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving
I will add another fourth option to #Mikhail's answer
DML QUERY
Action = 1 query to run
Full scans = 1
Cost = $5 x 0.34 = 1.7$ (x270 times cheaper than solution #1 \o/)
With the new DML feature of BiQuery you can convert a none partitioned table to a partitioned one while doing only one full scan of the source table
To illustrate my solution I will use one of BQ's public tables, namely bigquery-public-data:hacker_news.comments. below is the tables schema
name | type | description
_________________________________
id | INTGER | ...
_________________________________
by | STRING | ...
_________________________________
author | STRING | ...
_________________________________
... | |
_________________________________
time_ts | TIMESTAMP | human readable timestamp in UTC YYYY-MM-DD hh:mm:ss /!\ /!\ /!\
_________________________________
... | |
_________________________________
We are going to partition the comments table based on time_ts
#standardSQL
CREATE TABLE my_dataset.comments_partitioned
PARTITION BY DATE(time_ts)
AS
SELECT *
FROM `bigquery-public-data:hacker_news.comments`
I hope it helps :)
If your data was in sharded tables (i.e. with YYYYmmdd suffix), you could've used "bq partition" command. But with data in a single table - you will have to scan it multiple times applying different WHERE clauses on your partition key column.
The only optimization I can think of is to do it hierarchically, i.e. instead of 270 queries which will do 270 full table scans - first split table in half, then each half in half etc. This way you will need to pay for 2*log_2(270) = 2*9 = 18 full scans.
Once the conversion is done - all the temporary tables can be deleted to eliminate extra storage costs.

SQL Query to remove cyclic redundancy

I have a table that looks like this:
Column A | Column B | Counter
---------------------------------------------
A | B | 53
B | C | 23
A | D | 11
C | B | 22
I need to remove the last row because it's cyclic to the second row. Can't seem to figure out how to do it.
EDIT
There is an indexed date field. This is for Sankey diagram. The data in the sample table is actually the result of a query. The underlying table has:
date | source node | target node | path count
The query to build the table is:
SELECT source_node, target_node, COUNT(1)
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
GROUP BY source_node, target_node
In the sample, the last row C to B is going backwards and I need to ignore it or the Sankey won't display. I need to only show forward path.
Removing all edges from your graph where the tuple (source_node, target_node) is not ordered alphabetically and the symmetric row exists should give you what you want:
DELETE
FROM sankey_table t1
WHERE source_node > target_node
AND EXISTS (
SELECT NULL from sankey_table t2
WHERE t2.source_node = t1.target_node
AND t2.target_node = t1.source_node)
If you don't want to DELETE them, just use this WHERE clause in your query for generating the input for the diagram.
If you can adjust how your table is populated, you can change the query you're using to only retrieve the values for the first direction (for that date) in the first place, with a little bit an analytic manipulation:
SELECT source_node, target_node, counter FROM (
SELECT source_node,
target_node,
COUNT(*) OVER (PARTITION BY source_node, target_node) AS counter,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
)
WHERE rnk = 1;
The inner query gets the same data you collect now but adds a ranking column, which will be 1 for the first row for any source/target pair in any order for a given day. The outer query then just ignores everything else.
This might be a candidate for a materialised view if you're truncating and repopulating it daily.
If you can't change your intermediate table but can still see the underlying table you could join back to it using the same kind of idea; assuming the table you're querying from is called sankey_agg_table:
SELECT sat.source_node, sat.target_node, sat.counter
FROM sankey_agg_table sat
JOIN (SELECT source_node, target_node,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table) st
ON st.source_node = sat.source_node
AND st.target_node = sat.target_node
AND st.rnk = 1;
SQL Fiddle demos.
DELETE FROM yourTable
where [Column A]='C'
given that these are all your rows
EDIT
I would recommend that you clean up your source data if you can, i.e. delete the rows that you call backwards, if those rows are incorrect as you state in your comments.