redshift regex get multiple matches and expand rows - sql

I'm working on the URL extraction on AWS Redshift. The URL column looks like this:
url item origin
http://B123//ajdsb apple US
http://BYHG//B123 banana UK
http://B325//BF89//BY85 candy CA
The result I want to get is to get the series that starts with B and also expand rows if there are multiple series in a URL.
extracted item origin
B123 apple US
BYHG banana UK
B123 banana UK
B325 candy CA
BF89 candy CA
BY85 candy CA
My current code is:
select REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})') as extracted, item, origin
from data
The regex part works well but I have problems with extracting multiple values and expand them to new rows. I tried to use REGEXP_MATCHES(url, '(B[0-9A-Z]{3})', 'g') but function regexp_matches does not exist on Redshift...

The solution I use is fairly ugly but achieves the desired results. It involves using REGEXP_COUNT to determine the maximum number of matches in a row then joining the resulting table of numbers to a query using REGEXP_SUBSTR.
-- Get a table with the count of matches
-- e.g. if one row has 5 matches this query will return 0, 1, 2, 3, 4, 5
WITH n_table AS (
SELECT
DISTINCT REGEXP_COUNT(url, '(B[0-9A-Z]{3})') AS n
FROM data
)
-- Join the previous table to the data table and use n in the REGEXP_SUBSTR call to get the nth match
SELECT
REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})', 1, n) AS extracted,
item,
origin
FROM data,
n_table
-- Only keep non-null matches
WHERE n > 0
AND REGEXP_COUNT(url, '(B[0-9A-Z]{3})') >= N

IronFarm's answer inspired me, though I wanted to find a solution that didn't require a cross join. Here's what I came up with:
with
-- raw data
src as (
select
1 as id,
'abc def ghi' as stuff
union all
select
2 as id,
'qwe rty' as stuff
),
-- for each id, get a series of indexes for
-- each match in the string
match_idxs as (
select
id,
generate_series(1, regexp_count(stuff, '[a-z]{3}')) as idx
from
src
)
select
src.id,
match_idxs.idx,
regexp_substr(src.stuff, '[a-z]{3}', 1, match_idxs.idx) as stuff_match
from
src
join match_idxs using (id)
order by
id, idx
;
This yields:
id | idx | stuff_match
----+-----+-------------
1 | 1 | abc
1 | 2 | def
1 | 3 | ghi
2 | 1 | qwe
2 | 2 | rty
(5 rows)

Related

Big Query String Manipulation using SubQuery

I would appreciate a push in the right direction with how this might be achieved using GCP Big Query, please.
I have a column in my table of type string, inside this string there are a repeating sequence of characters and I need to extract and process each of them. To illustrate, lets say the column name is 'instruments'. A possible value for instruments could be:
'band=false;inst=basoon,inst=cello;inst=guitar;cases=false,permits=false'
In which case I need to extract 'basoon', 'cello' and 'guitar'.
I'm more or less a SQL newbie, sorry. So far I have:
SELECT
bandId,
REGEXP_EXTRACT(instruments, r'inst=.*?\;') AS INSTS
FROM `inventory.band.mytable`;
This extracts the instruments substring ('inst=basoon,inst=cello;inst=guitar;') and gives me an output column 'INSTS' but now I think I need to split the values in that column on the comma and do some further processing. This is where I'm stuck as I cannot see how to structure additional queries or processing blocks.
How can I reference the INSTS in order to do subsequent processing? Documentation suggests I should be buildin subqueries using WITH but I can't seem to get anything going. Could some kind soul give me a push in the right direction, please?
BigQuery has a function SPLIT() that does the same as SPLIT_PART() in other databases.
Assuming that you don't alternate between the comma and the semicolon for separating your «key»=«value» pairs, and only use the semicolon,
first you split your instruments string into as many parts that contain inst=. To do that, you use an in-line table of consecutive integers to CROSS JOIN with, so that you can SPLIT(instruments,';',i) with an increasing integer value for i. You will get strings in the format inst=%, of which you want the part after the equal sign. You get that part by applying another SPLIT(), this time with the equal sign as the delimiter, and for the second split part:
WITH indata(bandid,instruments) AS (
-- some input, don't use in real query ...
-- I assume that you don't alternate between comma and semicolon for the delimiter, and stick to semicolon
SELECT
1,'band=false;inst=basoon;inst=cello;inst=guitar;cases=false;permits=false'
UNION ALL
SELECT
2,'band=true;inst=drum;inst=cello;inst=bass;inst=flute;cases=false;permits=true'
UNION ALL
SELECT
3,'band=false;inst=12string;inst=banjo;inst=triangle;inst=tuba;cases=false;permits=true'
)
-- real query starts here, replace following comma with "WITH" ...
,
-- need a series of consecutive integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
)
SELECT
bandid
, i
, SPLIT(SPLIT(instruments,';',i),'=',2) AS instrument
FROM indata CROSS JOIN i
WHERE SPLIT(instruments,';',i) like 'inst=%'
ORDER BY 1
-- out bandid | i | instrument
-- out --------+---+------------
-- out 1 | 2 | basoon
-- out 1 | 3 | cello
-- out 1 | 4 | guitar
-- out 2 | 2 | drum
-- out 2 | 3 | cello
-- out 2 | 4 | bass
-- out 2 | 5 | flute
-- out 3 | 2 | 12string
-- out 3 | 3 | banjo
-- out 3 | 4 | triangle
-- out 3 | 5 | tuba
Consider below few options (just to demonstrate different technics here)
Option 1
select bandId,
( select string_agg(split(kv, '=')[offset(1)])
from unnest(split(instruments, ';')) kv
where split(kv, '=')[offset(0)] = 'inst'
) as insts
from `inventory.band.mytable`
Option 2 (for obvious reason this one would be my choice)
select bandId,
array_to_string(regexp_extract_all(instruments, r'inst=([^;$]+)'), ',') instrs
from `inventory.band.mytable`
If applied to sample data in your question - output in both cases is

Group data series into variable width windows based on first event

I have computational task which can be reduced to the follow problem:
I have a large set of pairs of integers (key, val) which I want to group into windows. The first window starts with the first pair p ordered by key attribute and spans all the pairs where p[i].key belongs to [p[0].key; p[0].key + N), with some arbitrary integer N, positive and common to all windows.
The next window starts with the first pair ordered by key not included in the previous windows and again spans all the pairs from its key to key + N, and so on for the following windows.
The last step is to sum second attribute for each window and display it together with the first key of the window.
For example, given list of records with values:
key
val
1
3
2
7
5
1
6
4
7
1
10
3
13
5
and N=3, the windows would be:
{(1,3),(2,7)},
{(5,1),(6,4),(7,1)},
{(10,3)}
{(13,5)}
The final result:
key
sum_of_values
1
10
5
6
10
3
13
5
This is easy to program with a standard programming language but I have no clue how to solve this with SQL.
Note: If clickhouse doesn't support the RECURSIVE keyword, just remove that keyword from the expression.
Clickhouse seems to use non-standard syntax for the WITH clause. The below uses standard SQL. Adjust as needed.
Sorry. clickhouse may not support this approach. If not, we would need to find another method of walking through the data.
Standard SQL:
There are a few ways. Here's one approach. First assign row numbers to allow recursively stepping through the rows. We could use LEAD as well.
Assign a group (key value) to each row based on the current key and the last group/key value and whether they are within some distance (N = 3, in this case).
The last step is to just SUM these values per group start_key and to use the start_key value as the starting key in each group.
WITH RECURSIVE nrows (xkey, val, n) AS (
SELECT xkey, val, ROW_NUMBER() OVER (ORDER BY xkey) FROM test
)
, cte (xkey, val, n, start_key) AS (
SELECT xkey, val, n, xkey FROM nrows WHERE n = 1
UNION ALL
SELECT t1.xkey, t1.val, t1.n
, CASE WHEN t1.xkey <= t2.start_key + (3-1) THEN t2.start_key ELSE t1.xkey END
FROM nrows AS t1
JOIN cte AS t2
ON t2.n = t1.n-1
)
SELECT start_key
, SUM(val) AS sum_values
FROM cte
GROUP BY start_key
ORDER BY start_key
;
Result:
+-----------+------------+
| start_key | sum_values |
+-----------+------------+
| 1 | 10 |
| 5 | 6 |
| 10 | 3 |
| 13 | 5 |
+-----------+------------+

How can I get the dates from a text string?

I use Vertical SQL and have a field "Note" that is a free text field (no consistent way to enter data). I'd like to create another field with only dates or extract the last date in the field.
E.g
"1st order on 3/2/21, second 5/5/21" -> "3/2/21 5/5/21" or "5/5/21"
"first delivery 2/2/21 second one 8/30/21" -> "2/2/21 8/30/21" or "8/30/21"
"reported 1st: 2/2/21." -> "2/2/21"
Thanks!
You can use REGEXP_SUBSTR() to grab the patterns: one or more digits; slash; one or more digits; slash; one or more digits.
If you have more than one of those patterns, then, create one row as output for each pattern found. For that, CROSS JOIN with a consecutive series of integers, so you can output the n-th occurrence of the pattern. Then, cast the found string as DATE.
Finally, and only if you only need the last date, apply a Vertica-peculiar analytic limit clause , to only output the highest i value for the respective id (which I had to add) of the result table.
WITH
-- need a sequence of integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
)
,
indata(id,s) AS (
SELECT 1,'1st order on 3/2/21, second 5/5/21'
UNION ALL SELECT 2,'first delivery 2/2/21 second one 8/30/21'
UNION ALL SELECT 3,'reported 1st: 2/2/21.'
)
SELECT
id
, i
, s
, REGEXP_SUBSTR(s,'\d+/\d+/\d+',1,i) AS found_token
, REGEXP_SUBSTR(s,'\d+/\d+/\d+',1,i)::DATE AS found_date
FROM indata CROSS JOIN i
WHERE REGEXP_SUBSTR(s,'(\d+/\d+/\d+)',1,i,'',1) <>''
-- remove the following line if you want all dates from all strings
-- and keep it if you only want the last date in the string
LIMIT 1 OVER(PARTITION BY id ORDER BY i DESC)
;
id | i | s | found_token | found_date
----+---+------------------------------------------+-------------+------------
1 | 2 | 1st order on 3/2/21, second 5/5/21 | 5/5/21 | 2021-05-05
2 | 2 | first delivery 2/2/21 second one 8/30/21 | 8/30/21 | 2021-08-30
3 | 1 | reported 1st: 2/2/21. | 2/2/21 | 2021-02-02
Consistently is critical when parsing string data. If it will always end with a date preceded by a space, pulling the last date should be fairly simple. Consider:
Trim(Mid(Note, InStrRev(Note, " ")))

Finding a value in multiple columns in Oracle table

I have a table like below
ID NUMBER 1 NUMBER 2 NUMBER 3 LOC
1-14H-4950 0616167 4233243 CA
A-522355 1234567 TN
A-522357 9876543 WY
A-522371 1112223 WA
A-522423 1234567 2345678 1234567 NJ
A-A-522427 9876543 6249853 6249853 NJ
and I have a bunch of values (1234567, 9876543, 0616167, 1112223, 999999...etc) which will be used in where clause, if a value from where clause found in one of the three Number columns (Number 1 or Number 2 Number 3) then I will have to write that to output1 (its like VLOOKUP of Excel).
If the value is found in more than one of the three columns then it will be different output2 with a flag as MultipleMatches. If the value is not found in any of the three columns then it should be in Output2 with flag as No Match. I tried using self join and or clauses, but not able to get what I want.
I want to write the SQL to generate both outputs. Outputs will include all the columns from the above table. For eg:
Output 1 from above sample data will look like
ID NUMBER 1 NUMBER 2 NUMBER 3 LOC
1-14H-4950 0616167 4233243 CA
A-522371 1112223 WA
Output 2 will be like:
ID NUMBER 1 NUMBER 2 NUMBER 3 LOC Flag
A-522423 1234567 2345678 1234567 NJ Multiple Match
A-A-522427 9876543 6249853 6249853 NJ Multiple Match
1234 No Match
I want to write the SQL to generate both outputs.
One SELECT operator cannot produce two output sets.
The main question is, why split the output when that the difference is only in the FLAG column? If you really need two different output of the result, then you can do this:
(Rightly) create a common cursor for the query, where the FLAG column will be calculated and split the output screens already in the UI.
drop table test_dt;
create table test_dt as
select '1-14h-4950' id,null num1,616167 num2,4233243 num3,'ca' loc from dual
union all
select 'a-522355',null ,1234567,null,'tn' from dual union all
select 'a-522357',null ,9876543,null,'wy' from dual union all
select 'a-522371',null ,1112223,null,'wa' from dual union all
select 'a-522423',1234567,2345678,1234567,'nj' from dual union all
select 'a3-522423',null,null,null,'nj' from dual union all
select 'a-a-522427',9876543,6249853,6249853,'nj' from dual;
--
select
d.*,
case when t.cc_ndv=0 and t.cc_null=3 then 'Not matching'
when t.cc_ndv=(3-t.cc_null) then 'Once'
else 'Multiplay match'
end flag
--t.cc_ndv,
--t.cc_null
from test_dt d ,lateral(
select
count(distinct case level when 1 then num1
when 2 then num2
when 3 then num3
end ) cc_ndv,
count(distinct case level when 1 then nvl2(num1,null,1)
when 2 then nvl2(num2,null,2)
when 3 then nvl2(num3,null,3)
end ) cc_null
from dual connect by level<=3 and sys_guid()is not null
) t;
Or
create a procedure(see to dbms_sql.return_result) that returns a some data sets.
Process these data of cursors / datasets separately.

PostgreSQL efficiently find last decendant in linear list

I currently try to retrieve the last decendet efficiently from a linked list like structure.
Essentially there's a table with a data series, with certain criteria I split it up to get a list like this
current_id | next_id
for example
1 | 2
2 | 3
3 | 4
4 | NULL
42 | 43
43 | 45
45 | NULL
etc...
would result in lists like
1 -> 2 -> 3 -> 4
and
42 -> 43 -> 45
Now I want to get the first and the last id from each of those lists.
This is what I have right now:
WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
SELECT start_ts, end_ts FROM track_caps tc
UNION
SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
FROM (
SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
GROUP BY rstart_ts
) sq
GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts
In this case I just used timestamps which work fine for the given data.
Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4
Just for illustration, the produced result set by the recursive query looks like this:
1 | 2
2 | 3
3 | 4
1 | 3
2 | 4
1 | 4
As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE.
For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.
Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?)
Or are there better single-query solutions to this problem?
The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.
To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.
To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.
So you wind up with something like this:
with recursive contract as (
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
union
select c.current_id, c.next_id, p.root_id
from track_caps c
join contract p on c.current_id = p.next_id
and c.next_id is not null
)
select *
from contract
order by current_id;
Online example: http://rextester.com/DOABC98823