How can I get the dates from a text string? - sql

I use Vertical SQL and have a field "Note" that is a free text field (no consistent way to enter data). I'd like to create another field with only dates or extract the last date in the field.
E.g
"1st order on 3/2/21, second 5/5/21" -> "3/2/21 5/5/21" or "5/5/21"
"first delivery 2/2/21 second one 8/30/21" -> "2/2/21 8/30/21" or "8/30/21"
"reported 1st: 2/2/21." -> "2/2/21"
Thanks!

You can use REGEXP_SUBSTR() to grab the patterns: one or more digits; slash; one or more digits; slash; one or more digits.
If you have more than one of those patterns, then, create one row as output for each pattern found. For that, CROSS JOIN with a consecutive series of integers, so you can output the n-th occurrence of the pattern. Then, cast the found string as DATE.
Finally, and only if you only need the last date, apply a Vertica-peculiar analytic limit clause , to only output the highest i value for the respective id (which I had to add) of the result table.
WITH
-- need a sequence of integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
)
,
indata(id,s) AS (
SELECT 1,'1st order on 3/2/21, second 5/5/21'
UNION ALL SELECT 2,'first delivery 2/2/21 second one 8/30/21'
UNION ALL SELECT 3,'reported 1st: 2/2/21.'
)
SELECT
id
, i
, s
, REGEXP_SUBSTR(s,'\d+/\d+/\d+',1,i) AS found_token
, REGEXP_SUBSTR(s,'\d+/\d+/\d+',1,i)::DATE AS found_date
FROM indata CROSS JOIN i
WHERE REGEXP_SUBSTR(s,'(\d+/\d+/\d+)',1,i,'',1) <>''
-- remove the following line if you want all dates from all strings
-- and keep it if you only want the last date in the string
LIMIT 1 OVER(PARTITION BY id ORDER BY i DESC)
;
id | i | s | found_token | found_date
----+---+------------------------------------------+-------------+------------
1 | 2 | 1st order on 3/2/21, second 5/5/21 | 5/5/21 | 2021-05-05
2 | 2 | first delivery 2/2/21 second one 8/30/21 | 8/30/21 | 2021-08-30
3 | 1 | reported 1st: 2/2/21. | 2/2/21 | 2021-02-02

Consistently is critical when parsing string data. If it will always end with a date preceded by a space, pulling the last date should be fairly simple. Consider:
Trim(Mid(Note, InStrRev(Note, " ")))

Related

Big Query String Manipulation using SubQuery

I would appreciate a push in the right direction with how this might be achieved using GCP Big Query, please.
I have a column in my table of type string, inside this string there are a repeating sequence of characters and I need to extract and process each of them. To illustrate, lets say the column name is 'instruments'. A possible value for instruments could be:
'band=false;inst=basoon,inst=cello;inst=guitar;cases=false,permits=false'
In which case I need to extract 'basoon', 'cello' and 'guitar'.
I'm more or less a SQL newbie, sorry. So far I have:
SELECT
bandId,
REGEXP_EXTRACT(instruments, r'inst=.*?\;') AS INSTS
FROM `inventory.band.mytable`;
This extracts the instruments substring ('inst=basoon,inst=cello;inst=guitar;') and gives me an output column 'INSTS' but now I think I need to split the values in that column on the comma and do some further processing. This is where I'm stuck as I cannot see how to structure additional queries or processing blocks.
How can I reference the INSTS in order to do subsequent processing? Documentation suggests I should be buildin subqueries using WITH but I can't seem to get anything going. Could some kind soul give me a push in the right direction, please?
BigQuery has a function SPLIT() that does the same as SPLIT_PART() in other databases.
Assuming that you don't alternate between the comma and the semicolon for separating your «key»=«value» pairs, and only use the semicolon,
first you split your instruments string into as many parts that contain inst=. To do that, you use an in-line table of consecutive integers to CROSS JOIN with, so that you can SPLIT(instruments,';',i) with an increasing integer value for i. You will get strings in the format inst=%, of which you want the part after the equal sign. You get that part by applying another SPLIT(), this time with the equal sign as the delimiter, and for the second split part:
WITH indata(bandid,instruments) AS (
-- some input, don't use in real query ...
-- I assume that you don't alternate between comma and semicolon for the delimiter, and stick to semicolon
SELECT
1,'band=false;inst=basoon;inst=cello;inst=guitar;cases=false;permits=false'
UNION ALL
SELECT
2,'band=true;inst=drum;inst=cello;inst=bass;inst=flute;cases=false;permits=true'
UNION ALL
SELECT
3,'band=false;inst=12string;inst=banjo;inst=triangle;inst=tuba;cases=false;permits=true'
)
-- real query starts here, replace following comma with "WITH" ...
,
-- need a series of consecutive integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
)
SELECT
bandid
, i
, SPLIT(SPLIT(instruments,';',i),'=',2) AS instrument
FROM indata CROSS JOIN i
WHERE SPLIT(instruments,';',i) like 'inst=%'
ORDER BY 1
-- out bandid | i | instrument
-- out --------+---+------------
-- out 1 | 2 | basoon
-- out 1 | 3 | cello
-- out 1 | 4 | guitar
-- out 2 | 2 | drum
-- out 2 | 3 | cello
-- out 2 | 4 | bass
-- out 2 | 5 | flute
-- out 3 | 2 | 12string
-- out 3 | 3 | banjo
-- out 3 | 4 | triangle
-- out 3 | 5 | tuba
Consider below few options (just to demonstrate different technics here)
Option 1
select bandId,
( select string_agg(split(kv, '=')[offset(1)])
from unnest(split(instruments, ';')) kv
where split(kv, '=')[offset(0)] = 'inst'
) as insts
from `inventory.band.mytable`
Option 2 (for obvious reason this one would be my choice)
select bandId,
array_to_string(regexp_extract_all(instruments, r'inst=([^;$]+)'), ',') instrs
from `inventory.band.mytable`
If applied to sample data in your question - output in both cases is

Group data series into variable width windows based on first event

I have computational task which can be reduced to the follow problem:
I have a large set of pairs of integers (key, val) which I want to group into windows. The first window starts with the first pair p ordered by key attribute and spans all the pairs where p[i].key belongs to [p[0].key; p[0].key + N), with some arbitrary integer N, positive and common to all windows.
The next window starts with the first pair ordered by key not included in the previous windows and again spans all the pairs from its key to key + N, and so on for the following windows.
The last step is to sum second attribute for each window and display it together with the first key of the window.
For example, given list of records with values:
key
val
1
3
2
7
5
1
6
4
7
1
10
3
13
5
and N=3, the windows would be:
{(1,3),(2,7)},
{(5,1),(6,4),(7,1)},
{(10,3)}
{(13,5)}
The final result:
key
sum_of_values
1
10
5
6
10
3
13
5
This is easy to program with a standard programming language but I have no clue how to solve this with SQL.
Note: If clickhouse doesn't support the RECURSIVE keyword, just remove that keyword from the expression.
Clickhouse seems to use non-standard syntax for the WITH clause. The below uses standard SQL. Adjust as needed.
Sorry. clickhouse may not support this approach. If not, we would need to find another method of walking through the data.
Standard SQL:
There are a few ways. Here's one approach. First assign row numbers to allow recursively stepping through the rows. We could use LEAD as well.
Assign a group (key value) to each row based on the current key and the last group/key value and whether they are within some distance (N = 3, in this case).
The last step is to just SUM these values per group start_key and to use the start_key value as the starting key in each group.
WITH RECURSIVE nrows (xkey, val, n) AS (
SELECT xkey, val, ROW_NUMBER() OVER (ORDER BY xkey) FROM test
)
, cte (xkey, val, n, start_key) AS (
SELECT xkey, val, n, xkey FROM nrows WHERE n = 1
UNION ALL
SELECT t1.xkey, t1.val, t1.n
, CASE WHEN t1.xkey <= t2.start_key + (3-1) THEN t2.start_key ELSE t1.xkey END
FROM nrows AS t1
JOIN cte AS t2
ON t2.n = t1.n-1
)
SELECT start_key
, SUM(val) AS sum_values
FROM cte
GROUP BY start_key
ORDER BY start_key
;
Result:
+-----------+------------+
| start_key | sum_values |
+-----------+------------+
| 1 | 10 |
| 5 | 6 |
| 10 | 3 |
| 13 | 5 |
+-----------+------------+

redshift regex get multiple matches and expand rows

I'm working on the URL extraction on AWS Redshift. The URL column looks like this:
url item origin
http://B123//ajdsb apple US
http://BYHG//B123 banana UK
http://B325//BF89//BY85 candy CA
The result I want to get is to get the series that starts with B and also expand rows if there are multiple series in a URL.
extracted item origin
B123 apple US
BYHG banana UK
B123 banana UK
B325 candy CA
BF89 candy CA
BY85 candy CA
My current code is:
select REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})') as extracted, item, origin
from data
The regex part works well but I have problems with extracting multiple values and expand them to new rows. I tried to use REGEXP_MATCHES(url, '(B[0-9A-Z]{3})', 'g') but function regexp_matches does not exist on Redshift...
The solution I use is fairly ugly but achieves the desired results. It involves using REGEXP_COUNT to determine the maximum number of matches in a row then joining the resulting table of numbers to a query using REGEXP_SUBSTR.
-- Get a table with the count of matches
-- e.g. if one row has 5 matches this query will return 0, 1, 2, 3, 4, 5
WITH n_table AS (
SELECT
DISTINCT REGEXP_COUNT(url, '(B[0-9A-Z]{3})') AS n
FROM data
)
-- Join the previous table to the data table and use n in the REGEXP_SUBSTR call to get the nth match
SELECT
REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})', 1, n) AS extracted,
item,
origin
FROM data,
n_table
-- Only keep non-null matches
WHERE n > 0
AND REGEXP_COUNT(url, '(B[0-9A-Z]{3})') >= N
IronFarm's answer inspired me, though I wanted to find a solution that didn't require a cross join. Here's what I came up with:
with
-- raw data
src as (
select
1 as id,
'abc def ghi' as stuff
union all
select
2 as id,
'qwe rty' as stuff
),
-- for each id, get a series of indexes for
-- each match in the string
match_idxs as (
select
id,
generate_series(1, regexp_count(stuff, '[a-z]{3}')) as idx
from
src
)
select
src.id,
match_idxs.idx,
regexp_substr(src.stuff, '[a-z]{3}', 1, match_idxs.idx) as stuff_match
from
src
join match_idxs using (id)
order by
id, idx
;
This yields:
id | idx | stuff_match
----+-----+-------------
1 | 1 | abc
1 | 2 | def
1 | 3 | ghi
2 | 1 | qwe
2 | 2 | rty
(5 rows)

Number of palindromes in character strings

I'm trying to gather a list of 6 letter palindromes and the number of times they occur using Postgres 9.3.5.
This is the query I've tried:
SELECT word, count(*)
FROM ( SELECT regexp_split_to_table(read_sequence, '([ATCG])([ATCG])([ATCG])(\3)(\2)(\1)') as word
FROM reads ) t
GROUP BY word;
However this brings up results that a) aren't palindromic and b) greater or less than 6 letters long.
\d reads
Table "public.reads"
Column | Type | Modifiers
--------------+---------+-----------
read_header | text | not null
read_sequence | text |
option | text |
quality_score | text |
pair_end | text | not null
species_id | integer |
Indexes:
"reads_pkey" PRIMARY KEY, btree (read_header, pair_end)
read_sequence contains DNA sequences, 'ATGCTGATGCGGCGTAGCTGGATCGA' for example.
I'd like to see the number of palindromes in each sequence so the example would contain 1 another sequence could have 4 another 3 and so on.
Count per row:
SELECT read_header, pair_end, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM reads r
, generate_series(1, length(r.read_sequence) - 5 ) i
WHERE substr(read_sequence, i, 6) ~ '([ATCG])([ATCG])([ATCG])\3\2\1'
GROUP BY 1,2,3
ORDER BY 1,2,3,4 DESC;
Count per read_header and palindrome:
SELECT read_header, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP BY 1,2
ORDER BY 1,2,3 DESC;
Count per read_header:
SELECT read_header, count(*) AS ct
FROM
...
GROUP BY 1
ORDER BY 1,2 DESC;
Count per palindrome:
SELECT substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP BY 1
ORDER BY 1,2 DESC;
SQL Fiddle.
Explain
A palindrome could start at any position 5 characters shy of the end to allow a length of 6. And palindromes can overlap. So:
Generate a list of possible starting positions with generate_series() in a LATERAL join, and based on this all possible 6-character strings.
Test for palindrome with regular expression with back references, similar to what you had, but regexp_split_to_table() is not the right function here. Use a regular expression match (~).
Aggregate, depending on what you actually want.

how to select one tuple in rows based on variable field value

I'm quite new into SQL and I'd like to make a SELECT statement to retrieve only the first row of a set base on a column value. I'll try to make it clearer with a table example.
Here is my table data :
chip_id | sample_id
-------------------
1 | 45
1 | 55
1 | 5986
2 | 453
2 | 12
3 | 4567
3 | 9
I'd like to have a SELECT statement that fetch the first line with chip_id=1,2,3
Like this :
chip_id | sample_id
-------------------
1 | 45 or 55 or whatever
2 | 12 or 453 ...
3 | 9 or ...
How can I do this?
Thanks
i'd probably:
set a variable =0
order your table by chip_id
read the table in row by row
if table[row]>variable, store the table[row] in a result array,increment variable
loop till done
return your result array
though depending on your DB,query and versions you'll probably get unpredictable/unreliable returns.
You can get one value using row_number():
select chip_id, sample_id
from (select chip_id, sample_id,
row_number() over (partition by chip_id order by rand()) as seqnum
) t
where seqnum = 1
This returns a random value. In SQL, tables are inherently unordered, so there is no concept of "first". You need an auto incrementing id or creation date or some way of defining "first" to get the "first".
If you have such a column, then replace rand() with the column.
Provided I understood your output, if you are using PostGreSQL 9, you can use this:
SELECT chip_id ,
string_agg(sample_id, ' or ')
FROM your_table
GROUP BY chip_id
You need to group your data with a GROUP BY query.
When you group, generally you want the max, the min, or some other values to represent your group. You can do sums, count, all kind of group operations.
For your example, you don't seem to want a specific group operation, so the query could be as simple as this one :
SELECT chip_id, MAX(sample_id)
FROM table
GROUP BY chip_id
This way you are retrieving the maximum sample_id for each of the chip_id.