Big Query String Manipulation using SubQuery - sql

I would appreciate a push in the right direction with how this might be achieved using GCP Big Query, please.
I have a column in my table of type string, inside this string there are a repeating sequence of characters and I need to extract and process each of them. To illustrate, lets say the column name is 'instruments'. A possible value for instruments could be:
'band=false;inst=basoon,inst=cello;inst=guitar;cases=false,permits=false'
In which case I need to extract 'basoon', 'cello' and 'guitar'.
I'm more or less a SQL newbie, sorry. So far I have:
SELECT
bandId,
REGEXP_EXTRACT(instruments, r'inst=.*?\;') AS INSTS
FROM `inventory.band.mytable`;
This extracts the instruments substring ('inst=basoon,inst=cello;inst=guitar;') and gives me an output column 'INSTS' but now I think I need to split the values in that column on the comma and do some further processing. This is where I'm stuck as I cannot see how to structure additional queries or processing blocks.
How can I reference the INSTS in order to do subsequent processing? Documentation suggests I should be buildin subqueries using WITH but I can't seem to get anything going. Could some kind soul give me a push in the right direction, please?

BigQuery has a function SPLIT() that does the same as SPLIT_PART() in other databases.
Assuming that you don't alternate between the comma and the semicolon for separating your «key»=«value» pairs, and only use the semicolon,
first you split your instruments string into as many parts that contain inst=. To do that, you use an in-line table of consecutive integers to CROSS JOIN with, so that you can SPLIT(instruments,';',i) with an increasing integer value for i. You will get strings in the format inst=%, of which you want the part after the equal sign. You get that part by applying another SPLIT(), this time with the equal sign as the delimiter, and for the second split part:
WITH indata(bandid,instruments) AS (
-- some input, don't use in real query ...
-- I assume that you don't alternate between comma and semicolon for the delimiter, and stick to semicolon
SELECT
1,'band=false;inst=basoon;inst=cello;inst=guitar;cases=false;permits=false'
UNION ALL
SELECT
2,'band=true;inst=drum;inst=cello;inst=bass;inst=flute;cases=false;permits=true'
UNION ALL
SELECT
3,'band=false;inst=12string;inst=banjo;inst=triangle;inst=tuba;cases=false;permits=true'
)
-- real query starts here, replace following comma with "WITH" ...
,
-- need a series of consecutive integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
)
SELECT
bandid
, i
, SPLIT(SPLIT(instruments,';',i),'=',2) AS instrument
FROM indata CROSS JOIN i
WHERE SPLIT(instruments,';',i) like 'inst=%'
ORDER BY 1
-- out bandid | i | instrument
-- out --------+---+------------
-- out 1 | 2 | basoon
-- out 1 | 3 | cello
-- out 1 | 4 | guitar
-- out 2 | 2 | drum
-- out 2 | 3 | cello
-- out 2 | 4 | bass
-- out 2 | 5 | flute
-- out 3 | 2 | 12string
-- out 3 | 3 | banjo
-- out 3 | 4 | triangle
-- out 3 | 5 | tuba

Consider below few options (just to demonstrate different technics here)
Option 1
select bandId,
( select string_agg(split(kv, '=')[offset(1)])
from unnest(split(instruments, ';')) kv
where split(kv, '=')[offset(0)] = 'inst'
) as insts
from `inventory.band.mytable`
Option 2 (for obvious reason this one would be my choice)
select bandId,
array_to_string(regexp_extract_all(instruments, r'inst=([^;$]+)'), ',') instrs
from `inventory.band.mytable`
If applied to sample data in your question - output in both cases is

Related

How can I get the dates from a text string?

I use Vertical SQL and have a field "Note" that is a free text field (no consistent way to enter data). I'd like to create another field with only dates or extract the last date in the field.
E.g
"1st order on 3/2/21, second 5/5/21" -> "3/2/21 5/5/21" or "5/5/21"
"first delivery 2/2/21 second one 8/30/21" -> "2/2/21 8/30/21" or "8/30/21"
"reported 1st: 2/2/21." -> "2/2/21"
Thanks!
You can use REGEXP_SUBSTR() to grab the patterns: one or more digits; slash; one or more digits; slash; one or more digits.
If you have more than one of those patterns, then, create one row as output for each pattern found. For that, CROSS JOIN with a consecutive series of integers, so you can output the n-th occurrence of the pattern. Then, cast the found string as DATE.
Finally, and only if you only need the last date, apply a Vertica-peculiar analytic limit clause , to only output the highest i value for the respective id (which I had to add) of the result table.
WITH
-- need a sequence of integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
)
,
indata(id,s) AS (
SELECT 1,'1st order on 3/2/21, second 5/5/21'
UNION ALL SELECT 2,'first delivery 2/2/21 second one 8/30/21'
UNION ALL SELECT 3,'reported 1st: 2/2/21.'
)
SELECT
id
, i
, s
, REGEXP_SUBSTR(s,'\d+/\d+/\d+',1,i) AS found_token
, REGEXP_SUBSTR(s,'\d+/\d+/\d+',1,i)::DATE AS found_date
FROM indata CROSS JOIN i
WHERE REGEXP_SUBSTR(s,'(\d+/\d+/\d+)',1,i,'',1) <>''
-- remove the following line if you want all dates from all strings
-- and keep it if you only want the last date in the string
LIMIT 1 OVER(PARTITION BY id ORDER BY i DESC)
;
id | i | s | found_token | found_date
----+---+------------------------------------------+-------------+------------
1 | 2 | 1st order on 3/2/21, second 5/5/21 | 5/5/21 | 2021-05-05
2 | 2 | first delivery 2/2/21 second one 8/30/21 | 8/30/21 | 2021-08-30
3 | 1 | reported 1st: 2/2/21. | 2/2/21 | 2021-02-02
Consistently is critical when parsing string data. If it will always end with a date preceded by a space, pulling the last date should be fairly simple. Consider:
Trim(Mid(Note, InStrRev(Note, " ")))

How do I create a repeating number sequence in a column in BigQuery

I'd like to create a query that returns a column with a repeating number sequence in it.
For example:
row_num | repeat
----------------
1 | 1
2 | 2
3 | 3
4 | 1
5 | 2
6 | 3
I'm struggling to understand how I could achieve this with BigQuery Standard SQL.
So far i've generated the row number (ROW_NUMBER() OVER()) as row_num in my select, and then I was thinking I could use a modulus function to determine the repeat number, but this would split it into several separate columns, so I'd need additional steps to merge them into the one column. I wondered if there was a more elegant way of achieving this.
Many Thanks!
In fact, the modulus should work here. Assuming your table already has a row_num column, and you want to generate the repeat column, you may try:
SELECT
row_num,
MOD(row_num - 1, 3) + 1 AS repeat
FROM yourTable
ORDER BY
row_num;

redshift regex get multiple matches and expand rows

I'm working on the URL extraction on AWS Redshift. The URL column looks like this:
url item origin
http://B123//ajdsb apple US
http://BYHG//B123 banana UK
http://B325//BF89//BY85 candy CA
The result I want to get is to get the series that starts with B and also expand rows if there are multiple series in a URL.
extracted item origin
B123 apple US
BYHG banana UK
B123 banana UK
B325 candy CA
BF89 candy CA
BY85 candy CA
My current code is:
select REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})') as extracted, item, origin
from data
The regex part works well but I have problems with extracting multiple values and expand them to new rows. I tried to use REGEXP_MATCHES(url, '(B[0-9A-Z]{3})', 'g') but function regexp_matches does not exist on Redshift...
The solution I use is fairly ugly but achieves the desired results. It involves using REGEXP_COUNT to determine the maximum number of matches in a row then joining the resulting table of numbers to a query using REGEXP_SUBSTR.
-- Get a table with the count of matches
-- e.g. if one row has 5 matches this query will return 0, 1, 2, 3, 4, 5
WITH n_table AS (
SELECT
DISTINCT REGEXP_COUNT(url, '(B[0-9A-Z]{3})') AS n
FROM data
)
-- Join the previous table to the data table and use n in the REGEXP_SUBSTR call to get the nth match
SELECT
REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})', 1, n) AS extracted,
item,
origin
FROM data,
n_table
-- Only keep non-null matches
WHERE n > 0
AND REGEXP_COUNT(url, '(B[0-9A-Z]{3})') >= N
IronFarm's answer inspired me, though I wanted to find a solution that didn't require a cross join. Here's what I came up with:
with
-- raw data
src as (
select
1 as id,
'abc def ghi' as stuff
union all
select
2 as id,
'qwe rty' as stuff
),
-- for each id, get a series of indexes for
-- each match in the string
match_idxs as (
select
id,
generate_series(1, regexp_count(stuff, '[a-z]{3}')) as idx
from
src
)
select
src.id,
match_idxs.idx,
regexp_substr(src.stuff, '[a-z]{3}', 1, match_idxs.idx) as stuff_match
from
src
join match_idxs using (id)
order by
id, idx
;
This yields:
id | idx | stuff_match
----+-----+-------------
1 | 1 | abc
1 | 2 | def
1 | 3 | ghi
2 | 1 | qwe
2 | 2 | rty
(5 rows)

SQL Server - Find records with identical substrings [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I inherited a table that has a column containing hand-entered award numbers. It has been used for many years by many people. The award numbers in general look like this:
R01AR012345-01
R01AR012345-02
R01AR012345-03
Award numbers get assigned each year. Because so many different people have had their hands in this in the past, there isn't a lot of consistency in how these are entered. For instance, an award sequence may appear like this:
R01AR012345-01
1 RO1AR012345-02
12345-03
12345-05A1
1234506
The rule I've been given to find is to return any record in which 5 consecutive integers from that column match with another record.
I know how to match a given string, but am at a loss when the 5 consecutive integers are unknown.
Here's a sample table to make what I'm looking for more clear:
+----------------------+
| table: AWARD |
+-----+----------------+
| ID | AWARD_NO |
+-----+----------------+
| 12 | R01AR015123-01 |
+-----+----------------+
| 13 | R01AR015124-01 |
+-----+----------------+
| 14 | 15123-02A1 |
+-----+----------------+
| 15 | 1 Ro1XY1512303 |
+-----+----------------+
| 16 | R01XX099232-01 |
+-----+----------------+
In the above table, the following IDs would be returned: 12,13,14,15
The five consecutive integers that match are:
12,13: 01512
12,14: 15123
12,15: 15123
In our specific case, ID 13 is a false positive... but we're willing to deal with those on a case-by-case basis.
Here's the desired return set for the above table:
+-----+-----+----------------+----------------+
| ID1 | ID2 | AWARD_NO_1 | AWARD_NO_2 |
+-----+-----+----------------+----------------+
| 12 | 13 | R01AR015123-01 | R01AR015124-01 |
+-----+-----+----------------+----------------+
| 12 | 14 | R01AR015123-01 | 15123-02A1 |
+-----+-----+----------------+----------------+
| 12 | 15 | R01AR015123-01 | 1 Ro1XY1512303 |
+-----+-----+----------------+----------------+
Now... I'm OK with false positives (like 12 matching 13) and duplicates (because if 12 matches 14, then 14 also matches 12). We're looking through something like 18,000 rows. Optimization isn't really necessary in this situation, because it's only needed to be run one time.
This should handle removing duplicates and most false-positives:
DECLARE #SPONSOR TABLE (ID INT NOT NULL PRIMARY KEY, AWARD_NO VARCHAR(50))
INSERT INTO #SPONSOR VALUES (12, 'R01AR015123-01')
INSERT INTO #SPONSOR VALUES (13, 'R01AR015124-01')
INSERT INTO #SPONSOR VALUES (14, '15123-02A1')
INSERT INTO #SPONSOR VALUES (15, '1 Ro1XY1512303')
INSERT INTO #SPONSOR VALUES (16, 'R01XX099232-01')
;WITH nums AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS [Num]
FROM sys.objects
),
cte AS
(
SELECT sp.ID,
sp.AWARD_NO,
SUBSTRING(sp.AWARD_NO, nums.Num, 5) AS [TestCode],
SUBSTRING(sp.AWARD_NO, nums.Num + 5, 1) AS [FalsePositiveTest]
FROM #SPONSOR sp
CROSS JOIN nums
WHERE nums.Num < LEN(sp.AWARD_NO)
AND SUBSTRING(sp.AWARD_NO, nums.Num, 5) LIKE '%[1-9][0-9][0-9][0-9][0-9]%'
-- AND SUBSTRING(sp.AWARD_NO, nums.Num, 5) LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
)
SELECT sp1.ID AS [ID1],
sp2.ID AS [ID2],
sp1.AWARD_NO AS [AWARD_NO1],
sp2.AWARD_NO AS [AWARD_NO2],
sp1.TestCode
FROM cte sp1
CROSS JOIN #SPONSOR sp2
WHERE sp2.AWARD_NO LIKE '%' + sp1.TestCode + '%'
AND sp1.ID < sp2.ID
--AND 1 = CASE
-- WHEN (
-- sp1.FalsePositiveTest LIKE '[0-9]'
-- AND sp2.AWARD_NO NOT LIKE
-- '%' + sp1.TestCode + sp1.FalsePositiveTest + '%'
-- ) THEN 0
-- ELSE 1
-- END
Output:
ID1 ID2 AWARD_NO1 AWARD_NO2 TestCode
12 14 R01AR015123-01 15123-02A1 15123
12 15 R01AR015123-01 1 Ro1XY1512303 15123
14 15 15123-02A1 1 Ro1XY1512303 15123
If IDs 14 and 15 should not match, we might be able to correct for that as well.
EDIT:
Based on the comment from #Serpiton I commented out the creation and usage of the [FalsePositiveTest] field since changing the initial character range in the LIKE clause on the SUBSTRING to be [1-9] accomplished the same goal and slightly more efficiently. However, this change assumes that no valid Award # will start with a 0 and I am not sure that this is a valid assumption. Hence, I left the original code in place but just commented out.
You want to use the LIKE command in your where clause and use a pattern to look for the 5 numbers. See this post here:
There are probably better ways of representing this but the below example looks for 5 digits from 0-9 next to each other in the data anywhere in your column value. This could perform quite slowly however...
Select *
from blah
Where column LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
Create a sql server function to extract the 5 numbers and then use the function in your query.
Perhaps something like:
select GetAwardNumber(AwardNumberField) as AwardNumber
from Awards
group by GetAwardNumber(AwardNumberField)
I will not post the code, but an idea on how to do it.
First of all you need to make a table valued function that will return all number sequences from a string bigger then 5 characters. (there are examples on SO)
So for each entry your function will return a list of numbers.
After that the query will simplify like:
;with res as (
select
id, -- hopefully there is an id on your table
pattern -- pattern is from the list of patterns the udtf returns
from myTable
cross apply udtf_custom(myString) -- myString is the string you need to split
)
select
pattern
from res
group by pattern
having count(distinct id)>1
I have to note that this is for example purposes, there should be some coding and testing involved, but this should be the story with it.
Good luck, hope it helps.
Here's what I ended up with:
SELECT a1.ID as AWARD_ID_1, a2.ID as AWARD_ID_2, a1.AWARD_NO as Sponsor_Award_1, a2.AWARD_NO as Sponsor_Award_2
FROM AWARD a1
LEFT OUTER JOIN AWARD a2
ON SUBSTRING(a1.AWARD_NO,PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',a1.AWARD_NO + '1'),5) = SUBSTRING(a2.AWARD_NO,PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',a2.AWARD_NO + '1'),5)
WHERE
a1.AWARD_NO <> '' AND a2.AWARD_NO <> ''
AND a1.ID <> a2.ID
AND a1.AWARD_NO LIKE '%[0-9][0-9][0-9][0-9][0-9]%' AND a2.AWARD_NO LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
There's a possibility that the first substring of five characters might not match (when they should generate a match), but it's close enough for us. :-)

Explode range of integers out for joining in SQL

I have one table that stores a range of integers in a field, sort of like a print range, (e.g. "1-2,4-7,9-11"). This field could also contain a single number.
My goal is to join this table to a second one that has discrete values instead of ranges.
So if table one contains
1-2,5
9-15
7
And table two contains
1
2
3
4
5
6
7
8
9
10
The result of the join would be
1-2,5 1
1-2,5 2
1-2,5 5
7 7
9-15 9
9-15 10
Working in SQL Server 2008 R2.
Use a string split function of your choice to split on comma. Figure out the min/max values and join using between.
SQL Fiddle
MS SQL Server 2012 Schema Setup:
create table T1(Col1 varchar(10))
create table T2(Col2 int)
insert into T1 values
('1-2,5'),
('9-15'),
('7')
insert into T2 values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
Query 1:
select T1.Col1,
T2.Col2
from T2
inner join (
select T1.Col1,
cast(left(S.Item, charindex('-', S.Item+'-')-1) as int) MinValue,
cast(stuff(S.Item, 1, charindex('-', S.Item), '') as int) MaxValue
from T1
cross apply dbo.Split(T1.Col1, ',') as S
) as T1
on T2.Col2 between T1.MinValue and T1.MaxValue
Results:
| COL1 | COL2 |
----------------
| 1-2,5 | 1 |
| 1-2,5 | 2 |
| 1-2,5 | 5 |
| 9-15 | 9 |
| 9-15 | 10 |
| 7 | 7 |
Like everybody has said, this is a pain to do natively in SQL Server. If you must then I think this is the proper approach.
First determine your rules for parsing the string, then break down the process into well-defined and understood problems.
Based on your example, I think this is the process:
Separate comma separated values in the string into rows
If the data does not contain a dash, then it's finished (it's a standalone value)
If it does contain a dash, parse the left and right sides of the dash
Given the left and right sides (the range) determine all the values between them into rows
I would create a temp table to populate the parsing results into which needs two columns:
SourceRowID INT, ContainedValue INT
and another to use for intermediate processing:
SourceRowID INT, ContainedValues VARCHAR
Parse your comma-separated values into their own rows using a CTE like this Step 1 is now a well-defined and understood problem to solve:
Turning a Comma Separated string into individual rows
So your result from the source
'1-2,5'
will be:
'1-2'
'5'
From there, SELECT from that processing table where the field does not contain a dash. Step 2 is now a well-defined and understood problem to solve These are standalone numbers and can go straight into the results temp table. The results table should also get the ID reference to the original row.
Next would be to parse the values to the left and right of the dash using CHARINDEX to locate it, then the appropriate LEFT and RIGHT functions as needed. This will give you the starting and ending value.
Here is a relevant question for accomplishing this step 3 is now a well-defined and understood problem to solve:
T-SQL substring - separating first and last name
Now you have separated the starting and ending values. Use another function which can explode this range. Step 4 is now a well-defined and understood problem to solve:
SQL: create sequential list of numbers from various starting points
SELECT all N between #min and #max
What is the best way to create and populate a numbers table?
and, also, insert it into the temp table.
Now what you should have is a temp table with every value in the exploded range.
Simply JOIN that to the other table on the values now, then to your source table on the ID reference and you're there.
My suggestion is to add one more field and many more records to your ranges table. Specifically, the primary key would be the integer and the other field would be the range. Records would look like this:
number range
1 1-2,5
2 1-2,5
3 na
4 na
5 1-2,5
etc
Having said that, this is still rather limiting because a number can only have one range. If you want to be thorough, set up a many to many relationship between numbers and ranges.
As far as I can tell you best option is something like below:
Create a table value function that accepts your ranges an converts them to a collection of ints. So 1-3,5 would return:
1
2
3
5
Then use these results to join to other tables. I don't have an exact function to do this at hand, but this one seems like an excellent start.