Regex that matches strings with specific text not between text in BigQuery

Regex that matches strings with specific text not between text in BigQuery - google-bigquery

I have the following strings:
step_1->step_2->step_3
step_1->step_3
step_1->step_2->step_1->step_3
step_1->step_2->step_1->step_2->step_3
What I would like to do is to capture the ones that between step_1 and step 3 there's no step_2.
The results should be like this:
string result
step_1->step_2->step_3 false
step_1->step_3 true
step_1->step_2->step_1->step_3 true
step_1->step_2->step_1->step_2->step_3 false
I have tried to use the negative lookahead but I found out that BigQuery doesn't support it. Any ideas?

You are essentially looking for when the pattern does not exist. The following regex would support that embedded in a case statement. This would not support a scenario where you have both conditions in a single string, however that was not a scenario you listed in your sample data.
Try the following:
with sample_data as (
select 'step_1->step_2->step_3' as string union all
select 'step_1->step_3' union all
select 'step_1->step_2->step_1->step_3' union all
select 'step_1->step_2->step_1->step_2->step_3' union all
select 'step_1->step_2->step_1->step_2->step_2->step_3' union all
select 'step_1->step_2->step_1->step_2->step_2'
)
select
string,
-- CASE WHEN regexp_extract(string, r'step_1->(\w+)->step_3') IS NULL THEN TRUE
CASE WHEN regexp_extract(string, r'1(->step_2)+->step_3') IS NULL THEN TRUE
ELSE FALSE END as result
from sample_data
This results in:

Consider also below option
select string,
not regexp_contains(string, r'step_1->(step_2->)+step_3\b') as result
from your_table

I believe #Daniel_Zagales answer is the one you were expecting. However here is a broader solution that can maybe be interesting in your usecase:it consists in using arrays
WITH sample AS (
SELECT 'step_1->step_2->step_3' AS path
UNION ALL SELECT 'step_1->step_3'
UNION ALL SELECT 'step_1->step_2->step_1->step_3'
UNION ALL SELECT 'step_1->step_2->step_1->step_2->step_3'
),
temp AS (
SELECT
path,
SPLIT(REGEXP_REPLACE(path,'step_', ''), '->') AS sequences
FROM
sample)
SELECT
path,
position,
flattened AS current_step,
LAG(flattened) OVER (PARTITION BY path ORDER BY OFFSET ) AS previous_step,
LEAD(flattened) OVER (PARTITION BY path ORDER BY OFFSET ) AS following_step
FROM
temp,
temp.sequences AS flattened
WITH
OFFSET AS position
This query returns the following table
The concept is to get an array of the step number (splitting on '->' and erasing 'step_') and to keep the OFFSET (crucial as UNNESTing arrays does not guarantee keeping the order of an array).
The table obtained contains for each path and step of said path, the previous and following step. It is therefore easy to test for instance if successive steps have a difference of 1.
(SELECT * FROM <previous> WHERE ABS(current_step-previous_step) != 1 for example)
(CASTing to INT required)

Related

How to easily remove count=1 on aliased field in SQL?

I have the following data in a table:
GROUP1|FIELD
Z_12TXT|111
Z_2TXT|222
Z_31TBT|333
Z_4TXT|444
Z_52TNT|555
Z_6TNT|666
And I engineer in a field that removes the leading numbers after the '_'
GROUP1|GROUP_ALIAS|FIELD
Z_12TXT|Z_TXT|111
Z_2TXT|Z_TXT|222
Z_31TBT|Z_TBT|333 <- to be removed
Z_4TXT|Z_TXT|444
Z_52TNT|Z_TNT|555
Z_6TNT|Z_TNT|666
How can I easily query the original table for only GROUP's that correspond to GROUP_ALIASES with only one Distinct FIELD in it?
Desired result:
GROUP1|GROUP_ALIAS|FIELD
Z_12TXT|Z_TXT|111
Z_2TXT|Z_TXT|222
Z_4TXT|Z_TXT|444
Z_52TNT|Z_TNT|555
Z_6TNT|Z_TNT|666
This is how I get all the GROUP_ALIAS's I don't want:
SELECT GROUP_ALIAS
FROM
(SELECT
GROUP1,FIELD,
case when instr(GROUP1, '_') = 2
then
substr(GROUP1, 1, 2) ||
ltrim(substr(GROUP1, 3), '0123456789')
else
substr(GROUP1 , 1, 1) ||
ltrim(substr(GROUP1, 2), '0123456789')
end GROUP_ALIAS
FROM MY_TABLE
GROUP BY GROUP_ALIAS
HAVING COUNT(FIELD)=1
Probably I could make the engineered field a second time simply on the original table and check that it isn't in the result from the latter, but want to avoid so much nesting. I don't know how to partition or do anything more sophisticated on my case statement making this engineered field, though.
UPDATE
Thanks for all the great replies below. Something about the SQL used must differ from what I thought because I'm getting info like:
GROUP1|GROUP_ALIAS|FIELD
111,222|,111|111
111,222|,222|222
etc.
Not sure why since the solutions work on my unabstracted data in db-fiddle. If anyone can spot what db it's actually using that would help but I'll also check on my end.

Here is one way, using analytic count. If you are not familiar with the with clause, read up on it - it's a very neat way to make your code readable. The way I declare column names in the with clause works since Oracle 11.2; if your version is older than that, the code needs to be re-written just slightly.
I also computed the "engineered field" in a more compact way. Use whatever you need to.
I used sample_data for the table name; adapt as needed.
with
add_alias (group1, group_alias, field) as (
select group1,
substr(group1, 1, instr(group1, '_')) ||
ltrim(substr(group1, instr(group1, '_') + 1), '0123456789'),
field
from sample_data
)
, add_counts (group1, group_alias, field, ct) as (
select group1, group_alias, field, count(*) over (partition by group_alias)
from add_alias
)
select group1, group_alias, field
from add_counts
where ct > 1
;

With Oracle you can use REGEXP_REPLACE and analytic functions:
select Group1, group_alias, field
from (select group1, REGEXP_REPLACE(group1,'_\d+','_') group_alias, field,
count(*) over (PARTITION BY REGEXP_REPLACE(group1,'_\d+','_')) as count from test) a
where count > 1
db-fiddle

How can I count repeated values in the string in BigQuery?

Example:
I have the following string:
201904,BLANK,201902,BLANK,BLANK,201811,201810,201809
How can I count the number of repeated values "BLANK" that goes one by one?
In the described example the answer is 2, but what is the query?
Thanks for your help in advance!

Below is for BigQuery Standard SQL (with quick simplified example)
Corrected Version
#standardSQL
WITH `project.dataset.table` AS (
SELECT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK' value UNION ALL
SELECT '201904,BLANK,201902,BLANK,BLANK,BLANK,201811' UNION ALL
SELECT '201904,BLANK,201902,BLANK,201811,201902,BLANK,201811'
)
SELECT value,
(
SELECT MAX(ARRAY_LENGTH(SPLIT(list))) - 1
FROM UNNEST(REGEXP_EXTRACT_ALL(value || ',', r'(?:BLANK,){1,}')) list
) max_repeated_count
FROM `project.dataset.table`
The idea here is
extract all instances of consecutive BLANK
split each such instances to array of elements of BLANK
and finally get max length of those arrays as a result
Just something came as quick approach
Refactored Version
#standardSQL
WITH `project.dataset.table` AS (
SELECT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK' value UNION ALL
SELECT '201904,BLANK,201902,BLANK,BLANK,BLANK,201811' UNION ALL
SELECT '201904,BLANK,201902,BLANK,201811,201902,BLANK,201811'
)
SELECT value,
(
SELECT MAX(LENGTH(element) - 1)
FROM UNNEST(REGEXP_EXTRACT_ALL(REPLACE(value || ',', 'BLANK', ''), r',+')) element
) max_repeated_count
FROM `project.dataset.table`
Both with output
Row value max_repeated_count
1 201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK 3
2 201904,BLANK,201902,BLANK,BLANK,BLANK,201811 3
3 201904,BLANK,201902,BLANK,201811,201902,BLANK,201811 1
Refactored version is slightly different (but main idea the same)
it removes all BLANKS (assuming BLANK cannot be part of other element - if it can - code can easily be adjusted)
then extract all consecutive entries of commas into array
calculates max length of such sequences of commas

Maybe I misunderstood, but can't you simply split by the value you're looking for and subtract 2 (1 for the first element and 1 for counting elements after splitting):
declare t DEFAULT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809';
SELECT
t as theString,
split(t,'BLANK') as theSplittedString,
array_length(split(t,'BLANK'))-2 as theAmount
n>0 - amount of repetition,
0 - no repetition,
-1 - element not found

Concatenate & Trim String

Can anyone help me, I have a problem regarding on how can I get the below result of data. refer to below sample data. So the logic for this is first I want delete the letters before the number and if i get that same thing goes on , I will delete the numbers before the letter so I can get my desired result.
Table:
SALV3000640PIX32BLU
SALV3334470A9CARBONGRY
TP3000620PIXL128BLK
Desired Output:
PIX32BLU
A9CARBONGRY
PIXL128BLK

You need to use a combination of the SUBSTRING and PATINDEX Functions
SELECT
SUBSTRING(SUBSTRING(fielda,PATINDEX('%[^a-z]%',fielda),99),PATINDEX('%[^0-9]%',SUBSTRING(fielda,PATINDEX('%[^a-z]%',fielda),99)),99) AS youroutput
FROM yourtable
Input
yourtable
fielda
SALV3000640PIX32BLU
SALV3334470A9CARBONGRY
TP3000620PIXL128BLK
Output
youroutput
PIX32BLU
A9CARBONGRY
PIXL128BLK
SQL Fiddle:http://sqlfiddle.com/#!6/5722b6/29/0

To do this you can use
PATINDEX('%[0-9]%',FieldName)
which will give you the position of the first number, then trim off any letters before this using SUBSTRING or other string functions. (You need to trim away the first letters before continuing with the next step because unlike CHARINDEX there is no starting point parameter in the PATINDEX function).
Then on the remaining string use
PATINDEX('%[a-z]%',FieldName)
to find the position of the first letter in the remaining string. Now trim off the numbers in front using SUBSTRING etc.
You may find this other solution helpful
SQL to find first non-numeric character in a string

Try this it may helps you
;With cte (Data)
AS
(
SELECT 'SALV3000640PIX32BLU' UNION ALL
SELECT 'SALV3334470A9CARBONGRY' UNION ALL
SELECT 'SALV3334470A9CARBONGRY' UNION ALL
SELECT 'SALV3334470B9CARBONGRY' UNION ALL
SELECT 'SALV3334470D9CARBONGRY' UNION ALL
SELECT 'TP3000620PIXL128BLK'
)
SELECT * , CASE WHEN CHARINDEX('PIX',Data)>0 THEN SUBSTRING(Data,CHARINDEX('PIX',Data),LEN(Data))
WHEN CHARINDEX('A9C',Data)>0 THEN SUBSTRING(Data,CHARINDEX('A9C',Data),LEN(Data))
ELSE NULL END AS DesiredResult FROM cte
Result
Data DesiredResult
-------------------------------------
SALV3000640PIX32BLU PIX32BLU
SALV3334470A9CARBONGRY A9CARBONGRY
SALV3334470A9CARBONGRY A9CARBONGRY
SALV3334470B9CARBONGRY NULL
SALV3334470D9CARBONGRY NULL
TP3000620PIXL128BLK PIXL128BLK

How to display the whole number only if it starts with two characters otherwise leave it blank. SQL query

I need to display the whole number in a field if it starts with "AB" otherwise do not show/display the number.

Your question is missing code of how you display this (since you wrote you need to display it to the field) so i can't answer you with actual code but here is solution.
If you want to select only rows which column1 starts with AB then use LIKE function. So condition at selecting command is Select * from yourtable where column1 LIKE 'AB%'
If you already selected and displayed data, let's say in datagridview, and you want to fill textbox with string that contains AB then you would go through all rows at specific column and look for it with string.Contains("AB");
So basically you put this command in foreach loop and you have it.

I was wrong. You can use a LIKE, just not in the WHERE clause.
;WITH testdata AS (
SELECT 'aw12354' AS val UNION ALL
SELECT 'a12b344' UNION ALL
SELECT 'AB11111' UNION ALL
SELECT '11AB111' UNION ALL
SELECT '11111AB' UNION ALL
SELECT 'ab22222'
)
SELECT
CASE WHEN val LIKE 'AB%' THEN val ELSE NULL END AS valFull
, CASE WHEN val LIKE 'AB%' THEN SUBSTRING(val,3,len(val)) ELSE NULL END AS valNums
FROM testdata
;
You can also use CLR to build a regex solution, but that is a LOT more involved.

PLSQL - order by string with REGEX

I'm trying to sort the result set of a query where the row is VARCHAR2.
I've tried using just:
ORDER BY
UPPER(SERVER_NAME) ASC
But I get inconstant results, for example:
120157
777555
AKO
a20064
Elilikes
kagan
1200165_DAVID
As you can see, 1200165_DAVID appears last, in addition, I tried using a regular expression like so:
ORDER BY
(CASE WHEN REGEXP_LIKE(UPPER(SERVER_NAME), '^[0-9]+$') THEN 1 ELSE 2 END) ASC,
UPPER(SERVER_NAME) ASC
But I get the same results, I would like to get the following ordring is possible:
120157
1200165_DAVID
777555
a20064
AKO
Elilikes
kagan
Please advise.

Three things.
First: Why do you want 1200165_DAVID to appear AFTER 120157? It should appear before it, if you order alphabetically.
Second: Running your query on your test data, I get the correct result. So I am inclined to believe either your query is different from what you reported, or there is some other error somewhere.
Third: You may have who-knows-what characters in your data. Selecting str and dump(str) side by side (or whatever the name of your expression; I like to use str in my test data) to see what characters are in each string. Look especially at those that seem to be sorted "out of order".
with
inputs ( str ) as (
select '120157' from dual union all
select '777555' from dual union all
select 'AKO' from dual union all
select 'a20064' from dual union all
select 'Elilikes' from dual union all
select 'kagan' from dual union all
select '1200165_DAVID' from dual
)
select str from inputs
order by upper(str);
STR
-------------
1200165_DAVID
120157
777555
a20064
AKO
Elilikes
kagan
7 rows selected.

This is too long for a comment.
Your data would appear to not be all characters that you recognize. In particular, the first character is suspicious.
I would suggest that you run a query like this:
select ASCII(SUBSTR(server_name, 1, 1)) as first_char-ascii,
'|' || SUBSTR(server_name, 1, 1) || '|' as first_char,
COUNT(*), min(server_name), max(server_name)
from t
group by SUBSTR(server_name, 1, 1)
order by count(*) asc;
Then you will see what characters are actually at the beginning of the string. My guess is you will find at least one interesting character. You will then need to modify the data (or the query) to handle that.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Regex that matches strings with specific text not between text in BigQuery - google-bigquery

Consider also below option select string, not regexp_contains(string, r'step_1->(step_2->)+step_3\b') as result from your_table

Related

How to easily remove count=1 on aliased field in SQL?

How can I count repeated values in the string in BigQuery?

Concatenate & Trim String

How to display the whole number only if it starts with two characters otherwise leave it blank. SQL query

PLSQL - order by string with REGEX

Categories

Resources