BigQuery return all matches of regular expression - sql

In Big Query, when I do a regular expression search, it only returns the first match/occurrence.
Is there any way to return all matches, concatenated? something like GROUP_CONCAT maybe?
REGEXP_EXTRACT(body, r"(\w+ )")

In Standard SQL that was recently introduced being supported by BigQuery - you can try as below
SELECT
body,
(SELECT STRING_AGG(word) FROM words.word) AS words
FROM (
SELECT
body, REGEXP_EXTRACT_ALL(body, r'(\w+)') AS word
FROM (
SELECT 'abc xyz qwerty asd' AS body UNION ALL
SELECT 'zxc dfg 345' AS body
)
) words
Don't forget to uncheck Use Legacy SQL checkbox under Show Options
See more details on REGEXP_EXTRACT_ALL and STRING_AGG
If you are stuck with what is now in BigQuery called legacy SQL - you can try something like below
SELECT
body,
GROUP_CONCAT(SPLIT(body, ' ')) AS words
FROM
(SELECT 'abc xyz qwerty asd' AS body),
(SELECT 'zxc dfg 345' AS body)
I understand, this is not necessarily exactly what you need - but might help
Another approach with BigQuery legacy SQL that more suited to cases where you have to use regex.
For example - assume you need extract only numbers from body
Idea is to nuke anything but numbers from body using REGEXP_REPLACE and then apply above described SPLIT() + GROUP_CONCAT()
SELECT
body,
GROUP_CONCAT(SPLIT(REGEXP_REPLACE(body, r'(\D)+', ':'), ':')) AS words
FROM
(SELECT 'abc 123 xyz 543 qwerty asd' AS body),
(SELECT '987zxc 123 dfg 345' AS body)

Related

How to find strings without different words?

For example, I'm using SQL to filter out all descriptions containing the fruit 'Plum'. Unfortunately, using this code yields all sorts of irrelevant words (e.g. 'Plump', 'Plumeria') while excluding anything with a comma or full stop right after it (e.g. 'plum,' and 'plum.')
SELECT winery FROM winemag_p1
WHERE description LIKE '%plum%' OR
Is there a better way to do this? Thanks. I'm using SQL Server but curious how to make this work for MySQL and PostgreSQL too
Try the following method, using translate* to handle edge-case characters, search for the keyword including spaces and concat spaces to the source:
with plums as (
select 'this has a plum in it' p union all
select 'plum crazy' union all
select 'plume of smoke' union all
select 'a plump turkey' union all
select 'I like plums.' union all
select 'pick a plum, eat a plum'
)
select *
from plums
where Concat(' ',Translate(p,',.s',' '), ' ') like '% plum %'
* assuming you're using the latest version of SQL Server, if not will need nested replace()
Solution (I wasn't able to try on SQL Server, but it should work):
SELECT winery FROM winemag_p1
WHERE description LIKE '% plum[ \.\,]%'
In MySQL you can use the REGEXP_LIKE command (works on 8.0.19) (docs):
SELECT winery FROM winemag_p1
WHERE REGEXP_LIKE(description, '% plum[ .,]%');
Tried with sql server 2014 with this sql:
select * from winemag_p1
where description like '%plum%'
and not description like '%plum[a-zA-Z0-9]%'
and not description like '%[a-zA-Z0-9]plum%'
with table content
a plum
b plumable
c plum.
d plum,blum
e aplum
f bplummer
it outputs
a plum
c plum.
d plum,blum

How to Pass list of words into SQL 'LIKE' operator

Iam trying to pass a list of words into SQL Like operator.
The query is to return column called Customer Issue where Customer Issue matches any word in the above list.
my_list =['air con','no cold air','hot air','blowing hot air']
SELECT customer_comments
FROM table
where customer_comments like ('%air con%') #for single search
How do i pass my_list above?
Regular expression can help here. Other solution is using unnest. Which is given already.
SELECT customer_comments
FROM table
where REGEXP_CONTAINS(lower(customer_comments), r'air con|no cold air|hot air|blowing hot air');
A similiar question was answered on the following, works for SQL Server:
Combining "LIKE" and "IN" for SQL Server
Basically you'll have to chain a bunch of 'OR' conditions.
Based on the post #Jordi shared, I think below query can be an option in BigQuery.
query:
SELECT DISTINCT customer_comments
FROM sample,
UNNEST(['air con','no cold air','hot air','blowing hot air']) keyword
WHERE INSTR(customer_comments, keyword) <> 0;
output:
with sample:
CREATE TEMP TABLE sample AS
SELECT * FROM UNNEST(['air conditioner', 'cold air', 'too hot air']) customer_comments;
Consider below
with temp as (
select ['air con','no cold air','hot air','blowing hot air'] my_list
)
select customer_comments
from your_table, (
select string_agg(item, '|') list
from temp t, t.my_list item
)
where regexp_contains(customer_comments, r'' || list)
There are myriad ways to refactor above based on your specific use case - for example
select customer_comments
from your_table
where regexp_contains(customer_comments, r'' ||
array_to_string(['air con','no cold air','hot air','blowing hot air'], '|')
)

Multiple conditions in Regex_Replace

I want to replace special characters from two different words as shown in the image below. From the first word, I want to replace a special character with "I" and from the second word, I want to replace a special character with "U".
My query is like below: It works for the first word. Can you pls assist?
SELECT
distinct ABC REGEXP_REPLACE(REGEXP_REPLACE(ABC, r'([^\p{ASCII}]+)', 'I') ,r'\&', 'U')
FROM Table
where ABC like '%B??RD%' or ABC like '%M??D%';
Instead of using REGEX, you can use CASE for this scenario:
WITH CTE as (
SELECT 'B��RD'as ABC,
UNION ALL SELECT 'M��D'as ABC)
SELECT
ABC,
CASE ABC
WHEN 'B��RD' THEN 'BIRD'
WHEN 'M��D' THEN 'MUD'
END AS output
FROM CTE
Output:

What's the equivalent of Excel's `left(find(), -1)` in BigQuery?

I have names in my dataset and they include parentheses. But, I am trying to clean up the names to exclude those parentheses.
Example: ABC Company (Somewhere, WY)
What I want to turn it into is: ABC Company
I'm using standard SQL with google big query.
I've done some research and I know big query has left(), but I do not know the equivalent of find(). My plan was to do something that finds the ( and then gives me everything to the left of -1 characters from the (.
My plan was to do something that finds the ( and then gives me everything to the left of -1 characters from the (.
Good plan! In BigQuery Standard SQL - equivalent of LEFT is SUBSTR(value, position[, length]) and equivalent of FIND is STRPOS(value1, value2)
With this in mind your query can look like (which is exactly as you planned)
#standardSQL
WITH names AS (
SELECT 'ABC Company (Somewhere, WY)' AS name
)
SELECT SUBSTR(name, 1, STRPOS(name, '(') - 1) AS clean_name
FROM names
Usually, string functions are less expensive than regular expression functions, so if you have pattern as in your example - you should go with above version
But in more generic cases, when pattern to clean is more dynamic like in Graham's answer - you should go with solution in Graham's answer
Just use REGEXP_REPLACE + TRIM. This will work with all variants (just not nested parentheses):
#standardSQL
WITH
names AS (
SELECT
'ABC Company (Somewhere, WY)' AS name
UNION ALL
SELECT
'(Somewhere, WY) ABC Company' AS name
UNION ALL
SELECT
'ABC (Somewhere, WY) Company' AS name)
SELECT
TRIM(REGEXP_REPLACE(name,r'\(.*?\)',''), ' ') AS cleaned
FROM
names
Use REGEXP_EXTRACT:
SELECT
RTRIM(REGEXP_EXTRACT(names, r'([^(]*)')) AS new_name
FROM yourTable
The regex used here will greedily consume and match everything up until hitting an opening parenthesis. I used RTRIM to remove any unwanted whitespace picked up by the regex.
Note that this approach is robust with respect to the edge case of an address record not having any term with parentheses. In this case, the above query would just return the entire original value.
I can't test this solution at the moment, but you can combine SUBSTR and INSTR. Like this:
SELECT CASE WHEN INSTR(name, '(') > 0 THEN SUBSTR( name, 1, INSTR(name, '(') ) ELSE name END as name FROM table;

Extract first word from a varchar column and reverse it

I have following data in my table
id nml
-- -----------------
1 Temora sepanil
2 Human Mixtard
3 stlliot vergratob
I need to get the result by extracting first word in column nml and get its last 3 characters with reverse order
That means output should be like
nml reverse
----------------- -------
Temora sepanil aro
Human Mixtard nam
stlliot vergratob toi
You use PostgreSQL's string functions to achieve desired output
in this case am using split_part,right,reverse function
select reverse(right(split_part('Temora sepanil',' ',1),3))
output:
aro
so you can write your query in following format
select nml
,reverse(right(split_part(nml,' ',1),3)) "Reverse"
from tbl
Split nml using regexp_split_to_array(string text, pattern text [, flags text ]) refer Postgres Doc for more info.
Use reverse(str) (refer Postgres Doc) to reverse the first word form previous split.
Use substr(string, from [, count]) (refer Postgres Doc) to select first three letters of the reversed test
Query
SELECT
nml,
substr(reverse(regexp_split_to_array(nml, E'\\s+')[0]),3) as reverse
FROM
MyTable
You can use the SUBSTRING, CHARINDEX, RIGHT and REVERSE function
here's the syntax
REVERSE(RIGHT(SUBSTRING(nml , 1, CHARINDEX(' ', nml) - 1),3))
sample:
SELECT REVERSE(RIGHT(SUBSTRING(nml , 1, CHARINDEX(' ', nml) - 1),3)) AS 'Reverse'
FROM TableNameHere