How to find strings without different words? - sql

For example, I'm using SQL to filter out all descriptions containing the fruit 'Plum'. Unfortunately, using this code yields all sorts of irrelevant words (e.g. 'Plump', 'Plumeria') while excluding anything with a comma or full stop right after it (e.g. 'plum,' and 'plum.')
SELECT winery FROM winemag_p1
WHERE description LIKE '%plum%' OR
Is there a better way to do this? Thanks. I'm using SQL Server but curious how to make this work for MySQL and PostgreSQL too

Try the following method, using translate* to handle edge-case characters, search for the keyword including spaces and concat spaces to the source:
with plums as (
select 'this has a plum in it' p union all
select 'plum crazy' union all
select 'plume of smoke' union all
select 'a plump turkey' union all
select 'I like plums.' union all
select 'pick a plum, eat a plum'
)
select *
from plums
where Concat(' ',Translate(p,',.s',' '), ' ') like '% plum %'
* assuming you're using the latest version of SQL Server, if not will need nested replace()

Solution (I wasn't able to try on SQL Server, but it should work):
SELECT winery FROM winemag_p1
WHERE description LIKE '% plum[ \.\,]%'
In MySQL you can use the REGEXP_LIKE command (works on 8.0.19) (docs):
SELECT winery FROM winemag_p1
WHERE REGEXP_LIKE(description, '% plum[ .,]%');

Tried with sql server 2014 with this sql:
select * from winemag_p1
where description like '%plum%'
and not description like '%plum[a-zA-Z0-9]%'
and not description like '%[a-zA-Z0-9]plum%'
with table content
a plum
b plumable
c plum.
d plum,blum
e aplum
f bplummer
it outputs
a plum
c plum.
d plum,blum

Related

BigQuery Collation

How can I set a collation order in BigQuery?
I want something like this
SELECT Place
FROM Locations
ORDER BY Place COLLATE "en_CA"
I can't find any documentation other than COLLATE is a reserved word in BigQuery.
BigQuery is sorting the following Strings in [a..zA..Z] order:
E.g.
ant
bee
cat
Apple
Banana
Cantaloupe
Is there a way to ask BigQuery to sort in [aA..zZ] order?
ant
Apple
bee
Banana
cat
Cantaloupe
Below example is for BigQuery Standard SQL
#standardSQL
create temp function collate_order(text string) as ((
select string_agg(chr(1000 * ascii(lower(c)) - ascii(c)), '' order by offset)
from unnest(split(text)) c with offset
));
with `project.dataset.Locations` as (
select 'ant' as Place union all
select 'Apple' union all
select 'bee' union all
select 'apple' union all
select 'cat' union all
select 'Banana' union all
select 'Cantaloupe'
)
select Place
from `project.dataset.Locations`
order by collate_order(Place)
with output
Forgot to mention - obviously you can extend this approach to handle unicode text by replacing ascii to unicode function
You can try following query it will work for your requirement, it will sort data in [aA..zZ] order :-
SELECT Place
FROM Locations
ORDER BY upper(Place)

Split string into words using Postgres

I am looking for some help in separating scientific names in my data. I want to take only the genus names and group them, but they are both connected in the same column. I saw the SQL Sever had a CHARINDEX command, but PostgreSQL does not. Does there need to be a function created for this? If so, how would it look?
I want to change 'Mallotus philippensis' to just 'Mallotus' or to just 'philippensis'
I am currently using Postgres 11, 12.
Use SPLIT_PART:
WITH yourTable AS (
SELECT 'Mallotus philippensis'::text AS genus
)
SELECT
SPLIT_PART(genus, ' ', 1) AS genus,
SPLIT_PART(genus, ' ', 2) AS species
FROM yourTable;
Demo
Probably string_to_array will be slightly more efficient than split_part here because string splitting will be done only once for each row.
SELECT
val_arr[1] AS genus,
val_arr[2] AS species
FROM (
SELECT string_to_array(val, ' ') as val_arr
FROM (
VALUES
('aaa bbb'),
('cc dddd'),
('e fffff')
) t (val)
) tt;

What's the equivalent of Excel's `left(find(), -1)` in BigQuery?

I have names in my dataset and they include parentheses. But, I am trying to clean up the names to exclude those parentheses.
Example: ABC Company (Somewhere, WY)
What I want to turn it into is: ABC Company
I'm using standard SQL with google big query.
I've done some research and I know big query has left(), but I do not know the equivalent of find(). My plan was to do something that finds the ( and then gives me everything to the left of -1 characters from the (.
My plan was to do something that finds the ( and then gives me everything to the left of -1 characters from the (.
Good plan! In BigQuery Standard SQL - equivalent of LEFT is SUBSTR(value, position[, length]) and equivalent of FIND is STRPOS(value1, value2)
With this in mind your query can look like (which is exactly as you planned)
#standardSQL
WITH names AS (
SELECT 'ABC Company (Somewhere, WY)' AS name
)
SELECT SUBSTR(name, 1, STRPOS(name, '(') - 1) AS clean_name
FROM names
Usually, string functions are less expensive than regular expression functions, so if you have pattern as in your example - you should go with above version
But in more generic cases, when pattern to clean is more dynamic like in Graham's answer - you should go with solution in Graham's answer
Just use REGEXP_REPLACE + TRIM. This will work with all variants (just not nested parentheses):
#standardSQL
WITH
names AS (
SELECT
'ABC Company (Somewhere, WY)' AS name
UNION ALL
SELECT
'(Somewhere, WY) ABC Company' AS name
UNION ALL
SELECT
'ABC (Somewhere, WY) Company' AS name)
SELECT
TRIM(REGEXP_REPLACE(name,r'\(.*?\)',''), ' ') AS cleaned
FROM
names
Use REGEXP_EXTRACT:
SELECT
RTRIM(REGEXP_EXTRACT(names, r'([^(]*)')) AS new_name
FROM yourTable
The regex used here will greedily consume and match everything up until hitting an opening parenthesis. I used RTRIM to remove any unwanted whitespace picked up by the regex.
Note that this approach is robust with respect to the edge case of an address record not having any term with parentheses. In this case, the above query would just return the entire original value.
I can't test this solution at the moment, but you can combine SUBSTR and INSTR. Like this:
SELECT CASE WHEN INSTR(name, '(') > 0 THEN SUBSTR( name, 1, INSTR(name, '(') ) ELSE name END as name FROM table;

BigQuery return all matches of regular expression

In Big Query, when I do a regular expression search, it only returns the first match/occurrence.
Is there any way to return all matches, concatenated? something like GROUP_CONCAT maybe?
REGEXP_EXTRACT(body, r"(\w+ )")
In Standard SQL that was recently introduced being supported by BigQuery - you can try as below
SELECT
body,
(SELECT STRING_AGG(word) FROM words.word) AS words
FROM (
SELECT
body, REGEXP_EXTRACT_ALL(body, r'(\w+)') AS word
FROM (
SELECT 'abc xyz qwerty asd' AS body UNION ALL
SELECT 'zxc dfg 345' AS body
)
) words
Don't forget to uncheck Use Legacy SQL checkbox under Show Options
See more details on REGEXP_EXTRACT_ALL and STRING_AGG
If you are stuck with what is now in BigQuery called legacy SQL - you can try something like below
SELECT
body,
GROUP_CONCAT(SPLIT(body, ' ')) AS words
FROM
(SELECT 'abc xyz qwerty asd' AS body),
(SELECT 'zxc dfg 345' AS body)
I understand, this is not necessarily exactly what you need - but might help
Another approach with BigQuery legacy SQL that more suited to cases where you have to use regex.
For example - assume you need extract only numbers from body
Idea is to nuke anything but numbers from body using REGEXP_REPLACE and then apply above described SPLIT() + GROUP_CONCAT()
SELECT
body,
GROUP_CONCAT(SPLIT(REGEXP_REPLACE(body, r'(\D)+', ':'), ':')) AS words
FROM
(SELECT 'abc 123 xyz 543 qwerty asd' AS body),
(SELECT '987zxc 123 dfg 345' AS body)

Searching Technique in SQL (Like,Contain)

I want to compare and select a field from DB using Like keyword or any other technique.
My query is the following:
SELECT * FROM Test WHERE name LIKE '%xxxxxx_Ramakrishnan_zzzzz%';
but my fields only contain 'Ramakrishnan'
My Input string contain some extra character xxxxxx_Ramakrishnan_zzzzz
I want the SQL query for this. Can any one please help me?
You mean you want it the other way round? Like this?
Select * from Test where 'xxxxxx_Ramakrishnan_zzzzz' LIKE '%' + name + '%';
You can use the MySQL functions, LOCATE() precisely like,
SELECT * FROM WHERE LOCATE("Ramakrishnan",input) > 0
Are the xxxxxx and zzzzz bits always 6 and 5 characters? If so, then this is doable with a bit of string cutting.
with Test (id,name) as (
select 1, 'Ramakrishnan'
union
select 2, 'Coxy'
union
select 3, 'xxxxxx_Ramakrishnan_zzzzz'
)
Select * from Test where name like '%'+SUBSTRING('xxxxxx_Ramakrishnan_zzzzz', 8, CHARINDEX('_',SUBSTRING('xxxxxx_Ramakrishnan_zzzzz',8,100))-1)+'%'
Results in:
id name
1 Ramakrishnan
3 xxxxxx_Ramakrishnan_zzzzz
If they are variable lengths, then it will be a horrible construction of SUBSTRING,CHARINDEX, REVERSE and LEN functions.