BigQuery Collation - google-bigquery

How can I set a collation order in BigQuery?
I want something like this
SELECT Place
FROM Locations
ORDER BY Place COLLATE "en_CA"
I can't find any documentation other than COLLATE is a reserved word in BigQuery.
BigQuery is sorting the following Strings in [a..zA..Z] order:
E.g.
ant
bee
cat
Apple
Banana
Cantaloupe
Is there a way to ask BigQuery to sort in [aA..zZ] order?
ant
Apple
bee
Banana
cat
Cantaloupe

Below example is for BigQuery Standard SQL
#standardSQL
create temp function collate_order(text string) as ((
select string_agg(chr(1000 * ascii(lower(c)) - ascii(c)), '' order by offset)
from unnest(split(text)) c with offset
));
with `project.dataset.Locations` as (
select 'ant' as Place union all
select 'Apple' union all
select 'bee' union all
select 'apple' union all
select 'cat' union all
select 'Banana' union all
select 'Cantaloupe'
)
select Place
from `project.dataset.Locations`
order by collate_order(Place)
with output
Forgot to mention - obviously you can extend this approach to handle unicode text by replacing ascii to unicode function

You can try following query it will work for your requirement, it will sort data in [aA..zZ] order :-
SELECT Place
FROM Locations
ORDER BY upper(Place)

Related

How to find strings without different words?

For example, I'm using SQL to filter out all descriptions containing the fruit 'Plum'. Unfortunately, using this code yields all sorts of irrelevant words (e.g. 'Plump', 'Plumeria') while excluding anything with a comma or full stop right after it (e.g. 'plum,' and 'plum.')
SELECT winery FROM winemag_p1
WHERE description LIKE '%plum%' OR
Is there a better way to do this? Thanks. I'm using SQL Server but curious how to make this work for MySQL and PostgreSQL too
Try the following method, using translate* to handle edge-case characters, search for the keyword including spaces and concat spaces to the source:
with plums as (
select 'this has a plum in it' p union all
select 'plum crazy' union all
select 'plume of smoke' union all
select 'a plump turkey' union all
select 'I like plums.' union all
select 'pick a plum, eat a plum'
)
select *
from plums
where Concat(' ',Translate(p,',.s',' '), ' ') like '% plum %'
* assuming you're using the latest version of SQL Server, if not will need nested replace()
Solution (I wasn't able to try on SQL Server, but it should work):
SELECT winery FROM winemag_p1
WHERE description LIKE '% plum[ \.\,]%'
In MySQL you can use the REGEXP_LIKE command (works on 8.0.19) (docs):
SELECT winery FROM winemag_p1
WHERE REGEXP_LIKE(description, '% plum[ .,]%');
Tried with sql server 2014 with this sql:
select * from winemag_p1
where description like '%plum%'
and not description like '%plum[a-zA-Z0-9]%'
and not description like '%[a-zA-Z0-9]plum%'
with table content
a plum
b plumable
c plum.
d plum,blum
e aplum
f bplummer
it outputs
a plum
c plum.
d plum,blum

Regex capture only symbols

Anyone know the appropriate Regex to grab any symbols (such as . / _ etc). I'm trying to extract anything that doesn't look like 1-3 complete words.
Online Chat
http://mailserver.test.com/zjalLNG391Vkfalka0
social
test.com
poc_email_outbound~51-tester-test~2018-04-12
http://mailserver.test.com/u/130931jiojf101901
to grab only the below:
http://mailserver.test.com/zjalLNG391Vkfalka0
test.com
poc_email_outbound~51-tester-test~2018-04-12
http://mailserver.test.com/u/130931jiojf101901
You can use REGEXP_CONTAINS(line, r'[./_]')
See example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'Online Chat' line UNION ALL
SELECT 'http://mailserver.test.com/zjalLNG391Vkfalka0' UNION ALL
SELECT 'social' UNION ALL
SELECT 'test.com' UNION ALL
SELECT 'poc_email_outbound~51-tester-test~2018-04-12' UNION ALL
SELECT 'http://mailserver.test.com/u/130931jiojf101901'
)
SELECT line
FROM `project.dataset.table`
WHERE REGEXP_CONTAINS(line, r'[./_]')
To exclude all non-word characters you can use REGEXP_CONTAINS(line, r'\W'), which is equivalent to REGEXP_CONTAINS(line, r'[^0-9A-Za-z_]')
You can extend the latter with more chars that you want to exclude from criteria

How to remove only letters from a string in BigQuery?

So I'm working with BigQuery SQL right now trying to figure out how to remove letters but keep numeric numbers. For example:
XXXX123456
AAAA123456789
XYZR12345678
ABCD1234567
1111
2222
All have the same amount of letters in front of the numbers along with regular numbers no letters. I want the end result to look like:
123456
123456789
12345678
1234567
1111
2222
I tried using PATINDEX but BigQuery doesn't support the function. I've also tried using LEFT but that function will get rid of any value and I don't want to get rid of any numeric value only letter values. Any help would be much appreciated!
-Maykid
You can use regexp_replace():
select regexp_replace(str, '[^0-9]', '')
Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'XXXX123456' str UNION ALL
SELECT 'AAAA123456789' UNION ALL
SELECT 'XYZR12345678' UNION ALL
SELECT 'ABCD1234567' UNION ALL
SELECT '1111' UNION ALL
SELECT '2222'
)
SELECT str, REGEXP_REPLACE(str, r'[a-zA-Z]', '') str_adjusted
FROM `project.dataset.table`

Conditional regexp_replace Oracle / PLSQL

I'm trying to do a conditional replace within one regexp_replace statement.
For example, if I have the string, 'Dog Cat Donkey', I would like to be able to replace 'Dog' with 'BigDog', 'Cat' with 'SmallCat' and 'Donkey' with 'MediumDonkey' to get the following:
'BigDog SmallCat MediumDonkey'
I can do it where all are prefixed with the word Big but can't seem to make it replace conditionally.
I currently have this
select regexp_replace('Dog Cat Donkey', '(Cat)|(Dog)|(Donkey)', ' Big\1\2\3')
from dual
but of course this only returns 'BigDog BigCat BigDonkey'.
I'm aware this isn't the best way of doing this but is it possible?
Have you considered just doing multiple replace()s?
select replace(replace(replace(str, 'Dog', 'BigDog'), 'Cat', 'SmallCat'), 'Donkey', 'MediumDonkey')
I get that regexp_replace() is really powerful. And it might be able to do this. But I'm not sure that's a better solution in terms of expressing what you are doing.
Query -
select listagg(final_str,' ') within group (order by sort_str) as output from (
SELECT
CASE LST
WHEN 'Dog' THEN 'BigDog'
WHEN 'Cat' THEN 'SmallCat'
WHEN 'Donkey' THEN 'MediumDonkey'
END AS final_str,
CASE LST
WHEN 'Dog' THEN 1
WHEN 'Cat' THEN 2
WHEN 'Donkey' THEN 3
END AS sort_str
from (
SELECT
trim(REGEXP_SUBSTR('Dog Cat Donkey', '(\S*)(\s*)', 1, LEVEL)) AS LST
FROM
DUAL
CONNECT BY
REGEXP_SUBSTR('Dog Cat Donkey', '(\S*)(\s*)', 1, LEVEL) IS NOT NULL
));
Output -
BigDog SmallCat MediumDonkey
For conditional replacement via REGEX_REPLACE?
Then currently you can do this by repeating it for each different replacement string.
But you could still use the | (OR) within the 1 capture group to change more than 1 word for the same replacement string.
And as Gordon Linoff pointed out.
You don't really need a REGEX_REPLACE when a normal REPLACE is sufficient to match a single word.
select regexp_replace(
regexp_replace(
regexp_replace( str,
'(Dog|Snoopy)', 'Big\1')
,'(Cat|Feline)', 'Small\1')
,'(Donkey|Ass)', 'Medium\1')
from (select 'You Ass, that is not a Dog, but a Cat on a Donkey.' as str from dual);
Returns:
You MediumAss, that is not a BigDog, but a SmallCat on a MediumDonkey.
Do note however that when using the pipe in a regex, that the order matters.
So if some words start the same then better put them in order of descending length.
Example:
select
regexp_replace(str, '(foo|foobar)', '[\1]') as foo_foobar,
regexp_replace(str, '(foobar|foo)', '[\1]') as foobar_foo
from (select 'foo foobar' as str from dual);
Returns:
FOO_FOOBAR FOOBAR_FOO
--------------- ---------------
[foo] [foo]bar [foo] [foobar]

Find the accent data in table records

In a table, I have a column that contains a few records with accented characters. I want a query to find the records with accented characters.
If we have records like as below:
2ème édition
Natália
sravanth
query should pick these records:
2ème édition
Natália
You can use the REGEXP_LIKE function along with a list of all the accented characters you're interested in:
with t1(data) as (
select '2ème édition' from dual union all
select 'Natália' from dual union all
select 'sravanth' from dual
)
select * from t1 where regexp_like(data,'[àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]');
DATA
--------------
2ème édition
Natália
The ASCIISTR function would be another way to find accented characters
ASCIISTR takes as its argument a string, or an expression that
resolves to a string, in any character set and returns an ASCII
version of the string in the database character set. Non-ASCII
characters are converted to the form \xxxx, where xxxx represents a
UTF-16 code unit.
So you can do something like
SELECT my_field FROM my_table
WHERE NOT my_field = ASCIISTR(my_field)
Or to re-use the demo from the accepted answer:
with t1(data) as (
select '2ème édition' from dual union all
select 'Natália' from dual union all
select 'sravanth' from dual
)
select * from t1 where data != asciistr(data)
which would output the 2 rows with accents.
with t1(data) as (
select '2ème édition' from dual union all
select 'Natália' from dual union all
select 'sravanth' from dual
)
select * from t1 where REGEXP_like(ASCIISTR(data), '\ \ [[:xdigit:]]{4}');
DATA
--------------
2ème édition
Natália
Way harder than it seems on the surface as there is more than one way to create an accent. What I do is have a mirror column I call clean and scrub out all the accents on load.
See this question I asked some time ago normalized string