Regex capture only symbols - google-bigquery

Anyone know the appropriate Regex to grab any symbols (such as . / _ etc). I'm trying to extract anything that doesn't look like 1-3 complete words.
Online Chat
http://mailserver.test.com/zjalLNG391Vkfalka0
social
test.com
poc_email_outbound~51-tester-test~2018-04-12
http://mailserver.test.com/u/130931jiojf101901
to grab only the below:
http://mailserver.test.com/zjalLNG391Vkfalka0
test.com
poc_email_outbound~51-tester-test~2018-04-12
http://mailserver.test.com/u/130931jiojf101901

You can use REGEXP_CONTAINS(line, r'[./_]')
See example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'Online Chat' line UNION ALL
SELECT 'http://mailserver.test.com/zjalLNG391Vkfalka0' UNION ALL
SELECT 'social' UNION ALL
SELECT 'test.com' UNION ALL
SELECT 'poc_email_outbound~51-tester-test~2018-04-12' UNION ALL
SELECT 'http://mailserver.test.com/u/130931jiojf101901'
)
SELECT line
FROM `project.dataset.table`
WHERE REGEXP_CONTAINS(line, r'[./_]')
To exclude all non-word characters you can use REGEXP_CONTAINS(line, r'\W'), which is equivalent to REGEXP_CONTAINS(line, r'[^0-9A-Za-z_]')
You can extend the latter with more chars that you want to exclude from criteria

Related

How to remove all stop words and single characters in Bigquery String column

I have a column in BigQuery table and I want to do some natural language pre-processing on it. Hence, I want to retain only characters from a-z and ignore others. I also want to ignore the words in the string that are single characters.
How can I do it best using big-query?
Sample input -
with data as (
select "efficacy!! and/or lasting affects whether community protected" as ip union all
select "n/a" as ip union all
select "this questions is un-clear information" union all
select "I m 84-years old!!!" union all select "none"
)
select * from data
so I have certain stop words like a,am,the,none,na etc.. which I want to remove from the text, I also dont want to keep retain single characters in the string.
Expected output -
efficacy and or lasting affects whether community protected
''
this questions un clear information
years old
''
The 2nd and 5th data points are blank because they contain stop words.
Try regexp_extract_all to extract all words and then filter them:
with data as (
select "efficacy!! and/or lasting affects whether community protected" as ip union all
select "n/a" as ip union all
select "this questions is un-clear information" union all
select "I m 84-years old!!!" union all
select "none"
)
select STRING_AGG(word, " " ORDER BY offset)
from (
select ip, word, offset
from data, unnest(regexp_extract_all(ip, "[a-z]{2,}")) as word with offset
where word not in ("am", "the", "none", "na", "is")
)
group by ip
Consider below alternative
with data as (
select "efficacy!! and/or lasting affects whether community protected" as ip union all
select "n/a" as ip union all
select "this questions is un-clear information" union all
select "I m 84-years old!!!" union all
select "none"
), stop_words as (
select 'am|the|none|na|is' list
), pattern_to_remove as (
select r'a-zA-Z' remove
)
select trim(regexp_replace(regexp_replace(regexp_replace(
ip, r'[^' || remove || r']', ' '), r'(?:\b)(' || list || r'|[' || remove || r'])(?:\b)', ''), r'[ ]+', ' '
)) as ip
from data, stop_words, pattern_to_remove
looks a little overengineered, but has benefit of having generic query with stop words and to exclude chars to be controlled within the respective CTEs

BigQuery Collation

How can I set a collation order in BigQuery?
I want something like this
SELECT Place
FROM Locations
ORDER BY Place COLLATE "en_CA"
I can't find any documentation other than COLLATE is a reserved word in BigQuery.
BigQuery is sorting the following Strings in [a..zA..Z] order:
E.g.
ant
bee
cat
Apple
Banana
Cantaloupe
Is there a way to ask BigQuery to sort in [aA..zZ] order?
ant
Apple
bee
Banana
cat
Cantaloupe
Below example is for BigQuery Standard SQL
#standardSQL
create temp function collate_order(text string) as ((
select string_agg(chr(1000 * ascii(lower(c)) - ascii(c)), '' order by offset)
from unnest(split(text)) c with offset
));
with `project.dataset.Locations` as (
select 'ant' as Place union all
select 'Apple' union all
select 'bee' union all
select 'apple' union all
select 'cat' union all
select 'Banana' union all
select 'Cantaloupe'
)
select Place
from `project.dataset.Locations`
order by collate_order(Place)
with output
Forgot to mention - obviously you can extend this approach to handle unicode text by replacing ascii to unicode function
You can try following query it will work for your requirement, it will sort data in [aA..zZ] order :-
SELECT Place
FROM Locations
ORDER BY upper(Place)

Regex - SQL - Query for finding all words that contain at-least 3 capital letters (Does not have to be in order)

I want to basically catch all of the words that contain at-least 3 capital letters anywhere throughout the word.
Example words that I am trying to catch:
sksDDKDeS4Ataow,
dS19DsA2NTbpctK
My bad regex:
regexp_like(word, '[A-Z]{1,4}?+[a-z]{1,16}+[A-Z]{1,4}?+[a-z]{1,16}+[A-Z]{1,4}?')
Try this one - it matches where the word says they should match ...
WITH
words(word) AS (
SELECT 'noMatch'
UNION ALL SELECT 'onlYtwoNomatch'
UNION ALL SELECT 'thrEECapsmatch'
UNION ALL SELECT 'ThReeCapsmatch'
UNION ALL SELECT 'FourMatcHToo'
)
SELECT
*
FROM words
WHERE REGEXP_LIKE(word,'([A-Z]\w*){3}')
;

PLSQL - order by string with REGEX

I'm trying to sort the result set of a query where the row is VARCHAR2.
I've tried using just:
ORDER BY
UPPER(SERVER_NAME) ASC
But I get inconstant results, for example:
120157
777555
AKO
a20064
Elilikes
kagan
1200165_DAVID
As you can see, 1200165_DAVID appears last, in addition, I tried using a regular expression like so:
ORDER BY
(CASE WHEN REGEXP_LIKE(UPPER(SERVER_NAME), '^[0-9]+$') THEN 1 ELSE 2 END) ASC,
UPPER(SERVER_NAME) ASC
But I get the same results, I would like to get the following ordring is possible:
120157
1200165_DAVID
777555
a20064
AKO
Elilikes
kagan
Please advise.
Three things.
First: Why do you want 1200165_DAVID to appear AFTER 120157? It should appear before it, if you order alphabetically.
Second: Running your query on your test data, I get the correct result. So I am inclined to believe either your query is different from what you reported, or there is some other error somewhere.
Third: You may have who-knows-what characters in your data. Selecting str and dump(str) side by side (or whatever the name of your expression; I like to use str in my test data) to see what characters are in each string. Look especially at those that seem to be sorted "out of order".
with
inputs ( str ) as (
select '120157' from dual union all
select '777555' from dual union all
select 'AKO' from dual union all
select 'a20064' from dual union all
select 'Elilikes' from dual union all
select 'kagan' from dual union all
select '1200165_DAVID' from dual
)
select str from inputs
order by upper(str);
STR
-------------
1200165_DAVID
120157
777555
a20064
AKO
Elilikes
kagan
7 rows selected.
This is too long for a comment.
Your data would appear to not be all characters that you recognize. In particular, the first character is suspicious.
I would suggest that you run a query like this:
select ASCII(SUBSTR(server_name, 1, 1)) as first_char-ascii,
'|' || SUBSTR(server_name, 1, 1) || '|' as first_char,
COUNT(*), min(server_name), max(server_name)
from t
group by SUBSTR(server_name, 1, 1)
order by count(*) asc;
Then you will see what characters are actually at the beginning of the string. My guess is you will find at least one interesting character. You will then need to modify the data (or the query) to handle that.

SQL, ORACLE - trim right string (remove all parameters from URL)

I query a column with URLs. Those URLs are from different origins and have different formats. Some of them have parameters. I wish to query this column and right trim the URLs from the first parameter symbol.
Example URLs:
URLs
http://www.domain1.com/path/page?parameters1&parameters2
https://www.domain2.com/path/page?parameters1&parameters2/somemorestufftoscrape
domain3.com/path/page?parameters1&parameters2
http://www.domain4.com/path/page&parameters1?parameters2
https://www.domain5.com/path/noparametershere.html
domain6.com/path/page=?parameters1&parameters2
I'll want to trim everything right from either ?,&,= (a list of characters that represent parameters for my case).
Desired Output:
TrimmedURLs
http://www.domain1.com/path/page
https://www.domain2.com/path/page
domain3.com/path/page
http://www.domain4.com/path/page
https://www.domain5.com/path
domain6.com/path/page
I've tried to use RTRIM as follows:
select
URLs
rtrim(URLs, '?=&') as TrimmedURLs
from
MyTable;
The query returns but URLs column is equal to TrimmedURLs (am I doing something wrong?).
I've tried to use regexp_substr but in the cases where there are multiple parameter charterers it trims from the last one and not the first one (see first note in page).
What is the query for the desired result?
Why does RTRIM not work for me?
Server is Oracle 11g
URLs Type is VARCHAR2(1024)
Thanks!
REGEXP_SUBSTR() sounds like the thing to use here:
with sample_data as (select 'http://www.domain1.com/path/page?parameters1&parameters2' url from dual union all
select 'https://www.domain2.com/path/page?parameters1&parameters2/somemorestufftoscrape' url from dual union all
select 'domain3.com/path/page?parameters1&parameters2' url from dual union all
select 'http://www.domain4.com/path/page&parameters1?parameters2' url from dual union all
select 'https://www.domain5.com/path/noparametershere.html' url from dual union all
select 'domain6.com/path/page=?parameters1&parameters2' url from dual)
select url,
regexp_substr(url, '[^?&=]+', 1, 1) main_url
from sample_data;
URL MAIN_URL
------------------------------------------------------------------------------- ------------------------------------------------------------
http://www.domain1.com/path/page?parameters1&parameters2 http://www.domain1.com/path/page
https://www.domain2.com/path/page?parameters1&parameters2/somemorestufftoscrape https://www.domain2.com/path/page
domain3.com/path/page?parameters1&parameters2 domain3.com/path/page
http://www.domain4.com/path/page&parameters1?parameters2 http://www.domain4.com/path/page
https://www.domain5.com/path/noparametershere.html https://www.domain5.com/path/noparametershere.html
domain6.com/path/page=?parameters1&parameters2 domain6.com/path/page
If you don't like regexp, you can also use a combination of substr and instr fonctions:
with sample_data as (select 'http://www.domain1.com/path/page?parameters1&parameters2' url from dual union all
select 'https://www.domain2.com/path/page?parameters1&parameters2/somemorestufftoscrape' url from dual union all
select 'domain3.com/path/page?parameters1&parameters2' url from dual union all
select 'http://www.domain4.com/path/page&parameters1?parameters2' url from dual union all
select 'https://www.domain5.com/path/noparametershere.html' url from dual union all
select 'domain6.com/path/page=?parameters1&parameters2' url from dual)
select
url,
substr(url, 0, instr(url,'?')-1) main_url
from
sample_data