Finding charaters to right of second period - Bigquery - google-bigquery

I have a BQ table that looks like this
| website |
| -------- |
| xyz.com |
| abc.xyz.com |
| 123.abc.xyz.com |
| 098.com |
I want to clean up this table so that I only get the domain. In my ideal world I want to execute the following steps.
For each row - Count number of '.'
If there are more than 1 '.', THEN extract everything to the right of the second '.' from the right. So 'abc.xyz.com' gets extracted as 'xyz.com'
If there is just 1 '.', THEN do nothing and give me the input as output. So 'xyz.com' gets outputed as 'xyz.com'

Use below approach
select website, net.reg_domain(website) as result
from your_table
if applied to sample data in your question - output is

A simplier approach to extract only the domain using regex:
with sample as (
select 'xyz.com' as website
union all
select 'abc.xyz.com' as website
union all
select '123.abc.xyz.com' as website
union all
select '098.com' as website
)
select
regexp_extract(website, r"[\w]+\.[\w]+$") as website
from sample;
Output:
xyz.com
xyz.com
xyz.com
098.com
Explanation:
[\w]+ will match any sequence of one or more letters
\. will match a single dot
[\w]+$ will match any sequence of one or more letters only on the end of the string

Related

extract json in column

I have a table
id | status | outgoing
-------------------------
1 | paid | {"a945248027_14454878":"processing","old.a945248027_14454878":"cancelled"}
2 | pending| {"069e5248cf_45299995":"processing"}
I am trying to extract the values after each underscore in the outgoing column e.g from a945248027_14454878 I want 14454878
Because the json data is not standardised I can't seem to figure it out.
You may extract the json key part after the underscore using regexp version of substring.
select id, status, outgoing,
substring(key from '_([^_]+)$') as key
from the_table, lateral jsonb_object_keys(outgoing) as j(key);
See demo.

Return only ALL CAPS strings in BigQuery

Pretty simple question, specific to BigQuery. I'm sure there's a command I'm missing. I'm used to using "collate" in another query which doesn't work here.
email
| -------- |
| eric#email.com |
| JOHN#EMAIL.COM |
| STACY#EMAIL.COM |
| tanya#email.com |
Desired return:
JOHN#EMAIL.COM,STACY#EMAIL.COM
Consider below
select *
from your_table
where upper(email) = email
If applied to sample data in your question - output is
In case you want the output as a comma separated list - use below
select string_agg(email) emails
from your_table
where upper(email) = email
with output
You can use below cte (which is exact data sample from your question) for testing purposes
with your_table as (
select 'eric#email.com' email union all
select 'JOHN#EMAIL.COM' union all
select 'STACY#EMAIL.COM' union all
select 'tanya#email.com'
)

Data field - search and write value in new data field (Oracle)

Sorry, I don't know how to describe that as a title.
With a query (example: Select SELECT PKEY, TRUNC (CREATEDFORMAT), STATISTICS FROM BUSINESS_DATA WHERE STATISTICS LIKE '% business_%'), I can display all data that contains the value "business_xxxxxx".
For example, the data field can have the following content: c01_ad; concierge_beendet; business_start; or also skill_my; pre_initial_markt; business_request; topIntMaster; concierge_start; c01_start;
Is it now possible in a temp-only output the corresponding value in another column?
So the output looks like this, for example?
PKEY | TRUNC(CREATEDFORMAT) | NEW_STATISTICS
1 | 13.06.2020 | business_start
2 | 14.06.2020 | business_request
That means removing everything that does not start with business_xxx? Is this possible in an SQL query? RegEx would not be the right one, I think.
I think you want:
select
pkey,
trunc(createdformat) createddate,
regexp_substr(statistics, 'business_\S*') new_statistics
from business_data
where statistics like '% business_%'
You can also use the following regexp_substr:
SQL> select regexp_substr(str,'business_[^;]+') as result
2 from
3 --sample data
4 (select 'skill_my; pre_initial_markt; business_request; topIntMaster; concierge_start; c01_start;' as str from dual
5 union all
6 select 'c01_ad; concierge_beendet; business_start;' from dual);
RESULT
--------------------------------------------------------------------------------
business_request
business_start
SQL>

Postgres matching against an array of regular expressions

My client wants the possibility to match a set of data against an array of regular expressions, meaning:
table:
name | officeId (foreignkey)
--------
bob | 1
alice | 1
alicia | 2
walter | 2
and he wants to do something along those lines:
get me all records of offices (officeId) where there is a member with
ANY name ~ ANY[.*ob, ali.*]
meaning
ANY of[alicia, walter] ~ ANY of [.*ob, ali.*] results in true
I could not figure it out by myself sadly :/.
Edit
The real Problem was missing form the original description:
I cannot use select disctinct officeId .. where name ~ ANY[.*ob, ali.*], because:
This application, stored data in postgres-xml columns, which means i do in fact have (after evaluating xpath('/data/clients/name/text()'))::text[]):
table:
name | officeId (foreignkey)
-----------------------------------------
[bob, alice] | 1
[anthony, walter] | 2
[alicia, walter] | 3
There is the Problem. And "you don't do that, that is horrible, why would you do it like this, store it like it is meant to be stored in a relation database, user a no-sql database for Document-based storage, use json" are no options.
I am stuck with this datamodel.
This looks pretty horrific, but the only way I can think of doing such a thing would be a hybrid of a cross-join and a semi join. On small data sets this would probably work pretty well. On large datasets, I imagine the cross-join component could hit you pretty hard.
Check it out and let me know if it works against your real data:
with patterns as (
select unnest(array['.*ob', 'ali.*']) as pattern
)
select
o.name, o.officeid
from
office o
where exists (
select null
from patterns p
where o.name ~ p.pattern
)
The semi-join helps protect you from cases where you have a name like "alicia nob" that would meet multiple search patterns would otherwise come back for every match.
You could cast the array to text.
SELECT * FROM workers WHERE (xpath('/data/clients/name/text()', xml_field))::text ~ ANY(ARRAY['wal','ant']);
When casting a string array into text, strings containing special characters or consisting of keywords are enclosed in double quotes kind of like {jimmy,"walter, james"} being two entries. Also when matching with ~ it is matched against any part of the string, not the same as LIKE where it's matched against the whole string.
Here is what I did in my test database:
test=# select id, (xpath('/data/clients/name/text()', name))::text[] as xss, officeid from workers WHERE (xpath('/data/clients/name/text()', name))::text ~ ANY(ARRAY['wal','ant']);
id | xss | officeid
----+-------------------------+----------
2 | {anthony,walter} | 2
3 | {alicia,walter} | 3
4 | {"walter, james"} | 5
5 | {jimmy,"walter, james"} | 4
(4 rows)

BQ giving incomplete results using CONTAINS condition

I'm working with Bigquery to process some Adwords data and, more precisely, to extract all the url parameters from our destination URLs so we can organize it better and etc.
I wrote the following Query to give me back all the parameters available in the "DestinationURL" field in the table. As follows:
SELECT Parameter
FROM (SELECT NTH(1, SPLIT(Params,'=')) as Parameter,
FROM (SELECT
AdID,
NTH(1, SPLIT(DestinationURL,'?')) as baseurl,
split(NTH(2, SPLIT(DestinationURL,'?')),'&') as Params
FROM [adwords_accounts_ads.ads_all]
HAVING Params CONTAINS '='))
GROUP BY 1
Runnig this will give me 6 parameters. That is correct but incomplete, because in this testing table I know there are 2 other parameters in the URLs that were not fetched. One called 'group' and the other called 'utm_content'.
Now if I run:
SELECT Parameter
FROM (SELECT NTH(1, SPLIT(Params,'=')) as Parameter,
FROM (SELECT
AdID,
NTH(1, SPLIT(DestinationURL,'?')) as baseurl,
split(NTH(2, SPLIT(DestinationURL,'?')),'&') as Params
FROM [adwords_accounts_ads.ads_all]
HAVING Params CONTAINS 'p='))
GROUP BY 1
I get the "group" parameter showing.
question is: shouldn't the
"CONTAINS '='"
condition include the
"CONTAINS 'p='"
In the result? same happens for 't=' instead of '='
Does anyone know how I can fix that? or even how to extract all the parameters from a string that contains a URL?
ps: using LIKE yields the exact same thing
Thanks!
Split creates a REPEATED output type, and you have to FLATTEN the table to see correctly.
Here I used flatten on params and the output is now good:
SELECT nth(1,SPLIT(Params,'=')) AS Param,
nth(2,SPLIT(Params,'=')) AS Value
FROM flatten(SELECT
AdID,
NTH(1, SPLIT(DestinationURL,'?')) AS baseurl,
split(NTH(2, SPLIT(DestinationURL,'?')),'&') AS Params
FROM
(SELECT 1 AS AdID,'http://www.example.com.br/?h=Passagens+Aereas&source=google&vt=0' AS DestinationURL)
HAVING Params CONTAINS '=',
params
)
Outputs:
+-----+--------+------------------+---+
| Row | Param | Value | |
+-----+--------+------------------+---+
| 1 | h | Passagens+Aereas | |
| 2 | source | google | |
| 3 | vt | 0 | |
+-----+--------+------------------+---+
NOTE: The Web UI always flattens your result but If you select a destination table and uncheck "flatten results", you will get a single row with a repeated parts column.