Extract domain and subdomains in Big Query - google-bigquery

I need to extract domains, subdomains and subsubomains from a link.
Example https://stackoverflow.com/users/17141604/badinmaths
domain : https://stackoverflow.com/
subdomain : https://stackoverflow.com/users
subsubdomain : https://stackoverflow.com/users/17141604 (even if the subsubdomain is weird)
https://stackoverflow.com/questions/ask
domain : https://stackoverflow.com/
subdomain : https://stackoverflow.com/questions
Here : no subsubdomain
I already know how to extract domain with NET.HOST but I need to extract other parts.
I have a large number of URL where I have to apply this method.

There might be better way but you can consider below.
WITH sample_table AS (
SELECT 'https://stackoverflow.com/users/17141604/badinmaths' url
UNION ALL
SELECT 'https://stackoverflow.com/questions/ask'
)
SELECT domain,
domain || paths[SAFE_OFFSET(0)] AS subdomain,
domain || paths[SAFE_OFFSET(0)] || '/' || paths[SAFE_OFFSET(1)] AS subsubdomain
FROM sample_table,
UNNEST ([STRUCT(SPLIT(url, NET.HOST(url)) AS split_url)]),
UNNEST ([STRUCT(split_url[SAFE_OFFSET(0)] || NET.HOST(url) || '/' AS domain)]),
UNNEST ([STRUCT(REGEXP_EXTRACT_ALL(split_url[SAFE_OFFSET(1)], r'(\w+)\/') AS paths)]);
Query results
UNNEST ([STRUCT(*expression* AS *field_name*)])
with this trick, you can treat field_name as column_name of base table.
useful to reduce the repetition of same expression in the select list
SPLIT(url, NET.HOST(url)) returns an array of ['https', '/questions/ask'] which will be used later to reconstruct domain and subdomains.
For the regular expression, see here regex101

Related

sql to create one record for a ip

I have a dataset:
IP,web,rule
12.54.5435,web1,rule1
12.54.5435,web1,rule1
12.54.5435,web2,rule1
12.54.5435,web1,rule2
13.54.5435,web1,rule1
13.54.5435,web1,rule1
13.54.5435,web1,rule1
13.54.5435,web1,rule2
For every ip, i need to create a single record that looks like
total_count,ip, webrulecountlist
4,12.54.5435, ['web1,rule1, 2', 'web2,rule1,1', 'web1,rule2,1']
4,13.54.5435, ['web1, rule1, 3', 'web1, rule2, 1']
My inner query looks like this:
select count(ip) as c, ip, webacl, rule from t1 group by ip, webacl, rule
above query output is:
2,12.54.5435,web1,rulw1
1,12.54.5435,web1,rulw2
1,12.54.5435,web2,rulw1
3,13.54.5435,web1,rulw1
1,13.54.5435,web1,rulw2
but how can i now group by ip combining column values
Using https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/systemfunctions/ for a list of functions:
|| concatenates.
CHR(##) gives us a text character for ascii equilivant
LISTAGG() lets us combine multiple rows into 1.
So...Maybe... [UNTESTED / no means to test]
SELECT sum(CNT) as c, --- Note Sum() not count
IP,
LISTAGG(Chr(39) || --- add first apostrophe
WEB || ', ' || --- Add WebName & ,
Rule || ', ' || --- Add Rule & ,
CNT || Chr(39) ',')--- Add Cnt and apostrophe & Separate sets by , though
--- technically we don't need the ',' as , is the default
--- syntax might be [,] instead of ',' too
FROM (SELECT COUNT(*) as CNT, IP, Web, Rule
FROM t1
GROUP BY IP, Web, Rule) Sub -- Create derived table (Sub) getting counts for
-- duplicates by IP, Web, Rule
GROUP BY IP

Full text search failure on PostgreSQL

I have a PostgreSQL used to index text content.
The SearchVector column is created successfully using the following code
UPDATE public."DocumentFiles"
SET "SearchVector" = setweight(to_tsvector('pg_catalog.italian', coalesce("DocumentFileName", '')), 'A')
|| setweight(to_tsvector('pg_catalog.italian', coalesce("DocumentFileDescription", '')), 'B')
|| setweight(to_tsvector('pg_catalog.italian', coalesce("DocumentFileContentString", '')), 'B')
WHERE "DocumentFileID" = 123;
The content looks like the following:
'011989':1A '5':7A 'cdp':2A 'contonu':10A 'elettr':6A 'grupp':8A 'impiant':5A 'manual':3A 'uso':4A
But if I try to run a query to get plurals or singular of manual (in Italian: manuale is one, manuali are 2 or more) it fails:
SELECT "DocumentFileID"
FROM public."DocumentFiles"
where "SearchVector"::tsvector ## 'manuali'::tsquery;
return nothing
SELECT "DocumentFileID"
FROM public."DocumentFiles"
where "SearchVector"::tsvector ## 'manuale'::tsquery;
return nothing
It only returns the record if I write exactly what is written in the searchvector field:
SELECT "DocumentFileID"
FROM public."DocumentFiles"
where "SearchVector"::tsvector ## 'manual'::tsquery;
What's wrong with it?
The problem is probably that the parameter default_text_search_configuration is not set to italian, so that a different stemming algorithm is used.
Be explicit and use to_tsquery('italian', 'manuali') rather than 'manuali'::tsquery.

Replace Asterisk(*) with "anything" in SQL

I am having a tons of URL's in my database and want to filter them by user-defined string in format something/*/something, where * stands for "anything". So when user defines checkout/*/complete, it means it filters out url's like:
http://my_url.com/checkout/15/complete
http://my_url.com/checkout/85/complete
http://my_url.com/checkout/something/complete
http://my_url.com/super/checkout/something/complete
etc.
How do I do that in SQL? Or should I filter out all the results and use PHP to do the job?
My SQL request now is
SELECT * FROM custom_logs WHERE pn='$webPage' AND id IN ( SELECT MAX(id) FROM custom_logs WHERE action_clicked_text LIKE '%{$text_value_active}%' GROUP BY token ) order by action_timestamp desc
This filters out all the log messages with user-defined text in column action_clicked_text, but uses LIKE statement, which will not work with * inside.
You want like. Either:
where url like '%checkout/%/complete%'
to get the urls that match he pattern. Or:
where url not like '%checkout/%/complete%'
to get the other urls.

Stripping the domain name from a url

I'm working with Big Query's Hacker News dataset, and was looking at which urls have the most news stories. I'd also like to strip the domain names out, and see which of those have the most news stories. I'm working in R, and am having a bit of trouble getting the follow query to work.
sql_domain <- "SELECT url,
REPLACE(CASE WHEN REGEXP_CONTAINS(url, '//')
THEN url ELSE 'http://' + url END, '&', '?') AS domain_name,
COUNT(domain_name) as story_number
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY domain_name
ORDER BY story_number DESC
LIMIT 10"
I've been getting the following error: "Error: No matching signature for operator + for argument types: STRING, STRING. Supported signatures: INT64 + INT64; FLOAT64 + FLOAT64; NUMERIC + NUMERIC"
Can't for the life of me figure out a replacement for the "+" operator. Your help is much appreciated!
Can't for the life of me figure out a replacement for the "+" operator
In BigQuery - instead of 'http://' + url you should use CONCAT('http://', url)
For your goals (top domains submitting to Hacker News):
#standardSQL
SELECT NET.REG_DOMAIN(url) domain, COUNT(*) c
, ROUND(EXP(AVG(LOG(IF(score<=0,0.1,score)))),2) avg_score
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
Note how much easier is to call NET.REG_DOMAIN() to get the domain.

Query domain suffix from BigQuery table

I am trying to get domain suffix from my table of websites, however, there is no reverse function in BigQuery and my domains have domains such as example.example.com. Thus, i cannot set the 1st2nd/3rd appearance of '.' as there is an inconsistent amount of '.'
SELECT
SUBSTR(Domain,( INSTR(Domain,'.')+1)) AS user_tld,
COUNT(*) AS activity_count
FROM [table]
GROUP EACH BY
user_tld
HAVING
user_tld IS NOT NULL AND NOT user_tld
IN ('')
ORDER BY
user_tld DESC
LIMIT 250;
This is where I am currently, only able to list out the whole domain name or the domain name after the first '.'
BQ has some great URL functions: https://developers.google.com/bigquery/query-reference?#urlfunctions
if this doesnt do the job, try to use regexp_extract instead of substring where you can define the exact string structure for your string and define it to match from the string end if you'd like.
As user2881671 says, you can use the TLD() function:
SELECT TLD('http://' + req_host), COUNT(*) c
FROM [httparchive:runs.2014_01_01_requests]
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1000
17130999 .com
3106860 .net
894779 .ru
538917 .de
504799 .org
252716 .jp
247244 .com.br
225529 .fr
218345 .pl
206532 .co.uk
Note that TLD() is "smart enough" to recognize that the TLD is '.co.uk' instead of '.uk'.
If you want only the '.uk' part, regex are good too:
SELECT COUNT(*) c, REGEXP_EXTRACT(req_host, r'(\.[^.:]*)\.?:?[0-9]*$')
FROM [httparchive:runs.2014_01_01_requests]
GROUP BY 2
ORDER BY 1 DESC
LIMIT 1000;
17130999 .com
3106860 .net
903360 .ru
539167 .de
504799 .org
491532 .jp
276205 .br
258811 .cn
237798 .pl
230407 .fr