Query domain suffix from BigQuery table - sql

I am trying to get domain suffix from my table of websites, however, there is no reverse function in BigQuery and my domains have domains such as example.example.com. Thus, i cannot set the 1st2nd/3rd appearance of '.' as there is an inconsistent amount of '.'
SELECT
SUBSTR(Domain,( INSTR(Domain,'.')+1)) AS user_tld,
COUNT(*) AS activity_count
FROM [table]
GROUP EACH BY
user_tld
HAVING
user_tld IS NOT NULL AND NOT user_tld
IN ('')
ORDER BY
user_tld DESC
LIMIT 250;
This is where I am currently, only able to list out the whole domain name or the domain name after the first '.'

BQ has some great URL functions: https://developers.google.com/bigquery/query-reference?#urlfunctions
if this doesnt do the job, try to use regexp_extract instead of substring where you can define the exact string structure for your string and define it to match from the string end if you'd like.

As user2881671 says, you can use the TLD() function:
SELECT TLD('http://' + req_host), COUNT(*) c
FROM [httparchive:runs.2014_01_01_requests]
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1000
17130999 .com
3106860 .net
894779 .ru
538917 .de
504799 .org
252716 .jp
247244 .com.br
225529 .fr
218345 .pl
206532 .co.uk
Note that TLD() is "smart enough" to recognize that the TLD is '.co.uk' instead of '.uk'.
If you want only the '.uk' part, regex are good too:
SELECT COUNT(*) c, REGEXP_EXTRACT(req_host, r'(\.[^.:]*)\.?:?[0-9]*$')
FROM [httparchive:runs.2014_01_01_requests]
GROUP BY 2
ORDER BY 1 DESC
LIMIT 1000;
17130999 .com
3106860 .net
903360 .ru
539167 .de
504799 .org
491532 .jp
276205 .br
258811 .cn
237798 .pl
230407 .fr

Related

Extract domain and subdomains in Big Query

I need to extract domains, subdomains and subsubomains from a link.
Example https://stackoverflow.com/users/17141604/badinmaths
domain : https://stackoverflow.com/
subdomain : https://stackoverflow.com/users
subsubdomain : https://stackoverflow.com/users/17141604 (even if the subsubdomain is weird)
https://stackoverflow.com/questions/ask
domain : https://stackoverflow.com/
subdomain : https://stackoverflow.com/questions
Here : no subsubdomain
I already know how to extract domain with NET.HOST but I need to extract other parts.
I have a large number of URL where I have to apply this method.
There might be better way but you can consider below.
WITH sample_table AS (
SELECT 'https://stackoverflow.com/users/17141604/badinmaths' url
UNION ALL
SELECT 'https://stackoverflow.com/questions/ask'
)
SELECT domain,
domain || paths[SAFE_OFFSET(0)] AS subdomain,
domain || paths[SAFE_OFFSET(0)] || '/' || paths[SAFE_OFFSET(1)] AS subsubdomain
FROM sample_table,
UNNEST ([STRUCT(SPLIT(url, NET.HOST(url)) AS split_url)]),
UNNEST ([STRUCT(split_url[SAFE_OFFSET(0)] || NET.HOST(url) || '/' AS domain)]),
UNNEST ([STRUCT(REGEXP_EXTRACT_ALL(split_url[SAFE_OFFSET(1)], r'(\w+)\/') AS paths)]);
Query results
UNNEST ([STRUCT(*expression* AS *field_name*)])
with this trick, you can treat field_name as column_name of base table.
useful to reduce the repetition of same expression in the select list
SPLIT(url, NET.HOST(url)) returns an array of ['https', '/questions/ask'] which will be used later to reconstruct domain and subdomains.
For the regular expression, see here regex101

Replace Asterisk(*) with "anything" in SQL

I am having a tons of URL's in my database and want to filter them by user-defined string in format something/*/something, where * stands for "anything". So when user defines checkout/*/complete, it means it filters out url's like:
http://my_url.com/checkout/15/complete
http://my_url.com/checkout/85/complete
http://my_url.com/checkout/something/complete
http://my_url.com/super/checkout/something/complete
etc.
How do I do that in SQL? Or should I filter out all the results and use PHP to do the job?
My SQL request now is
SELECT * FROM custom_logs WHERE pn='$webPage' AND id IN ( SELECT MAX(id) FROM custom_logs WHERE action_clicked_text LIKE '%{$text_value_active}%' GROUP BY token ) order by action_timestamp desc
This filters out all the log messages with user-defined text in column action_clicked_text, but uses LIKE statement, which will not work with * inside.
You want like. Either:
where url like '%checkout/%/complete%'
to get the urls that match he pattern. Or:
where url not like '%checkout/%/complete%'
to get the other urls.

Stripping the domain name from a url

I'm working with Big Query's Hacker News dataset, and was looking at which urls have the most news stories. I'd also like to strip the domain names out, and see which of those have the most news stories. I'm working in R, and am having a bit of trouble getting the follow query to work.
sql_domain <- "SELECT url,
REPLACE(CASE WHEN REGEXP_CONTAINS(url, '//')
THEN url ELSE 'http://' + url END, '&', '?') AS domain_name,
COUNT(domain_name) as story_number
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY domain_name
ORDER BY story_number DESC
LIMIT 10"
I've been getting the following error: "Error: No matching signature for operator + for argument types: STRING, STRING. Supported signatures: INT64 + INT64; FLOAT64 + FLOAT64; NUMERIC + NUMERIC"
Can't for the life of me figure out a replacement for the "+" operator. Your help is much appreciated!
Can't for the life of me figure out a replacement for the "+" operator
In BigQuery - instead of 'http://' + url you should use CONCAT('http://', url)
For your goals (top domains submitting to Hacker News):
#standardSQL
SELECT NET.REG_DOMAIN(url) domain, COUNT(*) c
, ROUND(EXP(AVG(LOG(IF(score<=0,0.1,score)))),2) avg_score
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
Note how much easier is to call NET.REG_DOMAIN() to get the domain.

How can I change this Oracle SQL for this new format

I am new to Oracle and I need to change this SQL for the new output.
table name: access_log
col name: activity
sample data from the field
Download file:/webdocs/data/3589/casemanagement/01/CR-CLOSE/01_31_30_9_1050073559.pdf
Download file:/webdocs/data/3589/casemanagement/01/CR-CLOSE/01_31_42_29_1070032338.pdf
Download file:/webdocs/data/3589/casemanagement/01/CR-CLOSE/01_31_47_16_1050909430.pdf
Download file:/webdocs/data/3423/casemanagement/01/debit_disputes/01_24_38_29_0001105562.pdf
Download file:/webdocs/data/3423/fraud/01/0130_FRAUD_CLAIM_OF_FRAUD_AND_FORGERY_RPT_3423.XLS
so here is the output I need
The SQL I have right now is the following but I need to change it for the new format
select regexp_replace(activity, '^.*/(.*)/.*$', '\1') AS FILENAME,
COUNT (regexp_replace(activity, '^.*/(.*)/.*$', '\1')) AS DOWNLOADS
FROM sa.web_access_log where application_id = 5339 and time_stamp BETWEEN TO_DATE ('2014/02/01', 'yyyy/mm/dd') AND TO_DATE ('2014/02/02', 'yyyy/mm/dd')
GROUP BY regexp_replace(activity, '^.*/(.*)/.*$', '\1')
ORDER BY DOWNLOADS DESC;
So filename is from the 2nd to the last "/" to the first "/"
folder is from the 4th from the left "/" to the 5th "/"
and download is the count of matching filenames in the folder.. So who can help me and get this working
try this one ,result as per data and output given by you,result is in case when all required field occurs at the same place in the given sample data also column and table names are my assumption for that you have to replace with the original names :-
Please find the sqlfiddle link for the below examples
select folder,filename,count(1) downloads
from
(
select substr(detail,instr(detail,'/',1,4)+1,instr(detail,'/',1,5)
-instr(detail,'/',1,4)-1) folder,
SUBSTR(DETAIL,INSTR(DETAIL,'/',-1,2)+1,INSTR(DETAIL,'/',-1,1)
-INSTR(DETAIL,'/',-1,2)-1) filename
from examd
)
group by folder,filename ;
Here is the solution with regexp_replace function as required by you :-
select folder,filename,count(1) downloads
from
(
select regexp_replace(detail, '(.*?/){4}(.*)/.*$', '\2') folder,
regexp_replace(detail, '.*/(.*)/.*', '\1') as filename
from examd
)
group by folder,filename
order by folder,downloads desc;
one more you can try
select folder,filename,count(1) downloads
from
(
select regexp_replace(detail, '(.*?/){4}(.*)/.*$', '\2') folder,
regexp_replace(detail, '(.*{2}?/)(.*)/.*$', '\2') filename from examd
)
group by folder,filename
order by folder,downloads desc;
If you prefer regex:
SELECT REGEXP_REPLACE(activity, '^(.*?/){4}(.*)/.*$', '\2') FROM access_log;
SQL Fiddle

Comparing 2 Columns until the 1st "."

I am new to SQL programming and I am trying to figure out how to get a report to show a mismatch in System Names & DNS Names. Both of the columns are in a table called nodes.
System Name router-1-dc and the DNS would be router-1-dc.domain I am trying to find Nodes that don't match to the "." prior to the domain example for this would be
System Name "router-1-datacenter" and DNS Name "router-1-dc.domain" I would want this example to show on the report page.
The tricky part is that some of the system names have the ".domain" and some don't.
Here is the SQL Query I built however it does not appear to be working as I need it too.
SELECT N. NodeID, N.Caption, N.SysName, N.DNS, N.IP_Address, N.Device_Type
FROM (
SELECT Nodes.NodeID, Nodes.Caption, Nodes.SysName, Nodes.DNS, Nodes.Device_Type, Nodes.IP_Address
FROM Nodes
WHERE CHARINDEX('.',Nodes.SysName)>0 AND CHARINDEX('.',Nodes.DNS)>0
) N
WHERE SUBSTRING(N.SysName, 1, CHARINDEX('.',N.SysName)-1) <> SUBSTRING(N.DNS, 1, CHARINDEX('.',N.DNS)-1)
AND N.Device_Type = 'UPS'
ORDER BY 5 ASC, 2 ASC
Thanks in advance for the help
Try this, or something like it (I've no data to test it against):
SELECT N.NodeID, N.Caption, N.SysName, N.DNS, N.IP_Address, N.Device_Type
from Nodes N
where left(n.sysname, charindex('.', n.sysname + '.') - 1 )
<> left(n.dns, charindex('.', n.dns + '.') - 1)
order by N.IP_Address, N.Caption
The trick is to add a "." to the end of each string for evaluation purposes. If there already is a period in the string, this has no effect, otherwist you get the whole string.