BigQuery Domain Function Case Sensitivity Discrepancy - google-bigquery

When using BigQuery query with Data containing URLs we noticed that the DOMAIN function behaves differently from the case of the URL.
This can be demonstrated with this simple query:
SELECT
domain('WWW.FOO.COM.AU'),
domain(LOWER('http://WWW.FOO.COM.AU/')),
domain('http://WWW.FOO.COM.AU/')
The result of URL of full uppercase does not seem to be right and the documentation does not mentioned anything regarding case in URLs.

DOMAIN (and the other URL-handling functions in legacy SQL) have a number of limitations, unfortunately. While we don't have an equivalent yet in standard SQL (uncheck the "Use Legacy SQL" box under Options), you can make up your own that works in more cases using a regular expression. There are a number of StackOverflow questions about domain extraction, and we can put one of the answers to use as:
CREATE TEMPORARY FUNCTION GetDomain(url STRING) AS (
REGEXP_EXTRACT(url, r'^(?:https?:\/\/)?(?:[^#\n]+#)?(?:www\.)?([^:\/\n]+)'));
WITH T AS (
SELECT url
FROM UNNEST(['WWW.FOO.COM.AU:8080', 'google.com',
'www.abc.xyz', 'http://example.com']) AS url)
SELECT
url,
GetDomain(url) AS domain
FROM T;
+---------------------+----------------+
| url | domain |
+---------------------+----------------+
| www.abc.xyz | abc.xyz |
| WWW.FOO.COM.AU:8080 | WWW.FOO.COM.AU |
| google.com | google.com |
| http://example.com | example.com |
+---------------------+----------------+

Related

Redshift - How to use column in one table as pattern in SIMILAR TO

I have a problem where I have two tables. One table constains urls and their information and another groups of urls that should be grouped by a pattern.
Urls table:
------------------------------------------------
| url | files |
| https://myurl1/test/one/es/main.html | 530 |
| https://myurl1/test/one/en/main.html | 530 |
| https://myurl1/test/one/ar/main.html | 530 |
------------------------------------------------
Urls patterns table:
---------------------------------------------
| group | url_pattern |
| group1 | https://myurl1/test/one/(es|en)/%|
| group2 | https://myurl1/test/one/(ar)/% |
---------------------------------------------
I have tried something like this bearing in mind that url_patterns will only have one row per group.
SELECT * FROM urls_table
WHERE url SIMILAR TO (SELECT MAX (url_pattern) FROM url_patterns WHERE group='group1')
LIMIT 10
The main problem here is that it seems that applying SIMILAR TO with a column argument is not working.
Could anyone give me some advices?
Thanks in advance.
You are running into the requirement that regexp patterns are compiled and that SIMILAR TO is a layer on regexp. So what you are trying to do won't work. I believe there are a number of other ways to do this.
I) Change to LIKE pattern matching: LIKE patterns aren't precompiled so can use dynamic patterns. The downside is that they are more limited but I think you can still do what you want. Just change your patterns to be set of pattern columns (if the number of patterns is limited) and test for all the patterns. Unneeded patterns can just be a value that can never match. Definitely a brute force hack.
II) Change to LIKE pattern matching w/ SQL to provide OR behavior: have multiple LIKE patterns in the url_pattern column separated by '|' (for example). Then use split_part to match each sub-pattern - a bit complex and possible slow but works. Like this:
SELECT url
FROM urls_table
LEFT JOIN (SELECT split_part(pattern, '|', part_no::int) as pattern
FROM url_patterns
CROSS JOIN (SELECT row_number() over () as part_no FROM urls_table)
WHERE "group" = 'group1'
)
ON url LIKE pattern
WHERE p.pattern IS NOT NULL;
You will also need to change your pattern strings to use the simpler LIKE format and use '|' for multiple possibilities - Ex: Group1 pattern becomes 'https://myurl1/test/one/es/%|https://myurl1/test/one/en/%'
III) Use some front-end query modification to find the pattern for the group and apply it to query BEFORE it is sent to the compiler. This could be an external tool or a stored procedure on Redshift. Get the pattern in one query and use it to issue the second query.
Do you want exists?
SELECT u.*
FROM urls_table u
WHERE EXISTS (SELECT 1
FROM url_patterns p
WHERE u.url SIMILAR TO p.url_pattern AND
p.group = 'group1'
)
LIMIT 10;

How to compare value with multiple modified values from another table in BigQuery?

I am using Google BigQuery and I got the following issue:
I have a table (A) like this:
| time | request |
|------------------------|-----------------|
|2019-09-24 11:10:00 UTC | fakewebsite.com |
|2019-09-24 11:10:00 UTC | realwebsite.com |
|........................|.................|
|2019-09-24 11:10:00 UTC | foobwebsite.com |
|2019-09-24 11:10:00 UTC | barrwebsite.com |
And another table (B) like this:
| blacklist |
|---------------|
| foo.com |
| ... |
| bar.com |
I want to make a query that will grab a modified version of the values inside the blacklist field of table B as follows:
SPLIT(NET.REG_DOMAIN(blacklist), CONCAT('.',NET.PUBLIC_SUFFIX(blacklist)))[OFFSET(0)] AS to_exclude --this will return only "foo" from "foo.com"
and then return all values from the request field of table A where none of the to_exclude was found.
I know how to do this for one value but I don't know how to do this for multiple. I am looking for something like the following:
#standardSQL
WITH tmp_blacklist AS
(SELECT
SPLIT(NET.REG_DOMAIN(blacklist), CONCAT('.',NET.PUBLIC_SUFFIX(blacklist)))[OFFSET(0)] AS to_exclude
FROM
mydataset.B)
SELECT
request
FROM
mydataset.A
WHERE
request NOT LIKE ("%value1%", "%value2%", ..., "%valuen%") -- I can't use OR along with the NOT LIKE since the values are too many and they will change.
The n values are the values of the tmp_blacklist table.
Also if I don't define the table with the WITH and I define it after the NOT LIKE I am going to get the following error: Scalar subquery produced more than one element which makes sense if LIKE expects only one element. But then again that's half of the job done if it get's fixed since I want the "%value%" and not just the value of the table.
Now I searched online for a way to do this and I found people saying that it can't be done and then some workarounds with combinations of LIKE and IN where people said it will be very slow if one of the tables grows to have tons of data(my case).
What is the best way to do this?
One method uses not exists:
SELECT a.request
FROM mydataset.A a
WHERE NOT EXISTS (SELECT 1
FROM tmp_blacklist bl
WHERE a.request LIKE CONCAT('%', bl.to_exclude, '%'
);
Note that this can be expensive. You might want to test constructing the exclusion string as:
'value1|value2|value3'
and then using regular expressions.

Homoiconicity and SQL

I'm currently using emacs sql-mode as my sql shell, a (simplified) query response is below:
my_db=# select * from visit limit 4;
num | visit_key | created | expiry
----+-----------------------------+----------------------------+------------
1 | 0f6fb8603f4dfe026d88998d81a | 2008-03-02 15:17:56.899817 | 2008-03-02
2 | 7c389163ff611155f97af692426 | 2008-02-14 12:46:11.02434 | 2008-02-14
3 | 3ecba0cfb4e4e0fdd6a8be87b35 | 2008-02-14 16:33:34.797517 | 2008-02-14
4 | 89285112ef2d753bd6f5e51056f | 2008-02-21 14:37:47.368657 | 2008-02-21
(4 rows)
If I want to then formulate another query based on that data, e.g.
my_db=# select visit_key, created from visit where expiry = '2008-03-02'
and num > 10;
You'll see that I have to add the comma between visit_key and created, and surround the expiry value with quotes.
Is there a SQL DB shell that shows it's content more homoiconically, so that I could minimise this sort of editing? e.g.
num, visit_key, created, expiry
(1, '0f6fb8603f4dfe026d88998d81a', '2008-03-02 15:17:56.899817', '2008-03-02')
or
(num=1, visit_key='0f6fb8603f4dfe026d88998d81a',
created='2008-03-02 15:17:56.899817', expiry='2008-03-02')
I'm using postgresql btw.
Here's one idea, which is similar to what I do sometimes, though I'm not sure that it's exactly what you're asking for:
Run a Lisp compiler (like SBCL) in SLIME. Then load CLSQL. It has a "Functional Data Manipulation Language" (SELECT documentation) which might help you do something like you want, perhaps in conjunction with SLIME's autocompletion capabilities. If not, it's easy to define Lisp functions and macros (assuming you know Lisp, but you're already an Emacser!).
Out-of-the-box, it doesn't give the nicely formatted tables that most SQL interfaces have, but even that isn't too hard to add. And Lisp is certainly powerful enough to let one easily come up with ways to make your common operations easier.
I've found the following changes in psql go some way to giving me homoiconicity:
=# select remote_ip, referer, http_method, time from hit limit 1;
remote_ip | referer | http_method | time
-----------------+---------+-------------+---------------------------
213.233.132.148 | | GET | 2013-08-27 08:01:42.38808
(1 row)
=# \a
Output format is unaligned.
=# \f ''', '''
Field separator is "', '".
=# \t
Showing only tuples.
=# select remote_ip, referer, http_method, time from hit limit 1;
213.233.132.148', '', 'GET', '2013-08-27 08:01:42.38808
caveats: everything is a string, and it's missing start and end quotes.

mysql - speedup regex

I have a table:
+--------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+------------------+------+-----+---------+----------------+
| idurl | int(11) | NO | PRI | NULL | auto_increment |
| idsite | int(10) unsigned | NO | MUL | NULL | |
| url | varchar(2048) | NO | | NULL | |
+--------+------------------+------+-----+---------+----------------+
the select statement is:
SELECT idurl,
url
FROM URL
WHERE idsite = 34
AND url REGEXP '^https\\://www\\.domain\\.com/checkout/step_one\\.php.*'
The query needs 5 seconds on a table with 1000000 rows.
Can I achieve a speedup with indexes or something else?
Looks like a LIKE might suffice. LIKE uses % as a wildcard for any number of characters.
AND url LIKE 'https://www.domain.com/checkout/step_one.php%'
LIKE does not require a starting anchor like ^. Only the second example would match:
'Sherlock and Watson' LIKE 'and%'
'Sherlock and Watson' LIKE '%and%'
'Sherlock and Watson' LIKE '%and'
Any index involving the URL column is likely not going to help you because the database engine still has to walk through the contents of that column to check whether the contents match the regex.
What may help you, depending on how many unique values of IDSITE you have, is to either place an index on IDSITE or do an initial select WHERE IDSITE = 34, and use that subquery as the target of your query on URL.
Something like:
select
idurl,
url
from
(select idurl, url from uwe_url where idsite = 34)
where
url REGEXP '^https\\://www\\.domain\\.com/checkout/step_one\\.php.*'
But I'm pretty sure you can't get around the text parsing for the URL column match.
You could use the LIKE operator instead of a regular expression. But as your regular expression is simple, this may or may not improve performance.
You could split out the domain into a separate field, index it and use that in your where clause. If the URLs that you store are from many different domains then such an index could improve performance considerably.
Looks like you don't really need that REGEXP.
This clause should suffice:
AND eu.url LIKE 'https://www.domain.com/checkout/step_one.php%'

Inverse of SQL LIKE '%value%'

I have a MySQL table containing domain names:
+----+---------------+
| id | domain |
+----+---------------+
| 1 | amazon.com |
| 2 | google.com |
| 3 | microsoft.com |
| | ... |
+----+---------------+
I'd like to be able to search through this table for a full hostname (i.e. 'www.google.com'). If it were the other way round where the table contained the full URL I'd use:
SELECT * FROM table WHERE domain LIKE '%google.com%'
But the inverse is not so straightforward. My current thinking is to search for the full hostname, then progressively strip off each part of the domain, and search again. (i.e. search for 'www.google.com' then 'google.com')
This is not particular efficient or clever, there must be a better way. I am sure it is a common problem, and no doubt easy to solve!
You can use the column on the right of the like too:
SELECT domain FROM table WHERE 'www.google.com' LIKE CONCAT('%', domain);
or
SELECT domain FROM table WHERE 'www.google.com' LIKE CONCAT('%', domain, '%');
It's not particularly efficient but it works.
In mysql you can use regular expressions (RLIKE) to perform matches. Given this ability you could do something like this:
SELECT * FROM table WHERE 'www.google.com' RLIKE domain;
It appears that the way RLIKE has been implemented it is even smart enough to treat the dot in that field (normally a wildcard in regex) as a literal dot.
MySQL's inclusion of regular expressions gives you a very powerful ability to parse and search strings. If you would like to know more about regular expressions, just google "regex". You can also use one of these links:
http://en.wikipedia.org/wiki/Regular_expression
http://www.regular-expressions.info/
http://www.codeproject.com/KB/string/re.aspx
You could use a bit of SQL string manipulation to generate the equivalent of string.EndsWith():
SELECT * FROM table WHERE
substring('www.google.com',
len('www.google.com') - len([domain]) ,
len([domain])+1) = [domain]