mysql - speedup regex - sql

I have a table:
+--------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+------------------+------+-----+---------+----------------+
| idurl | int(11) | NO | PRI | NULL | auto_increment |
| idsite | int(10) unsigned | NO | MUL | NULL | |
| url | varchar(2048) | NO | | NULL | |
+--------+------------------+------+-----+---------+----------------+
the select statement is:
SELECT idurl,
url
FROM URL
WHERE idsite = 34
AND url REGEXP '^https\\://www\\.domain\\.com/checkout/step_one\\.php.*'
The query needs 5 seconds on a table with 1000000 rows.
Can I achieve a speedup with indexes or something else?

Looks like a LIKE might suffice. LIKE uses % as a wildcard for any number of characters.
AND url LIKE 'https://www.domain.com/checkout/step_one.php%'
LIKE does not require a starting anchor like ^. Only the second example would match:
'Sherlock and Watson' LIKE 'and%'
'Sherlock and Watson' LIKE '%and%'
'Sherlock and Watson' LIKE '%and'

Any index involving the URL column is likely not going to help you because the database engine still has to walk through the contents of that column to check whether the contents match the regex.
What may help you, depending on how many unique values of IDSITE you have, is to either place an index on IDSITE or do an initial select WHERE IDSITE = 34, and use that subquery as the target of your query on URL.
Something like:
select
idurl,
url
from
(select idurl, url from uwe_url where idsite = 34)
where
url REGEXP '^https\\://www\\.domain\\.com/checkout/step_one\\.php.*'
But I'm pretty sure you can't get around the text parsing for the URL column match.

You could use the LIKE operator instead of a regular expression. But as your regular expression is simple, this may or may not improve performance.
You could split out the domain into a separate field, index it and use that in your where clause. If the URLs that you store are from many different domains then such an index could improve performance considerably.

Looks like you don't really need that REGEXP.
This clause should suffice:
AND eu.url LIKE 'https://www.domain.com/checkout/step_one.php%'

Related

Combine query to get all the matching search text in right order

I have the following table:
postgres=# \d so_rum;
Table "public.so_rum"
Column | Type | Collation | Nullable | Default
-----------+-------------------------+-----------+----------+---------
id | integer | | |
title | character varying(1000) | | |
posts | text | | |
body | tsvector | | |
parent_id | integer | | |
Indexes:
"so_rum_body_idx" rum (body)
I wanted to do phrase search query, so I came up with the below query, for example:
select id from so_rum
where body ## phraseto_tsquery('english','Is it possible to toggle the visibility');
This gives me the results, which only match's the entire text. However, there are documents, where the distance between lexmes are more and the above query doesn't gives me back those data. For example: 'it is something possible to do toggle between the. . . visibility' doesn't get returned. I know I can get it returned with <2> (for example) distance operator by giving in the to_tsquery, manually.
But I wanted to understand, how to do this in my sql statement itself, so that I get the results first with distance of 1 and then 2 and so on (may be till 6-7). Finally append results with the actual count of the search words like the following query:
select count(id) from so_rum
where body ## to_tsquery('english','string & string . . . ')
Is it possible to do in a single query with good performance?
I don't see a canned solution to this. It sounds like you need to use plainto_tsquery to get all the results with all the lexemes, and then implement your own custom ranking function to rank them by distance between the lexemes, and maybe filter out ones with the wrong order.

How to compare value with multiple modified values from another table in BigQuery?

I am using Google BigQuery and I got the following issue:
I have a table (A) like this:
| time | request |
|------------------------|-----------------|
|2019-09-24 11:10:00 UTC | fakewebsite.com |
|2019-09-24 11:10:00 UTC | realwebsite.com |
|........................|.................|
|2019-09-24 11:10:00 UTC | foobwebsite.com |
|2019-09-24 11:10:00 UTC | barrwebsite.com |
And another table (B) like this:
| blacklist |
|---------------|
| foo.com |
| ... |
| bar.com |
I want to make a query that will grab a modified version of the values inside the blacklist field of table B as follows:
SPLIT(NET.REG_DOMAIN(blacklist), CONCAT('.',NET.PUBLIC_SUFFIX(blacklist)))[OFFSET(0)] AS to_exclude --this will return only "foo" from "foo.com"
and then return all values from the request field of table A where none of the to_exclude was found.
I know how to do this for one value but I don't know how to do this for multiple. I am looking for something like the following:
#standardSQL
WITH tmp_blacklist AS
(SELECT
SPLIT(NET.REG_DOMAIN(blacklist), CONCAT('.',NET.PUBLIC_SUFFIX(blacklist)))[OFFSET(0)] AS to_exclude
FROM
mydataset.B)
SELECT
request
FROM
mydataset.A
WHERE
request NOT LIKE ("%value1%", "%value2%", ..., "%valuen%") -- I can't use OR along with the NOT LIKE since the values are too many and they will change.
The n values are the values of the tmp_blacklist table.
Also if I don't define the table with the WITH and I define it after the NOT LIKE I am going to get the following error: Scalar subquery produced more than one element which makes sense if LIKE expects only one element. But then again that's half of the job done if it get's fixed since I want the "%value%" and not just the value of the table.
Now I searched online for a way to do this and I found people saying that it can't be done and then some workarounds with combinations of LIKE and IN where people said it will be very slow if one of the tables grows to have tons of data(my case).
What is the best way to do this?
One method uses not exists:
SELECT a.request
FROM mydataset.A a
WHERE NOT EXISTS (SELECT 1
FROM tmp_blacklist bl
WHERE a.request LIKE CONCAT('%', bl.to_exclude, '%'
);
Note that this can be expensive. You might want to test constructing the exclusion string as:
'value1|value2|value3'
and then using regular expressions.

Postgres matching against an array of regular expressions

My client wants the possibility to match a set of data against an array of regular expressions, meaning:
table:
name | officeId (foreignkey)
--------
bob | 1
alice | 1
alicia | 2
walter | 2
and he wants to do something along those lines:
get me all records of offices (officeId) where there is a member with
ANY name ~ ANY[.*ob, ali.*]
meaning
ANY of[alicia, walter] ~ ANY of [.*ob, ali.*] results in true
I could not figure it out by myself sadly :/.
Edit
The real Problem was missing form the original description:
I cannot use select disctinct officeId .. where name ~ ANY[.*ob, ali.*], because:
This application, stored data in postgres-xml columns, which means i do in fact have (after evaluating xpath('/data/clients/name/text()'))::text[]):
table:
name | officeId (foreignkey)
-----------------------------------------
[bob, alice] | 1
[anthony, walter] | 2
[alicia, walter] | 3
There is the Problem. And "you don't do that, that is horrible, why would you do it like this, store it like it is meant to be stored in a relation database, user a no-sql database for Document-based storage, use json" are no options.
I am stuck with this datamodel.
This looks pretty horrific, but the only way I can think of doing such a thing would be a hybrid of a cross-join and a semi join. On small data sets this would probably work pretty well. On large datasets, I imagine the cross-join component could hit you pretty hard.
Check it out and let me know if it works against your real data:
with patterns as (
select unnest(array['.*ob', 'ali.*']) as pattern
)
select
o.name, o.officeid
from
office o
where exists (
select null
from patterns p
where o.name ~ p.pattern
)
The semi-join helps protect you from cases where you have a name like "alicia nob" that would meet multiple search patterns would otherwise come back for every match.
You could cast the array to text.
SELECT * FROM workers WHERE (xpath('/data/clients/name/text()', xml_field))::text ~ ANY(ARRAY['wal','ant']);
When casting a string array into text, strings containing special characters or consisting of keywords are enclosed in double quotes kind of like {jimmy,"walter, james"} being two entries. Also when matching with ~ it is matched against any part of the string, not the same as LIKE where it's matched against the whole string.
Here is what I did in my test database:
test=# select id, (xpath('/data/clients/name/text()', name))::text[] as xss, officeid from workers WHERE (xpath('/data/clients/name/text()', name))::text ~ ANY(ARRAY['wal','ant']);
id | xss | officeid
----+-------------------------+----------
2 | {anthony,walter} | 2
3 | {alicia,walter} | 3
4 | {"walter, james"} | 5
5 | {jimmy,"walter, james"} | 4
(4 rows)

Match a Query to a Regular Expression in SQL?

I'm trying to find a way to match a query to a regular expression in a database. As far as I can tell (although I'm no expert), while most DBMS like MySQL have a regex option for searching, you can only do something like:
Find all rows in Column 1 that match the regex in my query.
What I want to be able to do is the opposite, i.e.:
Find all rows in Column 1 such that the regex in Column 1 matches my query.
Simple example - say I had a database structured like so:
+----------+-----------+
| Column 1 | Column 2 |
+----------+-----------+
| [a-z]+ | whatever |
+----------+-----------+
| [\w]+ | whatever |
+----------+-----------+
| [0-9]+ | whatever |
+----------+-----------+
So if I queried "dog", I would want it to return the rows with [a-z]+ and [\w]+, and if I queried 123, it would return the row with [0-9]+.
If you know of a way to do this in SQL, a short SELECT example or a link with an example would be much appreciated.
For MySQL (and may be other databases too):
SELECT * FROM table WHERE "dog" RLIKE(`Column 1`)
In PostgreSQL it would be:
SELECT * FROM table WHERE 'dog' ~ "Column 1";

Inverse of SQL LIKE '%value%'

I have a MySQL table containing domain names:
+----+---------------+
| id | domain |
+----+---------------+
| 1 | amazon.com |
| 2 | google.com |
| 3 | microsoft.com |
| | ... |
+----+---------------+
I'd like to be able to search through this table for a full hostname (i.e. 'www.google.com'). If it were the other way round where the table contained the full URL I'd use:
SELECT * FROM table WHERE domain LIKE '%google.com%'
But the inverse is not so straightforward. My current thinking is to search for the full hostname, then progressively strip off each part of the domain, and search again. (i.e. search for 'www.google.com' then 'google.com')
This is not particular efficient or clever, there must be a better way. I am sure it is a common problem, and no doubt easy to solve!
You can use the column on the right of the like too:
SELECT domain FROM table WHERE 'www.google.com' LIKE CONCAT('%', domain);
or
SELECT domain FROM table WHERE 'www.google.com' LIKE CONCAT('%', domain, '%');
It's not particularly efficient but it works.
In mysql you can use regular expressions (RLIKE) to perform matches. Given this ability you could do something like this:
SELECT * FROM table WHERE 'www.google.com' RLIKE domain;
It appears that the way RLIKE has been implemented it is even smart enough to treat the dot in that field (normally a wildcard in regex) as a literal dot.
MySQL's inclusion of regular expressions gives you a very powerful ability to parse and search strings. If you would like to know more about regular expressions, just google "regex". You can also use one of these links:
http://en.wikipedia.org/wiki/Regular_expression
http://www.regular-expressions.info/
http://www.codeproject.com/KB/string/re.aspx
You could use a bit of SQL string manipulation to generate the equivalent of string.EndsWith():
SELECT * FROM table WHERE
substring('www.google.com',
len('www.google.com') - len([domain]) ,
len([domain])+1) = [domain]