BigQuery Standard performance of REGEXP_REPLACE vs RTRIM - sql

I was wondering about performances related to some RTRIM and REGEXP_REPLACE in BigQuery Standard SQL.
Which of the following two would be more performant:
DISTINCT RTRIM("12367e","abcdefghijklmnopqrstuvwxyz")
versus
REGEXP_REPLACE("12367e", r"\D$", "")
I am not sure if there is a big performance change between these two approaches.

It doesn't look like there is much difference unless you have a significant amount of data. I tried a few queries over the bigquery-public-data.github_repos.commits table, applying these string transformations to the commits column, which has values like 0000120032a071dcd7e4bb1c8d418ca7a0028431.
The queries that I tried were:
SELECT COUNTIF(RTRIM(commit,'abcdefghijklmnopqrstuvwxyz') = '')
FROM `bigquery-public-data`.github_repos.commits;
SELECT COUNTIF(REGEXP_REPLACE(commit, r'\D$', '') = '')
FROM `bigquery-public-data`.github_repos.commits;
SELECT COUNT(*)
FROM `bigquery-public-data`.github_repos.commits
WHERE RTRIM(commit,'abcdefghijklmnopqrstuvwxyz') = '';
SELECT COUNT(*)
FROM `bigquery-public-data`.github_repos.commits
WHERE REGEXP_REPLACE(commit, r'\D$', '') = '';
These all process 7.91 GB of data (from just the string column) and take between two to three seconds to run, without any query being that much faster than the rest. I intentionally filtered the data such that the results would be empty, since I didn't want to include write time.

Related

Count occurences of a pattern in SQL Server column

I have a varchar column in SQL Server 2012 with 3-letter patterns that are concatenated, like this value:
DECLARE #str VARCHAR(MAX) = 'POKPOKPOKHRSPOKPOKPOKPOKPOKPOIHEFHEFPOKPOHHRTHRT'
I need a query to search and count the occurrences of the pattern POK in that string. The trick is, all POK that are together must be counted as one. So, in the string above there are 3 "chains" of POK:
POKPOKPOK, interrupted by a HRS
POKPOKPOKPOKPOK, interrupted by a POI
POK, interrupted by a POH
So, my desired result is 3. If I use the following query, I get 9, that are the total POKs in string, which is not what I need.
SELECT (LEN(#str) - LEN(REPLACE(#str, 'POK', '')))/LEN('POK')
I think I need some sort of regexp to isolate the POKs and then count, but couldn't find a way to apply that in SQL Server. Any help much appreciated.
This is really not something that you want to do in SQL. But you can. Here is one method to reduce the adjacent 'POK's to a single POK:
select replace(replace(#str, 'POK', '<POK>'), 'POK><', '')
Well, this actually creates a '<POK>', but that is fine for our purposes.
Now, you can search in that:
select (len(replace(replace(#str, 'POK', '<POK>'), 'POK><', '')) -
len(replace(replace(replace(#str, 'POK', '<POK>'), 'POK><', ''), 'POK', ''))
) / 3
Here is a SQL Fiddle.

Comparing two tables and finding partial match (SQL / Oracle)

I haven't quite found an answer to this problem, it seems a bit tricky (and yes, I am a beginner). I have two tables; eb_site and eb_register and they both have the column id_glo which connects them. The values within these fields are not quite the same though, the number is the connecting factor. An example:
eb_site = kplus.hs.dlsn.3074823
eb_register = kplus.hs.register.3074823-1"
How could I select the ones ie make a list where the number in eb_register deviates from the number in eb_site (and disregard the mismatch between dlsn/register).
And also where the eb_register has a -1 at the end as in the example (the fixed ones don't have the -1 at the end).
Thanks for any replies.
edit: oops sorry guys, worded it badly, have edited
Rgds,
Steinar
the quality of the solution will depend on the possible id_glo values and the sql dialect you can use.
as a start, try
select s.id_glo
, r.id_glo
from eb_site s
inner join eb_register r on ( replace(replace(s.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') <> replace(replace(r.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '')
and replace(replace(r.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') not like replace(replace(s.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') || '-%'
)
;
this query assumes that:
there are no more different prefixes as the ones you've given
number complements will only occur in records from eb_register
If the numbers match, then the reverse of the numbers match. The following extracts the number (and final decimal point) from each key, using SQL Server syntax:
select *
from eb_site s join
eb_register r
on left(REVERSE(s.id_glo), charindex('.', reverse(s.id_glo))) =
left(REVERSE(r.id_glo), charindex('.', reverse(r.id_glo)))
In other databases the charindex() might need to be replaced by another function, such as instr(), location(), or position().

Oracle DBMS_LOB.INSTR and CONTAINS performance

Is there any performance difference between dbms_lob.instr and contains or am I doing something wrong?
Here is my code
SELECT DISTINCT ha.HRE_A_ID, ha.HRE_A_FIRSTNAME, ha.HRE_A_SURNAME, ha.HRE_A_CITY,
ha.HRE_A_EMAIL, ha.HRE_A_PHONE_MOBIL
FROM HRE_APPLICANT ha WHERE ha.HRE_A_STATUS_ID=1 AND ha.HRE_A_CURRENT_STATUS_ID <= '7'
AND ((DBMS_LOB.INSTR(hre_a_for_search,'java') > 0)
OR EXISTS
(SELECT 1 FROM gob_attachment, gob_table WHERE hre_a_id=gob_a_record_id
AND gob_a_table_id = gob_t_id AND gob_t_code = 'HRE_APPLICANT'
AND CONTAINS (gob_a_document, 'java') > 0))
ORDER BY HRE_A_SURNAME
and last two lines changed for using instr
AND dbms_lob.instr(gob_a_document,utl_raw.cast_to_raw('java')) <> 0))
ORDER BY HRE_A_SURNAME
My problem is that I would like to use instr instead of contains, but instr seems to me a lot slower then contains.
CONTAINS will use an Oracle Text index so you'd expect it to be much more efficient than something like INSTR that has to read the entire CLOB at runtime. If you generate the query plans for the two statements, I expect that you'll see that the difference is related to the Oracle Text index.
Why do you want to use INSTR rather than CONTAINS?

Searching a column containing CSV data in a MySQL table for existence of input values

I have a table say, ITEM, in MySQL that stores data as follows:
ID FEATURES
--------------------
1 AB,CD,EF,XY
2 PQ,AC,A3,B3
3 AB,CDE
4 AB1,BC3
--------------------
As an input, I will get a CSV string, something like "AB,PQ". I want to get the records that contain AB or PQ. I realized that we've to write a MySQL function to achieve this. So, if we have this magical function MATCH_ANY defined in MySQL that does this, I would then simply execute an SQL as follows:
select * from ITEM where MATCH_ANY(FEAURES, "AB,PQ") = 0
The above query would return the records 1, 2 and 3.
But I'm running into all sorts of problems while implementing this function as I realized that MySQL doesn't support arrays and there's no simple way to split strings based on a delimiter.
Remodeling the table is the last option for me as it involves lot of issues.
I might also want to execute queries containing multiple MATCH_ANY functions such as:
select * from ITEM where MATCH_ANY(FEATURES, "AB,PQ") = 0 and MATCH_ANY(FEATURES, "CDE")
In the above case, we would get an intersection of records (1, 2, 3) and (3) which would be just 3.
Any help is deeply appreciated.
Thanks
First of all, the database should of course not contain comma separated values, but you are hopefully aware of this already. If the table was normalised, you could easily get the items using a query like:
select distinct i.Itemid
from Item i
inner join ItemFeature f on f.ItemId = i.ItemId
where f.Feature in ('AB', 'PQ')
You can match the strings in the comma separated values, but it's not very efficient:
select Id
from Item
where
instr(concat(',', Features, ','), ',AB,') <> 0 or
instr(concat(',', Features, ','), ',PQ,') <> 0
For all you REGEXP lovers out there, I thought I would add this as a solution:
SELECT * FROM ITEM WHERE FEATURES REGEXP '[[:<:]]AB|PQ[[:>:]]';
and for case sensitivity:
SELECT * FROM ITEM WHERE FEATURES REGEXP BINARY '[[:<:]]AB|PQ[[:>:]]';
For the second query:
SELECT * FROM ITEM WHERE FEATURES REGEXP '[[:<:]]AB|PQ[[:>:]]' AND FEATURES REGEXP '[[:<:]]CDE[[:>:]];
Cheers!
select *
from ITEM where
where CONCAT(',',FEAURES,',') LIKE '%,AB,%'
or CONCAT(',',FEAURES,',') LIKE '%,PQ,%'
or create a custom function to do your MATCH_ANY
Alternatively, consider using RLIKE()
select *
from ITEM
where ','+FEATURES+',' RLIKE ',AB,|,PQ,';
Just a thought:
Does it have to be done in SQL? This is the kind of thing you might normally expect to write in PHP or Python or whatever language you're using to interface with the database.
This approach means you can build your query string using whatever complex logic you need and then just submit a vanilla SQL query, rather than trying to build a procedure in SQL.
Ben

SQL produced by Entity Framework for string matching

Given this linq query against an EF data context:
var customers = data.Customers.Where(c => c.EmailDomain.StartsWith(term))
You’d expect it to produce SQL like this, right?
SELECT {cols} FROM Customers WHERE EmailDomain LIKE #term+’%’
Well, actually, it does something like this:
SELECT {cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Do you know why?
Also, replacing the Where selector to:
c => c.EmailDomain.Substring(0, term.Length) == term
it runs 10 times faster but still produces some pretty yucky SQL.
NOTE: Linq to SQL correctly translates StartsWith into Like {term}%, and nHibernate has a dedicated LikeExpression.
I don't know about MS SQL server but on SQL server compact LIKE 'foo%' is thousands time faster than CHARINDEX, if you have INDEX on seach column. And now I'm sitting and pulling my hair out how to force it use LIKE.
http://social.msdn.microsoft.com/Forums/en-US/adodotnetentityframework/thread/1b835b94-7259-4284-a2a6-3d5ebda76e4b
The reason is that CharIndex is a lot faster and cleaner for SQL to perform than LIKE. The reason is, that you can have some crazy "LIKE" clauses. Example:
SELECT * FROM Customer WHERE EmailDomain LIKE 'abc%de%sss%'
But, the "CHARINDEX" function (which is basically "IndexOf") ONLY handles finding the first instance of a set of characters... no wildcards are allowed.
So, there's your answer :)
EDIT: I just wanted to add that I encourage people to use CHARINDEX in their SQL queries for things that they didn't need "LIKE" for. It is important to note though that in SQL Server 2000... a "Text" field can use the LIKE method, but not CHARINDEX.
Performance seems to be about equal between LIKE and CHARINDEX, so that should not be the reason. See here or here for some discussion. Also the CAST is very weird because CHARINDEX returns an int.
charindex returns the location of the first term within the second term.
sql starts with 1 as the first location (0 = not found)
http://msdn.microsoft.com/en-us/library/ms186323.aspx
i don't know why it uses that syntax but that's how it works
I agree that it is no faster, I was retrieving tens of thousands of rows from our database with the letter i the name. I did find however that you need to use > rather than = ... so use
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) > 0)
rather than
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Here are my two tests ....
select * from members where surname like '%i%' --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) > 0) --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) = 1) --too few results