Sql - Bulk crypto hash generation - sql

We're trying to implement change detection in out ETL process.
So we decided to get the cryptographic hash using
SET a.[HASH] = (SELECT
master.dbo.fn_varbintohexsubstring(0, HashBytes('md5', (SELECT TOP 1 * FROM customer_demographics_staging b WHERE b.customer_no = a.customer_no FOR XML RAW)), 1, 0))
FROM customer_demographics_staging a
For a table with 700k records and about 140 columns (we are yet to determine the changing columns), the query run for about half an hour before we canceled it.
Is there anyway, apart from reducing the number of queries we can improve this?

A couple of things. If the data type of the HASH column is varbinary(20), you don't need to concern yourself with converting the MD5 hash to a string; just store the hash bytes. To that end though, if you want to use a cryptographic hash for change detection, I'd use an inline table-valued function to get it. Here's an example that I cobbled together using AdventureWorks:
ALTER TABLE [HumanResources].[Employee] ADD [Hash] VARBINARY(20) NULL;
GO
CREATE FUNCTION dbo.CalculateHash(#EmployeeID AS INT)
RETURNS TABLE
AS
RETURN
SELECT e.[BusinessEntityID], HASHBYTES('md5', (
SELECT *
FROM [HumanResources].[Employee] AS [e2]
WHERE [e2].[BusinessEntityID] = e.[BusinessEntityID]
FOR XML RAW
)) AS [Hash]
FROM [HumanResources].[Employee] AS [e]
WHERE [e].[BusinessEntityID] = #EmployeeID
go
SELECT TOP 10 [e].*, ch.[Hash]
FROM [HumanResources].[Employee] AS [e]
CROSS APPLY dbo.[CalculateHash]([e].[BusinessEntityID]) AS [ch]
GO
That said, if it were me, I wouldn't bother with MD5 at all and just use the CHECKSUM() function (perhaps as a persisted computed column in the table). It supports taking multiple columns natively (so you don't incur the overhead of serializing the row to XML).

In line with what [Ben Thul] already said, I too tend to rely on BINARY_CHECKSUM() simply because its ease of use.
I'll agree that this function returns "but an int" which is 8 bytes while e.g MD5 will return a varbinary(16) which is twice as much of bytes so you get the square (not the double!) of 'result-space' meaning you'll end up with an incredibly much smaller chance on collisions. But paranoid me would like to add that even so, an exact match of MD5 values does not mean you also have the same (input) values!
In all honesty, I use the function only to eliminate differences. If the result of the checksum (or hash) is different then you can be 100% certain that the values are different too. If they are identical then you should still check the source-values in their entirety to see if there are no 'false matches'.
Your use-case seems to be the other way around: you want to find the ones that are different by eliminating the ones that are identical and short-cutting the latter by looking at the hash-code only. To be honest, I'm not a fan of the approach simply because you risk running into a collision causing a 'changed' record in your staging table to get the exact same hash value as the old one and thus being ignored when you want to copy the changes. Again, the chances are incredibly small, but like I said, I'm paranoid when it comes to this =)
If you'd wish to continue down this track nevertheless, some remarks:
HashBytes only supports an input of 8000 bytes. Given the overhead added by the XML syntax you might run into trouble with those 140 columns
I don't see any (good) reason to convert the result of HashBytes to something else before writing it to the table
Although FOR XML is pretty fast, wouldn't CONCAT be just as fast while at the same time resulting in a 'smaller' result (cf point 1)? I'll agree that it brings its own set of issues like when field1,field2,field3 are "hello", "world" "" would result in the same thing as "hello", "", "world" =/ You could get around this by CONCAT-ing the LEN() of each field too... not sure how much gain we'd have left though =)
I'm guessing you already have it, but is there an index, preferably unique and clustered, on the customer_no field in the staging table?

Related

How do you identify what where clause will be filtered first in Oracle and how to control it

I have a problem where the fix is to exchange what gets filtered first, but I'm not sure if that is even possible and not knowledgeable enough how it works.
To give an example:
Here is a table
When you filter this using the ff query:
select * from pcparts where Parts = 'Monitor' and id = 255322 and Brand = 'Asus'
by logic this will be correct as the Asus component with a character in its ID will be filtered and will prevent an ORA-01722 error.
But to my experience this is inconsistent.
I tried using the same filtering in two different DB connections, the first one didn't get the error (as expected) but other one got an ORA-01722 error.
Checking the explain plan the difference in the two DB's is the ff:
I was thinking if its possible to make sure that the Parts got filtered first before the ID but I'm unable to find anything when i was searching, is this even possible, if not, what is a fix for this issue without relying on using TO_CHAR
I assume you want to (sort of) fix a buggy program without changing the source code.
According to your image, you are using "Filter Predicates", this normally means Oracle isn't using index (though I don't know what displays execution plans this way).
If you have an index on PARTS, Oracle will probably use this index.
create index myindex on mytable (parts);
If Oracle thinks this index is inefficient, it may still use full table scan. You may try to 'fake' Oracle into thinking this an efficient index by lying about the number of distinct values (the more distinct values, the more efficient)
exec dbms_stats.set_index_stats(ownname => 'myname', indname => 'myindex', numdist => 100000000)
Note: This WILL impact performance of other querys using this table
"Fix" is rather simple: take control over what you're doing.
It is evident that ID column's datatype is VARCHAR2. Therefore, don't make Oracle guess, instruct it what to do.
No : select * from pcparts where Parts = 'Monitor' and id = 255322 and Brand = 'Asus'
Yes: select * from pcparts where Parts = 'Monitor' and id = '255322' and Brand = 'Asus'
--------
VARCHAR2 column's value enclosed into single quotes

Oracle error code ORA-00913 - IN CLAUSE limitation with more than 65000 values (Used OR condition for every 1k values)

My application team is trying to fetch 85,000 values from a table using a SELECT query that is being built on the fly by their program.
SELECT * FROM TEST_TABLE
WHERE (
ID IN (00001,00002, ..., 01000)
OR ID IN (01001,01002, ..., 02000)
...
OR ID IN (84001,84002, ..., 85000)
));
But i am getting an error "ORA-00913 too many values".
If I reduce the in clause to only 65,000 values, I am not getting this error. Is there any limitation of values for the IN CLAUSE (accompanied by OR clause)
The issue isn't about in lists; it is about a limit on the number of or-delimited compound conditions. I believe the limit applies not to or specifically, but to any compound conditions using any combination of or, and and not, with or without parentheses. And, importantly, this doesn't seem to be documented anywhere, nor acknowledged by anyone at Oracle.
As you clearly know already, there is a limit of 1000 items in an in list - and you have worked around that.
The parser expands an in condition as a compound, or-delimited condition. The limit that applies to you is the one I mentioned already.
The limit is 65,535 "atomic" conditions (put together with or, and, not). It is not difficult to write examples that confirm this.
The better question is why (and, of course, how to work around it).
My suspicion: To evaluate such compound conditions, the compiled code must use a stack, which is very likely implemented as an array. The array is indexed by unsigned 16-bit integers (why so small, only Oracle can tell). So the stack size can be no more than 2^16 = 65,536; and actually only one less, because Oracle thinks that array indexes start at 1, not at 0 - so they lose one index value (0).
Workaround: create a temporary table to store your 85,000 values. Note that the idea of using tuples (artificial as it is) allows you to overcome the 1000 values limit for a single in list, but it does not work around the limit of 65,535 "atomic" conditions in an or-delimited compound condition; this limit applies in the most general case, regardless of where the conditions come from originally (in lists or anything else).
More information on AskTom - you may want to start at the bottom (my comments, which are the last ones in the threads):
https://asktom.oracle.com/pls/apex/f?p=100:11:10737011707014::::P11_QUESTION_ID:9530196000346534356#9545388800346146842
https://asktom.oracle.com/pls/apex/f?p=100:11:10737011707014::::P11_QUESTION_ID:778625947169#9545394700346458835

Performance of SQL functions vs. code functions

We're currently investigating the load against our SQL server and looking at ways to alleviate it. During my post-secondary education, I was always told that, from a performance standpoint, it was cheaper to make SQL Server do the work. But is this true?
Here's an example:
SELECT ord_no FROM oelinhst_sql
This returns 783119 records in 14 seconds. The field is a char(8), but all of our order numbers are six-digits long so each has two blank characters leading. We typically trim this field, so I ran the following test:
SELECT LTRIM(ord_no) FROM oelinhst_sql
This returned the 783119 records in 13 seconds. I also tried one more test:
SELECT LTRIM(RTRIM(ord_no)) FROM oelinhst_sql
There is nothing to trim on the right, but I was trying to see if there was any overhead in the mere act of calling the function, but it still returned in 13 seconds.
My manager was talking about moving things like string trimming out of the SQL and into the source code, but the test results suggest otherwise. My manager also says he heard somewhere that using SQL functions meant that indexes would not be used. Is there any truth to this either?
Only optimize code that you have proven to be the slowest part of your system. Your data so far indicates that SQL string manipulation functions are not effecting performance at all. take this data to your manager.
If you use a function or type cast in the WHERE clause it can often prevent the SQL server from using indexes. This does not apply to transforming returned columns with functions.
It's typically user defined functions (UDFs) that get a bad rap with regards to SQL performance and might be the source of the advice you're getting.
The reason for this is you can build some pretty hairy functions that cause massive overhead with exponential effect.
As you've found with rtrim and ltrim this isn't a blanket reason to stop using all functions on the sql side.
It somewhat depends on what all is encompassed by: "things like string trimming", but, for string trimming at least, I'd definitely let the database do that (there will be less network traffic as well). As for the indexes, they will still be used if you're where clause is just using the column itself (as opposed to a function of the column). Use of the indexes won't be affected whatsoever by using functions on the actual columns you're retrieving (just on how you're selecting the rows).
You may want to have a look at this for performance improvement suggestions: http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/
As I said in my comment, reduce the data read per query and you will get a speed increase.
You said:
our order numbers are six-digits long
so each has two blank characters
leading
Makes me think you are storing numbers in a string, if so why are you not using a numeric data type? The smallest numeric type which will take 6 digits is an INT (I'm assuming SQL Server) and that already saves you 4 bytes per order number, over the number of rows you mention that's quite a lot less data to read off disk and send over the network.
Fully optimise your database before looking to deal with the data outside of it; it's what a database server is designed to do, serve data.
As you found it often pays to measure but I what I think your manager may have been referring to is somthing like this.
This is typically much faster
SELECT SomeFields FROM oelinhst_sql
WHERE
datetimeField > '1/1/2011'
and
datetimeField < '2/1/2011'
than this
SELECT SomeFields FROM oelinhst_sql
WHERE
Month(datetimeField) = 1
and
year(datetimeField) = 2011
even though the rows that are returned are the same

IP address numbers in MySQL subquery

I have a problem with a subquery involving IPV4 addresses stored in MySQL (MySQL 5.0).
The IP addresses are stored in two tables, both in network number format - e.g. the format output by MySQL's INET_ATON(). The first table ('events') contains lots of rows with IP addresses associated with them, the second table ('network_providers') contains a list of provider information for given netblocks.
events table (~4,000,000 rows):
event_id (int)
event_name (varchar)
ip_address (unsigned int)
network_providers table (~60,000 rows):
ip_start (unsigned int)
ip_end (unsigned int)
provider_name (varchar)
Simplified for the purposes of the problem I'm having, the goal is to create an export along the lines of:
event_id,event_name,ip_address,provider_name
If do a query along the lines of either of the following, I get the result I expect:
SELECT provider_name FROM network_providers WHERE INET_ATON('192.168.0.1') >= network_providers.ip_start ORDER BY network_providers.ip_start DESC LIMIT 1
SELECT provider_name FROM network_providers WHERE 3232235521 >= network_providers.ip_start ORDER BY network_providers.ip_start DESC LIMIT 1
That is to say, it returns the correct provider_name for whatever IP I look up (of course I'm not really using 192.168.0.1 in my queries).
However, when performing this same query as a subquery, in the following manner, it doesn't yield the result I would expect:
SELECT
events.event_id,
events.event_name,
(SELECT provider_name FROM network_providers
WHERE events.ip_address >= network_providers.ip_start
ORDER BY network_providers.ip_start DESC LIMIT 1) as provider
FROM events
Instead the a different (incorrect) value for provider is returned. Over 90% (but curiously not all) values returned in the provider column contain the wrong provider information for that IP.
Using events.ip_address in a subquery just to echo out the value confirms it contains the value I'd expect and that the subquery can parse it. Replacing events.ip_address with an actual network number also works, just using it dynamically in the subquery in this manner that doesn't work for me.
I suspect the problem is there is something fundamental and important about subqueries in MySQL that I don't get. I've worked with IP addresses like this in MySQL quite a bit before, but haven't previously done lookups for them using a subquery.
The question:
I'd really appreciate an example of how I could get the output I want, and if someone here knows, some enlightenment as to why what I'm doing doesn't work so I can avoid making this mistake again.
Notes:
The actual real-world usage I'm trying to do is considerably more complicated (involving joining two or three tables). This is a simplified version, to avoid overly complicating the question.
Additionally, I know I'm not using a between on ip_start & ip_end - that's intentional (the DB's can be out of date, and such cases the owner in the DB is almost always in the next specified range and 'best guess' is fine in this context) however I'm grateful for any suggestions for improvement that relate to the question.
Efficiency is always nice, but in this case absolutely not essential - any help appreciated.
You should take a look at this post:
http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/
It has some nice ideas for working with IPs in queries very similar to yours.
Another thing you should try is using a stored function instead of a sub-query. That would simplify your query as follows:
SELECT
event.id,
event.event_name,
GET_PROVIDER_NAME(event.ip_address) as provider
FROM events
There seems to be no way to achieve what I wanted with a JOIN or Subquery.
To expand on Ike Walker's suggestion of using a stored function, I ended up creating a stored function in MySQL with the following:
DELIMITER //
DROP FUNCTION IF EXISTS get_network_provider //
CREATE FUNCTION get_network_provider(ip_address_number INT) RETURNS VARCHAR(255)
BEGIN
DECLARE network_provider VARCHAR(255);
SELECT provider_name INTO network_provider FROM network_providers
WHERE ip_address_number >= network_providers.ip_start
AND network_providers.provider_name != ""
ORDER BY provider_name.ip_start DESC LIMIT 1;
RETURN network_provider;
END //
Explanation:
The check to ignore blank names, and using >= & ORDER BY for ip_start rather than BETWEEN ip_start and ip_end is a specific fudge for the two combined network provider databases I'm using, both of which need to be queried in this way.
This approach works well when the query calling the function only needs to return a few hundred results (though it may take a handful of seconds). On queries that return a few thousand results, it may take 2 or 3 minutes. For queries with tens of thousands of results (or more) it's too slow to be practical use.
This was not unexpected from using a stored function like this (i.e. every result returned triggering a separate query) but I did hit a drop in performance sooner than I had expected.
Recommendation:
The upshot of this was that I needed to accept that the data structure is just not suitable for my needs. This had been already pointed out to me by a friend, it just wasn't something I really wanted to hear at the time (because I really wanted to use that specific network_provider DB due to other keys in the table that were useful to me, e.g. for things like geolocation).
If you end up trying to use any of the IP provider DB's (or indeed any other database) that follow a similar dubious data format, then I can only suggest they are just not going to be suitable and it's not worth trying to cobble something together that will work with them as they are.
At the very least you need to reformat the data so that they can be reliably used with a simple BETWEEN statement (no sorting, and no other comparisons) so you can use it with subqueries (or JOINS) - although it's likely an indicator that any data that messed up is probably not all that reliable anyway.

Sql Optimization: Xml or Delimited String

This is hopefully just a simple question involving performance optimizations when it comes to queries in Sql 2008.
I've worked for companies that use Stored Procs a lot for their ETL processes as well as some of their websites. I've seen the scenario where they need to retrieve specific records based on a finite set of key values. I've seen it handled in 3 different ways, illustrated via pseudo-code below.
Dynamic Sql that concatinates a string and executes it.
EXEC('SELECT * FROM TableX WHERE xId IN (' + #Parameter + ')'
Using a user defined function to split a delimited string into a table
SELECT * FROM TableY INNER JOIN SPLIT(#Parameter) ON yID = splitId
USING XML as the Parameter instead of a delimited varchar value
SELECT * FROM TableZ JOIN #Parameter.Nodes(xpath) AS x (y) ON ...
While I know creating the dynamic sql in the first snippet is a bad idea for a large number of reasons, my curiosity comes from the last 2 examples. Is it more proficient to do the due diligence in my code to pass such lists via XML as in snippet 3 or is it better to just delimit the values and use an udf to take care of it?
There is now a 4th option - table valued parameters, whereby you can actually pass a table of values in to a sproc as a parameter and then use that as you would normally a table variable. I'd be preferring this approach over the XML (or CSV parsing approach)
I can't quote performance figures between all the different approaches, but that's one I'd be trying - I'd recommend doing some real performance tests on them.
Edit:
A little more on TVPs. In order to pass the values in to your sproc, you just define a SqlParameter (SqlDbType.Structured) - the value of this can be set to any IEnumerable, DataTable or DbDataReader source. So presumably, you already have the list of values in a list/array of some sort - you don't need to do anything to transform it into XML or CSV.
I think this also makes the sproc clearer, simpler and more maintainable, providing a more natural way to achieve the end result. One of the main points is that SQL performs best at set based/not looping/non string manipulation activities.
That's not to say it will perform great with a large set of values passed in. But with smaller sets (up to ~1000) it should be fine.
UDF invocation is a little bit more costly than splitting the XML using the built-in function.
However, this only needs to be done once per query, so the performance difference will be negligible.