SQL Server compare varbinary columns - sql

I have a project at hand which uses SQL Server to store fingerprints bitmap from a terminal hand held fingerprint reader.
My question: is there a way of comparing the fingerprint match in the database instead of bringing back all the fingerprints from the database for authentication?
Something like a query eg
SELECT *
FROM table
WHERE fingerprintcolumn = fingerprint_template

You can not do the comparison with simple compare/= operator. In the the reality, when you get the same fingerprint two times, both the images will be different from each other with little bit position change, angle change, and quality of the scan. So String comparison is not possible.
You have to get your automated fingerprint identification system implemented or need to get the 3rd party fingerprint comparison services like one from Cams Fingerprint Comparison API

Related

SQL: Cross-platform generation of N-digit unique identifier (SQL Server, Snowflake, etc.)

We have two databases/warehouses on two different platforms--Microsoft SQL Server and Snowflake (cloud data warehouse).
Across both, customers are identified via a unique AccountId (integer) and Uuid (32 character).
For a particular use case, we need to take one of these unique values (say, the AccountId for instance), pass it into a system function, and generate a unique 20-character identifier (it can't be longer/shorter).
This function needs to exist in both systems. (e.g. select sys.myfn(1234) returns the same in each)
I am aware that Snowflake has functions like sha1(): https://docs.snowflake.com/en/sql-reference/functions/sha1.html
Which are equivalent to HASHBYTES() in SQL Server: https://learn.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver15
How do I take the output from either and truncate it down to 20 characters and maintain uniqueness?
A UUID is a 128bit value (with a few bits reserved for version information). If you run that through a hash function, perform a base64 encoding of the hash, and then truncate to 20 characters, you still get 20 * 6 = 120 bits of range. The chance of collision is still in in the life-of-the-universe ballpark.
(Note: If you choose to base64 encode the UUID directly, truncation may yield collisions for sequentially assigned UUIDs.)
The integer value can be similarly encoded with little chance of collision with the UUID based values.
If you can find equivalent usable base64 wncoding implementations on both platforms, I think you will be on your way to a solution.

Obfuscate Phone Numbers Consistently

We have phone number fields that we need to obfuscate in a UAT environment, the problem is that the number needs to be unique, and should match other data processes using other databases that are also obfuscated. I'm trying to create a function that will reliably scramble a number, and each number passed in produces the same scrambled number every time, using some kind of encryption key that we'll store safely. I haven't found a way to reliably reproduce numbers in the same 10 digit format. Any ideas?
Why not use any hash function that will give you a guid?
E.g.
hash('012345677899')
in python
or
SELECT HASHBYTES('SHA2_256', '0103203803') in t-sql
https://learn.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver15
I believe Column Encryption is what you're looking for. You can encrypt the column, then pass the encrypted value.
SQLShack did a good write up as well.
Column Encryption is not what Steve is looking for, the phone number fields needs to obfuscated in the lower environment after a refresh from production in 2 separate tables and guarantee the same number of rows match before and after the process completes.
The process below seems to have worked but the before count did not match the after count.
SET [somePhone] = BINARY_CHECKSUM([somePhone])
Microsoft dynamic-data-masking may be a better option.
https://learn.microsoft.com/en-us/sql/relational-databases/security/dynamic-data-masking?view=sql-server-ver15

Database vs. Front-End for Output Formatting

I've read that (all things equal) PHP is typically faster than MySQL at arithmetic and string manipulation operations. This being the case, where does one draw the line between what one asks the database to do versus what is done by the web server(s)? We use stored procedures exclusively as our data-access layer. My unwritten rule has always been to leave output formatting (including string manipulation and arithmetic) to the web server. So our queries return:
unformatted dates
null values
no calculated values (i.e. return values for columns "foo" and "bar" and let the web server calculate foo*bar if it needs to display value foobar)
no substring-reduced fields (except when shortened field is so significantly shorter that we want to do it at database level to reduce result set size)
two separate columns to let front-end case the output as required
What I'm interested in is feedback about whether this is generally an appropriate approach or whether others know of compelling performance/maintainability considerations that justify pushing these activities to the database.
Note: I'm intentionally tagging this question to be dbms-agnostic, as I believe this is an architectural consideration that comes into play regardless of one's specific dbms.
I would draw the line on how certain layers could rotate out in place for other implementations. It's very likely that you will never use a different RDBMS or have a mobile version of your site, but you never know.
The more orthogonal a data point is, the closer it should be to being released from the database in that form. If on every theoretical version of your site your values A and B are rendered A * B, that should be returned by your database as A * B and never calculated client side.
Let's say you have something that's format heavy like a date. Sometimes you have short dates, long dates, English dates... One pure form should be returned from the database and then that should be formatted in PHP.
So the orthogonality point works in reverse as well. The more dynamic a data point is in its representation/display, the more it should be handled client side. If a string A is always taken as a substring of the first six characters, then have that be returned from the database as pre-substring'ed. If the length of the substring depends on some factor, like six for mobile and ten for your web app, then return the larger string from the database and format it at run time using PHP.
Usually, data formatting is better done on client side, especially culture-specific formatting.
Dynamic pivoting (i. e. variable columns) is also an example of what is better done on client side
When it comes to string manipulation and dynamic arrays, PHP is far more powerful than any RDBMS I'm aware of.
However, data formatting can use additional data which is also kept in the database. Like, the coloring info for each row can be stored in additional table.
You should then correspond the color to each row on database side, but wrap it into the tags on PHP side.
The rule of thumb is: retrieve everything you need for formatting in as few database round-trips as possible, then do the formatting itself on the client side.
I believe in returning the data pretty much as-is from the database and letting it be formatted on the front-end instead. I don't stick to it religously, but in general I think it's better as it provides greater flexibility - e.g. 1 sproc can service n different requirements for data, each of which can format the data as each individually needs. Otherwise, you end up either with multiple queries returning the same data with slightly different formatting from the DB (from a SQL Server point of view, thus reducing execution plan caching benefits - therefore negative impact on performance).
Leave output formatting to the web server

SQL server string manipulation in a view... Or in XSLT

I have been passed a piece of work that I can either do in my application or perhaps in SQL:
I have to get a date out of a string that may look like this:
1234567-DSP-01/01-VER-01/01
or like this:
1234567-VER-01/01-DSP-01/01
but may look like this:
00 12345 DISCH 01/01-VER-01/01 XXX X XXXXX
Yay. if it is a "DSP" then I want that date, if a "DISCH" then that date.
I am pulling the data out in a SQL Server view and would be happy to have the view transform the data for me. My application could do it but would add processor time. I could also see if the data could be manipulated before it is entered into the DB, I suppose.
Thank you for your time.
An option would be to check for the presence of DSP or DISCH then substring out the date as necessary.
For example (I don't have sqlserver today so I can verify syntax, sorry)
select
date = case date_attribute
when charindex('DSP',date_attribute) > 0 then substring(date_attribute,beg,end)
when charindex('DISCH',date_attribute) > 0 then substring(date_attribute,beg,end)
else 'unknown'
end
from myTable
don't store multiple items in the same column!
store the date in its own column when inserting the row!
add a new nullable column for the date
write an update that pulls the date out and sets the new column
alter the column to be not nullable
fix your save routine to pull the date out and insert it in for you
If you do it in the view your adding processing time on SQL which in general a more expensive resource then an app, web or some other type of client.
I'd recommend you try and format the data out when you insert the data, or you handle in the application tier. Scaling horizontally an app tier is so much easier then scalling your SQL.
Edit
I am talking the database server's physical resources are usually more expensive then a properly designed applications server's physical resources. This is because it is very easy to scale an application horizontally, it is in my opinion an order of magnitude more expensive to scale a DB server horizontally. Especially if your dealing with a transactional database and need to manage merging
I am not saying it is not possible just that scaling a database server horizontally is a much more difficult task, hence it's more expensive. The only reason I pointed this out is the OP raised a concern about using CPU cycles on the app server vs the database server. Most applications I have worked with have been data centric applications which processed through GB's of data to get a user an answer. We initially put everything on the database server because it was easier then doing it in classic asp and vb6 at the time. Over time the DB server was more and more loaded until scaling veritcally was no longer an option.
Database Servers are also designed at retrieving and joining data together. You should leave the formating of the data to the application and business rules (in general of course)

SQL SHA1 inside WHERE

In my program, we store a user's IP address in a record. When we display a list of records to a user, we don't want to give away the other user's IP, so we SHA1 hash it. Then, when the user clicks on a record, it goes to a URL like this:
http://www.example.com/allrecordsbyipaddress.php?ipaddress=SHA1HASHOFTHEIPADDRESS
Now, I need to list all the records by the IP address specified in the SHA1 hash. I tried this:
SELECT * FROM records
WHERE SHA1(IPADDRESS)="da39a3ee5e6b4b0d3255bfef95601890afd80709"
but this does not work. How would I do this?
Thanks,
Isaac Waller
Don't know if it matters, but your SHA1 hash da39a3ee5e6b4b0d3255bfef95601890afd80709 is a well-known hash of an empty string.
Is it just an example or you forgot to provide an actual IP address to the hash calculation function?
Update:
Does your webpage code generate SHA1 hashes in lowercase?
This check will fail in MySQL:
SELECT SHA1('') = 'DA39A3EE5E6B4B0D3255BFEF95601890AFD80709'
In this case, use this:
SELECT SHA1('') = LOWER('DA39A3EE5E6B4B0D3255BFEF95601890AFD80709')
, which will succeed.
Also, you can precalculate the SHA1 hash when you insert the records into the table:
INSERT
INTO ip_records (ip, ip_sha)
VALUES (#ip, SHA1(CONCAT('my_secret_salt', #ip))
SELECT *
FROM ip_records
WHERE ip_sha = #my_salted_sha1_from_webpage
This will return you the original IP and allow indexing of ip_sha, so that this query will work fast.
I'd store the SHA1 of the IP in the database along with the raw IP, so that the query would become
SELECT * FROM records WHERE ip_sha1 = "..."
Then I'd make sure that the SHA1 calculation happens exactly one place in code, so that there's no opportunity for it be be done slightly differently in multiple places. That also gives you the opportunity to mix a salt into the calculation, so that someone can't simply compute the SHA1 on an IP address they're interested in and pass that in by hand.
Storing the SHA1 hash the database also gives you the opportunity to add a secondary index on ip_sha1 to speed up that SELECT. If you have a very large data set, doing the SHA1 in the WHERE clauses forces the database to do a complete table scan, along with redoing a calculation for every record on every scan.
Every time I've had an unexpected hashing mismatch, it was because I accidentally hashed a string that included some whitespace, such as "\n".
Just a quick thought: that's a very simple obfuscation. There are only 232 possible IP addresses, so if somebody with technical knowledge wanted to figure it out, they could do that by calculating all 4 billion hashes, which wouldn't take very long. Depending on the sensitivity of those ip addresses, you may want to consider a private lookup table.
Did you compare the output of your hash algorithm with the output of MySQL's SHA1()? For example for IP address 1.2.3.4?
I ended up encrypting the IP addresses, and decrypting them on the other page. Then I can just use the raw IP address in the SQL query. Also, it protects against brute force attacks, like Autocracy said.