How to convert hashbytes string from sql to spark equivalent - sql

I have a process using the following select statement in sql server
SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
This give you: 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
I have been trying to re-create this same result in Spark SQL without luck.
I tried this sha1(encode('4100119300','utf-8')) in Spark
But the result of this is: b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4
During my test if I remove the cast in the sql area the result is the same in spark. The problem I see is that in spark you can't specify the size of the string or maybe is changing the encoding in the process. I already have data in sql hashed with the nvarchar(100) and is not possible to remove it from the spark equivalent.
Any suggestions ?

Have You seen those differences?
SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
-- 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
SELECT HASHBYTES('SHA1', '4100119300') AS StringConverted
-- 0xB4CF5AAE8CE3DC1673DA4949CFDF2EDFA33FDBA4
To store varbytes as string I use CONVERT with style = 1 flag (CAST & CONVERT)
SELECT CONVERT(VARCHAR(100), HASHBYTES('SHA1', '4100119300'), 1) AS StringConverted
And this is what You are looking for. It's simply Sparks b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4 in lowercase and without 0x prefix.

The first thing to highlight here is that Varchar to NVarchar relates to encoding so you just need the same encoding to regenerate the Hash Key which is 'utf_16_le' encoding.
For regeneration of :
SELECT CONVERT(VARCHAR(254), HASHBYTES('SHA2_512', CONVERT(NVARCHAR(24), '2020-05-27 00:00:00.000', 127)), 2)
You will need something like this in Pyspark :
hashlib.sha512('2020-05-27 00:00:00.000'.encode('utf_16_le')).hexdigest().upper()
Link to related issue: How to reproduce the behavior of SQL NVARCHAR in Python when generating a SHA-512 hash?
Hope that Helps. Thanks

Related

Spark SQL - Convert SHA1 to BIGINT

I've been tasked with moving a lot of T-SQL into Spark (Databricks). The procedures I'm converting are creating surrogate keys in a somewhat typical manner for BI. What I'm trying to figure out the Spark equivalent to the following T-SQL:
select convert(bigint,hashbytes('sha1', N'570'))
-- Returns: -1488953326447475322
In Spark using SQL I can get the same hashbytes by doing:
select sha1(encode('570', 'UTF-16LE'))
--2c14de511f01a8abec0a4f15eb562cd6a1f64586 in Spark
--0x2C14DE511F01A8ABEC0A4F15EB562CD6A1F64586 in T-SQL
What I'm struggling to figure out is how to convert the returned hash into a bigint. I know that SHA1 is a 16 byte result and bigint is only 8 bytes, so there is truncation going on, but when trying to force this truncation using CONV like I've seen suggested I don't get close to the results I'm after.
select conv(substring(sha1(encode('570', 'UTF-16LE')), 0, 16), 16, 10)
--Returns: 3176408077196961963
Has anyone accomplished this?
select hash(sha1(encode('570', 'UTF-16LE')))

Is there a way to query JSON column in SQL Server ignoring capitalization of keys?

I am trying to query a JSON column that has mixed capitalization. For instance, some rows have keys that are all lower case like below:
{"name":"Screening 1","type":"template","pages":[{"pageNumber":1,...}
However, some of the rows have keys that are capitalized on its first letter like this:
{"Type":"template","Name":"Screening2","Pages":[{"PageNumber":1,...}
Unfortunately, SQL Server seems to only supports JSON path system that is case sensitive. Therefore, I can't query on all rows successfully. If I use lower case path like '$.pages' in a query like below:
SELECT ST.Id AS Screening_Tool_Id
, ST.Name AS Screening_Tool_Name
, ST.Description AS Screening_Tool_Description
, COUNT(JSON_VALUE (SRQuestions.value, '$.id')) AS Question_Specific_Id
FROM dbo.ScreeningTemplate AS ST
CROSS APPLY OPENJSON(ST.Workflow, '$.pages') AS SRPages
CROSS APPLY OPENJSON(SRPages.Value, '$.sections') AS SRSections
I miss any row that has capitalized keys. Is there any way to query all rows ignoring their capitalization?
According to MS, looks like you're stuck with a case-sensitive query:
When OPENJSON parses a JSON array, the function returns the indexes of
the elements in the JSON text as keys.+ The comparison used to match
path steps with the properties of the JSON expression is
case-sensitive and collation-unaware (that is, a BIN2 comparison).
https://learn.microsoft.com/en-us/sql/t-sql/functions/openjson-transact-sql
If the only variations are in the capitalization of the first character, you could try to work around this limitation by creating queries with the variants and UNION the results together.
Maybe you can just lower the json:
COUNT(JSON_VALUE (lower(SRQuestions.value), '$.id')) AS Question_Specific_Id
Old question but I came across this when googling a similar issue so I will chip in with my solution:
SELECT #pb = PB from
OPENJSON(#PropertyBagsAsJson, '$."$values"')
WITH (
PbId1 nvarchar(MAX) 'lax $.Id',
PbId2 nvarchar(MAX) 'lax $.id',
PB nvarchar(MAX) '$' AS JSON
)
WHERE COALESCE(PbId1,PbId2) = #PropertyBagId
I hope that the example is clear. Basically I just add all possible casing of the property and then just use Coalesce to filter the results.
You can use openjson. Instead of
JSON_VALUE (SRQuestions.value, '$.id')
you can write
(select Value
from openjson( SRQuestions.value )
where [Key] collate latin1_general_ci = 'id')
You must use a Case-Insensitive "_ci" collation here. "UTF8_General_CI" works too, as does "database_default" if the database uses a CI collation.

Can I convert centigrade to farenheit in a query (not a function)

In Oracle I can convert centigrade to farenheit in an SQL query, see below. It seems SQL Server does not have full regex functionality. Is it possible to do this without dropping into a function, which I currently do?
(UNISTR('00B0') is the degree symbol we use)
The requirement is for any string that contains [digits]°C to be converted to same string with [new_digits]°F.
SELECT replace(replace(v_text_f,replace(regexp_substr(v_text_f,'\-? [[:digit:]]+\.?[[:digit:]]*'||UNISTR('\00B0')||'C'),UNISTR('\00B0')||'C'),
replace(regexp_substr(v_text_f,'\-?[[:digit:]]+\.?[[:digit:]]*'||UNISTR('\00B0')||'C'),UNISTR('\00B0')||'C')*9/5+32||UNISTR('\00B0')||'F'),
UNISTR('\00B0')||'F'||UNISTR('\00B0')||'C',UNISTR('\00B0') ||'F')
FROM (SELECT '38'||UNISTR('\00B0')||'C' as v_text_f FROM DUAL)
Try this, comparing to Oracle code, extremely simplified version:
DECLARE #C nvarchar(10) = '38'+CHAR(0x00B0)+'C' --38°C
SELECT CONVERT(nvarchar(10),CONVERT(int ,LEFT(#C, CHARINDEX(CHAR(0x00B0), #C)-1))*9/5+32)+CHAR(0x00B0)+'F'
--100°F

Converting a String to HEX in SQL

I'm looking for a way to transform a genuine string into it's hexadecimal value in SQL. I'm looking something that is Informix-friendly but I would obviously prefer something database-neutral
Here is the select I am using now:
SELECT SomeStringColumn from SomeTable
Here is the select I would like to use:
SELECT hex( SomeStringColumn ) from SomeTable
Unfortunately nothing is that simple... Informix gives me that message:
Character to numeric conversion error
Any idea?
Can you use Cast and the fn_varbintohexstr?
SELECT master.dbo.fn_varbintohexstr(CAST(SomeStringColumn AS varbinary))
FROM SomeTable
I'm not sure if you have that function in your database system, it is in MS-SQL.
I just tried it in my SQL server MMC on one of my tables:
SELECT master.dbo.fn_varbintohexstr(CAST(Addr1 AS VARBINARY)) AS Expr1
FROM Customer
This worked as expected. possibly what I know as master.dbo.fn_varbintohexstr on MS-SQL, might be similar to informix hex() function, so possibly try:
SELECT hex(CAST(Addr1 AS VARBINARY)) AS Expr1
FROM Customer
The following works in Sql 2005.
select convert(varbinary, SomeStringColumn) from SomeTable
Try this:
select convert(varbinary, '0xa3c0', 1)
The hex number needs to have an even number of digits. To get around that, try:
select convert(varbinary, '0x' + RIGHT('00000000' + REPLACE('0xa3c','0x',''), 8), 1)
If it is possible for you to do this in the database client in code it might be easier.
Otherwise the error probably means that the built in hex function can't work with your values as you expect. I would double check the input value is trimmed and in the format first, it might be that simple. Then I would consult the database documentation that describes the hex function and see what its expected input would be and compare that to some of your values and find out what the difference is and how to change your values to match that of the expected input.
A simple google search for "informix hex function" brought up the first result page with the sentence: "Must be a literal integer or some other expression that returns an integer". If your data type is a string, first convert the string to an integer. It looks like at first glance you do something with the cast function (I am not sure about this).
select hex(cast SomeStringColumn as int)) from SomeTable
what about:
declare #hexstring varchar(max);
set #hexstring = 'E0F0C0';
select cast('' as xml).value('xs:hexBinary( substring(sql:variable("#hexstring"), sql:column("t.pos")) )', 'varbinary(max)')
from (select case substring(#hexstring, 1, 2) when '0x' then 3 else 0 end) as t(pos)
I saw this here:
http://blogs.msdn.com/b/sqltips/archive/2008/07/02/converting-from-hex-string-to-varbinary-and-vice-versa.aspx
Sorrry, that work only on >MS SQL 2005
OLD Post but in my case I also had to remove the 0x part of the hex so I used the below code. (I'm using MS SQL)
convert(varchar, convert(Varbinary(MAX), YOURSTRING),2)
SUBSTRING(CONVERT(varbinary,Addr1 ) ,1,1) as Expr1

Convert HashBytes to VarChar

I want to get the MD5 Hash of a string value in SQL Server 2005. I do this with the following command:
SELECT HashBytes('MD5', 'HelloWorld')
However, this returns a VarBinary instead of a VarChar value. If I attempt to convert 0x68E109F0F40CA72A15E05CC22786F8E6 into a VarChar I get há ðô§*à\Â'†øæ instead of 68E109F0F40CA72A15E05CC22786F8E6.
Is there any SQL-based solution?
Yes
I have found the solution else where:
SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', 'HelloWorld')), 3, 32)
SELECT CONVERT(NVARCHAR(32),HashBytes('MD5', 'Hello World'),2)
Use master.dbo.fn_varbintohexsubstring(0, HashBytes('SHA1', #input), 1, 0) instead of master.dbo.fn_varbintohexstr and then substringing the result.
In fact fn_varbintohexstr calls fn_varbintohexsubstring internally. The first argument of fn_varbintohexsubstring tells it to add 0xF as the prefix or not. fn_varbintohexstr calls fn_varbintohexsubstring with 1 as the first argument internaly.
Because you don't need 0xF, call fn_varbintohexsubstring directly.
Contrary to what David Knight says, these two alternatives return the same response in MS SQL 2008:
SELECT CONVERT(VARCHAR(32),HashBytes('MD5', 'Hello World'),2)
SELECT UPPER(master.dbo.fn_varbintohexsubstring(0, HashBytes('MD5', 'Hello World'), 1, 0))
So it looks like the first one is a better choice, starting from version 2008.
convert(varchar(34), HASHBYTES('MD5','Hello World'),1)
(1 for converting hexadecimal to string)
convert this to lower and remove 0x from the start of the string by substring:
substring(lower(convert(varchar(34), HASHBYTES('MD5','Hello World'),1)),3,32)
exactly the same as what we get in C# after converting bytes to string
With personal experience of using the following code within a Stored Procedure which Hashed a SP Variable I can confirm, although undocumented, this combination works 100% as per my example:
#var=SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('SHA2_512', #SPvar)), 3, 128)
Changing the datatype to varbinary seems to work the best for me.