Spark SQL - Convert SHA1 to BIGINT - apache-spark-sql

I've been tasked with moving a lot of T-SQL into Spark (Databricks). The procedures I'm converting are creating surrogate keys in a somewhat typical manner for BI. What I'm trying to figure out the Spark equivalent to the following T-SQL:
select convert(bigint,hashbytes('sha1', N'570'))
-- Returns: -1488953326447475322
In Spark using SQL I can get the same hashbytes by doing:
select sha1(encode('570', 'UTF-16LE'))
--2c14de511f01a8abec0a4f15eb562cd6a1f64586 in Spark
--0x2C14DE511F01A8ABEC0A4F15EB562CD6A1F64586 in T-SQL
What I'm struggling to figure out is how to convert the returned hash into a bigint. I know that SHA1 is a 16 byte result and bigint is only 8 bytes, so there is truncation going on, but when trying to force this truncation using CONV like I've seen suggested I don't get close to the results I'm after.
select conv(substring(sha1(encode('570', 'UTF-16LE')), 0, 16), 16, 10)
--Returns: 3176408077196961963
Has anyone accomplished this?

select hash(sha1(encode('570', 'UTF-16LE')))

Related

SQL function to transform number with a certain pattern

I need for a SQL query to transform an int with a value between 1 to 300000 to a number which has this pattern : always 8 number.
For example:
1 becomes 00000001,
123 becomes 00000123,
123456 becomes 00123456.
I have no idea how to do that... How can I do it?
In Standard SQL, you can use this trick:
select substring(cast( (num + 100000000) as varchar(255)) from 2)
Few databases actually support this syntax. Any given database can do what you want, but the method depends on the database you are using.
For MS SQL Server
You could use FORMAT function, like this:
SELECT FORMAT(123,'00000000')
https://database.guide/how-to-format-numbers-in-sql-server/#:~:text=Starting%20from%20SQL%20Server%202012,the%20output%20should%20be%20formatted.
Read at the link Leading Zeroes
For MySql/Oracle
You could use LPAD, like this:
SELECT LPAD('123',8,'0')
https://database.guide/how-to-add-leading-zeros-to-a-number-in-mysql/

How to convert hashbytes string from sql to spark equivalent

I have a process using the following select statement in sql server
SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
This give you: 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
I have been trying to re-create this same result in Spark SQL without luck.
I tried this sha1(encode('4100119300','utf-8')) in Spark
But the result of this is: b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4
During my test if I remove the cast in the sql area the result is the same in spark. The problem I see is that in spark you can't specify the size of the string or maybe is changing the encoding in the process. I already have data in sql hashed with the nvarchar(100) and is not possible to remove it from the spark equivalent.
Any suggestions ?
Have You seen those differences?
SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
-- 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
SELECT HASHBYTES('SHA1', '4100119300') AS StringConverted
-- 0xB4CF5AAE8CE3DC1673DA4949CFDF2EDFA33FDBA4
To store varbytes as string I use CONVERT with style = 1 flag (CAST & CONVERT)
SELECT CONVERT(VARCHAR(100), HASHBYTES('SHA1', '4100119300'), 1) AS StringConverted
And this is what You are looking for. It's simply Sparks b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4 in lowercase and without 0x prefix.
The first thing to highlight here is that Varchar to NVarchar relates to encoding so you just need the same encoding to regenerate the Hash Key which is 'utf_16_le' encoding.
For regeneration of :
SELECT CONVERT(VARCHAR(254), HASHBYTES('SHA2_512', CONVERT(NVARCHAR(24), '2020-05-27 00:00:00.000', 127)), 2)
You will need something like this in Pyspark :
hashlib.sha512('2020-05-27 00:00:00.000'.encode('utf_16_le')).hexdigest().upper()
Link to related issue: How to reproduce the behavior of SQL NVARCHAR in Python when generating a SHA-512 hash?
Hope that Helps. Thanks

merge databricks SQL of different decimal datatypes

I have a Databricks SQL search that results in a data type decimal(18,0)
I want to append the results of this search into an existing table (df.write.format("delta").mode("append").save("a_path")) but cannot because it has a data type of decimal(38,18)
When I try to append, the error I get is:
AnalysisException: Failed to merge fields 'id' and 'id'. Failed to merge decimal types with incompatible precision 38 and 18 & scale 18 and 0;
Is there a way around this?
I tried to cast the result of the search to a decimal(38,18) select cast(id decimal(38,18))... but this did not work
Any suggestions
As a work around, I converted the SQL search columns into decimal type in pyspark, and then continued to merge:
query="""select * from ..."""
df=spark.sql(query)
df=df.withColumn("id",df["id"].cast(DecimalType(38,18)))
df.write.format("delta").mode("append").save("a_path")

Can I convert centigrade to farenheit in a query (not a function)

In Oracle I can convert centigrade to farenheit in an SQL query, see below. It seems SQL Server does not have full regex functionality. Is it possible to do this without dropping into a function, which I currently do?
(UNISTR('00B0') is the degree symbol we use)
The requirement is for any string that contains [digits]°C to be converted to same string with [new_digits]°F.
SELECT replace(replace(v_text_f,replace(regexp_substr(v_text_f,'\-? [[:digit:]]+\.?[[:digit:]]*'||UNISTR('\00B0')||'C'),UNISTR('\00B0')||'C'),
replace(regexp_substr(v_text_f,'\-?[[:digit:]]+\.?[[:digit:]]*'||UNISTR('\00B0')||'C'),UNISTR('\00B0')||'C')*9/5+32||UNISTR('\00B0')||'F'),
UNISTR('\00B0')||'F'||UNISTR('\00B0')||'C',UNISTR('\00B0') ||'F')
FROM (SELECT '38'||UNISTR('\00B0')||'C' as v_text_f FROM DUAL)
Try this, comparing to Oracle code, extremely simplified version:
DECLARE #C nvarchar(10) = '38'+CHAR(0x00B0)+'C' --38°C
SELECT CONVERT(nvarchar(10),CONVERT(int ,LEFT(#C, CHARINDEX(CHAR(0x00B0), #C)-1))*9/5+32)+CHAR(0x00B0)+'F'
--100°F

Convert HashBytes to VarChar

I want to get the MD5 Hash of a string value in SQL Server 2005. I do this with the following command:
SELECT HashBytes('MD5', 'HelloWorld')
However, this returns a VarBinary instead of a VarChar value. If I attempt to convert 0x68E109F0F40CA72A15E05CC22786F8E6 into a VarChar I get há ðô§*à\Â'†øæ instead of 68E109F0F40CA72A15E05CC22786F8E6.
Is there any SQL-based solution?
Yes
I have found the solution else where:
SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', 'HelloWorld')), 3, 32)
SELECT CONVERT(NVARCHAR(32),HashBytes('MD5', 'Hello World'),2)
Use master.dbo.fn_varbintohexsubstring(0, HashBytes('SHA1', #input), 1, 0) instead of master.dbo.fn_varbintohexstr and then substringing the result.
In fact fn_varbintohexstr calls fn_varbintohexsubstring internally. The first argument of fn_varbintohexsubstring tells it to add 0xF as the prefix or not. fn_varbintohexstr calls fn_varbintohexsubstring with 1 as the first argument internaly.
Because you don't need 0xF, call fn_varbintohexsubstring directly.
Contrary to what David Knight says, these two alternatives return the same response in MS SQL 2008:
SELECT CONVERT(VARCHAR(32),HashBytes('MD5', 'Hello World'),2)
SELECT UPPER(master.dbo.fn_varbintohexsubstring(0, HashBytes('MD5', 'Hello World'), 1, 0))
So it looks like the first one is a better choice, starting from version 2008.
convert(varchar(34), HASHBYTES('MD5','Hello World'),1)
(1 for converting hexadecimal to string)
convert this to lower and remove 0x from the start of the string by substring:
substring(lower(convert(varchar(34), HASHBYTES('MD5','Hello World'),1)),3,32)
exactly the same as what we get in C# after converting bytes to string
With personal experience of using the following code within a Stored Procedure which Hashed a SP Variable I can confirm, although undocumented, this combination works 100% as per my example:
#var=SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('SHA2_512', #SPvar)), 3, 128)
Changing the datatype to varbinary seems to work the best for me.