SHA_256 Hash in Bigquery - google-bigquery

I am trying to find the SQL equivalent of hash in bigquery.
SQL :
SELECT CAST(HASHBYTES('SHA2_256', CONCAT(
COL1, COL2, COL3
)) AS BINARY(32)) AS HashValue
Big Query:
SELECT SHA2_256(CONCAT(COL1, '', COL2 )) AS HashValue.
I can't find any examples where hashing is done on multiple columns. The datatype of the columns are different as well.
Any help is really appreciated.

You can see follow this change request
These are now implemented. Thanks again for sharing feedback on needing these ?> functions. Please see:
TO_HEX: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#to_hex
FROM_HEX: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#from_hex
2 related questions I found for you are:
Is it possible to hash using MD5 in BigQuery?
Random Sampling in Google BigQuery

Using Standard SQL (SHA256 function) you could cast all your fields to string, concatenate them and use the hash. Something like this:
SELECT SHA256(
CONCAT(
CAST(integer_field1 as STRING),
CAST(integer_field2 as STRING),
CAST(timestamp_field as STRING)
)
) as sha256_hash FROM `table`

Related

Splitting a string and converting to integer in BigQuery

I have a simple problem but I started to use google bq and their help menu was so complex for me.
I have a column like that for some rows:
ANSWER(title of column)
9
10 - Certainly Satisfied.
7 -
My aim is to split the previous part of that column from "-" sign and convert it to integer. I found some formulas like split(), regexp_extract() but I couldn't be sure how can I imply them for my data.
Thanks for your help in advance :)
If the number is always first, you can use:
select sum(safe.cast((split(answer, '-'))[ordinal(1)] as int64)
from t;
Note: It looks like you have spaces, so you might really want to split on the space:
select sum(safe.cast((split(answer, ' '))[ordinal(1)] as int64)
from t;
Consider below option
select answer,
safe_cast(regexp_extract(trim(answer), r'^\d+') as int64) as score
from `project.dataset.table`
if to apply to sample data in your question - output is

How to convert hashbytes string from sql to spark equivalent

I have a process using the following select statement in sql server
SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
This give you: 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
I have been trying to re-create this same result in Spark SQL without luck.
I tried this sha1(encode('4100119300','utf-8')) in Spark
But the result of this is: b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4
During my test if I remove the cast in the sql area the result is the same in spark. The problem I see is that in spark you can't specify the size of the string or maybe is changing the encoding in the process. I already have data in sql hashed with the nvarchar(100) and is not possible to remove it from the spark equivalent.
Any suggestions ?
Have You seen those differences?
SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
-- 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
SELECT HASHBYTES('SHA1', '4100119300') AS StringConverted
-- 0xB4CF5AAE8CE3DC1673DA4949CFDF2EDFA33FDBA4
To store varbytes as string I use CONVERT with style = 1 flag (CAST & CONVERT)
SELECT CONVERT(VARCHAR(100), HASHBYTES('SHA1', '4100119300'), 1) AS StringConverted
And this is what You are looking for. It's simply Sparks b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4 in lowercase and without 0x prefix.
The first thing to highlight here is that Varchar to NVarchar relates to encoding so you just need the same encoding to regenerate the Hash Key which is 'utf_16_le' encoding.
For regeneration of :
SELECT CONVERT(VARCHAR(254), HASHBYTES('SHA2_512', CONVERT(NVARCHAR(24), '2020-05-27 00:00:00.000', 127)), 2)
You will need something like this in Pyspark :
hashlib.sha512('2020-05-27 00:00:00.000'.encode('utf_16_le')).hexdigest().upper()
Link to related issue: How to reproduce the behavior of SQL NVARCHAR in Python when generating a SHA-512 hash?
Hope that Helps. Thanks

Convert strings into table columns in biq query

I would like to convert this table
to something like this
the long string can be dynamic so it's important to me that it's not a fixed solution for these values specifically
Please help, i'm using big query
You could start by using SPLIT SPLIT(value[, delimiter]) to convert your long string into separate key-value pairs in an array.
This will be sensitive to you having commas as part of your values.
SPLIT(session_experiments, ',')
Then you could either FLATTEN that array or access each element, and then use some REGEXs to separate the key and the value.
If you share more context on your restrictions and intended result I could try and put together a query for you that does exactly what you want.
It's not possible what you want, however, there is a better practice for BigQuery.
You can use arrays of structs to store that information in a table.
Let's say you have a table like that
You can use that sample query to understand how to use it.
with rawdata AS
(
SELECT 1 as id, 'test1-val1,test2-val2,test3-val3' as experiments union all
SELECT 1 as id, 'test1-val1,test3-val3,test5-val5' as experiments
)
select
id,
(select array_agg(struct(split(param, '-')[offset(0)] as experiment, split(param, '-')[offset(1)] as value)) from unnest(split(experiments)) as param ) as experiments
from rawdata
The output will look like that:
After having that output, it's more convenient to manipulate the data

How to substring records with variable length

I have a table which has a column with doc locations, such as AA/BB/CC/EE
I am trying to get only one of these parts, lets say just the CC part (which has variable length). Until now I've tried as follows:
SELECT RIGHT(doclocation,CHARINDEX('/',REVERSE(doclocation),0)-1)
FROM Table
WHERE doclocation LIKE '%CC %'
But I'm not getting the expected result
Use PARSENAME function like this,
DECLARE #s VARCHAR(100) = 'AA/BB/CC/EE'
SELECT PARSENAME(replace(#s, '/', '.'), 2)
This is painful to do in SQL Server. One method is a series of string operations. I find this simplest using outer apply (unless I need subqueries for a different reason):
select *
from t outer apply
(select stuff(t.doclocation, 1, patindex('%/%/%', t.doclocation), '') as doclocation2) t2 outer apply
(select left(tt.doclocation2), charindex('/', tt.doclocation2) as cc
) t3;
The PARSENAME function is used to get the specified part of an object name, and should not used for this purpose, as it will only parse strings with max 4 objects (see SQL Server PARSENAME documentation at MSDN)
SQL Server 2016 has a new function STRING_SPLIT, but if you don't use SQL Server 2016 you have to fallback on the solutions described here: How do I split a string so I can access item x?
The question is not clear I guess. Can you please specify which value you need? If you need the values after CC, then you can do the CHARINDEX on "CC". Also the query does not seem correct as the string you provided is "AA/BB/CC/EE" which does not have a space between it, but in the query you are searching for space WHERE doclocation LIKE '%CC %'
SELECT SUBSTRING(doclocation,CHARINDEX('CC',doclocation)+2,LEN(doclocation))
FROM Table
WHERE doclocation LIKE '%CC %'

Regular expressions inside SQL Server

I have stored values in my database that look like 5XXXXXX, where X can be any digit. In other words, I need to match incoming SQL query strings like 5349878.
Does anyone have an idea how to do it?
I have different cases like XXXX7XX for example, so it has to be generic. I don't care about representing the pattern in a different way inside the SQL Server.
I'm working with c# in .NET.
You can write queries like this in SQL Server:
--each [0-9] matches a single digit, this would match 5xx
SELECT * FROM YourTable WHERE SomeField LIKE '5[0-9][0-9]'
stored value in DB is: 5XXXXXX [where x can be any digit]
You don't mention data types - if numeric, you'll likely have to use CAST/CONVERT to change the data type to [n]varchar.
Use:
WHERE CHARINDEX(column, '5') = 1
AND CHARINDEX(column, '.') = 0 --to stop decimals if needed
AND ISNUMERIC(column) = 1
References:
CHARINDEX
ISNUMERIC
i have also different cases like XXXX7XX for example, so it has to be generic.
Use:
WHERE PATINDEX('%7%', column) = 5
AND CHARINDEX(column, '.') = 0 --to stop decimals if needed
AND ISNUMERIC(column) = 1
References:
PATINDEX
Regex Support
SQL Server 2000+ supports regex, but the catch is you have to create the UDF function in CLR before you have the ability. There are numerous articles providing example code if you google them. Once you have that in place, you can use:
5\d{6} for your first example
\d{4}7\d{2} for your second example
For more info on regular expressions, I highly recommend this website.
Try this
select * from mytable
where p1 not like '%[^0-9]%' and substring(p1,1,1)='5'
Of course, you'll need to adjust the substring value, but the rest should work...
In order to match a digit, you can use [0-9].
So you could use 5[0-9][0-9][0-9][0-9][0-9][0-9] and [0-9][0-9][0-9][0-9]7[0-9][0-9][0-9]. I do this a lot for zip codes.
SQL Wildcards are enough for this purpose. Follow this link: http://www.w3schools.com/SQL/sql_wildcards.asp
you need to use a query like this:
select * from mytable where msisdn like '%7%'
or
select * from mytable where msisdn like '56655%'