Calculating hash integer from a string in Athena - sql

I'm trying to calculate a hash from a string for best-effort ordering and partioning purposes in Athena. There is no String to hashCode() similar in Athena, so as a best effort, I try to get the 2nd character and calculate its codepoint and get the modulus. (As I said, best effort, maybe a nice effort)
Consider the query:
SELECT
doc_id,
substring(doc_id, 2, 1),
typeof(split(substring(doc_id, 2, 1)))
FROM events LIMIT 100
The 3rd row returns a varchar but the codepoint function expects a varchar(1) and casting it does not work as cast(substring(doc_id, 2, 1) as varchar(1)).
FUNCTION_NOT_FOUND: line 6:5: Unexpected parameters (varchar) for function codepoint. Expected: codepoint(varchar(1))
How can I accomplish this task without modifiying the data source? I'm open to ideas.

You can compute a hash code with the xxhash64 function. It takes a varbinary as input, so first cast the string to that type. Since the function also returns a 64-bit varbinary value, you can convert it to a bigint via the from_big_endian_64 function
WITH t(x) AS (VALUES 'hello')
SELECT from_big_endian_64(xxhash64(cast(x AS varbinary)))
FROM t
output:
_col0
---------------------
2794345569481354659
(1 row)

Related

Cast a hexadecimal string to an array of bigint in hive

I have a column that contains a length 16 hexademical string. I would like to convert it to a bigint. Is there any way to accomplish that? The usual approach returns null since the input string could represent a number > 2^63-1.
select
cast(conv(hash_col, 16, 10) as bigint) as p0,
conv(hash_col, 16, 10) as c0
from mytable limit 10
I have also tried using unhex(..),
cast(unhex(hash_col) as bigint) as p0 from mytable limit 10
but got the following error
No matching method for class org.apache.hadoop.hive.ql.udf.UDFToLong
with (binary). Possible choices: FUNC(bigint) FUNC(boolean)
FUNC(decimal(38,18)) FUNC(double) FUNC(float) FUNC(int) FUNC(smallint) FUNC(string) FUNC(timestamp) FUNC(tinyint) FUNC(void)
If I don't do the cast(.. as bigint) part, I get some undisplayable binary value for p0. It seems unhex is not exactly the inverse of hex in hive.
Your values are out of range for BigInt
Ref : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
Max range for BigInt is 9,223,372,036,854,775,807
Use decimal(20,0) instead.
select cast(conv('85A58F8B014692CA',16,10) as decimal(20,0))

Unexpected behavior of binary conversions (COALESCE vs. ISNULL)

Can you comment on what approach shown below is preferable? I hope the question will not be blocked as "opinionated". I would like to believe there is an explanation that makes that clear.
Context: I have a code for mirroring 3rd party table contents to my own table (optimization). It worked some time flawlessly until the size/modification of the database reached some threshold.
The optimization is based on row version values of more tables, and remembering the maximum of the values from the source tables. This way I am able to update my local table incrementally, much faster than rebuilding it from time to time from scratch.
The problem started to appear when the row-version value exceeded the 4byte value. After some effort, I have spotted that the upper 4 bytes of the binary(8) value were set to 0. Later, the suspect was found to have a form COALESCE(MAX(row_version), 1).
The COALESCE was used to cover the case when the local table is fresh, containing now data -- for comparing the MAX(row_version) of source tables with something meaningful.
The examples to show the bug: To simulate the last mentioned situation, I want to convert the NULL value of the binary(8) column to 1. I am adding also the ISNULL usage that was added later. The original code contained the COALESCE only.
DECLARE #bin8null binary(8) = NULL
SELECT 'bin NULL' AS the_variable, #bin8null AS value
SELECT 'coalesce 1' AS op, COALESCE(#bin8null, 1) AS good_value
SELECT 'coalesce 1 + convert' AS op, CONVERT(binary(8), COALESCE(#bin8null, 1)) AS good_value
SELECT 'isnull 1' AS op, ISNULL(#bin8null, 1) AS good_value
SELECT 'isnull 0x1' AS op, ISNULL(#bin8null, 0x1) AS bad_value
(There is a bug in the image coalesce 0x1 + convert fixed later in the code to coalesce 1 + convert, but not fixed in the image.)
The application bug appeared when the binary value was bigger than the part that could be stored in 4 bytes. Here the 0xAAAAAAAA was used. (Actually, the 0x00000001 was the case, and it was difficult to spot that the single 1 was changed to 0.)
DECLARE #bin8 binary(8) = 0xAAAAAAAA01BB3A35
SELECT 'bin' AS the_variable, #bin8 AS value
SELECT 'coalesce 1' AS op, COALESCE(#bin8, 1) AS bad_value
SELECT 'coalesce 1 + convert' AS op, CONVERT(binary(8), COALESCE(#bin8, 1)) AS bad_value
SELECT 'coalesce 0x1 + convert ' AS op, CONVERT(binary(8), COALESCE(#bin8, 0x1)) AS good_value
SELECT 'isnull 1' AS op, ISNULL(#bin8, 1) AS good_value
SELECT 'isnull 0x1' AS op, ISNULL(#bin8, 0x1) AS good_value
When executed in Microsoft SQL Server Management Studio on MS-SQL Server 2014, the result looks like this:
Description -- my understanding: The COALESCE() seems to derive the type of the result from the type of the last processed argument. This way, the non-NULL binary(8) was converted to int, and that lead to the loss of upper 4 bytes. (See the 2nd and 3rd red bad_value on the picture. The difference between the two cases is only in decimal/hexadecimal form of display.)
On the other hand, the ISNULL() seems to preserve the type of the first argument, and converts the second value to that type. One should be careful to understand that binary(8) is more like a series of bytes. The interpretation as one large integer is only the interpretation. Hence, the 0x1 as the default value does not expand as 8bytes integer and produces bad value.
My solution: So, I have fixed the bug using ISNULL(MAX(row_version), 1). Is that correct?
This is not a bug. They're documented to handle data type precedence differently. COALESCE determines the data type of the output based on examining all of the arguments, while ISNULL has a more simplistic approach of inspecting only the first argument. (Both still need to contain values which are all compatible, meaning they are all possible to convert to the determined output type.)
From the COALESCE topic:
Returns the data type of expression with the highest data type precedence.
The ISNULL topic does not make this distinction in the same way, but implicitly states that the first expression determines the type:
replacement_value must be of a type that is implicitly convertible to the type of check_expression.
I have a similar example (and describe several other differences between COALESCE and ISNULL) here. Basically:
DECLARE #int int, #datetime datetime;
SELECT COALESCE(#int, CURRENT_TIMESTAMP);
-- works because datetime has a higher precedence than the chosen output type, int
2020-08-20 09:39:41.763
GO
DECLARE #int int, #datetime datetime;
SELECT ISNULL(#int, CURRENT_TIMESTAMP);
-- fails because int, the first (and chosen) output type, has a lower precedence than datetimeMsg 257, Level 16, State 3Implicit conversion from data type datetime to int is not allowed. Use the CONVERT function to run this query.
Let me start of by saying:
This is not a "bug".
ISNULL and COALESCE are not the same function, and operate quite differently.
ISNULL takes 2 parameters, and returns the second parameter if the first has a value NULL. If the 2 parameters are different datatypes, then the dataype of the first datatype is returned (implicitly casting the second value).
COALESCE takes 2+ parameters, and returns the first non-NULL parameter. COALESCE is a short hand CASE expression, and uses Data Type Precendence to determine the returned data type.
As a result, this is why ISNULL returns what you expect, there is no implicit conversion in your query for the non-NULL variable.
For the COALESCE there is implicit conversion. binary has the lowest precedence of all the data types, with a rank of 30 (at time of writing). The value 1 is an int, and has a precedence of 16; far higher than 30.
As a result COALESCE(#bin8, 1) will implicitly convert the value 0xAAAAAAAA01BB3A35 to an int and then return that value. You see this as SELECT CONVERT(int,0xAAAAAAAA01BB3A35) returns 29047349, which your first "bad" value; it's not "bad", it's correct for what you wrote.
Then for the latter "bad" value, we can convert that int value (29047349) back to a binary, which results in 0x0000000001BB3A35, which is, again the result you get.
TL;DR: checking return types of functions is important. ISNULL returns the data type of first parameter and will implicitly convert the second if needed. For COALESCE it uses Data Type Precedence, and will implicitly convert the returned value to the data type of with the highest precedence of all the possible return values.

How to process bitand operation in Informix with column in hex string format

In table I have string column which contains a hex value. For example value '000000000000000a' means 10. Now I need to process bitand operation: bitand(tableName.hexColumn, ?). When I read the Informix specification of this function it needs 2 int. So my question is: what is the simpler way to process this operation?
PS: Probably there is no solution in Informix so I will have to create my own bitandhexstring function where input will be 2 string and hex form but I have no idea where to start.
There are a variety of issues to be dealt with:
Your hex string has 16 digits, so the values are presumably (in general) 64-bit quantities. That means you need to be sure that the BITAND function has a variant that handles BIGINT (or perhaps INT8 — I'm not going to mention INT8 again, but it is nominally an option when BIGINT is mentioned) data.
You need to convert your hex string to a BIGINT.
It is not clear whether you'll need to convert the result BIGINT back to a hex string.
Some testing with Informix 11.70.FC6 on Mac OS X 10.10.4 shows that BITAND is safe with 64-bit numbers. That's good news!
The HEX function, when passed a BIGINT, returns a CHAR(20) string that starts with 0x and contains a hex representation of the number, so that more or less addresses point 3. The residual issue is 'how to convert 16-byte strings of hex digits to a BIGINT value'. Nominally, a cast operation like:
CAST('0xde3962e8c68a8001' AS BIGINT)
should do the job (but see below). There may be a better way of doing it than a brute-force and ignorance stored procedure, but I'm not immediately sure what it is.
Caveat Lector.
While testing this, I tried two queries:
SELECT bi, HEX(bi) FROM Test_BigInt;
SELECT bi, HEX(bi), SUBSTR(HEX(bi), 3, 16) FROM Test_BigInt;
on a table Test_BigInt with a single column bi of type BIGINT (not null, as it happened, but that's not material).
The first query worked fine. The type of the HEX(bi) expression was CHAR(20) and the values were like
0 0x0000000000000000
6898532535585831936 0x5fbc82ca87117c00
-2300268458811555839 0xe013ce0628808001
The second query sort of worked for small values of bi (0, 1, 2), but generated an error -1215: Value exceeds limit of INTEGER precision when the values got large. The problem is not the SUBSTR function directly. This was testing with Informix 11.70.FC6 on Mac OS X 10.10.4 — tested on 2015-07-08. The following pair of queries worked as expected (which is my justification for claiming that the problem is not in the SUBSTR function per se).
SELECT bi, HEX(bi) AS hex_bi FROM Test_BigInt INTO TEMP t;
SELECT bi, hex_bi, SUBSTR(hex_bi, 3, 16) FROM t;
It seems to be an interaction problem when the result of HEX is used in a string operation context. I first got the problem when trying to concatenate an empty string to the result of HEX: HEX(bi) || ''. That turns out to be unnecessary given that the result of HEX is reported as CHAR(20), but also indicates SUBSTR is not directly at fault.
I also tried CAST to get the hex string converted to BIGINT:
SELECT CAST('0xde3962e8c68a8001' AS BIGINT) FROM dual;
BIGINT
-964001791
SELECT HEX(CAST('0xde3962e8c68a8001' AS BIGINT)) FROM dual;
CHAR(18)
0xffffffffc68a8001
Grrr! Something is mishandling the conversion. This is not new software (well over 2 years old), but the chances are that unless someone else has spotted the bug, it has not yet been fixed, even in the latest version.
I've reported this through back-channels to IBM/Informix.
Stored procedures to convert hex string to BIGINT
CREATE PROCEDURE hexval(c CHAR(1)) RETURNING INTEGER;
RETURN INSTR("0123456789abcdef", lower(c)) - 1;
END PROCEDURE;
CREATE PROCEDURE hexstr_to_bigint(ival VARCHAR(18)) RETURNING bigint;
DEFINE oval DECIMAL(20,0);
DEFINE i,j,len INTEGER;
LET ival = LOWER(ival);
IF (ival[1,2] = '0x') THEN LET ival = ival[3,18]; END IF;
LET len = LENGTH(ival);
LET oval = 0;
FOR i = 1 TO len
LET j = hexval(SUBSTR(ival, i, 1));
LET oval = oval * 16 + j;
END FOR;
IF (oval > 9223372036854775807) THEN
LET oval = oval - 18446744073709551616;
END IF;
RETURN oval;
END PROCEDURE;
Casual testing:
execute procedure hexstr_to_bigint('000A');
10
execute procedure hexstr_to_bigint('FFff');
65535
execute procedure hexstr_to_bigint('FFFFffffFFFFffff');
-1
execute procedure hexstr_to_bigint('0XFFFFffffFFFFffff');
-1
execute procedure hexstr_to_bigint('000000000000000A');
10
Those values are correct.

How do I remove the first character of a string and treat the remaining values as an integer in BigQuery

I currently am working with a large data set that was pre-populated in BigQuery. I have a column of orderID's which have the following set-up: o377412876, o380940924, etc. This is stored in a string. I need to do the following and am running into problems:
1) Strip off the first character using the BigQuery query language
2) Convert the remaining (or treat the remaining values), as an integer.
I will then run a join against the values. Now, I would be abundantly happier down this operation in either Python, R, or another language. That said, the challenge I have been given based on client needs is to write all the scripts in BigQuery's querying language.
SELECT 10 * INTEGER(REGEXP_REPLACE(x, '^.', ''))
FROM
(SELECT 'o1234' AS x)
12340
You can use SUBSTR function and SAFE_CAST (in case there are NULL values in your column). INTEGER does not work on BQ.
SELECT SAFE_CAST(SUBSTR(x, 2) AS INT64)
FROM (SELECT 'o1234' AS x)
Output: 1234

SQL server 'like' against a float field produces inconsistent results

I am using LIKE to return matching numeric results against a float field. It seems that once there are more than 4 digits to the left of the decimal, values that match my search item on the right side of the decimal are not returned. Here's an example illustrating the situation:
CREATE TABLE number_like_test (
num [FLOAT] NULL
)
INSERT INTO number_like_test (num) VALUES (1234.56)
INSERT INTO number_like_test (num) VALUES (3457.68)
INSERT INTO number_like_test (num) VALUES (13457.68)
INSERT INTO number_like_test (num) VALUES (1234.76)
INSERT INTO number_like_test (num) VALUES (23456.78)
SELECT num FROM number_like_test
WHERE num LIKE '%68%'
That query does not return the record with the value of 12357.68, but it does return the record with the value of 3457.68. Also running the query with 78 instead of 68 does not return the 23456.78 record, but using 76 returns the 1234.76 record.
So to get to the question: why having a larger number causes these results to change? How can I change my query to get the expected results?
The like operator requires a string as a left-hand value. According to the documentation, a conversion from float to varchar can use several styles:
Value Output
0 (default) A maximum of 6 digits. Use in scientific notation, when appropriate.
1 Always 8 digits. Always use in scientific notation.
2 Always 16 digits. Always use in scientific notation.
The default style will work fine for the six digits in 3457.68, but not for the seven digits in 13457.68. To use 16 digits instead of 6, you could use convert and specify style 2. Style 2 represents a number like 3.457680000000000e+003. But that wouldn't work for the first two digits, and you get an unexpected +003 exponent for free.
The best approach is probably a conversion from float to decimal. That conversion allows you to specify the scale and precision. Using scale 20 and precision 10, the float is represented as 3457.6800000000:
where convert(decimal(20,10), num) like '%68%'
When you are comparing number with LIKE it is implicitly converted to string and then matched
The problem here is that float number is not precise and when it is converted you can get
13457.679999999999999 instead of 13457.68
So to avid this explicitly format number in appropriate format(not sure how to do this in sql server, but it will be something like)
SELECT num FROM number_like_test
WHERE Format("0.##",num) LIKE '%68%'
The conversion to string is rounding your values. Both CONVERT and CAST have the same behavior.
SELECT cast(num as nvarchar(50)) as s
FROM number_like_test
Or
SELECT convert(nvarchar(50), num) as s
FROM number_like_test
provide the results:
1234.56
3457.68
13457.7
1234.76
23456.8
You'll have to use the STR function and correct format parameters to try to get your results. For example,
SELECT STR(num, 10, 2) as s
FROM number_like_test
gives:
1234.56
3457.68
13457.68
1234.76
23456.78
Pretty well solved already, but you only need to CAST once, not twice like the other answer suggests, LIKE takes care of the string conversion:
SELECT *
FROM number_like_test
WHERE CAST(num AS DECIMAL(12,6)) LIKE '%68%'
And here's a SQL Fiddle showing the rounding behavior: SQL Fiddle
It's probably because a FLOAT data type represents a floating point number which is an approximation of the number and should not be relied on for exact comparisons.
If you need to do a search that includes the float value you would need to either store it in a decimal data type (which will hold the exact number) or convert it to a varchar using something like the STR() function