How do I run md5() on a bigint in Presto? - sql

select md5(15)
returns
Query failed (#20160818_193909_00287_8zejd): line 1:8:
Unexpected parameters (bigint) for function md5. Expected: md5(varbinary)
How do I hash 15 and get back a string? I'd like to select 1 in 16 items at random, e.g. where md5(id) like '%3'.
FYI I might be on version 0.147, don't know how to tell.
FYI I found this PR. md5 would be cross-platform, which is nice, but I'd take a Presto-dependent hash function that spread ids relatively uniformly. I suppose I could implement my own linear formula. Seems awkward.

Best thing I could come up with was to cast the integer as a varchar, then turn it into varbinary via utf8, then apply md5 on the varbinary:
presto> select md5(to_utf8(cast(15 as varchar)));
_col0
-------------------------------------------------
9b f3 1c 7f f0 62 93 6a 96 d3 c8 bd 1f 8f 2f f3
(1 row)
If this is not the result you get, you can always turn it into a hex string manually:
presto> select to_hex(md5(to_utf8(cast(15 as varchar))));
_col0
----------------------------------
9BF31C7FF062936A96D3C8BD1F8F2FF3
(1 row)

Related

How many combinations does SHA-256 have?

By using an online tool and wikipedia I found out that every sha-256 encrypted string is 64 chars longs containing numbers and characters. Hence I assumed that there are 34^36 combinations ( 2^216 simplified by an algebra calculator ).
After doing some research I found out that most people said there are 2^256 combinations. Could someone explain ? To make the context clear, I write a paper about cryptocurrencies and try to explain how many different combinations there are to encrypt and how long this could take ( therefore how many guesses it could take) and compare this to the amount of total atoms in the universe (roughly 10^85).
SHA-256 produces 256 bits which is 32 bytes, not characters, each byte has 256 possible values.
There are 256 bits and each bit has 2 values (0 or 1), thus 2^256.
There are 32 bytes and each byte has 256 values, thus 256^32.
Note: 2^256 == 256^32 ~= 10^77.
The 32 bytes can be encoded many ways, in hexadecimal it would be 64 characters, in Base64 it would be 44 characters.
Total combinations of SHA-256 is
115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936
A sha-256 hash has 64 characters, 32 hex combinations, because a hex has 2 characters.
3a 7b d3 e2 36 0a 3d 29 ee a4 36 fc fb 7e 44 c7 35 d1 17 c4 2d 1c 18 35 42 0b 6b 99 42 dd 4f 1b
Above is a hash where the hex combinations are separated so you can count 32.
There are 16 characters available to hex 0-9&a-f and 16^2 or 256 combinations in hex.
With 32 slots for a hex in a sha-256 you use 256^32 to get:
115792089237316195423570985008687907853269984665640564039457584007913129639936
Available sha-256 hashes.

u-sql: filtering out empty// Null strings (microsoft academic graph)

I am new to u-sql of azure datalake analytics.
I want to do what I think is a very simple operations but ran into trouble.
Basically: I want to create a query which ignore empty string.
using it in select works, but not in WHERE statement.
Below the statement I am making and the cryptic error I get
JOB
#xsel_res_1 =
EXTRACT
x_paper_id long,
x_Rank uint,
x_doi string,
x_doc_type string,
x_paper_title string,
x_original_title string,
x_book_title string,
x_paper_year int,
x_paper_date DateTime?,
x_publisher string,
x_journal_id long?,
x_conference_series_id long?,
x_conference_instance_id long?,
x_volume string,
x_issue string,
x_first_page string,
x_last_page string,
x_reference_count long,
x_citation_count long?,
x_estimated_citation int?
FROM #"adl://xmag.azuredatalakestore.net/graph/2018-02-02/Papers.txt"
USING Extractors.Tsv()
;
#xsel_res_2 =
SELECT
x_paper_id AS x_paper_id,
x_doi.ToLower() AS x_doi,
x_doi.Length AS x_doi_length
FROM #xsel_res_1
WHERE NOT string.IsNullOrEmpty(x_doi)
;
#xsel_res_3 =
SELECT
*
FROM #xsel_res_2
SAMPLE ANY (5)
;
OUTPUT #xsel_res_3
TO #"/graph/2018-02-02/x_output/x_papers_x6.tsv"
USING Outputters.Tsv();
THE ERROR
Vertex failed
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][1] with error: Vertex user code error.
VertexFailedFast: Vertex failed with a fail-fast error
E_RUNTIME_USER_EXTRACT_ROW_ERROR: Error occurred while extracting row after processing 10 record(s) in the vertex' input split. Column index: 5, column name: 'x_original_title'.
E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD: Invalid character following the ending quote character in a quoted field.
Row selected
Component
RUNTIME
Message
Invalid character following the ending quote character in a quoted field.
Resolution
Column should be fully surrounded with double-quotes and double-quotes within the field escaped as two double-quotes.
Description
Invalid character is detected following the ending quote character in a quoted field. A column delimiter, row delimiter or EOF is expected. This error can occur if double-quotes within the field are not correctly escaped as two double-quotes.
Details
Row Delimiter: 0x0
Column Delimiter: 0x9
HEX: 61 76 6E 69 20 74 65 72 6D 69 6E 20 75 20 70 6F 76 61 6C 6A 73 6B 6F 6A 20 6C 69 73 74 69 6E 69 20 69 20 6E 61 74 70 69 73 75 20 67 20 31 31 38 35 09 22 50 6F 20 6B 6F 6E 63 75 22 ### 20 28 73 74 61 72 69 20 68 72
UPDATE
BY the way, the operations work on other datasets, so the problem is not the syntax as far as I can tell
//Define schema of file, must map all columns
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM #"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
#searchlog_1 =
SELECT * FROM #searchlog
WHERE NOT string.IsNullOrEmpty(ClickedUrls );
OUTPUT #searchlog_1
TO #"/Samples/Output/SearchLog_output_x1.tsv"
USING Outputters.Tsv();
This is an unfortunate error display for this case.
Assuming text is utf-8, you can use a site like www.hexutf8.com to convert the hex to:
avni termin u povaljskoj listini natpisu g 1185 "Po koncu" (Stari hr
It looks like the input row contains at least one " character that is not properly escaped. It should look like this:
avni termin u povaljskoj listini natpisu g 1185 ""Po koncu"" (Stari hr
#Saveenr's answer assumes that the values in your file are all quoted. Alternatively, if they are not quoted (and do not contain your column separator as values), then setting Extractors.Tsv(quoting:false) could help as well.

wrong output when decoding base64 string

i seem to always get incorrect output when decoding this base64 string in vb.net ( i think its base64? it really looks like it )
im using the frombase64string function
and i did it like this
Dim b64str = "0DDQQL3uAikQBgAAc4cqK4WnSQBg4SAgExEAAF3BAmAILYojRgkBhUrBAgEDRw=="
Dim i As String = System.Text.Encoding.Unicode.GetString(Convert.FromBase64String(b64str))
MsgBox(i)
but i always get this output
バ䃐⤂ؐ
that doesn't seem right
0DDQQL3uAikQBgAAc4cqK4WnSQBg4SAgExEAAF3BAmAILYojRgkBhUrBAgEDRw==
It looks like Base64, the length is a correct size, the characters belong to the Base64 character set and the trailing "==" is reasonable. Of course it might not be a Base64 encoding.
Base64 decoding results in:
D0 30 D0 40 BD EE 02 29 10 06 00 00 73 87 2A 2B 85 A7 49 00 60 E1 20 20 13 11 00 00 5D C1 02 60 08 2D 8A 23 46 09 01 85 4A C1 02 01 03 47
Now the problem, this is not a character string, it is an array of 8-bit bytes. Thus it can not be displayed as characters. The 0x00 bytes will signal the end of a string to the print method and the no-representable characters may be ignored, displayed with special characters or multiple bytes may display as must-byte unicode characters. The only guaranteed and usual display is in hexadecimal as above.
That String can be virtually anything. It might be the result of an encryption algorithm, like sha*. Your mistake is that you assume that it must be base64 because it might be.
It is a valid observation that it might be base64, so it was a perfectly valid thing to run that function, but it is you who has to determine whether based on the results it is base64 or something else, based on particular logic, which was not described in the question.

Oracle SQL - Combine results from two columns

I am seeking to combine the results of two columns, and view it in a single column:
select description1, description2 from daclog where description2 is not null;
results two registry:
1st row:
DESCRIPTION1
Initialization scans sent to RTU 1, 32 bit mask: 0x00000048. Initialization mask bits are as follows: B0 - status dump, B1 - analog dump B2 - accumulator dump, B3 - Group Data Dump, B4 - accumulat
(here begin DESCRIPTION2)
,or freeze, B5 - power fail reset, B6 - time sync.
2nd row:
DESCRIPTION1
Initialization scans sent to RTU 1, 32 bit mask: 0x00000048. Initialization mask bits are as follows: B0 - status dump, B1 - analog dump B2 - accumulator dump, B3 - Group Data Dump, B4 - accumulat
(here begin DESCRIPTION2)
,or freeze, B5 - power fail reset, B6 - time sync.
Then I need the value of description1 and description2, on the same column.
It is possible?
Thank you!
You can combine two columns into one by using || operator.
select description1 || description2 as description from daclog where description2 is not null;
If you would like to use some substrings from each of the descriptions, you can use String functions and then combine the results. FNC(description1) || FNC(descriptions2) where FNC might be a function to return the desired substring of your columns.

AS/400: Most efficient DB2 SQL to unpack EBCDIC subfield values as strings?

Most files I am working with only have the following fields:
F00001 - usually 1 (f1) or 9 (f9)
K00001 - usually only 1-3 sub-fields of
zoned decimals and ebcdic
F00002 - sub-fields of ebcdic, zoned and
packed decimals
Occasionally other field names K00002, F00003 and F00004 will appear in cross reference files.
Example Data:
+---------+--------------------------------------------------+--------------------------------------------------------------------------------------------+
| F00001 | K00001 | F00002 |
+---------+--------------------------------------------------+--------------------------------------------------------------------------------------------+
| f1 | f0 f0 f0 f0 f1 f2 f3 f4 f5 f6 d7 c8 | e2 e3 c1 c3 d2 d6 e5 c5 d9 c6 d3 d6 e7 40 12 34 56 7F e2 d2 c5 c5 e3 |
+---------+--------------------------------------------------+--------------------------------------------------------------------------------------------+
Currently using:
SELECT SUBSTR(HEX(F00001), 1, 2) AS FNAME_1, SUBSTR(HEX(K00001), 1, 14) AS KNAME_1, SUBSTR(HEX(K00001), 15, 2) AS KNAME_2, SUBSTR(HEX(K00001), 17, 2) AS KNAME_2, SUBSTR(HEX(F00002), 1, 28) AS FNAME_2, SUBSTR(HEX(F00002), 29, 8) AS FNAME_3, SUBSTR(HEX(F00002), 37, 10) AS FNAME_4 FROM QS36F.FILE
Is this the best way to unpack EBCDIC values as strings?
You asked for 'the best way'. Manually fiddling the bytes is categorically NOT the best way. #JamesA has a better answer: Externally describe the table and use more traditional SQL to access it. I see in your comments that you have multiple layouts within the same table. This was typical years ago when we converted from punched cards to disk. I feel your pain, having experienced this many times.
If you are using SQL to run queries, I think you have several options, all of which revolve around having a sane DB2 table instead of a jumbled S/36 flat file. Without more details on the business problem, all we can do is offer suggestions.
1) Add a trigger to QS36F.FILE that will break out the intermingled records into separate SQL defined tables. Query those.
2) Write some UDFs that will pack and unpack numbers. If you're querying today, you'll be updating tomorrow and if you think you have some chance of maintaining the raw HEX(this) and HEX(that) for SELECTS, wait until you try to do an UPDATE that way.
3) Write stored procedures that will extract out the bits you need for a given query, put them into SQL tables - maybe even a GLOBAL TEMPORARY TABLE. Have the SP query those bits and return a result set that can be consumed by other SQL queries. IBM i supports user defined table functions as well.
4) Have the RPG team write you a conversion program that will read the old file and create a data warehouse that you can query against.
It almost looks as if old S/36 files are being accessed and the system runs under CCSID 65535. That could cause the messy "hex"representation issue as well as at least some of the column name issues. A little more info about the server environment would be useful.