TSQL - Create a GUID from a random text and then get back the original text value from that GUID - sql

I'm curious, is there an efficient way to generate a GUID from a random string of text and then take that GUID and convert it back to the original random string of text without using any additional data/mapping? I looked around for ways to do it, but couldn't find anything substantial.
Take this variable as an example for the starting point. I want to know if I can generate a GUID from it and then destruct the GUID to get back the original #Text
DECLARE #Text CHAR(64) = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' ;

It seems like you are leaning towards EncryptByPassPhrase() and DecryptByPassPhrase()
Example
declare #encrypt varbinary(200)
select #encrypt = EncryptByPassPhrase('MySecretKey', 'abc' )
select #encrypt
select convert(varchar(100),DecryptByPassPhrase('MySecretKey', #encrypt ))

In SQL Server a UNIQUEIDENTIFIER can be any 16bytes of data. Per:
a string constant in the form xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx,
in which each x is a hexadecimal digit in the range 0-9 or a-f. For
example, 6F9619FF-8B86-D011-B42D-00C04FC964FF is a valid
uniqueidentifier value.
https://learn.microsoft.com/en-us/sql/t-sql/data-types/uniqueidentifier-transact-sql?view=sql-server-ver15
So any 16 character or less varchar can be round-tripped through a UNIQUEIDENTIFIER. Eg
declare #d varchar(16) = 'hello world'
declare #u uniqueidentifier = cast(cast(#d as varbinary(16)) as uniqueidentifier)
select cast(cast(#u as varbinary(16)) as varchar(16))

To add some general thoughts:
As you were told already, a GUID has a length of 16 bytes.
Assuming you can reduce your string to plain latin lower-case characters you have to deal with 26, together with numbers with 36 different values (not speaking of any dot, comma or question mark etc).
The count of bits reflects the number of possible values. One byte of 8 bit can represent 256 (2^8) different values. For 26 letters you'd need at least 5 bit (2^5=32), together with the numbers you'd have to go up to 6 bit (64 values). The 16-byte GUID represents 128 bits (16x8=128). You could divide this by 5 (~25) or by 6 (~21).
That means: Reduced to 26 plain latin characters you could (with quite some effort) encode up to 25 characters in the memory allocated by one GUID (by using chunks of 5 bits). Together with numbers you are limited to a length of 21.
If you want to deal with any VARCHAR value (which is extended ASCII), you can translate a string to binary and then to GUID and back easily (David Brown showed this in his answer), as long as you limit this to a length of 16 characters.
Otherwise you would need some kind of dictionary on both sides...

Related

Numeric Data Type - Storage

According to Microsoft Site a data with type Numeric(10,2) - 10 means precision should have 9 bytes.
But when I'm doing this:
DECLARE #var as numeric(10,0) = 2147483649
SELECT #var, DATALENGTH(#var)
DATALENGTH(#var) is returning 5 bytes instead of 10. Can someone explain me why?
The documentation specifies:
Maximum storage sizes vary, based on the precision.
The storage is not constant for a given precision. The actual storage depends on the value.
As a note, this has nothing to do with integerness. The following also returns 5:
declare #var numberic(11, 1) = 214483649.8
In actual fact, SQL Server seems to use the amount of storage needed for the value, not for the maximum value of the type. You can readily see this by changing the "10" to "20" and noting that the data length does not change.
EDIT:
You can see the dependence on the value if you run:
declare #a numeric(20, 1) = '123.1';
declare #b numeric(20, 1) = '1234567890123456789.0';
select datalength(#a), datalength(#b);
The two lengths are not the same.
The other answer, by #GordonLinoff is wrong, or at least misleading.
Numeric is not stored with a variable number of bytes, but with a fixed size for a specific precision.
Trying this on SQL Server 2017 gave the same results you got.
The documentation you linked to originally, for numeric, is correct about how many bytes it takes to store a numeric of varying precisions.
This storage requirement is based only on the precision of the numeric column. In other words, that's how many bytes of storage are used. It is not a maximum that depends on the value in that row.
All rows use the same number of bytes for that column.
The key to this variation is the documentation for DATALENGTH says this function
Returns the number of bytes used to represent any expression.
It appears that DATALENGTH goes not mean 'represent' as in 'represent' on disk, but rather 'represent' in memory.
The other documenation regarding numeric is talking about the on-disk storage of numeric.
This is probably because DATALENGTH is intended primarily for var* types or the other BLOB types.
So although a numeric(20,1) requires 13 bytes of storage, depending on the value, SQL Server can represent it in a smaller number of bytes when in memory, which is when DATALENGTH evaluates it.
As I pointed out in my other comment, although numeric has different sizes, it a fixed size data type, because for a specific column in a specific table, every values takes up the same amount of storage.
Roughly, a SQL Server row has 4 parts:
4 byte header
Fixed size data
Offsets into variable size data
Variable size data
Numerics & other fixed size types are stored in 2, var* are stored in 4, with lengths in 3.
This script displays the metadata for a table with some fixed & variable columns.
declare #a numeric(20, 1) = '123.1';
declare #b numeric(20, 1) = '1234567890123456789.0';
select datalength(#a) union select datalength(#b);
create table #numeric(num1 numeric(20,1), text1 varchar(10), char2 char(6));
insert into #numeric(num1, text1, char2) values ('123.1', 'hello', 'first'), ('1234567890123456789.0', 'there', '2nd');
select datalength(num1) from #numeric;
select
t.name as table_name,
c.name as column_name,
pc.partition_column_id,
pc.max_inrow_length,
pc.max_length,
pc.precision,
pc.scale,
pc.collation_name,
pc.leaf_offset
from tempdb.sys.tables as t
join tempdb.sys.partitions as p
on(t.object_id=p.object_id)
join tempdb.sys.system_internals_partition_columns as pc
on(pc.partition_id=p.partition_id)
join tempdb.sys.columns as c
on((c.object_id=p.object_id)and(c.column_id=pc.partition_column_id))
where (t.object_id=object_id('tempdb..#numeric'));
drop table #numeric;
Notice the leaf_offset column. This indicates the starting position of the value in the raw binary data.
The first column starts immediately after the 4 byte header.
The second fixed column starts 13 bytes later, as per the SQL documentation.
The varchar column has an offset of -1, indicating it is a variable length column & it's position in the byte array isn't fixed.
In this case it could be fixed since there's only 1 var column, but an alter table statement could add another column & shift things.
If you want to research further, the best source is a book called SQL Server Internals, by Kalen Delaney. She was part of the team that wrote SQL Server.

How to create human-readable identifier with more than 64 bits of entropy on IBM Netezza?

I'm working on IBM Netezza/PureData and I want to add an identifier column to a table containing billions of new records every day, so that I can track each record as it travels through different tables and systems. I want this identifier to have more than 64 bits of entropy, as the column may contain (in time) more than 10^12 different records, and I want to avoid hash collisions. According to this table, 64 bits is not enough to avoid hash collisions with this number of records.
Not enough bits
So, on Netezza I can easily do
select hash8(123456) as id
which will return a 64-bit number (BIGINT):
-1789169473613552245
This is perfectly readable, but it has only 64-bits of entropy.
Not human readable
I can also do:
select hash(123456) as id
to create a 128-bit hash on Netezza. This has more than enough entropy, but it becomes an unreadable mess of unicode characters:
oð8^GþåíOpJ
I'm afraid this will cause trouble when I start combining this date with tables from other systems.
Enough bits and human readable?
So instead, I would like to create a human-readable identifier by, for example, converting this 128-bit unicode string to a base-62 string, containing only alphanumeric characters (0-9, a-z, A-Z). Something like:
6KMPOATg6Y5TbuEZlD59Dp
Any ideas on how to do this? Ideally with only (Netezza) SQL-code or functions...
Since you seem to have the SQL extensions available (hash() being one of the Netezza SQL extension functions), you could try doing 'rawtohex()' on the output of your hash()
e.g.
select rawtohex(hash(123456));
This gives a nice HEX string representation of the hashed data:
'E10ADC3949BA59ABBE56E057F20F883E'
MS SQL has base-64 support, in XML datatype.
declare #source varbinary(max), #encoded varchar(max), #decoded varbinary(max)
set #source = convert(varbinary(max), 'Hello Base64')
set #encoded = cast('' as xml).value('xs:base64Binary(sql:variable("#source"))', 'varchar(max)')
set #decoded = cast('' as xml).value('xs:base64Binary(sql:variable("#encoded"))', 'varbinary(max)')
select
convert(varchar(max), #source) as source_varchar,
#source as source_binary,
#encoded as encoded,
#decoded as decoded_binary,
convert(varchar(max), #decoded) as decoded_varchar
From http://blog.falafel.com/t-sql-easy-base64-encoding-and-decoding/

Update varbinary(MAX) field in SQLServer 2012 Lost Last 4 bits

Recently I would like to do some data patching, and try to update a column of type varbinary(MAX), the update value is like this:
0xFFD8F...6DC0676
However, after update query run successfully, the value becomes:
0x0FFD8...6DC067
It seems the last 4 bits are lost, or whole value right shifting a byte...
I tried deleting entire row and run an Insert Query, same things happen!
Can anyone tell me why is this happening & how can I solve it? Thanks!
I have tried several varying length of binary, for maximum
43658 characters (Each represents 4 bits, total around 21 KB), the update query runs normally. 1 more character will make the above "bug" appears...
PS1: For a shorter length varbinary as update value, everything is okay
PS2: I can post whole binary string out if it helps, but it is really long and I am not sure if it's suitable to post here
EDITED:
Thanks for any help!
As someone suggested, the value inserted maybe of odd number of 4-bits, so there is a 0 append in front of it. Here is my update information on the value:
The value is of 43677 characters long exluding "0x", which menas Yes, it is odd
It does explain why a '0' is inserted before, but does not explain why the last character disappears...
Then I do an experiment:
I insert a even length value, with me manually add a '0' before the original value,
Now the value to be updated is
0x0FFD8F...6DC0676
which is of 43678 characters long, excluding "0x"
The result is no luck, the updated value is still
0x0FFD8...6DC067
It seems that the binary constant 0xFFD8F...6DC0676 that you used for update contains odd number of hex digits. And the SqlServer added half-byte at the beginning of the pattern so that it represent whole number of bytes.
You can see the same effect running the following simple query:
select 0x1, 0x104
This will return 0x01 and 0x0104.
The truncation may be due to some limitaions in SSMS, that can be observed in the following experiment:
declare #b varbinary(max)
set #b = 0x123456789ABCDEF0
set #b = convert(varbinary(max), replicate(#b, 65536/datalength(#b)))
select datalength(#b) DataLength, #b Data
The results returned are 65536 and 0x123456789ABCDEF0...EF0123456789ABCD, however if in SSMS I copy Data column I'm getting pattern of 43677 characters length (this is without leading 0x), which is 21838.5 bytes effectively. So it seems you should not (if you do) rely on long binary data values obtained via copy/paste in SSMS.
The reliable alternative can be using intermediate variable:
declare #data varbinary(max)
select #data = DataXXX from Table_XXX where ID = XXX
update Table_YYY set DataYYY = #data where ID = YYY

What does the specified number mean in a VARCHAR() clause?

Just to clarify, by specifying something like VARCHAR(45) means it can take up to max 45 characters? I remember I heard from someone a few years ago that the number in the parenthesis doesn't refer to the number of characters, then the person tried to explain to me something quite complicated which I don't understand and forgot already.
And what is the difference between CHAR and VARCHAR? I did search around a bit and see that CHAR gives you the max of the size of the column and it is better to use it if your data has a fixed size and use VARCHAR if your data size varies.
But if it gives you the max of the size of the column of all the data of this column, isn't it better to use it when your data size varies? Especially if you don't know how big your data size is going to be. VARCHAR needs to specify the size (CHAR don't really need right?), isn't it more troublesome?
You also have to specify the size with CHAR. With CHAR, column values are padded with spaces to fill the size you specified, whereas with VARCHAR, only the actual value you specified is stored.
For example:
CREATE TABLE test (
char_value CHAR(10),
varchar_value VARCHAR(10)
);
INSERT INTO test VALUES ('a', 'b');
SELECT * FROM test;
The above will select "a " for char_value and "b" for varchar_value
If all your values are about the same size, the CHAR is possibly a better choice because it will often require less storage space than VARCHAR. This is because VARCHAR stores both the length of the value and the value itself, whereas CHAR can just store the (fixed-size) value.
The MySQL documentation gives a good explanation of the storage requirements of the various data types.
In particular, for a string of length L, a CHAR(M) datatype will take up (M x c) bytes (where c is the number of bytes required to store a character... this depends on the character set in use).
A VARCHAR(M) will take up (L + 1) or (L + 2) depending on whether M is <=255 or >255.
So, it really depends on how long you expect your strings to be, what the variation in length will be.
NB: The documetation doesn't discuss the impact of character sets on the storage requirements of a VARCHAR type. I've tried to quote it accurately, but my guess is that you would need to multiply the string length by the character byte-width as well to get the storage requirement.
The complicated stuff you don't remember is that the 45 refer to bytes, not chars. It's not the same if you are using a multibyte character encoding. In Oracle you can specify bytes or chars explicitly.
varchar2(45 BYTE)
or
varchar2(45 CHAR)
See Difference between BYTE and CHAR in column datatypes
char and varchar actually becomes irrelevant if you have just 1 variable length field in your table, like a varchar or text. Mysql will automatically change all char to varchar.
The fixed length/size record can give you extra performance, but you can't use any variable length field types. The reason is that it will be quicker and easier for mysql to find the next record.
For example, if you do a SELECT * FROM table LIMIT 10, mysql has to scan the table file for the tenth record. This means finding the end of each record until you find the end of the 10th record. But if your table has fixed length/size records, mysql just needs to know the record size and then skip 10 x #bytes.
If you know a column will contain a small, fixed number of chars use a CHAR, otherwise use a varchar. A CHAR column is padded to the max length.
VARCHAR has a small overhead (4-8 bytes depending on RDBMS), but only uses the overhead + the actual number of chars stored.
For the values you know they are going to be constant, for example for Phone Numbers, Zip Codes etc., It is optimal to use "char" for sure.

Which data type saves more space TINYTEXT or VARCHAR for variable data length in MySQL?

I need to store a data into MySQL. Its length is not fixed, it could be 255 or 2 characters long. Should I use TINYTEXT or VARCHAR in order to save space (speed is irrelevant)?
When using VARCHAR, you need to specify maximum number of characters that will be stored in that column. So, if you declare a column to be VARCHAR(255), it means that you can store text with up to 255 characters. The important thing here is that if you insert two characters, only those two characters will be stored, i.e. allocated space will be 2 not 255.
TINYTEXT is one of four TEXT types. They are very similar to VARCHAR, but there are few differences (this depends on MySQL version you are using though). But for version 5.5, there are some limitations when it comes to TEXT types. First one is that you have to specify an index prefix length for indexes on TEXT. And the other one is that TEXT columns can't have default values.
In general, TEXT should be used for (extremely) long values. If you will be using string that will have up to 255 characters, then you should use VARCHAR.
Hope that this helps.
As for data storage space, VARCHAR(255) and TINYTEXT are equivalent:
VARCHAR(M): L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes.
TINYTEXT: L + 1 bytes, where L < 28.
Source: MySQL Reference Manual: Data Storage Requirements.
Storage space being equal, you may want to check out the following Stack Overflow posts for further reading on when you should use one or the other:
What’s the difference between VARCHAR(255) and TINYTEXT string types in MySQL?
varchar(255) v tinyblob v tinytext