Numeric Data Type - Storage - sql

According to Microsoft Site a data with type Numeric(10,2) - 10 means precision should have 9 bytes.
But when I'm doing this:
DECLARE #var as numeric(10,0) = 2147483649
SELECT #var, DATALENGTH(#var)
DATALENGTH(#var) is returning 5 bytes instead of 10. Can someone explain me why?

The documentation specifies:
Maximum storage sizes vary, based on the precision.
The storage is not constant for a given precision. The actual storage depends on the value.
As a note, this has nothing to do with integerness. The following also returns 5:
declare #var numberic(11, 1) = 214483649.8
In actual fact, SQL Server seems to use the amount of storage needed for the value, not for the maximum value of the type. You can readily see this by changing the "10" to "20" and noting that the data length does not change.
EDIT:
You can see the dependence on the value if you run:
declare #a numeric(20, 1) = '123.1';
declare #b numeric(20, 1) = '1234567890123456789.0';
select datalength(#a), datalength(#b);
The two lengths are not the same.

The other answer, by #GordonLinoff is wrong, or at least misleading.
Numeric is not stored with a variable number of bytes, but with a fixed size for a specific precision.
Trying this on SQL Server 2017 gave the same results you got.
The documentation you linked to originally, for numeric, is correct about how many bytes it takes to store a numeric of varying precisions.
This storage requirement is based only on the precision of the numeric column. In other words, that's how many bytes of storage are used. It is not a maximum that depends on the value in that row.
All rows use the same number of bytes for that column.
The key to this variation is the documentation for DATALENGTH says this function
Returns the number of bytes used to represent any expression.
It appears that DATALENGTH goes not mean 'represent' as in 'represent' on disk, but rather 'represent' in memory.
The other documenation regarding numeric is talking about the on-disk storage of numeric.
This is probably because DATALENGTH is intended primarily for var* types or the other BLOB types.
So although a numeric(20,1) requires 13 bytes of storage, depending on the value, SQL Server can represent it in a smaller number of bytes when in memory, which is when DATALENGTH evaluates it.
As I pointed out in my other comment, although numeric has different sizes, it a fixed size data type, because for a specific column in a specific table, every values takes up the same amount of storage.
Roughly, a SQL Server row has 4 parts:
4 byte header
Fixed size data
Offsets into variable size data
Variable size data
Numerics & other fixed size types are stored in 2, var* are stored in 4, with lengths in 3.
This script displays the metadata for a table with some fixed & variable columns.
declare #a numeric(20, 1) = '123.1';
declare #b numeric(20, 1) = '1234567890123456789.0';
select datalength(#a) union select datalength(#b);
create table #numeric(num1 numeric(20,1), text1 varchar(10), char2 char(6));
insert into #numeric(num1, text1, char2) values ('123.1', 'hello', 'first'), ('1234567890123456789.0', 'there', '2nd');
select datalength(num1) from #numeric;
select
t.name as table_name,
c.name as column_name,
pc.partition_column_id,
pc.max_inrow_length,
pc.max_length,
pc.precision,
pc.scale,
pc.collation_name,
pc.leaf_offset
from tempdb.sys.tables as t
join tempdb.sys.partitions as p
on(t.object_id=p.object_id)
join tempdb.sys.system_internals_partition_columns as pc
on(pc.partition_id=p.partition_id)
join tempdb.sys.columns as c
on((c.object_id=p.object_id)and(c.column_id=pc.partition_column_id))
where (t.object_id=object_id('tempdb..#numeric'));
drop table #numeric;
Notice the leaf_offset column. This indicates the starting position of the value in the raw binary data.
The first column starts immediately after the 4 byte header.
The second fixed column starts 13 bytes later, as per the SQL documentation.
The varchar column has an offset of -1, indicating it is a variable length column & it's position in the byte array isn't fixed.
In this case it could be fixed since there's only 1 var column, but an alter table statement could add another column & shift things.
If you want to research further, the best source is a book called SQL Server Internals, by Kalen Delaney. She was part of the team that wrote SQL Server.

Related

Unexpected behavior of binary conversions (COALESCE vs. ISNULL)

Can you comment on what approach shown below is preferable? I hope the question will not be blocked as "opinionated". I would like to believe there is an explanation that makes that clear.
Context: I have a code for mirroring 3rd party table contents to my own table (optimization). It worked some time flawlessly until the size/modification of the database reached some threshold.
The optimization is based on row version values of more tables, and remembering the maximum of the values from the source tables. This way I am able to update my local table incrementally, much faster than rebuilding it from time to time from scratch.
The problem started to appear when the row-version value exceeded the 4byte value. After some effort, I have spotted that the upper 4 bytes of the binary(8) value were set to 0. Later, the suspect was found to have a form COALESCE(MAX(row_version), 1).
The COALESCE was used to cover the case when the local table is fresh, containing now data -- for comparing the MAX(row_version) of source tables with something meaningful.
The examples to show the bug: To simulate the last mentioned situation, I want to convert the NULL value of the binary(8) column to 1. I am adding also the ISNULL usage that was added later. The original code contained the COALESCE only.
DECLARE #bin8null binary(8) = NULL
SELECT 'bin NULL' AS the_variable, #bin8null AS value
SELECT 'coalesce 1' AS op, COALESCE(#bin8null, 1) AS good_value
SELECT 'coalesce 1 + convert' AS op, CONVERT(binary(8), COALESCE(#bin8null, 1)) AS good_value
SELECT 'isnull 1' AS op, ISNULL(#bin8null, 1) AS good_value
SELECT 'isnull 0x1' AS op, ISNULL(#bin8null, 0x1) AS bad_value
(There is a bug in the image coalesce 0x1 + convert fixed later in the code to coalesce 1 + convert, but not fixed in the image.)
The application bug appeared when the binary value was bigger than the part that could be stored in 4 bytes. Here the 0xAAAAAAAA was used. (Actually, the 0x00000001 was the case, and it was difficult to spot that the single 1 was changed to 0.)
DECLARE #bin8 binary(8) = 0xAAAAAAAA01BB3A35
SELECT 'bin' AS the_variable, #bin8 AS value
SELECT 'coalesce 1' AS op, COALESCE(#bin8, 1) AS bad_value
SELECT 'coalesce 1 + convert' AS op, CONVERT(binary(8), COALESCE(#bin8, 1)) AS bad_value
SELECT 'coalesce 0x1 + convert ' AS op, CONVERT(binary(8), COALESCE(#bin8, 0x1)) AS good_value
SELECT 'isnull 1' AS op, ISNULL(#bin8, 1) AS good_value
SELECT 'isnull 0x1' AS op, ISNULL(#bin8, 0x1) AS good_value
When executed in Microsoft SQL Server Management Studio on MS-SQL Server 2014, the result looks like this:
Description -- my understanding: The COALESCE() seems to derive the type of the result from the type of the last processed argument. This way, the non-NULL binary(8) was converted to int, and that lead to the loss of upper 4 bytes. (See the 2nd and 3rd red bad_value on the picture. The difference between the two cases is only in decimal/hexadecimal form of display.)
On the other hand, the ISNULL() seems to preserve the type of the first argument, and converts the second value to that type. One should be careful to understand that binary(8) is more like a series of bytes. The interpretation as one large integer is only the interpretation. Hence, the 0x1 as the default value does not expand as 8bytes integer and produces bad value.
My solution: So, I have fixed the bug using ISNULL(MAX(row_version), 1). Is that correct?
This is not a bug. They're documented to handle data type precedence differently. COALESCE determines the data type of the output based on examining all of the arguments, while ISNULL has a more simplistic approach of inspecting only the first argument. (Both still need to contain values which are all compatible, meaning they are all possible to convert to the determined output type.)
From the COALESCE topic:
Returns the data type of expression with the highest data type precedence.
The ISNULL topic does not make this distinction in the same way, but implicitly states that the first expression determines the type:
replacement_value must be of a type that is implicitly convertible to the type of check_expression.
I have a similar example (and describe several other differences between COALESCE and ISNULL) here. Basically:
DECLARE #int int, #datetime datetime;
SELECT COALESCE(#int, CURRENT_TIMESTAMP);
-- works because datetime has a higher precedence than the chosen output type, int
2020-08-20 09:39:41.763
GO
DECLARE #int int, #datetime datetime;
SELECT ISNULL(#int, CURRENT_TIMESTAMP);
-- fails because int, the first (and chosen) output type, has a lower precedence than datetimeMsg 257, Level 16, State 3Implicit conversion from data type datetime to int is not allowed. Use the CONVERT function to run this query.
Let me start of by saying:
This is not a "bug".
ISNULL and COALESCE are not the same function, and operate quite differently.
ISNULL takes 2 parameters, and returns the second parameter if the first has a value NULL. If the 2 parameters are different datatypes, then the dataype of the first datatype is returned (implicitly casting the second value).
COALESCE takes 2+ parameters, and returns the first non-NULL parameter. COALESCE is a short hand CASE expression, and uses Data Type Precendence to determine the returned data type.
As a result, this is why ISNULL returns what you expect, there is no implicit conversion in your query for the non-NULL variable.
For the COALESCE there is implicit conversion. binary has the lowest precedence of all the data types, with a rank of 30 (at time of writing). The value 1 is an int, and has a precedence of 16; far higher than 30.
As a result COALESCE(#bin8, 1) will implicitly convert the value 0xAAAAAAAA01BB3A35 to an int and then return that value. You see this as SELECT CONVERT(int,0xAAAAAAAA01BB3A35) returns 29047349, which your first "bad" value; it's not "bad", it's correct for what you wrote.
Then for the latter "bad" value, we can convert that int value (29047349) back to a binary, which results in 0x0000000001BB3A35, which is, again the result you get.
TL;DR: checking return types of functions is important. ISNULL returns the data type of first parameter and will implicitly convert the second if needed. For COALESCE it uses Data Type Precedence, and will implicitly convert the returned value to the data type of with the highest precedence of all the possible return values.

How Can I Get An Exact Character Representation of a Float in SQL Server?

We are doing some validation of data which has been migrated from one SQL Server to another SQL Server. One of the things that we are validating is that some numeric data has been transferred properly. The numeric data is stored as a float datatype in the new system.
We are aware that there are a number of issues with float datatypes, that exact numeric accuracy is not guaranteed, and that one cannot use exact equality comparisons with float data. We don't have control over the database schemas nor data typing and those are separate issues.
What we are trying to do in this specific case is verify that some ratio values were transferred properly. One of the specific data validation rules is that all ratios should be transferred with no more than 4 digits to the right of the decimal point.
So, for example, valid ratios would look like:
.7542
1.5423
Invalid ratios would be:
.12399794301
12.1209377
What we would like to do is count the number of digits to the right of the decimal point and find all cases where the float values have more than four digits to the right of it. We've been using the SUBSTRING, LEN, STR, and a couple of other functions to achieve this, and I am sure it would work if we had numeric fields typed as decimal which we were casting to char.
However, what we have found when attempting to convert a float to a char value is that SQL Server seems to always convert to decimal in between. For example, the field in question shows this value when queried in SQL Server Enterprise Manager:
1.4667
Attempting to convert to a string using the recommended function for SQL Server:
LTRIM(RTRIM(STR(field_name, 22, 17)))
Returns this value:
1.4666999999999999
The value which I would expect if SQL Server were directly converting from float to char (which we could then trim trailing zeroes from):
1.4667000000000000
Is there any way in SQL Server to convert directly from a float to a char without going through what appears to be an intermediate conversion to decimal along the way? We also tried the CAST and CONVERT functions and received similar results to the STR function.
SQL Server Version involved: SQL Server 2012 SP2
Thank you.
Your validation rule seems to be misguided.
An SQL Server FLOAT, or FLOAT(53), is stored internally as a 64-bit floating-point number according to the IEEE 754 standard, with 53 bits of mantissa ("value") plus an exponent. Those 53 binary digits correspond to approximately 15 decimal digits.
Floating-point numbers have limited precision, which does not mean that they are "fuzzy" or inexact in themselves, but that not all numbers can be exactly represented, and instead have to be represented using another number.
For example, there is no exact representation for your 1.4667, and it will instead be stored as a binary floating-point number that (exactly) corresponds to the decimal number 1.466699999999999892708046900224871933460235595703125. Correctly rounded to 16 decimal places, that is 1.4666999999999999, which is precisely what you got.
Since the "exact character representation of the float value that is in SQL Server" is 1.466699999999999892708046900224871933460235595703125, the validation rule of "no more than 4 digits to the right of the decimal point" is clearly flawed, at least if you apply it to the "exact character representation".
What you might be able to do, however, is to round the stored number to fewer decimal places, so that the small error at the end of the decimals is hidden. Converting to a character representation rounded to 15 instead of 16 places (remember those "15 decimal digits" mentioned at the beginning?) will give you 1.466700000000000, and then you can check that all decimals after the first four are zeroes.
You can try using cast to varchar.
select case when
len(
substring(cast(col as varchar(100))
,charindex('.',cast(col as varchar(100)))+1
,len(cast(col as varchar(100)))
)
) = 4
then 'true' else 'false' end
from tablename
where charindex('.',cast(col as varchar(100))) > 0
For this particular number, don't use STR(), and use a convert or cast to varchar. But, in general, you will always have precision issues when storing in float... it's the nature of the storage of that datatype. The best you can do is normalize to a NUMERIC type and compare with threshold ranges (+/- .0001, for example). See the following for a breakdown of how the different conversions work:
declare #float float = 1.4667
select #float,
convert(numeric(18,4), #float),
convert(nvarchar(20), #float),
convert(nvarchar(20), convert(numeric(18,4), #float)),
str(#float, 22, 17),
str(convert(numeric(18,4), #float)),
convert(nvarchar(20), convert(numeric(18,4), #float))
Instead of casting to a VarChar you might try this: cast to a decimal with 4 fractional digits and check if it's the same value as before.
case when field_name <> convert(numeric(38,4), field_name)
then 1
else 0
end
The issue you have here is that float is an approximate number data type with an accuracy of about seven digits. That means it approaches the value while using less storage than a decimal / numeric. That's why you don't use float for values that require exact precision.
Check this example:
DECLARE #t TABLE (
col FLOAT
)
INSERT into #t (col)
VALUES (1.4666999999999999)
,(1.4667)
,(1.12399794301)
,(12.1209377);
SELECT col
, CONVERT(NVARCHAR(MAX),col) AS chr
, CAST(col as VARBINARY) AS bin
, LTRIM(RTRIM(STR(col, 22, 17))) AS rec
FROM #t
As you see the float 1.4666999999999999 binary equals 1.4667. For your stated needs I think this query would fit:
SELECT col
, RIGHT(CONVERT(NVARCHAR(MAX),col), LEN(CONVERT(NVARCHAR(MAX),col)) - CHARINDEX('.',CONVERT(NVARCHAR(MAX),col))) AS prec
from #t

Update varbinary(MAX) field in SQLServer 2012 Lost Last 4 bits

Recently I would like to do some data patching, and try to update a column of type varbinary(MAX), the update value is like this:
0xFFD8F...6DC0676
However, after update query run successfully, the value becomes:
0x0FFD8...6DC067
It seems the last 4 bits are lost, or whole value right shifting a byte...
I tried deleting entire row and run an Insert Query, same things happen!
Can anyone tell me why is this happening & how can I solve it? Thanks!
I have tried several varying length of binary, for maximum
43658 characters (Each represents 4 bits, total around 21 KB), the update query runs normally. 1 more character will make the above "bug" appears...
PS1: For a shorter length varbinary as update value, everything is okay
PS2: I can post whole binary string out if it helps, but it is really long and I am not sure if it's suitable to post here
EDITED:
Thanks for any help!
As someone suggested, the value inserted maybe of odd number of 4-bits, so there is a 0 append in front of it. Here is my update information on the value:
The value is of 43677 characters long exluding "0x", which menas Yes, it is odd
It does explain why a '0' is inserted before, but does not explain why the last character disappears...
Then I do an experiment:
I insert a even length value, with me manually add a '0' before the original value,
Now the value to be updated is
0x0FFD8F...6DC0676
which is of 43678 characters long, excluding "0x"
The result is no luck, the updated value is still
0x0FFD8...6DC067
It seems that the binary constant 0xFFD8F...6DC0676 that you used for update contains odd number of hex digits. And the SqlServer added half-byte at the beginning of the pattern so that it represent whole number of bytes.
You can see the same effect running the following simple query:
select 0x1, 0x104
This will return 0x01 and 0x0104.
The truncation may be due to some limitaions in SSMS, that can be observed in the following experiment:
declare #b varbinary(max)
set #b = 0x123456789ABCDEF0
set #b = convert(varbinary(max), replicate(#b, 65536/datalength(#b)))
select datalength(#b) DataLength, #b Data
The results returned are 65536 and 0x123456789ABCDEF0...EF0123456789ABCD, however if in SSMS I copy Data column I'm getting pattern of 43677 characters length (this is without leading 0x), which is 21838.5 bytes effectively. So it seems you should not (if you do) rely on long binary data values obtained via copy/paste in SSMS.
The reliable alternative can be using intermediate variable:
declare #data varbinary(max)
select #data = DataXXX from Table_XXX where ID = XXX
update Table_YYY set DataYYY = #data where ID = YYY

Getting max length of a varchar(max) from syscolumns in sql server

select c.name, t.name, c.length
from syscolumns c
c.length gives me -1 for any column that has max e.g varchar(max)
What should I do to get length ?
The data type of length on sys.columns is a smallint, whilst the max length of the varchar(max) is 2.1 billion, so it has a problem holding the real length. The -1 is in the documentation for denoting a varchar(max), varbinary(max), nvarchar(max) and xml.
http://msdn.microsoft.com/en-us/library/ms176106(v=sql.100).aspx
If you really need the number, then you would need a case statement to replace -1 with (2^31)-1
If you want to get the length of physical data, then you need to max / min / avg the appropriate lengths on the tables with the data on it based on what you need that information for. When querying the length of the field, DATALENGTH returns the bytes used, LEN returns the characters count.
-1 means that the column is of type max. The max length is then the max type, as per documentation. MAX types have a maximum length of 2GB if the FILESTREAM attribute is not specified, or a max size limited only by the disk size available:
The sizes of the BLOBs are limited only by the volume size of the file
system. The standard varbinary(max) limitation of 2-GB file sizes does
not apply to BLOBs that are stored in the file system.
Therefore your question really doesn't have an answer. You can ask what is the actual size of any actual in the table value, using DATALENGTH.
As seen HERE:
Variable-length, non-Unicode character data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size is the actual length of data entered + 2 bytes. The data entered can be 0 characters in length. The ISO synonyms for varchar are char varying or character varying.
In other words, max = 2147483647 bytes if all the possible space is occupied..
The length of the column in each row could vary. Hence the result of -1

What does the specified number mean in a VARCHAR() clause?

Just to clarify, by specifying something like VARCHAR(45) means it can take up to max 45 characters? I remember I heard from someone a few years ago that the number in the parenthesis doesn't refer to the number of characters, then the person tried to explain to me something quite complicated which I don't understand and forgot already.
And what is the difference between CHAR and VARCHAR? I did search around a bit and see that CHAR gives you the max of the size of the column and it is better to use it if your data has a fixed size and use VARCHAR if your data size varies.
But if it gives you the max of the size of the column of all the data of this column, isn't it better to use it when your data size varies? Especially if you don't know how big your data size is going to be. VARCHAR needs to specify the size (CHAR don't really need right?), isn't it more troublesome?
You also have to specify the size with CHAR. With CHAR, column values are padded with spaces to fill the size you specified, whereas with VARCHAR, only the actual value you specified is stored.
For example:
CREATE TABLE test (
char_value CHAR(10),
varchar_value VARCHAR(10)
);
INSERT INTO test VALUES ('a', 'b');
SELECT * FROM test;
The above will select "a " for char_value and "b" for varchar_value
If all your values are about the same size, the CHAR is possibly a better choice because it will often require less storage space than VARCHAR. This is because VARCHAR stores both the length of the value and the value itself, whereas CHAR can just store the (fixed-size) value.
The MySQL documentation gives a good explanation of the storage requirements of the various data types.
In particular, for a string of length L, a CHAR(M) datatype will take up (M x c) bytes (where c is the number of bytes required to store a character... this depends on the character set in use).
A VARCHAR(M) will take up (L + 1) or (L + 2) depending on whether M is <=255 or >255.
So, it really depends on how long you expect your strings to be, what the variation in length will be.
NB: The documetation doesn't discuss the impact of character sets on the storage requirements of a VARCHAR type. I've tried to quote it accurately, but my guess is that you would need to multiply the string length by the character byte-width as well to get the storage requirement.
The complicated stuff you don't remember is that the 45 refer to bytes, not chars. It's not the same if you are using a multibyte character encoding. In Oracle you can specify bytes or chars explicitly.
varchar2(45 BYTE)
or
varchar2(45 CHAR)
See Difference between BYTE and CHAR in column datatypes
char and varchar actually becomes irrelevant if you have just 1 variable length field in your table, like a varchar or text. Mysql will automatically change all char to varchar.
The fixed length/size record can give you extra performance, but you can't use any variable length field types. The reason is that it will be quicker and easier for mysql to find the next record.
For example, if you do a SELECT * FROM table LIMIT 10, mysql has to scan the table file for the tenth record. This means finding the end of each record until you find the end of the 10th record. But if your table has fixed length/size records, mysql just needs to know the record size and then skip 10 x #bytes.
If you know a column will contain a small, fixed number of chars use a CHAR, otherwise use a varchar. A CHAR column is padded to the max length.
VARCHAR has a small overhead (4-8 bytes depending on RDBMS), but only uses the overhead + the actual number of chars stored.
For the values you know they are going to be constant, for example for Phone Numbers, Zip Codes etc., It is optimal to use "char" for sure.