Effect of NULL values on storage in SQL Server?

Effect of NULL values on storage in SQL Server? - sql

If you have a table with 20 rows that contains 12 null columns and 8 columns with values, what is the implications for storage and memory usage?
Is null unique or is it stored in memory at the same location each time and just referenced? Do a ton of nulls take up a ton of space? Does a table full of nulls take up the same amount of space as a table the same size full of int values?
This is for Sql server.

This depends on database engine as well as column type.
At least SQLite stores each null column as a "null type" which takes up NO additional space (each record is serialized to a single blob for storage so there is no space reserved for a non-null value in this case). With optimizations like this a NULL value has very little overhead to store. (SQLite also has optimizations for the values 0 and 1 -- the designers of databases aren't playing about!) See 2.1 Record Format for the details.
Now, things can get much more complex, especially with updating and potential index fragmentation. For instance, in SQL Server space may be reserved for the column data, depending upon the type. For instance, a int null will still reserve space for the integer (as well as have an "is null" flag somewhere), however varchar(100) null doesn't seem to reserve the space (this last bit is from memory, so be warned!).
Happy coding.

Starting with SQL Server 2008, you can define a column as SPARSE when you have a "ton of nulls". This will save some space but it requires a portion of the values of a column to be null . Exactly how much depends on the type.
See the Estimated Space Savings by Data Type tables in the article Using Sparse Columns which will tell you what percentage of the values need to be null for net saving of 40%
For example according to the tables 98% of values in a bit field must be null in order to get a savings of 40% while only 43% of a uniqueidentifier column will net you the same percentage.

Related

Do columns with the value NULL impact the performance of Microsoft SQL Server?

I have a datatable with over 200 columns, however, in over half of those columns, the majority of my datarows has the value 'NULL'.
Do those NULL-values decrease the performance of my SQL Server or are fields with NULL-values irrelevant for all actions on the datatable?

Performance of the table is basically a function of I/O. The way that SQL Server lays out rows on data pages means that NULL values might or might not take up space -- depending on the underlying data type. SQL Server data pages contain a list of nullability bits for each column (even NOT NULL columns) to keep the NULL information.
Variable length strings simply use the NULL bits, so they occupy no additional space in each row. Other data types do occupy space, even for NULL values (this includes fixed-length strings, I believe).
What impact does this have on performance? If you have 200 NULL integer fields, that is 800 bytes on the data page. That limits the number of records stored on a given page to no more than 10 records. So, if you want to read 100 records, the query has to read (at least) 10 data pages. If the table did not have these columns, then it might be able to read only one data page.
Whether or not this is important for a given query or set of queries depends on the queries. But yes, columns that are NULL could have an impact on performance, particularly on the I/O side of queries.

Aside from taking up space, and what little impact that may have on performance, they don't have any other effect.

SQL Query Performance with an nvarchar(500) where the MAX(LEN(column)) < 30 [duplicate]

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).

One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number

From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.

There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.

Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace

Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

SQL Server varchar(50) and varchar(128) performance difference [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
is there an advantage to varchar(500) over varchar(8000)?
I am currently working on a table which has lots of columns with varchar(50). The data, we now have to insert in some columns is above 50 characters so we have to change the column size from 50 to 128 for those columns, since we have a lot of columns its a waste of time to change individual columns.
So I proposed to my team, why dont we change all the columns to varchar(128). Some of the team mates argued that this will cause a performance hit during select and join operations.
Now I am not an expert on databases but I dont think moving from varchar 50 to varchar 128 will cause any significant performance hit.
P.S - We dont have any name, surname, address kind of data in those columns.

varchar(50) and varchar(128) will behave pretty much identical from every point of view. The storage size is identical for values under 50 characters. They can be joined interchangeably (varchar(50) joined with varchar(128)) w/o type convertion issues (ie. an index on varchar(50) can seek a column varchar(128) in a join) and same applies to WHERE predicates. Prior to SQL Server 2012 ncreasing the size of a varchar column is a very fast metadata-only operation, after SQL Server 2012 this operation may be a slow size-of-data-update-each-record operation under certain conditions, similar to those descirbed in Adding a nullable column can update the entire table.
Some issues can arrise from any column length change:
application issues from handling unexpected size values. Native ones may run into buffer size issues if improperly codded (ie. larger size can cause buffer overflow). Managed apps are unlikely to have serious issues, but minor issues like values not fitting on column widths on screen or on reports may occur.
T-SQL errors from truncating values on insert or update
T-SQL silent truncation occuring and resulting in incorrect values (Eg. #variables declared as varchar(50) in stored proc)
Limits like max row size or max index size may be reached. Eg. you have today a composite index on 8 columns of type varchar(50), extending to varchar(128) will exceed the max index size of 900 and trigger warnings.
Martin's warning about memory grants incresing is a very valid concern. I would just buy more RAM if that would indeed turn out to be an issue.

Please check here
Performance Comparison-of-varchar(max) vs varchar(n)

How many are such columns? The best practice rule says you need to plan each column carefully and define size accordingly. You should identify columns suitable for varchar(128) and not just increase size of all columns blindly.

I would suggest that change only the columns which needs to be changed.if you do not need more than 50 chars in any column do not change that.
Why are you intersted in making the length same for all columns?I doubt all columns has same length requirements.

Does varchar result in performance hit due to data fragmentation?

How are varchar columns handled internally by a database engine?
For a column defined as char(100), the DBMS allocates 100 contiguous bytes on the disk. However, for a column defined as varchar(100), that presumably isn't the case, since the whole point of varchar is to not allocate any more space than required to store the actual data value stored in the column. So, when a user updates a database row containing an empty varchar(100) column to a value consisting of 80 characters for instance, where does the space for that 80 characters get allocated from?
It seems that varchar columns must result in a fair amount of fragmentation of the actual database rows, at least in scenarios where column values are initially inserted as blank or NULL, and then updated later with actual values. Does this fragmentation result in degraded performance on database queries, as opposed to using char type values, where the space for the columns stored in the rows is allocated contiguously? Obviously using varchar results in less disk space than using char, but is there a performance hit when optimizing for query performance, especially for columns whose values are frequently updated after the initial insert?

You make a lot of assumptions in your question that aren't necessarily true.
The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.
In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.
The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.

The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.
For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)
In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:
Using variable length columns can improve performance as it allows more rows to fit on a single page, meaning fewer reads
Using variable length columns requires special offset values, and the maintenance of these values requires a slight overhead, however this extra overhead is generally neglible.
Another potential cost is the cost of increasing the size of a column when the page containing that row is nearly full
As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.
At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!

The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.

Your question is too general because different database engines will have different behavior. If you really need to know this, I suggest that you set up a benchmark to write a large number of records and time it. You would want enough records to take at least an hour to write.
As you suggested, it would be interesting to see what happens if you write insert all the records with an empty string ("") and then update them to have 100 characters that are reasonably random, not just 100 Xs.
If you try this with SQLITE and see no significant difference, then I think it unlikely that the larger database servers, with all the analysis and tuning that goes on, would be worse than SQLITE.

This is going to be completely database specific.
I do know that in Oracle, the database will reserve a certain percentage of each block for future updates (The PCTFREE parameter). For example, if PCTFREE is set to 25%, then a block will only be used for new data until it is 75% full. By doing that, room is left for rows to grow. If the row grows such that the 25% reserved space is completely used up, then you do end up with chained rows and a performance penalty. If you find that a table has a large number of chained rows, you can tune the PCTFREE for that table. If you have a table which will never have any updates at all, a PCTFREE of zero would make sense

In SQL Server varchar (except varchar(MAX)) is generally stored together with the rest of the row's data (on the same page if the row's data is < 8KB and on the same extent if it is < 64KB. Only the large data types such as TEXT, NTEXT, IMAGE, VARHCAR(MAX), NVARHCAR(MAX), XML and VARBINARY(MAX) are stored seperately.

NULL in MySQL (Performance & Storage)

What exactly does null do performance and storage (space) wise in MySQL?
For example:
TINYINT: 1 Byte
TINYINT w/NULL 1 byte + somehow stores NULL?

It depends on which storage engine you use.
In MyISAM format, each row header contains a bitfield with one bit for each column to encode NULL state. A column that is NULL still takes up space, so NULL's don't reduce storage. See https://dev.mysql.com/doc/internals/en/myisam-introduction.html
In InnoDB, each column has a "field start offset" in the row header, which is one or two bytes per column. The high bit in that field start offset is on if the column is NULL. In that case, the column doesn't need to be stored at all. So if you have a lot of NULL's your storage should be significantly reduced.
See https://dev.mysql.com/doc/internals/en/innodb-field-contents.html
EDIT:
The NULL bits are part of the row headers, you don't choose to add them.
The only way I can imagine NULLs improving performance is that in InnoDB, a page of data may fit more rows if the rows contain NULLs. So your InnoDB buffers may be more effective.
But I would be very surprised if this provides a significant performance advantage in practice. Worrying about the effect NULLs have on performance is in the realm of micro-optimization. You should focus your attention elsewhere, in areas that give greater bang for the buck. For example adding well-chosen indexes or increasing database cache allocation.

Bill's answer is good, but a little bit outdated. The use of one or two bytes for storing NULL applies only to InnoDB REDUNDANT row format. Since MySQL 5.0.3 InnoDB uses COMPACT row format which uses only one bit to store a NULL (of course one byte is the minimum), therefore:
Space Required for NULLs = CEILING(N/8) bytes where N is the number of NULL columns in a row.
0 NULLS = 0 bytes
1 - 8 NULLS = 1 byte
9 - 16 NULLS = 2 bytes
17 - 24 NULLS = 3 bytes
etc...
According to the official MySQL site about COMPACT vs REDUNDANT:
The compact row format decreases row storage space by about 20% at the cost of increasing CPU use for some operations. If your workload is a typical one that is limited by cache hit rates and disk speed, compact format is likely to be faster.
Advantage of using NULLS over Empty Strings or Zeros:
1 NULL requires 1 byte
1 Empty String requires 1 byte (assuming VARCHAR)
1 Zero requires 4 bytes (assuming INT)
You start to see the savings here:
8 NULLs require 1 byte
8 Empty Strings require 8 bytes
8 Zeros require 32 bytes
On the other hand, I suggest using NULLs over empty strings or zeros, because they're more organized, portable, and require less space. To improve performance and save space, focus on using the proper data types, indexes, and queries instead of weird tricks.
More on:
https://dev.mysql.com/doc/refman/5.7/en/innodb-physical-record.html

I would agree with Bill Karwin, although I would add these MySQL tips. Number 11 addresses this specifically:
First of all, ask yourself if there is any difference between having an empty string value vs. a NULL value (for INT fields: 0 vs. NULL). If there is no reason to have both, you do not need a NULL field. (Did you know that Oracle considers NULL and empty string as being the same?)
NULL columns require additional space and they can add complexity to your comparison statements. Just avoid them when you can. However, I understand some people might have very specific reasons to have NULL values, which is not always a bad thing.
On the other hand, I still utilize null on tables that don't have tons of rows, mostly because I like the logic of saying NOT NULL.
Update
Revisiting this later, I would add that I personally don't like to use 0 instead of NULL in the database, and I don't recommend it. This can easily lead to a lot of false positives in your application if you are not careful.

dev.mysql.com/doc/refman/5.0/en/is-null-optimization.html
MySQL can perform the same optimization on col_name IS NULL that it can use for col_name = constant_value. For example, MySQL can use indexes and ranges to search for NULL with IS NULL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas