NULL in MySQL (Performance & Storage)

NULL in MySQL (Performance & Storage) - sql

What exactly does null do performance and storage (space) wise in MySQL?
For example:
TINYINT: 1 Byte
TINYINT w/NULL 1 byte + somehow stores NULL?

It depends on which storage engine you use.
In MyISAM format, each row header contains a bitfield with one bit for each column to encode NULL state. A column that is NULL still takes up space, so NULL's don't reduce storage. See https://dev.mysql.com/doc/internals/en/myisam-introduction.html
In InnoDB, each column has a "field start offset" in the row header, which is one or two bytes per column. The high bit in that field start offset is on if the column is NULL. In that case, the column doesn't need to be stored at all. So if you have a lot of NULL's your storage should be significantly reduced.
See https://dev.mysql.com/doc/internals/en/innodb-field-contents.html
EDIT:
The NULL bits are part of the row headers, you don't choose to add them.
The only way I can imagine NULLs improving performance is that in InnoDB, a page of data may fit more rows if the rows contain NULLs. So your InnoDB buffers may be more effective.
But I would be very surprised if this provides a significant performance advantage in practice. Worrying about the effect NULLs have on performance is in the realm of micro-optimization. You should focus your attention elsewhere, in areas that give greater bang for the buck. For example adding well-chosen indexes or increasing database cache allocation.

Bill's answer is good, but a little bit outdated. The use of one or two bytes for storing NULL applies only to InnoDB REDUNDANT row format. Since MySQL 5.0.3 InnoDB uses COMPACT row format which uses only one bit to store a NULL (of course one byte is the minimum), therefore:
Space Required for NULLs = CEILING(N/8) bytes where N is the number of NULL columns in a row.
0 NULLS = 0 bytes
1 - 8 NULLS = 1 byte
9 - 16 NULLS = 2 bytes
17 - 24 NULLS = 3 bytes
etc...
According to the official MySQL site about COMPACT vs REDUNDANT:
The compact row format decreases row storage space by about 20% at the cost of increasing CPU use for some operations. If your workload is a typical one that is limited by cache hit rates and disk speed, compact format is likely to be faster.
Advantage of using NULLS over Empty Strings or Zeros:
1 NULL requires 1 byte
1 Empty String requires 1 byte (assuming VARCHAR)
1 Zero requires 4 bytes (assuming INT)
You start to see the savings here:
8 NULLs require 1 byte
8 Empty Strings require 8 bytes
8 Zeros require 32 bytes
On the other hand, I suggest using NULLs over empty strings or zeros, because they're more organized, portable, and require less space. To improve performance and save space, focus on using the proper data types, indexes, and queries instead of weird tricks.
More on:
https://dev.mysql.com/doc/refman/5.7/en/innodb-physical-record.html

I would agree with Bill Karwin, although I would add these MySQL tips. Number 11 addresses this specifically:
First of all, ask yourself if there is any difference between having an empty string value vs. a NULL value (for INT fields: 0 vs. NULL). If there is no reason to have both, you do not need a NULL field. (Did you know that Oracle considers NULL and empty string as being the same?)
NULL columns require additional space and they can add complexity to your comparison statements. Just avoid them when you can. However, I understand some people might have very specific reasons to have NULL values, which is not always a bad thing.
On the other hand, I still utilize null on tables that don't have tons of rows, mostly because I like the logic of saying NOT NULL.
Update
Revisiting this later, I would add that I personally don't like to use 0 instead of NULL in the database, and I don't recommend it. This can easily lead to a lot of false positives in your application if you are not careful.

dev.mysql.com/doc/refman/5.0/en/is-null-optimization.html
MySQL can perform the same optimization on col_name IS NULL that it can use for col_name = constant_value. For example, MySQL can use indexes and ranges to search for NULL with IS NULL

Related

Do columns with the value NULL impact the performance of Microsoft SQL Server?

I have a datatable with over 200 columns, however, in over half of those columns, the majority of my datarows has the value 'NULL'.
Do those NULL-values decrease the performance of my SQL Server or are fields with NULL-values irrelevant for all actions on the datatable?

Performance of the table is basically a function of I/O. The way that SQL Server lays out rows on data pages means that NULL values might or might not take up space -- depending on the underlying data type. SQL Server data pages contain a list of nullability bits for each column (even NOT NULL columns) to keep the NULL information.
Variable length strings simply use the NULL bits, so they occupy no additional space in each row. Other data types do occupy space, even for NULL values (this includes fixed-length strings, I believe).
What impact does this have on performance? If you have 200 NULL integer fields, that is 800 bytes on the data page. That limits the number of records stored on a given page to no more than 10 records. So, if you want to read 100 records, the query has to read (at least) 10 data pages. If the table did not have these columns, then it might be able to read only one data page.
Whether or not this is important for a given query or set of queries depends on the queries. But yes, columns that are NULL could have an impact on performance, particularly on the I/O side of queries.

Aside from taking up space, and what little impact that may have on performance, they don't have any other effect.

SQL Query Performance with an nvarchar(500) where the MAX(LEN(column)) < 30 [duplicate]

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).

One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number

From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.

There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.

Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace

Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

Effect of NULL values on storage in SQL Server?

If you have a table with 20 rows that contains 12 null columns and 8 columns with values, what is the implications for storage and memory usage?
Is null unique or is it stored in memory at the same location each time and just referenced? Do a ton of nulls take up a ton of space? Does a table full of nulls take up the same amount of space as a table the same size full of int values?
This is for Sql server.

This depends on database engine as well as column type.
At least SQLite stores each null column as a "null type" which takes up NO additional space (each record is serialized to a single blob for storage so there is no space reserved for a non-null value in this case). With optimizations like this a NULL value has very little overhead to store. (SQLite also has optimizations for the values 0 and 1 -- the designers of databases aren't playing about!) See 2.1 Record Format for the details.
Now, things can get much more complex, especially with updating and potential index fragmentation. For instance, in SQL Server space may be reserved for the column data, depending upon the type. For instance, a int null will still reserve space for the integer (as well as have an "is null" flag somewhere), however varchar(100) null doesn't seem to reserve the space (this last bit is from memory, so be warned!).
Happy coding.

Starting with SQL Server 2008, you can define a column as SPARSE when you have a "ton of nulls". This will save some space but it requires a portion of the values of a column to be null . Exactly how much depends on the type.
See the Estimated Space Savings by Data Type tables in the article Using Sparse Columns which will tell you what percentage of the values need to be null for net saving of 40%
For example according to the tables 98% of values in a bit field must be null in order to get a savings of 40% while only 43% of a uniqueidentifier column will net you the same percentage.

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause

At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.

There might be significant performance gains if the column is used in an index.

It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.

Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.

Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?

having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.

is there an advantage to varchar(500) over varchar(8000)?

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).

One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number

From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.

Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace

Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas