Nice simple question for the db guys,
Its not terribly important but im curious.
Is there something smaller (numeric) that is smaller than a tinyint.
Im storing 120 different values and nearly all of them are going to be (0-9).
A tinyint is the smallest I can find and it hold a value of upto 255. (3 digits)
DB:
Im using MSSqlServer 2008 version 10.00.1600
Numerically tinyint is likely as small as you can go (it really depends on the database engine you are using). You could use char(1) and then rely on implicit conversion to query the values but that would be needless overhead to solve a problem that doesn't really need solving. Also, char(1) is still going to consume 1 byte, 8 bits, which ranges 0-255. I would consider this a micro-optimization and not worth the time/effort.
I know you are probably asking for academic purposes, but in terms of database storage, tinyint is small enough for almost all situations. If you are that concerned about space consumption, I would say you need to look at other options than a traditional RDBMS.
You can of course manually pack multiple of your elements into one field; since almost every computing system is byte (8-bit) oriented, typically the smallest usefully available element is just one byte, 8 bits, that can represent 0-255 (or -128 to 127) for example.
Related
I've got a table which is set to keep track of various items. Among other properties, items can be either A, B, or C, each mutually exclusive of the rest. Is it best practice to store this information as a character, or as 3 sets of bits (isA isB, isC, etc), or some other method? I could understand using the character if I would possibly need more data types in the future, however it also makes sense to me that using bit datatypes would consume smaller amounts of storage. Or am I overanalyzing this and will the difference be so miniscule as to not even matter?
Or am I overanalyzing this and will the difference be so miniscule as to not even matter?
A little bit, yes.
But you must understand that there's a crucial difference between your design proposals: having a char column will make exclusive exception work. Having IsX fields (alone) will not. Explained: by having IsA and IsB columns, you can, potentially, have both set to true in the same record, unless you use another mechanism to prevent that (trigger, check constraint, etc.)
Additionally, having a new column every time a new value is possible is not good DB design.
Just use Char.
Space wise, you will be using an extra 625kb per million rows (assuming 5 bits saved per row, which is a best-case scenario savings-wise).
That isn't very much.
To put it into perspective, that's 625 MB per BILLION rows. When you get to tables of that size you don't care about any units that don't start with giga, tera, or peta.
Internally, SQL Server stores them all as a byte regardless (up to 8 bit fields).
By the time the space matters, any architecture changes (from using bit fields to something more flexible) will be extremely painful.
I'd use a single char, byte, enum, whatever. If the states are mutually exclusive, then that isn't the best use for flags.
Come to think of it a really tight, but kind of crazy, way to pull your scenario off would be to stored them in a nullable bit.
"An integer data type that can take a value of 1, 0, or NULL."
but I don't quite see how they pull that off though since
"The SQL Server Database Engine optimizes storage of bit columns. If there are 8 or less bit columns in a table, the columns are stored as 1 byte."
Both from http://msdn.microsoft.com/en-us/library/ms177603.aspx
If you need to index on the three values I would go for a tinyint instead of three bit fields.
I'd use a tiny int, basically a one byte number 0 to 255. As you expand your possible values, you end up using crazy letters that don't mean anything. So, I just start out with numbers. Keeping the three bits mutually exclusive isn't worth the hassle, they'll take a byte of storage anyways.
I am writing a new program and it will require a database (SQL Server 2008). Everything I am running now for the system is 64-bit, which brings me to this question. For all of the Id columns in various tables, should I make them all INT or BIGINT? I doubt the system will ever surpass the INT range but it is a possibility within some of the larger financial tables I suppose. It seems like INT is the standard though...
OK, let's do a quick math recap:
INT is 32-bit and gives you basically 4 billion values - if you only count the values larger than zero, it's still 2 billion. Do you have this many employees? Customers? Products in stock? Orders in the lifetime of your company? REALLY?
BIGINT goes way way way beyond that. Do you REALLY need that?? REALLY?? If you're an astronomer, or into particle physics - maybe. An average Line of Business user? I strongly doubt it
Imagine you have a table with - say - 10 million rows (orders for your company). Let's say, you have an Orders table, and that OrderID which you made a BIGINT is referenced by 5 other tables, and used in 5 non-clustered indices on your Orders table - not overdone, I think, right?
10 million rows, by 5 tables plus 5 non-clustered indices, that's 100 million instances where you are using 8 bytes each instead of 4 bytes - 400 million bytes = 400 MB. A total waste... you'll need more data and index pages, your SQL Server will have to read more pages from disk and cache more pages.... that's not beneficial for your performance - plain and simple.
PLUS: What most programmer's don't think about: yes, disk space it dirt cheap. But that wasted space is also relevant in your SQL Server RAM memory and your database cache - and that space is not dirt cheap!
So to make a very long post short: use the smallest type of INT that really suits your need; if you have 10-20 distinct values to handle - use TINYINT. If you need an order table, I believe INT should be PLENTY ENOUGH - BIGINT is only a waste of space.
Plus: should any of your tables really ever get close to reaching 2 or 4 billion rows, you'll still have plenty of time to upgrade your table to a BIGINT ID, if that's really needed.......
Here is an article with some real answers on performance... I prefer to answer questions with hard numbers if possible... If you click the following link at least up to a million records you will find a negligible difference in disk usage....
http://www.sqlservercentral.com/articles/Performance+Tuning/2753/
Personally I do feel that using the appropriate ID size is important,but also consider the fact that you may have a table that has a ton of activity over time. It is not that your storing a massive amount of data, but that the key value has grown due to the nature of being auto-incremented (deletes and inserts occurring over time).
Consider a file repository on a community site, or the id of the user comments on a community site multi-tenant application.
I understand that most developers are building systems that will never touch millions of records, but it is important to note that there are reasons that a bigint is required, and I am still not convinced that when you are designing a schema that you do not know the potential growth for that you should not attempt to anticipate the future and consider using a bigint if you feel that the potential is there to exceed the max value of int as the id value grows.
You should use the smallest data type that makes sense for the table in question. That includes using smallint or even tinyint if there are few enough rows.
You'll save space on both data and indexes and get better index performance. Using a bigint when all you need is a smallint is similar to using a varchar(4000) when all you need is a varchar(50).
Even if the machine's native word size is 64 bits, that only means that 64-bit CPU operations won't be any slower than 32-bit operations. Most of the time, they also won't be faster, they'll be the same. But most databases are not going to be CPU bound anyway, they'll be I/O bound and to a lesser extent memory-bound, so a 50%-90% smaller data size is a Very Good Thing when you need to perform an index scan over 200 million rows.
The alignment of 32 bit numbers with x86 architecture or 64 bit with x64 architecture is called data structure alignment
This has no meaning for data in a database because here it's things disk space, data cache and table/index architecture that affect performance (as mentioned in other answers).
Remember, it's not the CPU accessing the data as such. It's the DB engine code (which may be aligned, but who cares?) that runs on the CPU and manipulates your data. When/if your data goes through the CPU it certainly won't be in the same on-disk structures.
Other people already gave compelling answers for 32-bit IDs.
For some applications 64-bit IDs do make more sense.
If you want to guarantee that IDs are unique across a cluster of databases - 63-bits for IDs can be very convenient. With 32 bits it's very difficult to distribute generation of IDs across servers in a cluster; or across data centers. While with 64 bits you have enough room to play with that you can conveniently generate IDs across servers without locking and still guarantee uniqueness.
For example see Twitter Snowflake, and Instagram Engineering's blog post on "Sharding & IDs at Instagram". Both provide good reasons why 63 or 64 bits make more sense for their IDs than 32-bit counters.
The first answer is the naive answer for anyone not working with TB size databases or tables with constant and high volume inserts. In any decent sized database you will run into problems with INT at some stage in its lifetime. Use BIGINT if you have to as it will save a lot of hassle further down the line. I have seen companies hit the INT issue after only a year of data and where reseeding was not an option it caused massive downtime. Also in long running systems (10 years+) where the system was not expected to still be used it has been hit even with moderate sized databases that purge old data. It is much better to use GUID in most cases where large amounts of data are expected but barring that use BIGINT if required.
You should judge each table individually as to what datatype would meet the needs for each one. If an INTEGER would meet the needs of a particular table, use that. If a SMALLINT would be sufficient, use that. Use the datatype that will last, without being excessive.
Should I choose the smallest datatype possible, or if I am storing the value 1 for example, it doesn't matter what is the col datatype and the value will occupy the same memory size?
The question is also, cuz I will always have to convert it and play around in the application.
UPDATE
I think that varchar(1) and varchar(50) is the same memory size if value is "a", I thought it's the same with int and tinyint, according to the answers I understand it's not, is it?
Always choose the smallest data type possible. SQL can't guess what you want the maximum value to be, but it can optimize storage and performance once you tell it the data type.
To answer your update:
varchar does take up only as much space as you use and so you're right when you say that the character "a" will take up 1 byte (in latin encoding) no matter how large a varchar field you choose. That is not the case with any other type of field in SQL.
However, you will likely be sacrificing efficiency for space if you make everything a varchar field. If everything is a fixed-size field then SQL can do a simple constant-time multiplication to find your value (like an array). If you have varchar fields in there, then the only way to find out where you data is stored it to go through all the previous fields (like a linked list).
If you're beginning SQL then I advise just to stay away from varchar fields unless you expect to have fields that sometimes have very small amounts of text and sometimes very large amounts of text (like blog posts). It takes experience to know when to use variable length fields to the best effect and even I don't know most of the time.
It's a performance consideration particular to the design of your system. In general, the more data you can fit into a page of Sql Server data, the better the performance.
One page in Sql Server is 8k. Using tiny ints instead of ints will enable you to put more data into a single page but you have to consider whether or not it's worth it. If you're going to be serving up thousands of hits a minute, then yes. If this is a hobby project or something that just a few dozen users will ever see, then it doesn't matter.
The advantage is there but might not be significant unless you have lots of rows and performs los of operation. There'll be performance improvement and smaller storage.
Traditionally every bit saved on the page size would mean a little bit of speed improvement: narrower rows means more rows per page, which means less memory consumed and fewer IO requests, resulting in better speed. However, with SQL Server 2008 Page compression things start to get fuzzy. The compression algorithm may compress 4 byte ints with values under 255 on even less than a byte.
Row compression algorithms will store a 4 byte int on a single byte for values under 127 (int is signed), 2 bytes for values under 32768 and so on and so forth.
However, given that the nice compression features are only available on Enterprise Edition servers, it makes sense to keep the habit of using the smallest possible data type.
I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).
One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number
From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.
There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.
Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace
Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.
On my current project, I came across our master DB script. Taking a closer look at it, I noticed that all of our original primary keys have a data type of numeric(38,0)
We are currently running SQL Server 2005 as our primary DB platform.
For a little context, we support both Oracle and SQL Server as our back-end. In Oracle, our primary keys have a data type of number(38,0).
Does anybody know of possible side-effects and performance impact of such implementation? I have always advocated and implemented int or bigint as primary keys and would love to know if numeric(38,0) is a better alternative.
Well, you are spending more data to store numbers that you will never really reach.
bigint goes up to 9,223,372,036,854,775,807 in 8 Bytes
int goes up to 2,147,483,647 in 4 bytes
A NUMERIC(38,0) is going to take, if I am doing the math right, 17 bytes.
Not a huge difference, but: smaller datatypes = more rows in memory (or fewer pages for the same # of rows) = fewer disk I/O to do lookups (either indexed or data page seeks). Goes the same for replication, log pages, etc.
For SQL Server: INT is an IEEE standard and so is easier for the CPU to compare, so you get a slight performance increase by using INT vs. NUMERIC (which is a packed decimal format). (Note in Oracle, if the current version matches the older versions I grew up on, ALL datatypes are packed so an INT inside is pretty much the same thing as a NUMERIC( x,0 ) so there's no performance difference)
So, in the grand scheme of things -- if you have lots of disk, RAM, and spare I/O, use whatever datatype you want. If you want to get a little more performance, be a little more conservative.
Otherwise at this point, I'd leave it as it is. No need to change things.
Barring the storage considerations and some initial confusion from future DBAs, I don't see any reason why NUMERIC(38,0) would be a bad idea. You're allowing for up to 9.99 x 10^38 records in your table, which you will certainly never reach. My quick digging into this didn't turn up any glaring reason not to use it. I suspect that your only potential issue will be the storage space consumed by that, but seeing as how storage space is so cheap, that shouldn't be an issue.
I've seen this a fair number of times in Oracle databases since it's a pretty big default value that you don't need to think about when you're creating a table, similar to using INT or BIGINT by default in SQL Server.
This is overly large because you are never going to have that many rows. The larger size will result in more storage space. This is not a big deal in itself but will also mean more disk reads to retrieve data from a table or index. It will mean less rows will fit into memory on the database server.
I don't think it's broken enough to be bothered fixing.
You'd be better off using a GUID. Really. The normal reason not to use one is that an integer performs better. But GUID is smaller than numeric(38), and has the added benefit of making it a little easier to do thing like let disconnected users create and sync new records.