On my current project, I came across our master DB script. Taking a closer look at it, I noticed that all of our original primary keys have a data type of numeric(38,0)
We are currently running SQL Server 2005 as our primary DB platform.
For a little context, we support both Oracle and SQL Server as our back-end. In Oracle, our primary keys have a data type of number(38,0).
Does anybody know of possible side-effects and performance impact of such implementation? I have always advocated and implemented int or bigint as primary keys and would love to know if numeric(38,0) is a better alternative.
Well, you are spending more data to store numbers that you will never really reach.
bigint goes up to 9,223,372,036,854,775,807 in 8 Bytes
int goes up to 2,147,483,647 in 4 bytes
A NUMERIC(38,0) is going to take, if I am doing the math right, 17 bytes.
Not a huge difference, but: smaller datatypes = more rows in memory (or fewer pages for the same # of rows) = fewer disk I/O to do lookups (either indexed or data page seeks). Goes the same for replication, log pages, etc.
For SQL Server: INT is an IEEE standard and so is easier for the CPU to compare, so you get a slight performance increase by using INT vs. NUMERIC (which is a packed decimal format). (Note in Oracle, if the current version matches the older versions I grew up on, ALL datatypes are packed so an INT inside is pretty much the same thing as a NUMERIC( x,0 ) so there's no performance difference)
So, in the grand scheme of things -- if you have lots of disk, RAM, and spare I/O, use whatever datatype you want. If you want to get a little more performance, be a little more conservative.
Otherwise at this point, I'd leave it as it is. No need to change things.
Barring the storage considerations and some initial confusion from future DBAs, I don't see any reason why NUMERIC(38,0) would be a bad idea. You're allowing for up to 9.99 x 10^38 records in your table, which you will certainly never reach. My quick digging into this didn't turn up any glaring reason not to use it. I suspect that your only potential issue will be the storage space consumed by that, but seeing as how storage space is so cheap, that shouldn't be an issue.
I've seen this a fair number of times in Oracle databases since it's a pretty big default value that you don't need to think about when you're creating a table, similar to using INT or BIGINT by default in SQL Server.
This is overly large because you are never going to have that many rows. The larger size will result in more storage space. This is not a big deal in itself but will also mean more disk reads to retrieve data from a table or index. It will mean less rows will fit into memory on the database server.
I don't think it's broken enough to be bothered fixing.
You'd be better off using a GUID. Really. The normal reason not to use one is that an integer performs better. But GUID is smaller than numeric(38), and has the added benefit of making it a little easier to do thing like let disconnected users create and sync new records.
Related
I'm currently overseeing a database that has (sequential) GUID's all over the place. This database will grow in size significantly in the short term. It wouldn't be too much work to convert the whole shebang over to use bigint's. I'm wondering, is it worth it?
The clustered indexes are going to fall apart as it grows in size, SQL page sizes will increase, I'm expecting all sorts of hell if I continue on the path using sequential GUID's. Fragmented pages, horrendous indexing.. (especially in the event of a server reboot, which resets the sequential GUID creation)
Is there a world in which I could keep the GUID's around for some reason and use bigint for indexing? All SQL statements could easily be converted to use a bigint column for SELECT clauses.
What's my best approach? Any reason to keep the GUID's around? Or should I just convert everything into bigint and run from there?
#simon_j_dm: Sequential GUIDs are not a ordered as BIGINT... Seq. GUID gives you 1000 item long sequences of ordered GUIDs... but between those sequences... you still see fragmentation.
BIGINT is the most ordered key type you can have.
Does this mean you should change? not necessarily, BIGINT are smaller, so you have less memory pressure, and the dont cause fragmentation like GUIDs does. But depending on the load, you could see latch congestion will using BIGINT, that you wont see on GUIDs as they by nature spread the load over more pages.
You can reduce the fragmentation with GUIDs by lowereing the fill-factor. That however causes the database to bloat in MB size, and you dont fill up the data-pages right away. And you will still need to deal with fragmentation at some point.
So my point is... you need to do what is right for your situation. There is no golden way to do this. Do it the Brent Ozar way:
Figure out what you want to change.
Test your change in a controlled environment
Messure if the change had the wanted affect.
If you are using sequential guid then they are ordered just as a bigint is. Changing to a bigint will not affect fragmentation at all.
Moving to a bigint would however reduce the storage space by 50% on those columns, which in turn would also reduce memory usage and general query performance as memory grants can be smaller and tempdb usage lowered.
If it's not too much pain I would change it as smaller data types are always preferred.
Nice simple question for the db guys,
Its not terribly important but im curious.
Is there something smaller (numeric) that is smaller than a tinyint.
Im storing 120 different values and nearly all of them are going to be (0-9).
A tinyint is the smallest I can find and it hold a value of upto 255. (3 digits)
DB:
Im using MSSqlServer 2008 version 10.00.1600
Numerically tinyint is likely as small as you can go (it really depends on the database engine you are using). You could use char(1) and then rely on implicit conversion to query the values but that would be needless overhead to solve a problem that doesn't really need solving. Also, char(1) is still going to consume 1 byte, 8 bits, which ranges 0-255. I would consider this a micro-optimization and not worth the time/effort.
I know you are probably asking for academic purposes, but in terms of database storage, tinyint is small enough for almost all situations. If you are that concerned about space consumption, I would say you need to look at other options than a traditional RDBMS.
You can of course manually pack multiple of your elements into one field; since almost every computing system is byte (8-bit) oriented, typically the smallest usefully available element is just one byte, 8 bits, that can represent 0-255 (or -128 to 127) for example.
I am writing a new program and it will require a database (SQL Server 2008). Everything I am running now for the system is 64-bit, which brings me to this question. For all of the Id columns in various tables, should I make them all INT or BIGINT? I doubt the system will ever surpass the INT range but it is a possibility within some of the larger financial tables I suppose. It seems like INT is the standard though...
OK, let's do a quick math recap:
INT is 32-bit and gives you basically 4 billion values - if you only count the values larger than zero, it's still 2 billion. Do you have this many employees? Customers? Products in stock? Orders in the lifetime of your company? REALLY?
BIGINT goes way way way beyond that. Do you REALLY need that?? REALLY?? If you're an astronomer, or into particle physics - maybe. An average Line of Business user? I strongly doubt it
Imagine you have a table with - say - 10 million rows (orders for your company). Let's say, you have an Orders table, and that OrderID which you made a BIGINT is referenced by 5 other tables, and used in 5 non-clustered indices on your Orders table - not overdone, I think, right?
10 million rows, by 5 tables plus 5 non-clustered indices, that's 100 million instances where you are using 8 bytes each instead of 4 bytes - 400 million bytes = 400 MB. A total waste... you'll need more data and index pages, your SQL Server will have to read more pages from disk and cache more pages.... that's not beneficial for your performance - plain and simple.
PLUS: What most programmer's don't think about: yes, disk space it dirt cheap. But that wasted space is also relevant in your SQL Server RAM memory and your database cache - and that space is not dirt cheap!
So to make a very long post short: use the smallest type of INT that really suits your need; if you have 10-20 distinct values to handle - use TINYINT. If you need an order table, I believe INT should be PLENTY ENOUGH - BIGINT is only a waste of space.
Plus: should any of your tables really ever get close to reaching 2 or 4 billion rows, you'll still have plenty of time to upgrade your table to a BIGINT ID, if that's really needed.......
Here is an article with some real answers on performance... I prefer to answer questions with hard numbers if possible... If you click the following link at least up to a million records you will find a negligible difference in disk usage....
http://www.sqlservercentral.com/articles/Performance+Tuning/2753/
Personally I do feel that using the appropriate ID size is important,but also consider the fact that you may have a table that has a ton of activity over time. It is not that your storing a massive amount of data, but that the key value has grown due to the nature of being auto-incremented (deletes and inserts occurring over time).
Consider a file repository on a community site, or the id of the user comments on a community site multi-tenant application.
I understand that most developers are building systems that will never touch millions of records, but it is important to note that there are reasons that a bigint is required, and I am still not convinced that when you are designing a schema that you do not know the potential growth for that you should not attempt to anticipate the future and consider using a bigint if you feel that the potential is there to exceed the max value of int as the id value grows.
You should use the smallest data type that makes sense for the table in question. That includes using smallint or even tinyint if there are few enough rows.
You'll save space on both data and indexes and get better index performance. Using a bigint when all you need is a smallint is similar to using a varchar(4000) when all you need is a varchar(50).
Even if the machine's native word size is 64 bits, that only means that 64-bit CPU operations won't be any slower than 32-bit operations. Most of the time, they also won't be faster, they'll be the same. But most databases are not going to be CPU bound anyway, they'll be I/O bound and to a lesser extent memory-bound, so a 50%-90% smaller data size is a Very Good Thing when you need to perform an index scan over 200 million rows.
The alignment of 32 bit numbers with x86 architecture or 64 bit with x64 architecture is called data structure alignment
This has no meaning for data in a database because here it's things disk space, data cache and table/index architecture that affect performance (as mentioned in other answers).
Remember, it's not the CPU accessing the data as such. It's the DB engine code (which may be aligned, but who cares?) that runs on the CPU and manipulates your data. When/if your data goes through the CPU it certainly won't be in the same on-disk structures.
Other people already gave compelling answers for 32-bit IDs.
For some applications 64-bit IDs do make more sense.
If you want to guarantee that IDs are unique across a cluster of databases - 63-bits for IDs can be very convenient. With 32 bits it's very difficult to distribute generation of IDs across servers in a cluster; or across data centers. While with 64 bits you have enough room to play with that you can conveniently generate IDs across servers without locking and still guarantee uniqueness.
For example see Twitter Snowflake, and Instagram Engineering's blog post on "Sharding & IDs at Instagram". Both provide good reasons why 63 or 64 bits make more sense for their IDs than 32-bit counters.
The first answer is the naive answer for anyone not working with TB size databases or tables with constant and high volume inserts. In any decent sized database you will run into problems with INT at some stage in its lifetime. Use BIGINT if you have to as it will save a lot of hassle further down the line. I have seen companies hit the INT issue after only a year of data and where reseeding was not an option it caused massive downtime. Also in long running systems (10 years+) where the system was not expected to still be used it has been hit even with moderate sized databases that purge old data. It is much better to use GUID in most cases where large amounts of data are expected but barring that use BIGINT if required.
You should judge each table individually as to what datatype would meet the needs for each one. If an INTEGER would meet the needs of a particular table, use that. If a SMALLINT would be sufficient, use that. Use the datatype that will last, without being excessive.
Should I choose the smallest datatype possible, or if I am storing the value 1 for example, it doesn't matter what is the col datatype and the value will occupy the same memory size?
The question is also, cuz I will always have to convert it and play around in the application.
UPDATE
I think that varchar(1) and varchar(50) is the same memory size if value is "a", I thought it's the same with int and tinyint, according to the answers I understand it's not, is it?
Always choose the smallest data type possible. SQL can't guess what you want the maximum value to be, but it can optimize storage and performance once you tell it the data type.
To answer your update:
varchar does take up only as much space as you use and so you're right when you say that the character "a" will take up 1 byte (in latin encoding) no matter how large a varchar field you choose. That is not the case with any other type of field in SQL.
However, you will likely be sacrificing efficiency for space if you make everything a varchar field. If everything is a fixed-size field then SQL can do a simple constant-time multiplication to find your value (like an array). If you have varchar fields in there, then the only way to find out where you data is stored it to go through all the previous fields (like a linked list).
If you're beginning SQL then I advise just to stay away from varchar fields unless you expect to have fields that sometimes have very small amounts of text and sometimes very large amounts of text (like blog posts). It takes experience to know when to use variable length fields to the best effect and even I don't know most of the time.
It's a performance consideration particular to the design of your system. In general, the more data you can fit into a page of Sql Server data, the better the performance.
One page in Sql Server is 8k. Using tiny ints instead of ints will enable you to put more data into a single page but you have to consider whether or not it's worth it. If you're going to be serving up thousands of hits a minute, then yes. If this is a hobby project or something that just a few dozen users will ever see, then it doesn't matter.
The advantage is there but might not be significant unless you have lots of rows and performs los of operation. There'll be performance improvement and smaller storage.
Traditionally every bit saved on the page size would mean a little bit of speed improvement: narrower rows means more rows per page, which means less memory consumed and fewer IO requests, resulting in better speed. However, with SQL Server 2008 Page compression things start to get fuzzy. The compression algorithm may compress 4 byte ints with values under 255 on even less than a byte.
Row compression algorithms will store a 4 byte int on a single byte for values under 127 (int is signed), 2 bytes for values under 32768 and so on and so forth.
However, given that the nice compression features are only available on Enterprise Edition servers, it makes sense to keep the habit of using the smallest possible data type.
How are varchar columns handled internally by a database engine?
For a column defined as char(100), the DBMS allocates 100 contiguous bytes on the disk. However, for a column defined as varchar(100), that presumably isn't the case, since the whole point of varchar is to not allocate any more space than required to store the actual data value stored in the column. So, when a user updates a database row containing an empty varchar(100) column to a value consisting of 80 characters for instance, where does the space for that 80 characters get allocated from?
It seems that varchar columns must result in a fair amount of fragmentation of the actual database rows, at least in scenarios where column values are initially inserted as blank or NULL, and then updated later with actual values. Does this fragmentation result in degraded performance on database queries, as opposed to using char type values, where the space for the columns stored in the rows is allocated contiguously? Obviously using varchar results in less disk space than using char, but is there a performance hit when optimizing for query performance, especially for columns whose values are frequently updated after the initial insert?
You make a lot of assumptions in your question that aren't necessarily true.
The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.
In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.
The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.
The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.
For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)
In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:
Using variable length columns can improve performance as it allows more rows to fit on a single page, meaning fewer reads
Using variable length columns requires special offset values, and the maintenance of these values requires a slight overhead, however this extra overhead is generally neglible.
Another potential cost is the cost of increasing the size of a column when the page containing that row is nearly full
As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.
At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!
The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.
Your question is too general because different database engines will have different behavior. If you really need to know this, I suggest that you set up a benchmark to write a large number of records and time it. You would want enough records to take at least an hour to write.
As you suggested, it would be interesting to see what happens if you write insert all the records with an empty string ("") and then update them to have 100 characters that are reasonably random, not just 100 Xs.
If you try this with SQLITE and see no significant difference, then I think it unlikely that the larger database servers, with all the analysis and tuning that goes on, would be worse than SQLITE.
This is going to be completely database specific.
I do know that in Oracle, the database will reserve a certain percentage of each block for future updates (The PCTFREE parameter). For example, if PCTFREE is set to 25%, then a block will only be used for new data until it is 75% full. By doing that, room is left for rows to grow. If the row grows such that the 25% reserved space is completely used up, then you do end up with chained rows and a performance penalty. If you find that a table has a large number of chained rows, you can tune the PCTFREE for that table. If you have a table which will never have any updates at all, a PCTFREE of zero would make sense
In SQL Server varchar (except varchar(MAX)) is generally stored together with the rest of the row's data (on the same page if the row's data is < 8KB and on the same extent if it is < 64KB. Only the large data types such as TEXT, NTEXT, IMAGE, VARHCAR(MAX), NVARHCAR(MAX), XML and VARBINARY(MAX) are stored seperately.