I'm currently overseeing a database that has (sequential) GUID's all over the place. This database will grow in size significantly in the short term. It wouldn't be too much work to convert the whole shebang over to use bigint's. I'm wondering, is it worth it?
The clustered indexes are going to fall apart as it grows in size, SQL page sizes will increase, I'm expecting all sorts of hell if I continue on the path using sequential GUID's. Fragmented pages, horrendous indexing.. (especially in the event of a server reboot, which resets the sequential GUID creation)
Is there a world in which I could keep the GUID's around for some reason and use bigint for indexing? All SQL statements could easily be converted to use a bigint column for SELECT clauses.
What's my best approach? Any reason to keep the GUID's around? Or should I just convert everything into bigint and run from there?
#simon_j_dm: Sequential GUIDs are not a ordered as BIGINT... Seq. GUID gives you 1000 item long sequences of ordered GUIDs... but between those sequences... you still see fragmentation.
BIGINT is the most ordered key type you can have.
Does this mean you should change? not necessarily, BIGINT are smaller, so you have less memory pressure, and the dont cause fragmentation like GUIDs does. But depending on the load, you could see latch congestion will using BIGINT, that you wont see on GUIDs as they by nature spread the load over more pages.
You can reduce the fragmentation with GUIDs by lowereing the fill-factor. That however causes the database to bloat in MB size, and you dont fill up the data-pages right away. And you will still need to deal with fragmentation at some point.
So my point is... you need to do what is right for your situation. There is no golden way to do this. Do it the Brent Ozar way:
Figure out what you want to change.
Test your change in a controlled environment
Messure if the change had the wanted affect.
If you are using sequential guid then they are ordered just as a bigint is. Changing to a bigint will not affect fragmentation at all.
Moving to a bigint would however reduce the storage space by 50% on those columns, which in turn would also reduce memory usage and general query performance as memory grants can be smaller and tempdb usage lowered.
If it's not too much pain I would change it as smaller data types are always preferred.
Related
I have a table that has 124,387,133 rows each row has 59 columns and of those 59, 18 of the columns are TinyInt data type and all row values are either 0 or 1. Some of the TinyInt columns are used in indexes and some are not.
My question will it make a difference on query performance and table size if I change the tinyint to a bit?
In case you don't know, a bit uses less space to store information than a TinyInt (1 bit against 8 bits). So you would save space changing to bit, and in theory the performance should be better. Generally is hard to notice such performance improvement but with the amount of data you have, it might actually make a difference, I would test it in a backup copy.
Actually,it's good to use the right data type..below are the benefits i could see when you use bit data type
1.Buffer pool savings,page is read into memory from storage and less memory can be allocated
2.Index key size will be less,so more rows can fit into one page and there by less traversing
Also you can see storage space savings as immediate benefit
In theory yes, in practise the difference is going to be subtle, the 18 bit fields get byte packed and rounded up, so it changes to 3 bytes. Depending on nullability / any nullability change the storage cost again will change. Both types are held within the fixed width portion of the row. So you will drop from 18 bytes to 3 bytes for those fields - depending on the overall size of the row versus the page size you may squeeze an extra row onto the page. (The rows/page density is where the performance gain will show up primarily, if you are to gain)
This seems like a premature micro-optimization however, if you are suffering from bad performance, investigate that and gather the evidence supporting any changes. Making type changes on existing systems should be carefully considered, if you cause the need for a code change, which prompts a full regression test etc, the cost of the change rises dramatically - for very little end result. (The production change on a large dataset will also not be rapid, so you can factor in some downtime in the cost as well to make this change)
You would be saving about 15 bytes per record, for a total of 1.8 Gbytes.
You have 41 remaining fields. If I assume that those are 4-byte integers, then your current overall size is about 22 Gbytes. The overall savings is less than 10% -- and could be much less if the other fields are larger.
This does mean that a full table scan would be about 10% faster, so that gives you a sense of the performance gain and magnitude.
I believe that bit fields require an extra operation or two to mask the bits and read -- trivial overhead that is measured in nanoseconds these days -- but something to keep in mind.
The benefits of a smaller page size are that more records fit on a single page, so the table occupies less space in memory (assuming all is read in at once) and less space on disk. Smaller data does not always mean improved query performance. Here are two caveats:
If you are reading a single record, then the entire page needs to be read into cache. It is true that you are a bit less likely to get a cache miss with a warm cache, but overall reading a single record from a cold cache would be the same.
If you are reading the entire table, SQL Server actually reads pages in blocks and implements some look-ahead (also called read-ahead or prefetching). If you are doing complex processing, you might not even notice the additional I/O time, because I/O operations can run in parallel with computing.
For other operations such as deletes and updates, locking is sometimes done at the page level. In these cases, sparser pages can be associated with better performance.
If you are talking about btrees, I wouldn't imagine that the additional overhead of a non clustered index (not counting stuff like full text search or other kind of string indexing) is even measurable, except for an extremely high volume high write scenario.
What kind of overhead are we actually talking about? Why would it be a bad idea to just index everything? Is this implementation specific? (in that case, I am mostly interested in answers around pg)
EDIT: To explain the reasoning behind this a bit more...
We are looking to specifically improve performance right now across the board, and one of the key things we are looking at is query performance. I have read the things mentioned here, that indexes will increase db size on disk and will slow down writes. The question came up today when one pair did some pre-emptive indexing on a new table, since we usually apply indexes in a more reactive way. Their arguement was that they weren't indexing string fields, and they weren't doing clustered indexes, so the negative impact of possibly redundant indexes should barely be measurable.
Now, I am far from an expert in such things, and those arguments made a lot of sense to me based on what I understand.
Now, I am sure there are other reasons, or I am misunderstanding something. I know a redundant index will have a negative effect, what I want to know is how bad it will be (because it seems negligible). The whole indexing every field thing is a worst case scenario, but I figured if people could tell me what that will do to my db, it will help me understand the concerns around being conservative with indexing, or just throwing them out there when it has a possibility of helping things.
Random thoughts
Indexes benefit reads of course
You should index where you get the most bang for your buck
Most DBs are > 95% read (think about updates, FK checks, duplicate checks etc = reads)
"Everything" is pointless: most indexed need to be composite with includes
Define high volume we have 15-20 million new rows per day with indexes
Introduction to Indices
In short, an index, whether clustered or non-, adds extra "branches" to the "tree" in which data is stored by most current DBMSes. This makes finding values with a single unique combination of the index logarithmic-time instead of linear-time. This reduction in access time speeds up many common tasks the DB does; however, when performing tasks other than that, it can slow it down because the data must be accessed through the tree. Filtering based on non-indexed columns, for instance, requires the engine to iterate through the tree, and because the ratio of branch nodes (containing only pointers to somewhere else in the tree) to leaf nodes has been reduced, this will take longer than if the index were not present.
In addition, non-clustered indices separate data based on column values, but if those column values are not very unique across all table rows (like a flag indicating "yes" or "no"), then the index adds an extra level of complexity that doesn't actually help the search; in fact, it hinders it because in navigating from root to leaves of the tree, an extra branch is encountered.
I am sure the exact overheard is probably implementation specific, but off the top of my head some points:
Increased Disk Space requirements.
All writes (inserts, updates, deletes) cost more as all indexes must be updated.
Increased transaction locking overheard (all indexes must be updated within a transaction, leading to more locks being required, etc).
Potentially increased complexity for the query optimizer (choosing which index is most likely to perform best; Also potential for one index to be chosen when another index would actually be better).
Is there/has somebody any comparison, personal experience, or guideline when to use the text type instead of a large varchar in MySQL?
While most of the entries in my database will be less than 1000 characters, some might take up to 4000 characters or more. What is the limiting length of varchar which makes text a better variant?
I do not need to index those fields.
I don't have personal experience, but this guy does:
VARCHAR vs. TEXT - some performance numbers
Quick answer: varchar was a good bit faster.
Edit - no, it wasn't. He was indexing them differently - he had a full index on the varchar (255 chars) but a 255-char prefix index on the text. When he removed that, they performed more or less the same.
Later in the thread is this interesting tidbit:
When a tmp table is needed for a
SELECT, the first choice is to use
MEMORY, which will be RAM-only, hence
probably noticeably faster. (Second
choice is MyISAM.) However, TEXT and
BLOB are not allowed in MEMORY, so it
can't use it. (There are other reasons
why it might skip MEMORY.)
Edit 2 - some more relevant info, this time comparing the way different indices deal with the various types.
MyISAM puts TEXT and BLOB 'inline'. If
you are searching a table (range scan
/ table scan), you are 'stepping over
those cow paddies' -- costly for disk
I/O. That is, the existence of the
inline blob hurts performance in this
case.
InnoDB puts only 767 bytes of a TEXT
or BLOB inline, the rest goes into
some other block. This is a compromise
that sometimes helps, sometimes hurts
performance.
Something else (Maria? Falcon? InnoDB
plugin?) puts TEXTs and BLOBs entirely
elsewhere. This would make a
noticeable difference in performance
when compared to VARCHAR. Sometimes
TEXT would be faster (eg, range scan
that does not need the blob);
sometimes the VARCHAR would be faster
(eg, if you need to look at it and/or
return it).
Of course the best way to know is to run some tests yourself with your real dataset, or at least a simulated equivalent. Just write some scripts to populate the data and run your selects. Test with varchar at different sizes, then text, and measure both the timing and overall system utilization (cpu/load, memory, disk i/o).
If you are going to have enough load that this will matter then you ought to have automated tests anyway.
I am creating a new database for a web site using SQL Server 2005 (possibly SQL Server 2008 in the near future). As an application developer, I've seen many databases that use an integer (or bigint, etc.) for an ID field of a table that will be used for relationships. But lately I've also seen databases that use the unique identifier (GUID) for an ID field.
My question is whether one has an advantage over the other? Will integer fields be faster for querying and joining, etc.?
UPDATE: To make it clear, this is for a primary key in the tables.
GUIDs are problematic as clustered keys because of the high randomness. This issue was addressed by Paul Randal in the last Technet Magazine Q&A column: I'd like to use a GUID as the clustered index key, but the others are arguing that it can lead to performance issues with indexes. Is this true and, if so, can you explain why?
Now bear in mind that the discussion is specifically about clustered indexes. You say you want to use the column as 'ID', that is unclear if you mean it as clustered key or just primary key. Typically the two overlap, so I'll assume you want to use it as clustered index. The reasons why that is a poor choice are explained in the link to the article I mentioned above.
For non clustered indexes GUIDs still have some issues, but not nearly as big as when they are the leftmost clustered key of the table. Again, the randomness of GUIDs introduces page splits and fragmentation, be it at the non-clustered index level only (a much smaller problem).
There are many urban legends surrounding the GUID usage that condemn them based on their size (16 bytes) compared to an int (4 bytes) and promise horrible performance doom if they are used. This is slightly exaggerated. A key of size 16 can be a very peformant key still, on a properly designed data model. While is true that being 4 times as big as a int results in more a lower density non-leaf pages in indexes, this is not a real concern for the vast majority of tables. The b-tree structure is a naturally well balanced tree and the depth of tree traversal is seldom an issue, so seeking a value based on GUID key as opposed to a INT key is similar in performance. A leaf-page traversal (ie. a table scan) does not look at the non-leaf pages, and the impact of GUID size on the page size is typically quite small, as the record itself is significantly larger than the extra 12 bytes introduced by the GUID. So I'd take the hear-say advice based on 'is 16 bytes vs. 4' with a, rather large, grain of salt. Analyze on individual case by case and decide if the size impact makes a real difference: how many other columns are in the table (ie. how much impact has the GUID size on the leaf pages) and how many references are using it (ie. how many other tables will increase because of the fact they need to store a larger foreign key).
I'm calling out all these details in a sort of makeshift defense of GUIDs because they been getting a lot of bad press lately and some is undeserved. They have their merits and are indispensable in any distributed system (the moment you're talking data movement, be it via replication or sync framework or whatever). I've seen bad decisions being made out based on the GUID bad reputation when they were shun without proper consideration. But is true, if you have to use a GUID as clustered key, make sure you address the randomness issue: use sequential guids when possible.
And finally, to answer your question: if you don't have a specific reason to use GUIDs, use INTs.
The GUID is going to take up more space and be slower than an int - even if you use the newsequentialid() function. If you are going to do replication or use the sync framework you pretty much have to use a guid.
INTs are 4 bytes, BIGINTs ar 8 bytes, and GUIDS are 16 bytes. The more space required to represent the data, the more resources required to process it -- disk space, memory, etc. So (a) they're slower, but (b) this probably only matters if volume is an issue (millions of rows, or thousands of transactions in very, very little time.)
The advantage of GUIDs is that they are (pretty much) Globally Unique. Generate a guid using the proper algorithm (and SQL Server xxxx will use the proper algorithm), and no two guids will ever be alike--no matter how many computers you have generating them, no matter how frequently. (This does not apply after 72 years of usage--I forget the details.)
If you need unique identifiers generated across multiple servers, GUIDs may be useful. If you need mondo perforance and under 2 billion values, ints are probably fine. Lastly and perhaps most importantly, if your data has natural keys, stick with them and forget the surrogate values.
if you positively, absolutely have to have a unique ID, then GUID. Meaning if you're ever gonna merge, sync, replicate, you probably should use a GUID.
For less robust things, an int, should suffice depending upon how large the table will grow.
As in most cases, the proper answer is, it depends.
Use them for replication etc, not as primary keys.
Kimberly L Tripp article
Against: Space, not strictly monotonic, page splits, bookmark/RIDs etc
For: er...
Fully agreed with JBrooks.
I want to say that when your table is large, and you use selects with JOINS, especially with derived tables, using GUIDs can significally decrease performance.
On my current project, I came across our master DB script. Taking a closer look at it, I noticed that all of our original primary keys have a data type of numeric(38,0)
We are currently running SQL Server 2005 as our primary DB platform.
For a little context, we support both Oracle and SQL Server as our back-end. In Oracle, our primary keys have a data type of number(38,0).
Does anybody know of possible side-effects and performance impact of such implementation? I have always advocated and implemented int or bigint as primary keys and would love to know if numeric(38,0) is a better alternative.
Well, you are spending more data to store numbers that you will never really reach.
bigint goes up to 9,223,372,036,854,775,807 in 8 Bytes
int goes up to 2,147,483,647 in 4 bytes
A NUMERIC(38,0) is going to take, if I am doing the math right, 17 bytes.
Not a huge difference, but: smaller datatypes = more rows in memory (or fewer pages for the same # of rows) = fewer disk I/O to do lookups (either indexed or data page seeks). Goes the same for replication, log pages, etc.
For SQL Server: INT is an IEEE standard and so is easier for the CPU to compare, so you get a slight performance increase by using INT vs. NUMERIC (which is a packed decimal format). (Note in Oracle, if the current version matches the older versions I grew up on, ALL datatypes are packed so an INT inside is pretty much the same thing as a NUMERIC( x,0 ) so there's no performance difference)
So, in the grand scheme of things -- if you have lots of disk, RAM, and spare I/O, use whatever datatype you want. If you want to get a little more performance, be a little more conservative.
Otherwise at this point, I'd leave it as it is. No need to change things.
Barring the storage considerations and some initial confusion from future DBAs, I don't see any reason why NUMERIC(38,0) would be a bad idea. You're allowing for up to 9.99 x 10^38 records in your table, which you will certainly never reach. My quick digging into this didn't turn up any glaring reason not to use it. I suspect that your only potential issue will be the storage space consumed by that, but seeing as how storage space is so cheap, that shouldn't be an issue.
I've seen this a fair number of times in Oracle databases since it's a pretty big default value that you don't need to think about when you're creating a table, similar to using INT or BIGINT by default in SQL Server.
This is overly large because you are never going to have that many rows. The larger size will result in more storage space. This is not a big deal in itself but will also mean more disk reads to retrieve data from a table or index. It will mean less rows will fit into memory on the database server.
I don't think it's broken enough to be bothered fixing.
You'd be better off using a GUID. Really. The normal reason not to use one is that an integer performs better. But GUID is smaller than numeric(38), and has the added benefit of making it a little easier to do thing like let disconnected users create and sync new records.