Bits vs Char - What's the best way to store 3 mutually exclusive flags? - sql

I've got a table which is set to keep track of various items. Among other properties, items can be either A, B, or C, each mutually exclusive of the rest. Is it best practice to store this information as a character, or as 3 sets of bits (isA isB, isC, etc), or some other method? I could understand using the character if I would possibly need more data types in the future, however it also makes sense to me that using bit datatypes would consume smaller amounts of storage. Or am I overanalyzing this and will the difference be so miniscule as to not even matter?

Or am I overanalyzing this and will the difference be so miniscule as to not even matter?
A little bit, yes.
But you must understand that there's a crucial difference between your design proposals: having a char column will make exclusive exception work. Having IsX fields (alone) will not. Explained: by having IsA and IsB columns, you can, potentially, have both set to true in the same record, unless you use another mechanism to prevent that (trigger, check constraint, etc.)
Additionally, having a new column every time a new value is possible is not good DB design.

Just use Char.
Space wise, you will be using an extra 625kb per million rows (assuming 5 bits saved per row, which is a best-case scenario savings-wise).
That isn't very much.
To put it into perspective, that's 625 MB per BILLION rows. When you get to tables of that size you don't care about any units that don't start with giga, tera, or peta.
Internally, SQL Server stores them all as a byte regardless (up to 8 bit fields).
By the time the space matters, any architecture changes (from using bit fields to something more flexible) will be extremely painful.

I'd use a single char, byte, enum, whatever. If the states are mutually exclusive, then that isn't the best use for flags.

Come to think of it a really tight, but kind of crazy, way to pull your scenario off would be to stored them in a nullable bit.
"An integer data type that can take a value of 1, 0, or NULL."
but I don't quite see how they pull that off though since
"The SQL Server Database Engine optimizes storage of bit columns. If there are 8 or less bit columns in a table, the columns are stored as 1 byte."
Both from http://msdn.microsoft.com/en-us/library/ms177603.aspx
If you need to index on the three values I would go for a tinyint instead of three bit fields.

I'd use a tiny int, basically a one byte number 0 to 255. As you expand your possible values, you end up using crazy letters that don't mean anything. So, I just start out with numbers. Keeping the three bits mutually exclusive isn't worth the hassle, they'll take a byte of storage anyways.

Related

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause
At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.
There might be significant performance gains if the column is used in an index.
It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.
Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.
Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?
having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.

Is there an advantage on setting tinyint fields when I know that the value will not exceed 255?

Should I choose the smallest datatype possible, or if I am storing the value 1 for example, it doesn't matter what is the col datatype and the value will occupy the same memory size?
The question is also, cuz I will always have to convert it and play around in the application.
UPDATE
I think that varchar(1) and varchar(50) is the same memory size if value is "a", I thought it's the same with int and tinyint, according to the answers I understand it's not, is it?
Always choose the smallest data type possible. SQL can't guess what you want the maximum value to be, but it can optimize storage and performance once you tell it the data type.
To answer your update:
varchar does take up only as much space as you use and so you're right when you say that the character "a" will take up 1 byte (in latin encoding) no matter how large a varchar field you choose. That is not the case with any other type of field in SQL.
However, you will likely be sacrificing efficiency for space if you make everything a varchar field. If everything is a fixed-size field then SQL can do a simple constant-time multiplication to find your value (like an array). If you have varchar fields in there, then the only way to find out where you data is stored it to go through all the previous fields (like a linked list).
If you're beginning SQL then I advise just to stay away from varchar fields unless you expect to have fields that sometimes have very small amounts of text and sometimes very large amounts of text (like blog posts). It takes experience to know when to use variable length fields to the best effect and even I don't know most of the time.
It's a performance consideration particular to the design of your system. In general, the more data you can fit into a page of Sql Server data, the better the performance.
One page in Sql Server is 8k. Using tiny ints instead of ints will enable you to put more data into a single page but you have to consider whether or not it's worth it. If you're going to be serving up thousands of hits a minute, then yes. If this is a hobby project or something that just a few dozen users will ever see, then it doesn't matter.
The advantage is there but might not be significant unless you have lots of rows and performs los of operation. There'll be performance improvement and smaller storage.
Traditionally every bit saved on the page size would mean a little bit of speed improvement: narrower rows means more rows per page, which means less memory consumed and fewer IO requests, resulting in better speed. However, with SQL Server 2008 Page compression things start to get fuzzy. The compression algorithm may compress 4 byte ints with values under 255 on even less than a byte.
Row compression algorithms will store a 4 byte int on a single byte for values under 127 (int is signed), 2 bytes for values under 32768 and so on and so forth.
However, given that the nice compression features are only available on Enterprise Edition servers, it makes sense to keep the habit of using the smallest possible data type.

Does varchar result in performance hit due to data fragmentation?

How are varchar columns handled internally by a database engine?
For a column defined as char(100), the DBMS allocates 100 contiguous bytes on the disk. However, for a column defined as varchar(100), that presumably isn't the case, since the whole point of varchar is to not allocate any more space than required to store the actual data value stored in the column. So, when a user updates a database row containing an empty varchar(100) column to a value consisting of 80 characters for instance, where does the space for that 80 characters get allocated from?
It seems that varchar columns must result in a fair amount of fragmentation of the actual database rows, at least in scenarios where column values are initially inserted as blank or NULL, and then updated later with actual values. Does this fragmentation result in degraded performance on database queries, as opposed to using char type values, where the space for the columns stored in the rows is allocated contiguously? Obviously using varchar results in less disk space than using char, but is there a performance hit when optimizing for query performance, especially for columns whose values are frequently updated after the initial insert?
You make a lot of assumptions in your question that aren't necessarily true.
The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.
In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.
The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.
The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.
For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)
In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:
Using variable length columns can improve performance as it allows more rows to fit on a single page, meaning fewer reads
Using variable length columns requires special offset values, and the maintenance of these values requires a slight overhead, however this extra overhead is generally neglible.
Another potential cost is the cost of increasing the size of a column when the page containing that row is nearly full
As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.
At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!
The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.
Your question is too general because different database engines will have different behavior. If you really need to know this, I suggest that you set up a benchmark to write a large number of records and time it. You would want enough records to take at least an hour to write.
As you suggested, it would be interesting to see what happens if you write insert all the records with an empty string ("") and then update them to have 100 characters that are reasonably random, not just 100 Xs.
If you try this with SQLITE and see no significant difference, then I think it unlikely that the larger database servers, with all the analysis and tuning that goes on, would be worse than SQLITE.
This is going to be completely database specific.
I do know that in Oracle, the database will reserve a certain percentage of each block for future updates (The PCTFREE parameter). For example, if PCTFREE is set to 25%, then a block will only be used for new data until it is 75% full. By doing that, room is left for rows to grow. If the row grows such that the 25% reserved space is completely used up, then you do end up with chained rows and a performance penalty. If you find that a table has a large number of chained rows, you can tune the PCTFREE for that table. If you have a table which will never have any updates at all, a PCTFREE of zero would make sense
In SQL Server varchar (except varchar(MAX)) is generally stored together with the rest of the row's data (on the same page if the row's data is < 8KB and on the same extent if it is < 64KB. Only the large data types such as TEXT, NTEXT, IMAGE, VARHCAR(MAX), NVARHCAR(MAX), XML and VARBINARY(MAX) are stored seperately.

Increase increment size to match GUID advantage

I've been thinking of implementing this system, but can't help but feel there's a catch somewhere. One of the points of using GUID over incrementing int is that, in the future, if you were to merge databases together, you wouldn't have any clashes over the primary key/identifier. However, my approach is to set the increment size to X where X is the number of servers I'll most likely have in the future. Then, on each server, have the seed be an increment over the seed number on the previous server. That way, during merging, there would be no clashes with the primary key. Is this a safe, normal method or have I gone mental :)?
Thanks
In multi-master SQL replication, you typically have primary keys defined as:
GUIDs
int's with a increment size > number of installs
int's with a fixed offset
The downside of GUIDs is they can be harder to read and take up slightly more space. However, it allows you to scale to n instances.
Integers are a bit easier to deal with. They also have the advantage of being able to easily tell which server created a record. The downside is you must either predict the maximum number of databases which might be merged, or guess the maximum number of rows a single instance might insert.
An example of a fixed offset is: site A starts at 0, site B starts at 1,000,000 and site C starts at 2,000,000. This scheme works fine until one site inserts one million rows. This scheme might work well for cars at car dealerships, where it's unlikely that any one dealer would ever sell more than 1,000,000 cars, and you might have hundreds of dealerships over the life of the application.
What scares me here is your use of "most likely". You're assuming on the future here, and typically that's not a good thing to do with things like this. Why not use a GUID?
What if you add one extra server over what you thought you'd have? I could see things getting really complicated really quickly.