SQL: best way to store yes/no values? Looking after performance in huge databases - sql

I have some columns where I have to store basically yes/no values.
For example user status for active or inactive. Newsletter suscription status for suscribed or unsuscribed.
Well I want to know (considering tables with a lot of records) if the best way is to put a tiny int with char length of 1 and set 1 for yes, and 0 for no.
Is this a correct thought? Or there are no impact in the performance of db queries when using just words like yes, no, active, inactive, suscribed, etc.
thanks in advance.

Semantically, I suggest you use bit if it's available to you. When looking at the column, any other developer can immediately determine that a boolean value is stored in it. If you don't have bit, try using tinyint. Ensuring that 1 is the only true value and 0 is the only false value will bring consistency. Otherwise, you could end up with a messy mixture of true/false, yes/no, valid/invalid, y/n, and/or t/f.
Comparing bit or tinyint values probably isn't slower than comparing strings, and even if it were slower than comparing strings, I can't imagine it having a significant effect on overall speed.

Is there something you don't like about the 'bit' data type?

The most commonly supported means is to use CHAR(1) - on most databases, it takes the same amount of space as BIT (assuming BIT is available, 1 byte) but supports more values (26 if case insensitive, 52 if not) if there's any chance of supporting more values. Unlike BIT, CHAR(1) is human readable. Also, BIT isn't supported on every database.

If your RDBMS supports bitmap indexes, go for BIT every time. If it doesn't, use whatever you want, there is really no difference between char(1), tinyint (byte).

Are you just asking in general, what the most efficient way to store a yes/no flag is?
Or do you have a performance problem at hand?
If so, when do you have the performance problem (specific queries, inserts, maintenance etc)? What kind of performance gain are you looking for?
2%? 10%? 50%?
Changing datatypes will likely result in only a minor improvement unless we are talking about several hundred million rows. I will give you an example. Let's say that whatever change you did, you shaved of 3 bytes per row. Let's say the table contains 100,000,000 rows. That would be a saving of ~285 mb. Assuming the disk subsystem can provide you 100mb/s you have saved a whopping 3 seconds for a full table scan. Something tells me that the users would think 2 hours and 3 seconds vs 2 hours is same same :)

My intuition would have said performance would have been better with tinyints, but this post doesn't really bare that thought out. This SO post also offers some other interesting opinions.
I do think that performing analysis with data stored as numbers is typically easier than character data. What other programs are you going to have to interface with and use? For example, several of my analysis tools do not read character data at all, so we have to recode any data we receive in the format of "yes", "no", etc.

Related

For datetype BIT, is | faster than OR?

Given that X & Y are datetype bit
WHERE Something = 1 AND ( X = 1 OR Y= 1 )
WHERE Something = 1 AND X | Y = 1
Is the second one, that uses |, faster than the OR above? If so, why?
My co-worker is saying they are. I'm kinda dumb, and also lazy, and am trying to confirm that's true, before I try to understand why.
We use T-SQL, though I imagine whatever the answer it, it's probably universal to most (all?) flavors of SQL?
In most programming languages, bit-fiddling is a way to save storage and increase performance. Although storing items as bits can save storage in a relational database, it is not a route to improved performance.
Basically, bit operators -- as with most other functions and operators -- impede the optimizer and prevent the use of indexes and partition pruning. They also generally incur a bit of overhead, just because CPUs are designed to handle bunches of bits at a time (think 4-bytes or 8-bytes) rather than individual bits.
Then there is the question of what performance would be saved. If the query is doing a full table scan, the primary cost is reading the data. Not filtering the data. In the time it takes to read a data page into memory, lots and lots and lots and lots of comparisons on bits or integers can be made -- way more than 3 per row.
So, even if there were a difference in performance, it would be very hard to measure. I do encourage you to try. I would imagine that the two versions are essentially equivalent, each having three bit operations and some booleans thrown in.

Performance issues with varchar(max)? [duplicate]

Here is my predicament.
Basically, I need a column in a table to hold up an unknown length of characters. But I was curious if in Sql Server performance problems could arise using a VARCHAR(MAX) or NVARCHAR(MAX) in a column, such as: 'This time' I only need to store 3 characters and most of the time I only need to store 10 characters. But there is a small chances that It could be up to a couple thousand characters in that column, or even possibly a million, It is unpredictable. But, I can guarantee that it will not go over the 2GB limit.
I was just curious if there are any performance issues, or possibly better ways of solving this problem where available.
Sounds to me like you plan to use the varchar(MAX) data type for its intended purpose.
When data in a MAX data type exceeds 8 KB, an over-flow page is used. SQL Server 2005 automatically assigns an over-flow indicator to the page and knows how to manipulate data rows the same way it manipulates other data types.
For further reading, check out Books Online: char and varchar
You cannot create indexes on varchar(max) ( and nvarchar(max)) columns (although they can be included in them. But who would include a column in an index that could get to 2GB?!) so if you want to search on this value, you will do a scan each time unless you use full-text indexes. Also, remember that any report designer or presentation designer (web or otherwise) must assume that someone might put the Encyclopedia into that column and design around it. Nothing is worse than hearing "the users probably won't do X". If a user can do it, they will do it. If a user can put in a tome into a column, at some point they will. If they never should, then IMO, it makes more sense to cap the column size at some reasonable level and if a user tries to stuff more into that column that is allowed, it would elicit a discussion of whether they should be entering that value into that column in the first place.
I just saw this article the other day. It documents a fairly minor performance lag for varchar(max) over a varchar(n) column. Probably not enough to make a difference for you. But if it does, perhaps you can use a separate table to store those few large text blocks. Your small text could stay in the main table, but you could add a flag field to tell you to look in the new table for the big ones.
I've seen some problems - particularly with scalar functions (but these are generally horrible, anyway) which return varchar(MAX) and then aren't re-cast. For instance, say you have a special function CleanString(somevarcharmax) returns varchar(max) and call it on varchar(50) but don't CAST(CleanString(varchar10col) AS varchar(10)) - nasty performance issues.
But typically, when you have varchar(max) columns in a table, you shouldn't be performing those kinds of operations en masse, so I'd say if you are using it properly for your data needs in the table, then it's fine.
Crystal Reports 12 (and other versions, as far as I know) doesn't handle varchar(max) properly and interprets it as varchar(255) which leads to truncated data in reports.
So if you're using Crystal Reports, thats a disadvantage to varchar(max). Or a disadvantage to using Crystal, to be precise.
See:
http://www.crystalreportsbook.com/Forum/forum_posts.asp?TID=5843&PID=17503
http://michaeltbeeitprof.blogspot.com/2010/05/crystal-xi-and-varcharmax-aka-memo.html
No, varchar(max) adjusts itself based on the size of the entry, so it is the most efficient if you will be using widely varied sized inputs.

For databases, does choosing the correct data type affect performance?

And if so, why? I mean, is a tinyint faster to search than int?
If so, what are the practical differences in performance?
Depending on the data types, yes, it does make a difference.
int vs. tinyint wouldn't make a noticeable difference in speed, but it would make a difference in data sizes. Assuming tinyint is 1 byte, versus int being 4, that's 3 bytes saved every row. it adds up after awhile.
Now, if it was int against varchar, then there would be a bit of a drop, as things like sorts would be much faster on integer values than string values.
If it's a comparable type, and you're not very pressed for space, go with the one that's easier and more robust.
Theoretically, yes, a tinyint is faster than an int. But good database design and proper indexing has a far more substantial effect on performance, so I always use int for design simplicity.
I would venture that there are no practical performance differences in that case. Storage space is the more substantial factor, but even then, it's not much difference. The difference is perhaps 2 bytes? After 500,000 rows you've almost used an extra megabyte. Hopefully you aren't pinching megabytes if you are working with that much data.
Choosing the right data type can improve performance. In a lot of cases the practical difference might not be a lot but a bad choice can definitely have an impact. Imagine using a 1000 character char field instead of a varchar field when you are only going to be storing a string of a few characters. It's a bit of an extreme example but you would definitely be a lot better using a varchar. You would probably never notice a difference in performance between an int and a tinyint. Your overall database design (normalized tables, good indices, etc.) will have a far larger impact.
of course choosing right datatypes always helps in faster execution
take a look in at this article this will surely help you out:
http://www.peachpit.com/articles/article.aspx?p=30885&seqNum=7
The performance consideration all depends on the scale of your model and usage. While the consideration for storage space in these modern times is almost a non-issue, you might need to think about performance:
Database engines tend to store data in chunks called pages. Sql Server has 8k pages, Oracle 2k and MySql 16k page size by default? Not that big for any of these systems. Whenever you perform an operation on a bit of data (the field and row) its entire page is fetched from the db and put into memory. When your data is smaller (tiny int vs. int) then you can fit more individual rows and data items into a page and therefore your likelihood of having to fetch more pages goes down and the overall performance speeds up. So yes, using the smallest possible representation of your data will definitely have an impact on performance because it allows the db engine to be more efficient.
One way it can affect performance is by not requiring you to convert it to the correct type to manipulate the data. This is true when someone uses varchar for instance instead of a datetime datatype and then they have to be converted to do date math. It can also affect performance by giving a smaller record (this why you shouldn't define everything at the max size) which affects how pages are stored and retrieved in the database.
Of course using the correct type of data can also help data integrity; you can't store a date that doesn't exist in a datetime field but you can in varchar field. If you use float instead of int then your values aren't restricted to integer values etc. ANd speaking of float, it is generally bad to use if you intend to do math calulations as you get rounding errors since it is not an exact type.

Is BIT field faster than int field in SQL Server?

I have table with some fields that the value will be 1 0. This tables will be extremely large overtime. Is it good to use bit datatype or its better to use different type for performance? Of course all fields should be indexed.
I can't give you any stats on performance, however, you should always use the type that is best representative of your data. If all you want is 1-0 then absolutely you should use the bit field.
The more information you can give your database the more likely it is to get it's "guesses" right.
Officially bit will be fastest, especially if you don't allow nulls. In practice it may not matter, even at large usages. But if the value will only be 0 or 1, why not use a bit? Sounds like the the best way to ensure that the value won't get filled with invalid stuff, like 2 or -1.
As I understand it, you still need a byte to store a bit column (but you can store 8 bit columns in a single byte). So having a large number (how many?) of these bit columns could save you a bit on storage. As Yishai said it probably won't make much of a difference in performance (though a bit will translate to a boolean in application code more nicely).
If you can state with 100% confidence that the two options for this column will NEVER change then by all means use the bit. But if you can see a third value popping up in the future it could make life a little easier when that day comes to use a tinyint.
Just a thought, but I'm not sure how much good an index will do you on this column either, unless you see the vast majority of rows going to one side or the other. In a roughly 50/50 distribution you might actually take more of a hit keeping the index up to date than it gains you'd see in querying the table.
It depends.
If you would like to maximize speed of selects, use int (tinyint to save space), because bit in where clause is slower then int (not drastically, but every millisecond counts). Also make the column not null which also speeds things up. Below is link to actual performance test, which I would recommend to run at your own database and also extend it by using not nulls, indexes and using multiple columns at once. At home I even tried to compare using multiple bit columns vs multiple tinyint columns and tinyint columns were faster (select count(*) where A=0 and B=0 and C=0). I thought that SQL Server (2014) would optimize by doing only one comparison using bitmask, so it should by three times faster but that wasn't the case. If you use indexes, you would need more than 5000000 rows (as used in the test) to notice any difference (which I didn't have the patience to do since filling table with multiple millions of rows would take ages on my machine).
https://www.mssqltips.com/sqlservertip/4137/sql-server-performance-test-for-bit-data-type-in-a-where-clause/
If you would like to save space, use bit, since 8 of them can ocuppy one byte whereas 8 tinyints will ocupy 8 bytes. Which is around 7 Megabytes saved on each million of rows.
The differences between those two cases are basically negligable and since using bit has the upside of signalling that the column represents merely a flag, I would recommend using bit.

Char(4) versus int as StatusID/StatusCode column in a table

I need a status column that will have about a dozen possible values.
Is there any reason why I should choose int (StatusID) over char(4) (StatusCode)?
Since sql server doesn't support named constants, char is far more descriptive than int when used in stored procedure and views as constants.
To clarify, I would still use a lookup table either way. Since the I will need a more descriptive text for the UI. So this decision is only to help me as the developer when I'm maintaining the stored procedures and views.
Right now I'm leaning toward char(4). Especially since designing views in SQL Server Management Studio prevents me from adding comments (I know it's possible to add it in the script editor, but realistically I will use the View Designer far more often, especially if the view is trivial). StateCODE = 'NEW' is much more readable than StateID = 1000.
I guess the question is will there be cases where char(4) is problematic, and since the database is pretty small, I'm not too concerned about slight performance hit (like using TinyInt versus int), but more afraid of code maintenance problems.
Database purists will say a key should have no meaning in the business domain, and that you should create a status table where you look up the description and other meanings of the status.
But for operators and end users, having a descriptive status code can be a blessing. And it doesn't even have to be char(4), you can make it varchar(20). This allows them to query without joins, and inspect the database in an easier way.
In the end, I think the char(20) organization will run more smoothly, and go home earlier on Friday. But the int organization has a better abstraction of the database, and they can enjoy meta programming on friday evening (or boosting on forums.)
(All of this assuming that you're writing business support software. One of the more succesful business support systems, SAP, makes successful use of meaningful keys.)
There are many pro's and con's to each method. I'm sure other arguments will come up in favour of using a char(4). My reasons for choosing an int over a char include:
I always use lookup tables. They allow for an audit trail of the value to be retained and easily examined. For example, if one of your status codes is 'MING' and a business decision is made to change it from 'MING' to 'MONG' from a certain date, my lookup table handles this.
Smaller index - if you need to index this column, it will be thinner.
Extendability - OK, I made that word up, but if you need to go from 4 chars to 5 chars for example, a lookup table would be a blessing.
Descriptions: We use a lot of TLA's here which once you know what they are is great but if I gave a business user a report that said "GDA's 2007 1001", they wouldn't necessarily twig that GDA = Good Dead on Arrival. With a lookup table, I can add this description.
Best practice: Can't find the link to hand but it might be something I read in a K.Tripp article. Aim to make your clustered primary key incrementing integers to optimise the index.
Of course if you are absolutely positive that you will never need any more than a handful of 4 characters, there is no reason not to bang it in the table.
The best thing should be a lookup table with defined values and then relate it to original table, that uses that enumeration.
Collation ambigities are one reason to say no to char 4: Does ABcD = abCD = äBCd?
If you have 12 possible values, why not tinyint/byte and a Status table?
If you have to store the status for 10 million rows the 3 bytes different and the collation/string compares add up.
The place where I've run into this use case is columns that would map onto things that I would typically use an Enum for when programming. Do you store the integer value of the Enum or the name of the Enum in the database column? Honestly, I've done it both ways. Usually, I ask myself if the database will be used outside the application I'm building. If so, I will choose the human readable format to store in the database. If not, then I'll choose the integer value as it saves a little time when reconstituting (it's just a cast instead of a parse operation) the Enum in code.
You could also use a tinyint over an int
i always choose int's simply because they are easier to map to enums in code.
If you're dealing with huge amounts of data and high throughput then a smallint or tinyint can give better performance and a smaller footprint on the hard disk. If the data in your application is often viewed directly through applications like Access or Cognos then your business people will probably appreciate the descriptive values. I know that when I'm analyzing data as part of my Database Developer role I get tired of joining a lot of lookup tables because I can't remember if 1 = Foo and 2 = Bar or 1 = Bar and 2 = Foo.
Also, although performance will be enhanced if you have to lookup rows by these codes which can have smaller indexes, it can also be hurt (in a minor way) by having to do the joins if you are often looking up rows regardless of the code but where you have to include the text value. In most applications that's not an issue though and would probably only come into play in large data warehousing/reporting environments.