I have table with some fields that the value will be 1 0. This tables will be extremely large overtime. Is it good to use bit datatype or its better to use different type for performance? Of course all fields should be indexed.
I can't give you any stats on performance, however, you should always use the type that is best representative of your data. If all you want is 1-0 then absolutely you should use the bit field.
The more information you can give your database the more likely it is to get it's "guesses" right.
Officially bit will be fastest, especially if you don't allow nulls. In practice it may not matter, even at large usages. But if the value will only be 0 or 1, why not use a bit? Sounds like the the best way to ensure that the value won't get filled with invalid stuff, like 2 or -1.
As I understand it, you still need a byte to store a bit column (but you can store 8 bit columns in a single byte). So having a large number (how many?) of these bit columns could save you a bit on storage. As Yishai said it probably won't make much of a difference in performance (though a bit will translate to a boolean in application code more nicely).
If you can state with 100% confidence that the two options for this column will NEVER change then by all means use the bit. But if you can see a third value popping up in the future it could make life a little easier when that day comes to use a tinyint.
Just a thought, but I'm not sure how much good an index will do you on this column either, unless you see the vast majority of rows going to one side or the other. In a roughly 50/50 distribution you might actually take more of a hit keeping the index up to date than it gains you'd see in querying the table.
It depends.
If you would like to maximize speed of selects, use int (tinyint to save space), because bit in where clause is slower then int (not drastically, but every millisecond counts). Also make the column not null which also speeds things up. Below is link to actual performance test, which I would recommend to run at your own database and also extend it by using not nulls, indexes and using multiple columns at once. At home I even tried to compare using multiple bit columns vs multiple tinyint columns and tinyint columns were faster (select count(*) where A=0 and B=0 and C=0). I thought that SQL Server (2014) would optimize by doing only one comparison using bitmask, so it should by three times faster but that wasn't the case. If you use indexes, you would need more than 5000000 rows (as used in the test) to notice any difference (which I didn't have the patience to do since filling table with multiple millions of rows would take ages on my machine).
https://www.mssqltips.com/sqlservertip/4137/sql-server-performance-test-for-bit-data-type-in-a-where-clause/
If you would like to save space, use bit, since 8 of them can ocuppy one byte whereas 8 tinyints will ocupy 8 bytes. Which is around 7 Megabytes saved on each million of rows.
The differences between those two cases are basically negligable and since using bit has the upside of signalling that the column represents merely a flag, I would recommend using bit.
Related
I always tried to make my sql database as simple and as understandable as possible.
Until now I always used a limited number of columns, I think I never had more than 20. Now, there is one thing, that would make my life easier, if I had much more columns. Let´s say 200 columns. (not rows). What do you think about it?
I just want to know, if it is a bad idea, not why i´m doing this or if there are other possibilities, just if somebody has already experienced something like that and if it is a bad idea to do such a table.
Fewer, smaller width columns is better than lots of columns and/or large width columns.
Why? Because the narrower the row size, the more rows you fit on a 8K page. That means you do less I/O and use less memory to buffer pages. That is always a good thing.
In those (hopefully) rare cases, where the domain requires many attributes on an object (with the assumption of 1-1 object-table mapping), you should consider splitting into two tables ina 1-1 relationship, one containing the frequently used columns.
I don't think it is black and white. Having a large row size (implied by the large number of columns) will hurt performance (i.e., more I/O) -- but there are cases where taking a small hit in performance in one place will be offset by increased performance in others.
I'd say it depends on how many rows you expect this table to have, how often will it will be queried, how many of those additional columns will really be accessed, and how it would compare to your alternative design in terms of efficiency and complexity.
Luke--
It really depends on the type of the system you are working with. Example in transactional systems, most tables have at most 50 columns or so with almost no redundant data attrributes ( If you have a process date, you would not need the Process Month or the process year as a seperate column). This of course is because the records are updated/inserted frequently and you'll need to update all the redundant attributes everytime you update one row.
In Data Warehouse/reporting environments, for Dimension tables (which have the attributes for an entity) it is typical to have 100+ columns as there are could be various ways you want to categorize a given entity.The Updates here are not so much a problem as data is typically loaded once during off-peak hours and then is used mostly in selects.
Take a look at these links to know more..
http://en.wikipedia.org/wiki/Database_normalization
http://en.wikipedia.org/wiki/Star_schema
So the answer is it depends... If you want a perfectly relational system, then may be 200+ columns is kind of a red flag indicating you should look at normalize your data (May be not). Updates and Indexes are two things that you should be concerned with in such a system.
You are using SQL Server, which I think defaults to row-oriented storage (all fields in a row are stored together in a page), which can be a problem with large number of columns. However, if you use column-oriented storage, the number of columns per table does not matter because each column is stored together. I don't know if this is possible with SQL Server.
I have some columns where I have to store basically yes/no values.
For example user status for active or inactive. Newsletter suscription status for suscribed or unsuscribed.
Well I want to know (considering tables with a lot of records) if the best way is to put a tiny int with char length of 1 and set 1 for yes, and 0 for no.
Is this a correct thought? Or there are no impact in the performance of db queries when using just words like yes, no, active, inactive, suscribed, etc.
thanks in advance.
Semantically, I suggest you use bit if it's available to you. When looking at the column, any other developer can immediately determine that a boolean value is stored in it. If you don't have bit, try using tinyint. Ensuring that 1 is the only true value and 0 is the only false value will bring consistency. Otherwise, you could end up with a messy mixture of true/false, yes/no, valid/invalid, y/n, and/or t/f.
Comparing bit or tinyint values probably isn't slower than comparing strings, and even if it were slower than comparing strings, I can't imagine it having a significant effect on overall speed.
Is there something you don't like about the 'bit' data type?
The most commonly supported means is to use CHAR(1) - on most databases, it takes the same amount of space as BIT (assuming BIT is available, 1 byte) but supports more values (26 if case insensitive, 52 if not) if there's any chance of supporting more values. Unlike BIT, CHAR(1) is human readable. Also, BIT isn't supported on every database.
If your RDBMS supports bitmap indexes, go for BIT every time. If it doesn't, use whatever you want, there is really no difference between char(1), tinyint (byte).
Are you just asking in general, what the most efficient way to store a yes/no flag is?
Or do you have a performance problem at hand?
If so, when do you have the performance problem (specific queries, inserts, maintenance etc)? What kind of performance gain are you looking for?
2%? 10%? 50%?
Changing datatypes will likely result in only a minor improvement unless we are talking about several hundred million rows. I will give you an example. Let's say that whatever change you did, you shaved of 3 bytes per row. Let's say the table contains 100,000,000 rows. That would be a saving of ~285 mb. Assuming the disk subsystem can provide you 100mb/s you have saved a whopping 3 seconds for a full table scan. Something tells me that the users would think 2 hours and 3 seconds vs 2 hours is same same :)
My intuition would have said performance would have been better with tinyints, but this post doesn't really bare that thought out. This SO post also offers some other interesting opinions.
I do think that performing analysis with data stored as numbers is typically easier than character data. What other programs are you going to have to interface with and use? For example, several of my analysis tools do not read character data at all, so we have to recode any data we receive in the format of "yes", "no", etc.
To preface, I'm aware (as should you!) that using SELECT * in production is bad, but I was maintaining a script written by someone else. And, I'm also aware that this question is low on specifics... But hypothetical scenario.
Let's say I have a script that selects everything from a table of 20 fields. Let's say typical customer information.
Then let's say being the good developer I am, I shorten the SELECT * to a SELECT of the 13 specific fields I'm actually using on the display end.
What type of performance benefit, if any, could I expect by explicitly listing the fields versus SELECT *?
I will say this, both queries take advantage of the same exact indexes. The more specific query does not have access to a covering index that the other query could not use, in case you were wondering.
I'm not expecting miracles, like adding an index that targets the more specific query. I'm just wondering.
It depends on three things: the underlying storage and retrieval mechanism used by your database, the nature of the 7 columns you're leaving out, and the number of rows returned in the result set.
If the 7 (or whatever number) columns you're leaving out are "cheap to retrieve" columns, and the number of rows returned is low, I would expect very little benefit. If the columns are "expensive" (for instance, they're large, or they're BLOBs requiring reference to another file that is never cached) and / or you're retrieving a lot of rows then you could expect a significant improvement. Just how much depends on how expensive it is in your particular database to retrieve that information and assemble in memory.
There are other reasons besides speed, incidentally, to use named columns when retrieving information having to do with knowing absolutely that certain columns are contained in the result set and that the columns are in the desired order that you want to use them in.
The main difference I would expect to see is reduced network traffic. If any of the columns are large, they could take time to transfer, which is of course a complete waste if you're not displaying them.
It's also fairly critical if your database library references columns by index (instead of name), because if the column order changes in the database, it'll break the code.
Coding-style wise, it allows you to see which columns the rest of the code will be using, without having to read it.
Hmm, in one simple experiment, I was surprised at how much difference it made.
I just did a simple query with three variations:
select *
select the field that is the primary key. (It might pull get this directly from the index without actually reading the record)
select a non-key field.
I used a table with a pretty large number of fields -- 72 of them -- including one CLOB. The query was just a select with one condition in the where clause.
Results:
Run * Key Non-key
1 .647 .020 .028
2 .599 .041 .014
3 .321 .019 .027
avg .522 .027 .023
Key vs non-key didn't seem to matter. (Which surprises me.) But retrieving just one field versus select * saved 95% of the runtime!
Of course this is one tiny experiment with one table. There could be many many relevant factors. I'm certainly not claiming that you will always reduce runtime by 95% by not using select *! But it's far more impressive than I expected.
When comparing 13 vs 20 fields, if the 7 fields that are left out are not fields such as CLOB/BLOBs or such, I would expect to see no noticable performance gain.
I/O is main DB bottleneck (most DB systems are I/O bound), so you might think that you would bring execution time to 13/20 of the original query execution time (since you need that much less data), but since normal fields are stored within the same physical structure (usually fields are arranged consecutively) and the file system reads whole blocks, your disk heads will read the same amount of data (assuming all 20 fields are less then block size; situation can change if the size of a record is bigger than a block of your filesystem).
The principle that SELECT * is bad has a different cause - stability of the system.
If you use SELECT * at wrong places then changes to underlying table(s) might break your system unexpectedly (mostly later, and if things break it is usually better if they break sooner). This can especially be intresting if normalize data (move columns from one table to another, while keeping the same name). In such case if you chain SELECT * in views and if you chain your views then you might actually not get any errors, but have (essentially) different end results.
Why don't you try it yourself and let us know?
It's all going to be dependent on how many columns and how wide they are.
Better still, do you have an actual performance problem? Tell us what your actual problem is and show us the code, and then we can suggest potential improvements. Chances are there are other improvements to be made that are much better than worrying about SELECT * vs. SELECT field list.
Select * means the database has to take time to lookup the fields. If you don't need all those fields (and anytime you have have an inner join you don't as the join field is repeated!) then you are wasting but server resources to get the data and network resources to transport the data. You may also be wasting memory to hold the recordset to work with it. And while the performance improvement may be tiny for one query, how many times is that query run? And people who use this abysmally poor technique tend to use it everywhere, so fixing all of them can be a major imporvement for not that much effort. And how hard is it to specify the fields? I don't know about every database, but in SQL Server I can drag and drop what I want from the object browser in seconds. So using select * is trading less than a minute of development time for a worse performance every single time the query is run and creating code that is fragile and subject to very bad problems as the schema changes. I see no reason to ever use select * in production code.
And if so, why? I mean, is a tinyint faster to search than int?
If so, what are the practical differences in performance?
Depending on the data types, yes, it does make a difference.
int vs. tinyint wouldn't make a noticeable difference in speed, but it would make a difference in data sizes. Assuming tinyint is 1 byte, versus int being 4, that's 3 bytes saved every row. it adds up after awhile.
Now, if it was int against varchar, then there would be a bit of a drop, as things like sorts would be much faster on integer values than string values.
If it's a comparable type, and you're not very pressed for space, go with the one that's easier and more robust.
Theoretically, yes, a tinyint is faster than an int. But good database design and proper indexing has a far more substantial effect on performance, so I always use int for design simplicity.
I would venture that there are no practical performance differences in that case. Storage space is the more substantial factor, but even then, it's not much difference. The difference is perhaps 2 bytes? After 500,000 rows you've almost used an extra megabyte. Hopefully you aren't pinching megabytes if you are working with that much data.
Choosing the right data type can improve performance. In a lot of cases the practical difference might not be a lot but a bad choice can definitely have an impact. Imagine using a 1000 character char field instead of a varchar field when you are only going to be storing a string of a few characters. It's a bit of an extreme example but you would definitely be a lot better using a varchar. You would probably never notice a difference in performance between an int and a tinyint. Your overall database design (normalized tables, good indices, etc.) will have a far larger impact.
of course choosing right datatypes always helps in faster execution
take a look in at this article this will surely help you out:
http://www.peachpit.com/articles/article.aspx?p=30885&seqNum=7
The performance consideration all depends on the scale of your model and usage. While the consideration for storage space in these modern times is almost a non-issue, you might need to think about performance:
Database engines tend to store data in chunks called pages. Sql Server has 8k pages, Oracle 2k and MySql 16k page size by default? Not that big for any of these systems. Whenever you perform an operation on a bit of data (the field and row) its entire page is fetched from the db and put into memory. When your data is smaller (tiny int vs. int) then you can fit more individual rows and data items into a page and therefore your likelihood of having to fetch more pages goes down and the overall performance speeds up. So yes, using the smallest possible representation of your data will definitely have an impact on performance because it allows the db engine to be more efficient.
One way it can affect performance is by not requiring you to convert it to the correct type to manipulate the data. This is true when someone uses varchar for instance instead of a datetime datatype and then they have to be converted to do date math. It can also affect performance by giving a smaller record (this why you shouldn't define everything at the max size) which affects how pages are stored and retrieved in the database.
Of course using the correct type of data can also help data integrity; you can't store a date that doesn't exist in a datetime field but you can in varchar field. If you use float instead of int then your values aren't restricted to integer values etc. ANd speaking of float, it is generally bad to use if you intend to do math calulations as you get rounding errors since it is not an exact type.
I want to search a table to find all rows where one particular field is one of two values. I know exactly what the values would be, but I'm wondering which is the most efficient way to search for them:
for the sake of example, the two values are "xpoints" and "ypoints". I know for certain that there will be no other values in that field which has "points" at the end, so the two queries I'm considering are:
WHERE `myField` IN ('xpoints', 'ypoints')
--- or...
WHERE `myField` LIKE '_points'
which would give the best results in this case?
As always with SQL queries, run it through the profiler to find out. However, my gut instinct would have to say that the IN search would be quicker. Espcially in the example you gave, if the field was indexed, it would only have to do 2 lookups. If you did a like search, it may have to do a scan, because you are looking for records that end with a certain value. It would also be more accurate as LIKE '_points' could also return 'gpoints', or any other similar string.
Unless all of the data items in the column in question start with 'x' or 'y', I believe IN will always give you a better query. If it is indexed, as #Kibbee points out, you will only have to perform 2 lookups to get both. Alternatively, if it is not indexed, a table scan using IN will only have to check the first letter most of the time whereas with LIKE it will have to check two characters every time (assuming all items are at least 2 characters) -- since the first character is allowed to be anything.
Try it and see. Create a large amount of test data, Also, try it with and without an index on myfield. While you are at it, see if there's a noticeable difference between
LIKE 'points' and LIKE 'xpoint'.
It depends on what the optimizer does with each query.
For small amounts of data, the difference will be negligible. Do whichever one makes more sense. For large amounts of data the amount of disk I/O matters much more than the amount of CPU time.
I'm betting that IN will get you better results than LIKE, if there is an index on myfield. I'm also betting that 'xpoint_' runs faster than '_points'. But there's nothing like trying it yourself.
MySQL can't use an index when using string comparisons such as LIKE '%foo' or '_foo', but can use an index for comparisons like 'foo%' and 'foo_'.
So in your case, IN will be much faster assuming that the field is indexed.
If you're working with a limited set of possible values, it's worth specifying the field as an ENUM - MySQL will then store it internally as an integer and make this sort of lookup much faster, and save disk space.
It will be faster to do the IN-version than the LIKE-version. Especially when your wildcard isn't at the end of the comparison, but even under ideal conditions IN would still be ideal up until your query nears the size of your max-query insert.