Hello I have a database for a certain record where it needs to store a 1 or a 0 for each day of the week. So which one would be better? Bitshifting each bit into an integer and just having an integer in the database named days or should we make all of them separate boolean values so to have sunday, monday, tuesday... columns?
Note that these columns are only used for a computation in our software(not on the DB itself) so the only thing that will be done with these values is selecting, updating, and inserting.
I'd go for separate columns, for the following reasons:
That would seem like the better-designed database model (because clearer, more intuitive, easier-to-understand), and everyone will probably not need any further explanations;
"Which bit was for Sunday again...? Did I assign the most-significant bit to it, or the least-significant one?" -- You won't run into such problems with separate, named columns... therefore less potential for bugs.
If you later want to enhance your database model so that you could store NULL for single days, you will almost definitely want a separate column per day. Otherwise, you'd need at least two bits per day (since you now have 3 possible states, and 1 bit no longer suffices for that) and an appropriate, home-baked encoding scheme.
I'm quite sure today's RDBMSs are smart enough to pack several boolean columns together;
Separate columns.
The DB engine will pack them anyway and work with the bits transparently for you. I suspect it's better at this than you or me rolling our own...
It depends on your needs. Sql Server suffers slower performance when querying with bitshifting. If you'll be doing a lot of filtering for just one day per query, then I'd recommend separate bit fields for each day.
Related
I have a need to store a fairly large history of data. I have been researching the best ways to store such an archive. It seems that a datawarehouse approach is what I need to tackle. It seems highly recommended to use a date dimension table rather than a date itself. Can anyone please explain to me why a separate table would be better? I don't have a need to summarize any of the data, just access it quickly and efficiently for any give day in the past. I'm sure I'm missing something, but I just can't see how storing the dates in a separate table is any better than just storing a date in my archive.
I have found these enlightening posts, but nothing that quite answers my question.
What should I have in mind when building OLAP solution from scratch?
Date Table/Dimension Querying and Indexes
What is the best way to store historical data in SQL Server 2005/2008?
How to create history fact table?
Well, one advantage is that as a dimension you can store many other attributes of the date in that other table - is it a holiday, is it a weekday, what fiscal quarter is it in, what is the UTC offset for a specific (or multiple) time zone(s), etc. etc. Some of those you could calculate at runtime, but in a lot of cases it's better (or only possible) to pre-calculate.
Another is that if you just store the DATE in the table, you only have one option for indicating a missing date (NULL) or you need to start making up meaningless token dates like 1900-01-01 to mean one thing (missing because you don't know) and 1899-12-31 to mean another (missing because the task is still running, the person is still alive, etc). If you use a dimension, you can have multiple rows that represent specific reasons why the DATE is unknown/missing, without any "magic" values.
Personally, I would prefer to just store a DATE, because it is smaller than an INT (!) and it keeps all kinds of date-related properties, the ability to perform date math etc. If the reason the date is missing is important, I could always add a column to the table to indicate that. But I am answering with someone else's data warehousing hat on.
Lets say you've got a thousand entries per day for the last year. If you've a date dimension your query grabs the date in the date dimension and then uses the join to collect the one thousand entries you're interested in. If there's no date dimension your query reads all 365 thousand rows to find the one thousand you want. Quicker, more efficient.
I'm working on a database, and can see that the table was set up with multiple columns (day,month,year) as opposed to one date column.
I'm thinking I should convert that to one, but wanted to check if there's much point to it.
I'm rewriting the site, so I'm updating the code that deals with it anyway, but I'm curious if there is any advantage to having it that way?
The only thing it gets used for is to compare data, where all columns get compared, and I think that an integer comparison might be faster than a date comparison.
Consolidate them to a single column - an index on a single date will be more compact (and therefore more efficient) than the compound index on 3 ints. You'll also benefit from type safety and date-related functions provided by the DBMS.
Even if you want to query on month of year or day of month (which doesn't seem to be the case, judging by your description), there is no need to keep them separate - simply create the appropriate computed columns and intex them.
The date column makes sense for temporal data because it is fit for purpose.
However, if you have a specific use-case where you are more often comparing month-to-month data instead of using the full date, then there is a little bit of advantage - as you mentioned - int columns are much leaner to store into index pages and faster to match.
The downsides are that with 3 separate int columns, validation of dates is pretty much a front-end affair without resorting to additional coding on the SQL Server side.
Normally, a single date field is ideal, as it allows for more efficient comparison, validity-checks at a low level, and database-side date-math functions.
The only significant advantage of separating the components is when a day or month first search (comparison) is frequently needed. Maybe an "other events that happened on this day" sort of thing. Or a monthly budgeting application or something.
(Even then, a proper date field could probably be made to work efficiently with proper indexing.)
Yes, I would suggest you replace the 3 columns with a single column that contains the date in Julian which is a floating point number. The part before the dot gives the day, the part after the dot gives the time within the day. Calculations will be easy and you can also easily convert Julian back into month/day/year etc. I believe that MS Excel stores dates internally as a floating point number so you will be in good company.
Actually i am building a software for academic institutions, so i just wanted to know answers of a few questions:
As you know the some new data will be generated each year(for new admissions) and some will be upgraded. So i should store all the data in one single table with academic year separation(as a columns like ac_year), or should i make separate tables for each year. Also that there are different tables to store information like, classes,marks,fee,hostel,etc about the students. So each Info, like Fee would be stored in different tables like
Fee-2010
Fee-2011
Fee-2012...
Or in 1 single Fee table with year as a columns.
One more point is that soon after 1-2 years database will get heavier so backing up data for a single year would be possible in single table(Like Fee with year as a column) ?
And
Please answer keeping in mind SQL Server 2005.
Thanks
As you phrase the question, the answer is clearly to store the data in one table, with the year (or date or other information) as a column. This is simply the right thing to do. You are dealing with the same entity over time.
The one exception would be when the fields are changing significantly from one year to the next. I doubt that is the case for your table.
If your database is really getting big, then you can look into partitioning the table by time. Each partition would be one year's worth of data. This would speed up queries that only need to access one year's worth. It also helps with backing up and restoring data. This may be the solution you are ultimately looking for.
"Really getting big" means at least millions of rows in the table. Even with a couple million rows in the table, most queries will probably run fine on decent hardware with appropriate indexes.
It's not typical to store the data in multiple tables based on time constraints. I would prefer to store it all in one table. In the future, you may look to archiving old data, but it will still be significant time before performance will become an issue.
It is always better option to add new property to entity than create a new entity for every different property. This way maintenance and querying will be much more easier for you.
On the performance part of querying you don't have to worry about internal affairs of data and database. If there become a real performance issue there are many solutions like Creating Index on years as in your situation.
The two tables below can both hold the same data - a full year, including some arbitrary info about each month
table1 (one row = one month)
------
id
month
year
info
table2 (one row = one year)
------
id
year
jan_info
feb_info
mar_info
apr_info
may_info
jun_info
jul_info
aug_info
sep_info
oct_info
nov_info
dec_info
Table A
Seems more intuitive because the month is numeric, but its
10x more rows for a full year of data. Also the
Rows are smaller (less columns)
Table B
10x less rows for a full year of data, but
Single rows are much larger
Possibly more difficult to add more arbitrary info for a month
In a real world test scenerio I set up, there were 12,000 rows in table1 for 10 years of data, where table2 had 150. I realize less is better, generally speaking, but ALWAYS? I'm afraid that im overlooking some caveat that ill find later if I commit to one way. I havent even considered disk usage or what query might be faster. What does MySQL prefer? Is there a "correct" way? Or is there a "better" way?
Thanks for your input!
Don't think about how to store it, think about how you use it. And also think about how it might change in the future. The storage structure should reflect use.
The first option is more normalized by the second, so I would tend to prefer it. It has the benefit of being easy to change, for example if every month suddenly needed a second piece of information stored about it. Usually this kind of structure is easier to populate, but not always. Think about where the data is coming from.
If you're only using this data for reports and you don't need to aggregate data across months, use the second option.
It really depends on what the data is for and where it comes from. Generally, though, the first option is better.
12000 rows for 10 years of data? I say that scale pretty well since 12000 rows is next to nothing with a decent DBMS.
How are you using the database? Are you sure you really need to worry about optimizations?
If you need to store data that is specific to a month then you should absolutely store a row for each month. It's a lot cleaner approach compared to the one with a column for each month.
"In a real world test scenerio I set up, there were 12,000 rows in table1 for 10 years of data, where table2 had 150."
How? There would have to be 80 months in a year for that to be the case.
Since this is an optimising problem the optimising answer applies: It depends.
What do you want to do with your data?
Table A is the normal form in which one would store this kind of data.
For special cases Table B might come in handy, but I'd need to think hard to find a good example.
So either go with A or give us some details about what you want to do with the data.
A note on disc space: Total disc space is is a non issue, except for extremely huge tables. If at all discspace per select matters, and that should be less for the Table A design in most cases.
A note on math: if you divide 12000 by 12 and get 150 as an result, something is wrong.
How are you using the data? If you are often doing a report that splits the data out by month, the second is easier (and probably faster but you need to test for yourself) to query. It is less normalized but but honestly when was the last time we added a new month to the year?
In general I'd say one record per month as the more general solution.
One important issue is whether "info" is and must logically always be a single field. If there are really several pieces of data per month, or if it's at all likely that in the future there will be, than putting them all in one table gets to be a major pain.
Another question is what you will do with this data. You don't say what "info" is, so just for purposes of discussion let's suppose it's "sales for the month". Will you ever want to say, "In what months did we have over $1,000,000 in sales?" ? With one record per month, this is an easy query: "select year, month from sales where month_sales>1000000". Now try doing that with the year table. "select year, 'Jan' from year_sales where jan_sales>1000000 union select year, 'Feb' from year_sales where feb_sales>1000000 union select year, 'Mar' from year_sales where mar_sales>1000000 union ..." etc. Or maybe you'd prefer "select year, case when jan_sales>1000000 then 'Jan=yes' else 'Jan=no', case when feb_sales>1000000 then 'Feb=yes' else 'Feb=no' ... for the remaining months ... from year_sales where jan_sales>1000000 or feb_sales>1000000 or mar_sales>1000000 ..." Yuck.
Having many small records is not that much more of a resource burden than having fewer but bigger records. Yes, the total disk space requirement will surely be more because of per-record overhead, and index searches will be somewhat slower because the index will be larger. But the difference is likely to be minor, and frankly there are so many factors in database performance that this sort of thing is hard to predict.
But I have to admit that I just faced a very similar problem and went the other way: I needed a set of flags for each day of the week, saying "are you working on this day". I wrestled with whether to create a separate table with one record per day, but I ended up putting seven fields into a single record. My thinking is that there will never be additional data for each day without some radical change in the design, and I have no reason to ever want to look at just one day. The days are used for calculating a schedule and assigning due dates, so I can't imagine, in the context of this application, ever wanting to say "give me all the people who are working on Tuesday". But I can readily imagine the same data in a different application being used with precisely that question.
I have table with some fields that the value will be 1 0. This tables will be extremely large overtime. Is it good to use bit datatype or its better to use different type for performance? Of course all fields should be indexed.
I can't give you any stats on performance, however, you should always use the type that is best representative of your data. If all you want is 1-0 then absolutely you should use the bit field.
The more information you can give your database the more likely it is to get it's "guesses" right.
Officially bit will be fastest, especially if you don't allow nulls. In practice it may not matter, even at large usages. But if the value will only be 0 or 1, why not use a bit? Sounds like the the best way to ensure that the value won't get filled with invalid stuff, like 2 or -1.
As I understand it, you still need a byte to store a bit column (but you can store 8 bit columns in a single byte). So having a large number (how many?) of these bit columns could save you a bit on storage. As Yishai said it probably won't make much of a difference in performance (though a bit will translate to a boolean in application code more nicely).
If you can state with 100% confidence that the two options for this column will NEVER change then by all means use the bit. But if you can see a third value popping up in the future it could make life a little easier when that day comes to use a tinyint.
Just a thought, but I'm not sure how much good an index will do you on this column either, unless you see the vast majority of rows going to one side or the other. In a roughly 50/50 distribution you might actually take more of a hit keeping the index up to date than it gains you'd see in querying the table.
It depends.
If you would like to maximize speed of selects, use int (tinyint to save space), because bit in where clause is slower then int (not drastically, but every millisecond counts). Also make the column not null which also speeds things up. Below is link to actual performance test, which I would recommend to run at your own database and also extend it by using not nulls, indexes and using multiple columns at once. At home I even tried to compare using multiple bit columns vs multiple tinyint columns and tinyint columns were faster (select count(*) where A=0 and B=0 and C=0). I thought that SQL Server (2014) would optimize by doing only one comparison using bitmask, so it should by three times faster but that wasn't the case. If you use indexes, you would need more than 5000000 rows (as used in the test) to notice any difference (which I didn't have the patience to do since filling table with multiple millions of rows would take ages on my machine).
https://www.mssqltips.com/sqlservertip/4137/sql-server-performance-test-for-bit-data-type-in-a-where-clause/
If you would like to save space, use bit, since 8 of them can ocuppy one byte whereas 8 tinyints will ocupy 8 bytes. Which is around 7 Megabytes saved on each million of rows.
The differences between those two cases are basically negligable and since using bit has the upside of signalling that the column represents merely a flag, I would recommend using bit.