The two tables below can both hold the same data - a full year, including some arbitrary info about each month
table1 (one row = one month)
------
id
month
year
info
table2 (one row = one year)
------
id
year
jan_info
feb_info
mar_info
apr_info
may_info
jun_info
jul_info
aug_info
sep_info
oct_info
nov_info
dec_info
Table A
Seems more intuitive because the month is numeric, but its
10x more rows for a full year of data. Also the
Rows are smaller (less columns)
Table B
10x less rows for a full year of data, but
Single rows are much larger
Possibly more difficult to add more arbitrary info for a month
In a real world test scenerio I set up, there were 12,000 rows in table1 for 10 years of data, where table2 had 150. I realize less is better, generally speaking, but ALWAYS? I'm afraid that im overlooking some caveat that ill find later if I commit to one way. I havent even considered disk usage or what query might be faster. What does MySQL prefer? Is there a "correct" way? Or is there a "better" way?
Thanks for your input!
Don't think about how to store it, think about how you use it. And also think about how it might change in the future. The storage structure should reflect use.
The first option is more normalized by the second, so I would tend to prefer it. It has the benefit of being easy to change, for example if every month suddenly needed a second piece of information stored about it. Usually this kind of structure is easier to populate, but not always. Think about where the data is coming from.
If you're only using this data for reports and you don't need to aggregate data across months, use the second option.
It really depends on what the data is for and where it comes from. Generally, though, the first option is better.
12000 rows for 10 years of data? I say that scale pretty well since 12000 rows is next to nothing with a decent DBMS.
How are you using the database? Are you sure you really need to worry about optimizations?
If you need to store data that is specific to a month then you should absolutely store a row for each month. It's a lot cleaner approach compared to the one with a column for each month.
"In a real world test scenerio I set up, there were 12,000 rows in table1 for 10 years of data, where table2 had 150."
How? There would have to be 80 months in a year for that to be the case.
Since this is an optimising problem the optimising answer applies: It depends.
What do you want to do with your data?
Table A is the normal form in which one would store this kind of data.
For special cases Table B might come in handy, but I'd need to think hard to find a good example.
So either go with A or give us some details about what you want to do with the data.
A note on disc space: Total disc space is is a non issue, except for extremely huge tables. If at all discspace per select matters, and that should be less for the Table A design in most cases.
A note on math: if you divide 12000 by 12 and get 150 as an result, something is wrong.
How are you using the data? If you are often doing a report that splits the data out by month, the second is easier (and probably faster but you need to test for yourself) to query. It is less normalized but but honestly when was the last time we added a new month to the year?
In general I'd say one record per month as the more general solution.
One important issue is whether "info" is and must logically always be a single field. If there are really several pieces of data per month, or if it's at all likely that in the future there will be, than putting them all in one table gets to be a major pain.
Another question is what you will do with this data. You don't say what "info" is, so just for purposes of discussion let's suppose it's "sales for the month". Will you ever want to say, "In what months did we have over $1,000,000 in sales?" ? With one record per month, this is an easy query: "select year, month from sales where month_sales>1000000". Now try doing that with the year table. "select year, 'Jan' from year_sales where jan_sales>1000000 union select year, 'Feb' from year_sales where feb_sales>1000000 union select year, 'Mar' from year_sales where mar_sales>1000000 union ..." etc. Or maybe you'd prefer "select year, case when jan_sales>1000000 then 'Jan=yes' else 'Jan=no', case when feb_sales>1000000 then 'Feb=yes' else 'Feb=no' ... for the remaining months ... from year_sales where jan_sales>1000000 or feb_sales>1000000 or mar_sales>1000000 ..." Yuck.
Having many small records is not that much more of a resource burden than having fewer but bigger records. Yes, the total disk space requirement will surely be more because of per-record overhead, and index searches will be somewhat slower because the index will be larger. But the difference is likely to be minor, and frankly there are so many factors in database performance that this sort of thing is hard to predict.
But I have to admit that I just faced a very similar problem and went the other way: I needed a set of flags for each day of the week, saying "are you working on this day". I wrestled with whether to create a separate table with one record per day, but I ended up putting seven fields into a single record. My thinking is that there will never be additional data for each day without some radical change in the design, and I have no reason to ever want to look at just one day. The days are used for calculating a schedule and assigning due dates, so I can't imagine, in the context of this application, ever wanting to say "give me all the people who are working on Tuesday". But I can readily imagine the same data in a different application being used with precisely that question.
Related
I have a need to store a fairly large history of data. I have been researching the best ways to store such an archive. It seems that a datawarehouse approach is what I need to tackle. It seems highly recommended to use a date dimension table rather than a date itself. Can anyone please explain to me why a separate table would be better? I don't have a need to summarize any of the data, just access it quickly and efficiently for any give day in the past. I'm sure I'm missing something, but I just can't see how storing the dates in a separate table is any better than just storing a date in my archive.
I have found these enlightening posts, but nothing that quite answers my question.
What should I have in mind when building OLAP solution from scratch?
Date Table/Dimension Querying and Indexes
What is the best way to store historical data in SQL Server 2005/2008?
How to create history fact table?
Well, one advantage is that as a dimension you can store many other attributes of the date in that other table - is it a holiday, is it a weekday, what fiscal quarter is it in, what is the UTC offset for a specific (or multiple) time zone(s), etc. etc. Some of those you could calculate at runtime, but in a lot of cases it's better (or only possible) to pre-calculate.
Another is that if you just store the DATE in the table, you only have one option for indicating a missing date (NULL) or you need to start making up meaningless token dates like 1900-01-01 to mean one thing (missing because you don't know) and 1899-12-31 to mean another (missing because the task is still running, the person is still alive, etc). If you use a dimension, you can have multiple rows that represent specific reasons why the DATE is unknown/missing, without any "magic" values.
Personally, I would prefer to just store a DATE, because it is smaller than an INT (!) and it keeps all kinds of date-related properties, the ability to perform date math etc. If the reason the date is missing is important, I could always add a column to the table to indicate that. But I am answering with someone else's data warehousing hat on.
Lets say you've got a thousand entries per day for the last year. If you've a date dimension your query grabs the date in the date dimension and then uses the join to collect the one thousand entries you're interested in. If there's no date dimension your query reads all 365 thousand rows to find the one thousand you want. Quicker, more efficient.
Actually i am building a software for academic institutions, so i just wanted to know answers of a few questions:
As you know the some new data will be generated each year(for new admissions) and some will be upgraded. So i should store all the data in one single table with academic year separation(as a columns like ac_year), or should i make separate tables for each year. Also that there are different tables to store information like, classes,marks,fee,hostel,etc about the students. So each Info, like Fee would be stored in different tables like
Fee-2010
Fee-2011
Fee-2012...
Or in 1 single Fee table with year as a columns.
One more point is that soon after 1-2 years database will get heavier so backing up data for a single year would be possible in single table(Like Fee with year as a column) ?
And
Please answer keeping in mind SQL Server 2005.
Thanks
As you phrase the question, the answer is clearly to store the data in one table, with the year (or date or other information) as a column. This is simply the right thing to do. You are dealing with the same entity over time.
The one exception would be when the fields are changing significantly from one year to the next. I doubt that is the case for your table.
If your database is really getting big, then you can look into partitioning the table by time. Each partition would be one year's worth of data. This would speed up queries that only need to access one year's worth. It also helps with backing up and restoring data. This may be the solution you are ultimately looking for.
"Really getting big" means at least millions of rows in the table. Even with a couple million rows in the table, most queries will probably run fine on decent hardware with appropriate indexes.
It's not typical to store the data in multiple tables based on time constraints. I would prefer to store it all in one table. In the future, you may look to archiving old data, but it will still be significant time before performance will become an issue.
It is always better option to add new property to entity than create a new entity for every different property. This way maintenance and querying will be much more easier for you.
On the performance part of querying you don't have to worry about internal affairs of data and database. If there become a real performance issue there are many solutions like Creating Index on years as in your situation.
*As a first note, I only have read access to my server. Just, FYI as it seems to come up a lot...
Server:DB2(6.1) for i (IBM)
I have a query I'm running on a table that has 19mil rows in it (I don't design them, I just query them). I've been limiting my return data to 10 rows (*) until I get this query sorted out so that return times are a bit more reasonable.
The basic design is that I need to get data about categories of products we sell on a week by week basis, using columns: WEEK_ID, and CATEGORY. Here's example code (with some important bits #### out.)
SELECT WEEK_ID, CATEGORY
FROM DWQ####.SLSCATW
INNER JOIN DW####.CATEGORY
ON DWQ####.SLSCATW.CATEGORY_NUMBER = DW####.CATEGORY.CATEGORY_NUMBER
WHERE WEEK_ID
BETWEEN 200952 AND 201230 --Format is year/week
GROUP BY WEEK_ID, CATEGORY
If I comment out that last line I can get back 100 rows in 254 ms. If I put that line back in my return takes longer than I've had patience to wait for :-). (Longest I've waited is 10 minutes.)
This question has two parts. The first question is quite rudimentary: Is this normal? There are 50 categories (roughly) and 140 weeks (or so) that I'm trying to condense down to. I realize that's a lot of info to condense off of 19mil rows, but I was hoping limiting my query to 10 rows returned would minimize the amount of time?
And, if I'm not just a complete n00b, and this in fact should not take several minutes, what exactly is wrong with my SQL?
I've Googled WHERE statement optimization and can't seem to find anything. All links and explanation are more than welcome.
Apologies for such a newbie post... we all have to start somewhere, right?
(*)using SQLExplorer, my IDE, an Eclipse implementation of Squirrel SQL.
I'm not sure how the server handles group by when there's no aggregating functions in the query. Based on your answers in the comments, I'd just try to add those:
SELECT
...,
SUM(SalesCost) as SalesCost,
SUM(SalesDollars) as SalesDollars
FROM
...
Leave the rest of the query as is.
If that doesn't solve the problem, you might have missing indexes. I would try to find out if there's an index where the WEEK_ID is the only column or where it is the first column. You could also check if you have another temporal column (i.e. TransactionDate or something similar) on the same table that already is indexed. If so, you could use that instead in the where clause.
Without correct indexes, the database server is forced to do a complete table scan, and that could explain your performance issues. 39 million rows does take some not insignificant amount of time to read from disk.
Also check that the data type of WEEK_ID is int or similar, just to avoid unnecessary casting in your query.
To avoid a table scan on the Category table, you need to make sure that Category_Number is indexed as well. (It probably already is, since I assume it is a key to that table.)
Indices on WEEK_ID, CATEGORY (and possibly CATEGORY_NUMBER) is the only way to make it really fast, so you need to convince the DBO to introduce those.
In the project where I work I saw this structure in database, and I ask to all of you, what a hell of modeling is this?
TableX
Columns: isMonday, BeginingHourMonday, EndHourMonday, isTuesday, BeginingHourTuesday, EndHourTuesday and so on...
Is this no-sql? I did not asked to the personn who created becaus I'm ashamed :$
Bye.
this is totally de-normalized data. no-sql kind of. i just wonder why month is not included. it could increase the de-normalization-factor.
This is called a calendar table.
It is a very common and incredibly useful approach to dealing with and solving a lot of date and time related queries. It allows you to search, sort, group, or otherwise mine for data in interesting and clever ways.
#Brian Gideon is right. So is #iamgopal. And I am too, when I say "it depends on the nature of the data being modeled and stored in the database".
If it is a list of days with certain attributes/properties for each day, then yes, I would call it denormalized -- and 9 times out of 10 (or more) this will probably be the case. (I recall a database with 13 columns, one for each month in the year and one for total, and at the end of the year the user added 13 more columns for the next year. "Mr. Database", we called him.)
If this is a description of, say, work hours within a week, where each and every time the data is queried you always require the information for each day in the week, then the row would represent one "unit" of data (each column dependant upon the primary key of the table and all that), and it would be counter-productive to split the data into smaller pieces.
And, of course, it might be a combination of the two -- data that was initially normalized down to one row per day, and then intentionally denormalized for performance reasons. Perhaps 9 times out of 10 they do need a weeks' worth of information, and analysis showed massive performance gains by concatenating that data into one row?
As it is, without further information on use and rational I'm siding with #iamgopal, and upvoting him.
Looks like a structure of a timesheet for a given week.
If normalized, it might look like
columns: day, startHour, endHour
When this is converted to a pivot table in excel, you will have a timesheet kind of a structure, which is good for input screens/views (as against creating a view with normalized structure).
Looking to that table. I don't see any good reason to do that, even for a performance reason.
Lets see, if I change the isMonday, isTuesday, etc to ID_Day I still get the same speed and logic. And if I change the BeginingHourMonday to StartHour and the EndHourMonday to EndHour, I still get the same effect.
I still have the day of week and the start and end time and that is the basically idea I get from the table struture. Maybe there is something I'm not seeing.
Regards
I need to write a query that will group a large number of records by periods of time from Year to Hour.
My initial approach has been to decide the periods procedurally in C#, iterate through each and run the SQL to get the data for that period, building up the dataset as I go.
SELECT Sum(someValues)
FROM table1
WHERE deliveryDate BETWEEN #fromDate AND # toDate
I've subsequently discovered I can group the records using Year(), Month() Day(), and datepart(week, date) and datepart(hh, date).
SELECT Sum(someValues)
FROM table1
GROUP BY Year(deliveryDate), Month(deliveryDate), Day(deliveryDate)
My concern is that using datepart in a group by will lead to worse performance than running the query multiple times for a set period of time due to not being able to use the index on the datetime field as efficiently; any thoughts as to whether this is true?
Thanks.
As with anything performance related Measure
Checking the query plan up for the second approach will tell you any obvious problems in advance (a full table scan when you know one is not needed) but there is no substitute for measuring. In SQL performance testing that measurement should be done with appropriate sizes of test data.
Since this is a complex case, you are not simply comparing two different ways to do a single query but comparing a single query approach against a iterative one, aspects of your environment may play a major role in the actual performance.
Specifically
the 'distance' between your application and the database as the latency of each call will be wasted time compared to the one big query approach
Whether you are using prepared statements or not (causing additional parsing effort for the database engine on each query)
whether the construction of the ranges queries itself is costly (heavily influenced by 2)
If you put a formula into the field part of a comparison, you get a table scan.
The index is on field, not on datepart(field), so ALL fields must be calculated - so I think your hunch is right.
you could do something similar to this:
SELECT Sum(someValues)
FROM
(
SELECT *, Year(deliveryDate) as Y, Month(deliveryDate) as M, Day(deliveryDate) as D
FROM table1
WHERE deliveryDate BETWEEN #fromDate AND # toDate
) t
GROUP BY Y, M, D
If you can tolerate the performance hit of joining in yet one more table, I have a suggestion that seems odd but works real well.
Create a table that I'll call ALMANAC with columns like weekday, month, year. You can even add columns for company specific features of a date, like whether the date is a company holiday or not. You might want to add a starting and ending timestamp, as referenced below.
Although you might get by with one row per day, when I did this I found it convenient to go with one row per shift, where there are three shifts in a day. Even at that rate, a period of ten years was only a little over 10,000 rows.
When you write the SQL to populate this table, you can make use of all the date oriented built in functions to make the job easier. When you go to do queries you can use the date column as a join condition, or you may need two timestamps to provide a range for catching timestamps within the range. The rest of it is as easy as working with any other kind of data.
I was looking for similar solution for reporting purposes, and came across this article called Group by Month (and other time periods). It shows various ways, good and bad, to group by the datetime field. Definitely worth looking at.
I think that you should benchmark it to get reliable results , but, IMHO and my first thought would be that letting the DB take care of it (your 2nd approach) would be much faster then when you do it in your client code.
With your first approach, you have multiple roundtrips to the DB, which I think will be far more expensive. :)
You may want to look at a dimensional approach (this is simliar to what Walter Mitty has suggested), where each row has a foreign key to a date and/or time dimension. This allows very flexible summations through the join to this table where these parts are precalculated. In these cases, the key is usually a natural integer key of the form YYYYMMDD and HHMMSS which is relatively performant and also human readable.
Another alternative might be indexed views, where there are separate expressions for each of the date parts.
Or calculated columns.
But performance has to be tested and execution plans examined...