What is the best practice for aggregating data in SQL Server? - sql

I have a source table that is way too huge and queries take way too long for it to be usable directly for on-demand reporting.
The charts we generate are time-based, usually the resolution is in months or days, so my first idea was to create a "Months" table and a "Days" table, and to filter / sum / count into these tables, essentially running all possible queries in advance.
The question is how. My first idea was to write a C# console app to load all data for a month, and then somehow filter it (DataSet? DataView?) and then aggregate it (load into List<>? LinqToSQL?) and then update the Month and Day tables.
Is there a better way to do this? I apologize in advance for the lack of code in this post. I am writing this for advice BEFORE I start coding.

My 2 cents would be to create a calendar table( with some surrogate key) and use SSIS to aggregate the data.

Related

How to normalise a database for dates

I'm creating a database which involves many records which have many dates. Many records within these tables can have the same date. These will range from 3 years prior to about 3 years in the future. Would an efficient system use the date datatype built into SQL or to make individual tables for the Date, Month and Year. Sorry if this seems like an amateur question, I've only learnt SQL recently for this project.
Thanks
Yes, as you already guessed, the best solution here is to use the date datatype built into SQL.
From the way you have asked the question, it sounds like you want to record aggregated data for each day/month/year. As #edward said, you will definitely want to use the built in data type for the raw records - your "fact" table, and then you might also build up aggregated data in separate tables for the year or month.
Depending on the volume of data these might be stored physically, or just done through views on the fact table.
In general, you never want to remove information as you never know how it might be used in the future, which is why storing with the raw date is the correct option.

Solution: get report faster from database with big data

I use oracle database. I have many table with data very big (300-500 million record).
I use query statement have join many table together. I set index for table but get report very slow.
Please, help me solution when working with big data.
Thanks.
Do you really need to have all the data at once?
Try creating a table that stores just the information you need for the report, and run a query once a day (or few hours) to update the table. You can also use Sql Server Integration Services, although I have not tried SSIS with Oracle myself.
I agree with the other users, you really need to give more info on the problem.

BigQuery best practice for segmenting tables by dates

I am new to columnar DB concepts and BigQuery in particular. I noticed that for the sake of performance and cost efficiency it is recommended to split data across tables not only logically - but also by time.
For example - while I need a table to store my logs (1 logical table that is called "logs"), it is actually considered a good practice to have a separate table for different periods, like "logs_2012", "logs_2013", etc... or even "logs_2013_01", "logs_2013_02", etc...
My questions:
1) Is it actually the best practice?
2) Where would be best to draw the line - an annual table? A monthly table? A daily table? You get the point...
3) In terms of retrieving the data via queries - what is the best approach? Should I construct my queries dynamically using the UNION option? If I had all my logs in one table - I would naturally use the where clause to get data for the desired time range, but having data distributed over multiple tables makes it weird. I come from the world of relational DB (if it wasn't obvious so far) and I'm trying to make the leap as smoothly as possible...
4) Using the distributed method (different tables for different periods) still raises the following question: before querying the data itself - I want to be able to determine for a specific log type - what is the available range for querying. For example - for a specific machine I would like to first present to my users the relevant scope of their available logs, and let them choose the specific period within that scope to get insights for. The question is - how do I construct such a query when my data is distributed over a number of tables (each for a period) where I don't know which tables are available? How can I construct a query when I don't know which tables exist? I might try to access the table "logs_2012_12" when this table doesn't actually exist, or event worst - I wouldn't know which tables are relevant and available for my query.
Hope my questions make sense...
Amit
Table naming
For daily tables, the suggested table name pattern is the specific name of your table + the date like in '20131225'. For example, "logs20131225" or "logs_20131225".
Ideal aggregation: Day, month, year?
The answer to this question will depend on your data and your queries.
Will you usually query one or two days of data? Then have daily tables, and your costs will be much lower, as you query only the data you need.
Will you usually query all your data? Then have all the data in one table. Having many tables in one query can get slower as the number of tables to query grow.
If in doubt, do both! You could have daily, monthly, yearly tables. For a small storage cost, you could save a lot when doing queries that target only the intended data.
Unions
Feel free to do unions.
Keep in mind that there is a limit of a 1000 tables per query. This means if you have daily tables, you won't be able to query 3 years of data (3*365 > 1000).
Remember that unions in BigQuery don't use the UNION keyword, but the "," that other databases use for joins. Joins in BigQuery can be done with the explicit SQL keyword JOIN (or JOIN EACH for very big joins).
Table discovery
API: tables.list will list all tables in a dataset, through the API.
SQL: To query the list of tables within SQL... keep tuned.
New 2016 answer: Partitions
Now you can have everything in one table, and BigQuery will analyze only the data contained in the desired dates - if you set up the new partitioned tables:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables

How do I manage large data set spanning multiple tables? UNIONs vs. Big Tables?

I have an aggregate data set that spans multiple years. The data for each respective year is stored in a separate table named Data. The data is currently sitting in MS ACCESS tables, and I will be migrating it to SQL Server.
I would prefer that data for each year is kept in separate tables, to be merged and queried at runtime. I do not want to do this at the expense of efficiency, however, as each year is approx. 1.5M records of 40ish fields.
I am trying to avoid having to do an excessive number of UNIONS in the query. I would also like to avoid having to edit the query as each new year is added, leading to an ever-expanding number of UNIONs.
Is there an easy way to do these UNIONs at runtime without an extensive SQL query and high system utility? Or, if all the data should be managed in one large table, is there a quick and easy way to append all the tables together in a single query?
If you really want to store them in separate tables, then I would create a view that does that unioning for you.
create view AllData
as
(
select * from Data2001
union all
select * from Data2002
union all
select * from Data2003
)
But to be honest, if you use this, why not put all the data into 1 table. Then if you wanted you could create the views the other way.
create view Data2001
as
(
select * from AllData
where CreateDate >= '1/1/2001'
and CreateDate < '1/1/2002'
)
A single table is likely the best choice for this type of query. HOwever you have to balance that gainst the other work the db is doing.
One choice you did not mention is creating a view that contains the unions and then querying on theview. That way at least you only have to add the union statement to the view each year and all queries using the view will be correct. Personally if I did that I would write a createion query that creates the table and then adjusts the view to add the union for that table. Once it was tested and I knew it would run, I woudl schedule it as a job to run on the last day of the year.
One way to do this is by using horizontal partitioning.
You basically create a partitioning function that informs the DBMS to create separate tables for each period, each with a constraint informing the DBMS that there will only be data for a specific year in each.
At query execution time, the optimiser can decide whether it is possible to completely ignore one or more partitions to speed up execution time.
The setup overhead of such a schema is non-trivial, and it only really makes sense if you have a lot of data. Although 1.5 million rows per year might seem a lot, depending on your query plans, it shouldn't be any big deal (for a decently specced SQL server). Refer to documentation
I can't add comments due to low rep, but definitely agree with 1 table, and partitioning is helpful for large data sets, and is supported in SQL Server, where the data will be getting migrated to.
If the data is heavily used and frequently updated then monthly partitioning might be useful, but if not, given the size, partitioning probably isn't going to be very helpful.

Single Row Table in SQL : Is this a good implementation?

I am new to SQL. I read a bit about how creating a single row table is not really a good practice, but I can't help but find it useful in my case. I am making a web app which balances the workload of employees in the organization. So apart from keeping track of how much work is assigned to every employee and how much work does each task (2 main task types) require, I also need to track the overall workload.
So I plan to make a single row table for total workload, with three columns. One for each of the two task types, summed together. And the third for the sum of those 2 totals. I plan to use triggers to update the table in case of addition of a new task or change in its requirements so that it reflects on the total.
Please let me know if I am heading in the right direction. Thanks!
It will work but it is not extensible, in the sense if tomorrow you need to add a 3rd main task then you will need to alter the table and add another column (not so preferred ). So may be you can just have the table with two columns for now with task type and load and you can always calculate the sum with sql query.