How do I manage large data set spanning multiple tables? UNIONs vs. Big Tables? - sql

I have an aggregate data set that spans multiple years. The data for each respective year is stored in a separate table named Data. The data is currently sitting in MS ACCESS tables, and I will be migrating it to SQL Server.
I would prefer that data for each year is kept in separate tables, to be merged and queried at runtime. I do not want to do this at the expense of efficiency, however, as each year is approx. 1.5M records of 40ish fields.
I am trying to avoid having to do an excessive number of UNIONS in the query. I would also like to avoid having to edit the query as each new year is added, leading to an ever-expanding number of UNIONs.
Is there an easy way to do these UNIONs at runtime without an extensive SQL query and high system utility? Or, if all the data should be managed in one large table, is there a quick and easy way to append all the tables together in a single query?

If you really want to store them in separate tables, then I would create a view that does that unioning for you.
create view AllData
as
(
select * from Data2001
union all
select * from Data2002
union all
select * from Data2003
)
But to be honest, if you use this, why not put all the data into 1 table. Then if you wanted you could create the views the other way.
create view Data2001
as
(
select * from AllData
where CreateDate >= '1/1/2001'
and CreateDate < '1/1/2002'
)

A single table is likely the best choice for this type of query. HOwever you have to balance that gainst the other work the db is doing.
One choice you did not mention is creating a view that contains the unions and then querying on theview. That way at least you only have to add the union statement to the view each year and all queries using the view will be correct. Personally if I did that I would write a createion query that creates the table and then adjusts the view to add the union for that table. Once it was tested and I knew it would run, I woudl schedule it as a job to run on the last day of the year.

One way to do this is by using horizontal partitioning.
You basically create a partitioning function that informs the DBMS to create separate tables for each period, each with a constraint informing the DBMS that there will only be data for a specific year in each.
At query execution time, the optimiser can decide whether it is possible to completely ignore one or more partitions to speed up execution time.
The setup overhead of such a schema is non-trivial, and it only really makes sense if you have a lot of data. Although 1.5 million rows per year might seem a lot, depending on your query plans, it shouldn't be any big deal (for a decently specced SQL server). Refer to documentation

I can't add comments due to low rep, but definitely agree with 1 table, and partitioning is helpful for large data sets, and is supported in SQL Server, where the data will be getting migrated to.
If the data is heavily used and frequently updated then monthly partitioning might be useful, but if not, given the size, partitioning probably isn't going to be very helpful.

Related

SQL - multiple tables vs one big table

I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.

Why select count(*) is faster the select * even if table has no indexes?

Is there any diffrence between the time taken for Select * and Select count(*) for the table having no primary key and other indexes in SQL server 2008 R2?
I have tried select count(*) from a view and it has taken 00:05:41 for 410063922 records.
Select (*) from view has already taken 10 minutes for first 600000 records and the query is still running. So it looks like that it will take more than 1 hour.
Is there any way through which I can make this view faster without any change in the structure of the underlying tables?
Can I create indexed view for tables without indexes?
Can I use caching for the view inside sql server so if it is called again, it takes less time?
It's a view which contains 20 columns from one table only. The table does not have any indexes.The user is able to query the view. I am not sure whether user does select * or select somecolumn from view with some where conditions. The only thing which I want to do is to propose them for some changes through which their querying on the view will return results faster. I am thinking of indexing and caching but I am not sure whether they are possible on a view with table having no indexes. Indexing is not possible here as mentioned in one of the answers.
Can anyone put some light on caching within sql server 2008 R2?
count(*) returns just a number and select * returns all the data. Imagine having to move all that data and the time it takes for your hundred of thousands of records. Even if your table was indexed probably, running select * on your hundreds of thousands of records will still take a lot of time even if less than before, and should never bee needed in the first place.
Can I create indexed view for tables without indexes?
No, you have to add indexes for indexed results
Can I use caching for the view inside sql server so if it is called again, it takes less time?
Yes you can, but its of no use for such a requirement. Why are you selecting so many records in the first place? You should never have to return millions or thousands of rows of complete data in any query.
Edit
Infact you are trying to get billions of rows without any where clause. This is bound to fail on any server that you can get hold off, so better stop there :)
TL;DR
Indexes do not matter for a SELECT * FROM myTABLE query because there is no condition and billions of rows. Unless you change your query, no optimization can help you
The execution time difference is due to the fact that SELEC * will show the entire content of your table and the SELECT COUNT(*) will only count how many rows are present without showing them.
Answer about optimisation
In my opinion you're taking the problem with the wrong angle. First of all it's important to define the real need of your clients, when the requirements are defined you'll certainly be able to improve your view in order to get better performance and avoid returning billions of data.
Optimisations can even be made on the table structure sometimes (we don't have any info about your current structure).
SQL Server will automatically use a system of caching in order to make the execution quicker but that will not solve your problem.
SQL Server apparently does very different work when its result set field list is different. I just did a test of a query joining several tables where many millions of rows were in play. I tested different queries, which were all the same except for the list of fields in the SELECT clause. Also, the base query (for all tests) returned zero rows.
The SELECT COUNT(*) took 6 seconds and the SELECT MyPrimaryKeyField took 6 seconds. But once I added any other column (even small ones) to the SELECT list, the time jumped to 20 minutes - even though there were no records to return.
When SQL Server thinks it needs to leave its indexes (e.g., to access table columns not included in an index) then its performance is very different - we all know this (which is why SQL Server supports including base columns when creating indexes).
Getting back to the original question, the SQL Server optimizer apparently chooses to access the base table data outside of the indexes before it knows that it has no rows to return. In the poster's original scenario, though, there were no indexes or PK (don't know why), but maybe SQL Server is still accessing table data differently with COUNT(*).

More efficient to query a table or a view?

I am still trying to wrap my head around exactly how views work and when it is best to use a view vs querying a table directly. Here is my scenario:
All of the underlying data resides in a single table that stores three month's worth of data
The table includes four columns: 'TagName', 'Alarm', 'Timestamp', and 'Value'; 'TagName' and 'Timestamp' are indexed
There is a view of this table (no other tables involved) that shows one week's worth of data and combines 'Alarm' and 'Value' into a single column.
The database used in this scenario is SQL Server.
I only need to retrieve 5 - 30 minutes worth of data. Is more efficient to write a query against the underlying table that combines 'Alarm' and 'Value' into a single column (like the view) and sets the time range dynamically, or query the existing view by passing in a time range? To me, the former seems like the way to go since the latter essentially requires two queries. Furthermore, in the second scenario, the first query (i.e. the view) would load an unnecessary number of values into memory.
If your data in single table is not large then querying a table will be faster than creating a view first and querying it for required data as it will avoid one step.
If data is not much and columns in where clause are properly indexed then generally queries should go to tables directly(which is faster in most cases).
Views should be used when you have very large data in a single table and you need to operate on small subset very frequently. In this case views will fetch the required data only once and will work that thus this will help in minimizing re-execution of time taking search queries (on single table or a join)
Before reaching to a solution, please validate/understand your data,requirement and have a single run with both the approaches and compare the time (I think query on table should be the winner) and then take the decision.
Hope this helps.
In general you should simplify queries as much as possible by using views. This makes management of your application simpler and also helps you avoid repetition of query logic (joins and WHERE clauses). However, views can be ill-suited for a particular query which leads to bad for performance as they might lead to unnecessary operations not needed in your particular query. Views are abused when they introduce unnecessary complexity.

Select Query VS View

Generally we are aggregating transaction data to one table using SQL Job periodically. Now the customer needs live data so 1 hour or hour is too high. Instead of running job in minutes, team suggest to create a view which includes the logic for selecting data from the transaction table itself.
But my confusion is any performance related difference between select query and View. Putting data in aggregate table on a periodica manner and select from this aggregation is easy because
1. It does not affect main tables
Direct select statement in direct transactions tables impact insert opertions as it frequently happening. If i create a view i think the concept is same as select statement. Now also we are selecting from the main tables. Or my understanding is wrong?
Please suggest the best approach for this and why
If i create a view i think the concept is same as select statement
Correct, a view in its simplest form is nothing more than saved query definition. When this view is used in a query the definition will be expanded out into the outer query and optimised accordingly. There is not performance benefit. This is not true of course for indexed views, where the view essentially becomes it's own table and using the NOEXPAND query hint will stop the definition being expanded, and will simply read from the view's index(es). Since this is an aggregation query though I suspect an indexed view won't even be possible, never mind a viable solution.
The next part about having a table to store the aggregation is more difficult to answer. Yes this could benefit performance, but at the cost of not having up to the minute data, and also having to maintain the table. Whether this is a suitable solution is entirely dependent on your needs, how up to date the data needs to be, how often it is required, how long it takes to (a) populate the reporting table (b) run the query on it's own.
For example if it takes 20 seconds to run the query, but it is only needed twice a day, then there is no point running this query every hour to maintain a reporting table to assist a query that is run twice a day.
Another option could be maintaining this reporting table through triggers, i.e. when a row is inserted/updated, cascade the change to the reporting table there and then, this would make the reporting table up to date, but again, if you are inserting millions of transactions a day and running the report a few times, you have to weigh up if the additional overhead cause by the trigger is worth it.
You could reduce the impact on write operations by using a transaction isolation level of READ UNCOMMITTED for the SELECT, but as with the summary table option, this comes at the cost of having up to the second accurate information, as you will be reading uncommitted transactions.
A final option could be an intermediate option, create a summary table daily, and partition/index your main table by date, then you can get the summary from historic data from the table created daily, and union this with only today's data which should be relatively fast with the right indexes.
A good approach is to create the view and profile your DB. Yes the best way to check if the view is a good solution is creating it and testing the results.
A similar question was asked here:
Is a view faster than a simple query?

BigQuery best practice for segmenting tables by dates

I am new to columnar DB concepts and BigQuery in particular. I noticed that for the sake of performance and cost efficiency it is recommended to split data across tables not only logically - but also by time.
For example - while I need a table to store my logs (1 logical table that is called "logs"), it is actually considered a good practice to have a separate table for different periods, like "logs_2012", "logs_2013", etc... or even "logs_2013_01", "logs_2013_02", etc...
My questions:
1) Is it actually the best practice?
2) Where would be best to draw the line - an annual table? A monthly table? A daily table? You get the point...
3) In terms of retrieving the data via queries - what is the best approach? Should I construct my queries dynamically using the UNION option? If I had all my logs in one table - I would naturally use the where clause to get data for the desired time range, but having data distributed over multiple tables makes it weird. I come from the world of relational DB (if it wasn't obvious so far) and I'm trying to make the leap as smoothly as possible...
4) Using the distributed method (different tables for different periods) still raises the following question: before querying the data itself - I want to be able to determine for a specific log type - what is the available range for querying. For example - for a specific machine I would like to first present to my users the relevant scope of their available logs, and let them choose the specific period within that scope to get insights for. The question is - how do I construct such a query when my data is distributed over a number of tables (each for a period) where I don't know which tables are available? How can I construct a query when I don't know which tables exist? I might try to access the table "logs_2012_12" when this table doesn't actually exist, or event worst - I wouldn't know which tables are relevant and available for my query.
Hope my questions make sense...
Amit
Table naming
For daily tables, the suggested table name pattern is the specific name of your table + the date like in '20131225'. For example, "logs20131225" or "logs_20131225".
Ideal aggregation: Day, month, year?
The answer to this question will depend on your data and your queries.
Will you usually query one or two days of data? Then have daily tables, and your costs will be much lower, as you query only the data you need.
Will you usually query all your data? Then have all the data in one table. Having many tables in one query can get slower as the number of tables to query grow.
If in doubt, do both! You could have daily, monthly, yearly tables. For a small storage cost, you could save a lot when doing queries that target only the intended data.
Unions
Feel free to do unions.
Keep in mind that there is a limit of a 1000 tables per query. This means if you have daily tables, you won't be able to query 3 years of data (3*365 > 1000).
Remember that unions in BigQuery don't use the UNION keyword, but the "," that other databases use for joins. Joins in BigQuery can be done with the explicit SQL keyword JOIN (or JOIN EACH for very big joins).
Table discovery
API: tables.list will list all tables in a dataset, through the API.
SQL: To query the list of tables within SQL... keep tuned.
New 2016 answer: Partitions
Now you can have everything in one table, and BigQuery will analyze only the data contained in the desired dates - if you set up the new partitioned tables:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables