I am still trying to wrap my head around exactly how views work and when it is best to use a view vs querying a table directly. Here is my scenario:
All of the underlying data resides in a single table that stores three month's worth of data
The table includes four columns: 'TagName', 'Alarm', 'Timestamp', and 'Value'; 'TagName' and 'Timestamp' are indexed
There is a view of this table (no other tables involved) that shows one week's worth of data and combines 'Alarm' and 'Value' into a single column.
The database used in this scenario is SQL Server.
I only need to retrieve 5 - 30 minutes worth of data. Is more efficient to write a query against the underlying table that combines 'Alarm' and 'Value' into a single column (like the view) and sets the time range dynamically, or query the existing view by passing in a time range? To me, the former seems like the way to go since the latter essentially requires two queries. Furthermore, in the second scenario, the first query (i.e. the view) would load an unnecessary number of values into memory.
If your data in single table is not large then querying a table will be faster than creating a view first and querying it for required data as it will avoid one step.
If data is not much and columns in where clause are properly indexed then generally queries should go to tables directly(which is faster in most cases).
Views should be used when you have very large data in a single table and you need to operate on small subset very frequently. In this case views will fetch the required data only once and will work that thus this will help in minimizing re-execution of time taking search queries (on single table or a join)
Before reaching to a solution, please validate/understand your data,requirement and have a single run with both the approaches and compare the time (I think query on table should be the winner) and then take the decision.
Hope this helps.
In general you should simplify queries as much as possible by using views. This makes management of your application simpler and also helps you avoid repetition of query logic (joins and WHERE clauses). However, views can be ill-suited for a particular query which leads to bad for performance as they might lead to unnecessary operations not needed in your particular query. Views are abused when they introduce unnecessary complexity.
Related
We have a table design that consists of 10,000,000 records and 200,000 columns.
The columns are a mixture of:
Binary flags.
Integers.
The queries need to perform and / or operations on 1-100 columns at a time, and should complete in under 0.1 seconds, returning a only projection/subset of each matched row.
Around 10 new columns get added per day.
Around 1,000 new rows get added per day.
There are no joins.
Which DBMS is best suited for this?
Reason behind this approach:
The columns are materialized indexes from user defined queries: that's why new columns get added each day (as more users come up with their own queries). The other option would be to not use materialized views, and have the user's queries perform joins. Problem here is the queries could take any form and in aggregate there would be a large number of very different execution plans across everyones query... since the user defines the query, it's kinda impossible to optimise a traditional SQL database using indexes, normalised tables, etc.
First, I'd suggest measuring ad-hoc JOINs, and only doing further optimization if you find the performance lacking. I understand it could be difficult to measure every possible query, but you may be able to cover most common/representative cases, and if they perform well-enough just stop there. There is a lot that can be done with good indexing!
Second, and only if the measurements above warrant it, create a new separate materialized view for each ad-hoc query.
Some databases will be able to maintain such views automatically for you1, so if the "base" data changes, relevant results will be automatically added or removed from the materialized view (just as they would from the "live" query result).
Other databases may allow periodic refresh2.
Be warned though: maintaining materialized views is not free, and having thousands of them (especially if they are constantly kept up-to-date, as opposed to periodically refreshed) will definitely impact the insert/update/delete performance on the base data!
1 E.g. SQL Server indexed views.
2 E.g. Oracle Materialized views, although it looks like 12c can also do something close to SQL Server's immediate refresh.
Keeping aside ,why you want to go with 1000 of columns,you can look at below databases which support,unlimited columns
References: https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.
I have been tasked with replacing a costly stored procedure which performs calculations across 10 - 15 tables, some of which contain many millions of rows. The plan is to pre-stage the many computations and store the results in separate tables for speeding reading.
Having quickly created these new tables and inserted all of the necessary pre-staged data as a test case, the execution time of getting the same results is vastly improved, as you would expect.
My question is, what is the best practice for keeping these new separate tables up to date?
A procedure which runs at a specific interval could do it, but there
is a requirement for the data to be live.
A trigger on each table could do it, but that seems very costly, and
could cause slow-downs for everywhere else that uses these tables.
Are there other alternatives?
Have you considered Indexed Views for this? As long as you meet the criteria for creating Indexed Views (no self joins etc) it may well be a good solution.
The downsides of Indexed Views are that when the data in underlying tables is changed (delete, update, insert) then it will have to recalculate the indexed view. This can slow down these types of operations in certain circumstances so you have to be careful. I've put some links to documentation below;
https://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://msdn.microsoft.com/en-GB/library/ms191432.aspx
https://technet.microsoft.com/en-GB/library/ms187864(v=sql.105).aspx
what is the best practice for keeping these new separate tables up to date?
Answer is it depends .Depends on what ..?
1.How frequently you will use those computed values
2.what is the acceptable data latency
we to have same kind of reporting where we store computed values in seperate tables and use them in reports.In our case we run this sps before sending the reports through SQL server agent
Consider using an A/B table solution. Place a generic view on over the _A table version (CREATE VIEW MY_TABLE AS SELECT * FROM MY_TABLE_A). And then you rebuild the _B version, and then switch the view to the _B version (CREATE VIEW MY_TABLE AS SELECT * FROM MY_TABLE_B). It takes twice as much space for processing, but it gives you the opportunity to build your tables without down-time.
I have a View that has structure exactly same as the query to generate record in the Table.
But the time to execute the select statement as below by using View is taking longer time if compare to Table.
Is that mean View is taking longer time to retrieve the data if compare to table?
If I have huge data, that is more suitable to use Table instead of View?
select count(*) from XYZ_VIEW -- this is a view which returns record by using 4min4sec ,count = 5896
select count(*) from XYZ --this is a table which return return record less than 1 second ,count = 5896
You don't show the query behind the view but I assume it involves some joining to expose normalized data and/or other processing. Here's how you work with views in this regard.
Views should not (as in never) be the target of an unfiltered query. Actually, this should be discouraged against tables but they are sometimes necessary during testing. Fortunately, these queries are almost never justified or even necessary in production code.
And since you don't really care about performance during testing but rather performance during production, then you haven't said anything that suggests you might have a problem. That's right, four minutes against one second for the query you show is meaningless. Many of my views (if not most) would give similar results.
Instead, use a query that would more likely be used in production. Using timing methods with millisecond precision, use a query more in the form:
select ...
from table/view
where <typical filtering criteria>;
That will give you more useful information. Depending on your particular application and requirements, an acceptable range would be 10-20%. That is, if the table returns the result in 30 ms, the view is good if it returns the same result in less than about 36 ms.
You are using a view probably because it manipulates the data in some way to present the data more favorably in some way. This processing is overhead you will not have when you query the tables directly. When you omit filtering, you perform that processing on every row of the underlying table(s), probably needlessly. When you do something particularly stupid like select count(*), you perform all that extra processing for no benefit whatsoever.
When you filter the query, the extra processing is performed only on the qualifying result set. Should that result set be only one row, the processing will be performed only on the single row, rendering the performance from the view (depending, of course, on the exact amount and type of processing) practically indistinguishable from the table.
I always recommend the use of views -- lots and lots of views -- to present your users with data in whatever form is best for their various purposes. Test those views, by all means. But use meaningful tests.
Is there any performance benefit to splitting a large table with roughly 100 columns into 2 separate tables? This would be in terms of inserting, deleting and selecting tasks? I'm using SQL Server 2008.
If one of the fields is a CLOB or BLOB and you anticipate it holding a huge amount of data and you won't need that field very often and the result set will transmitted over a long pipe (like server to a web-based client), then I think putting that field in a separate table would be appropriate.
But just returning 100 regular fields probably won't tax your system so much as to justify a separate table and a join.
The only benefit you might see is if a number of columns are only occasionally populated. In which case putting those into their own table and only adding a row when there is data might make sense in terms of overall row overhead and, depending on the number of rows, overall page count for the table(s). That said, this is one of the reasons they introduced sparse columns in SQL Server 2008.
For the maintenance and other overhead of managing two tables instead of one (especially given that people can act on individual tables if they choose), it's unlikely it would be worth it.
Can you describe what type of entity needs to have over 100 columns? Perhaps the data model is just wrong in the first place.
I would say no as it would take more execution time to join the 2 tables whenever you wanted to do something.
I depends if you use these fields in the same time in your application.
These kind of performance improvements are really bad : you make your source code impossible to understand. If you have performance trouble with this table, add something (like a table containing the 15 fields you'll use in a request that'll updated via trigger), don't modify your clean solution.
If you don't have performance problem, don't do anything, you'll see later !