We have a system that makes use of a database View, which takes data from a few reference tables (lookups) and then does a lot of pivoting and complex work on a hierarchy table of (pretty much fixed and static) locations, returning a view view of data to the application.
This view is getting slow, as new requirements are added.
A solution that may be an option would be to create a normal table, and select from the view, into this table, and let the application use that highly indexed and fast table for it's querying.
Issue is, I guess, if the underlying tables change, the new table will show old results. But the data that drives this table changes very infrequently. And if it does - a business/technical process could be made that means an 'Update the Table' procedure is run to refresh this data. Or even an update/insert trigger on the primary driving table?
Is this practice advised/ill-advised? And are there ways of making it safer?
The ideal solution is to optimise the underlying queries.
In SSMS run the slow query and include the actual execution plan (Ctrl + M), this will give you a graphical representation of how the query is being executed against your database.
Another helpful tool is to turn on IO statistics, this is usually the main bottleneck with queries, put this line at the top of your query window:
SET STATISTICS IO ON;
Check if SQL recommends any missing indexes (displayed in green in the execution plan), as you say the data changes infrequently so it should be safe to add additional indexes if needed.
In the execution plan you can hover your mouse over any element for more information, check the value for estimated rows vs actual rows returned, if this varies greatly update the statistics for the tables, this can help the query optimiser find the best execution plan.
To do this for all tables in a database:
USE [Database_Name]
GO
exec sp_updatestats
Still no luck in optimising the view / query?
Be careful with update triggers as if the schema changes on the view/table (say you add a new column to the source table) the new column will not be inserted into your 'optimised' table unless you update the trigger.
If it is not a business requirement to report on real time data there is not too much harm in having a separate optimized table for reporting (Much like a DataMart), just use a SQL Agent job to refresh it nightly during non-peak hours.
There are a few cons to this approach though:
More storage space / duplicated data
More complex database
Additional workload during the refresh
Decreased cache hits
Related
I execute SELECT INTO query where my source data are in other database than the table I insert to (but on the same server).
When I execute the query using the same database where my source data are (USE DATABASE_MY_SOURCE_DATA), it completes in under a minute. When I change the database to the database where my target table sits, it doesn't complete in 10 minutes (I don't know the exact time because I cancelled it).
Why is that? Why is the difference so huge? I can't get my head around it.
Querying cross-database, even using a linked server connection, is always likely (at least in 2021) to present performance concerns.
The first problem is that the optimizer doesn't have access to estimate the number of rows in the remote table(s). It's also going to miss indexes on those tables, resorting to table scans (which tend to be a lot slower on large tables than index seeks).
Another issue is that there is no data caching, so the optimizer makes round-trips to the remote database for every necessary operation.
More information (from a great source):
https://www.brentozar.com/archive/2021/07/why-are-linked-server-queries-so-bad/
Assuming that you want this to be more performant, and that you are doing substantial filtering on the remote data source, you may see some performance benefit from creating - on the remote database - a view that filters to just the rows you want on the target table and query that for your results.
Alternatively (and likely more correctly) you should wrap these operations in an ETL process (such as SSIS) that better manages these connections.
Can we create index on a table while data is being read from it.Actually I am using an ETL tool to load data to target.The performance is very slow and I do not want the data load to stop.
Will the performance improve if i create the index during data load is in progress.
ETL tool used :-BODS
In an analytical DB you basically have 2 use cases for indexes:
To speed up data loads
To speed up queries
The more indexes you have on a table the slower it will be to load. Therefore, traditionally, "query" indexes were dropped at the start of each data load and re-built once the load was completed. This is still good practice, where possible, but obviously if you have massive tables (and therefore excessive index re-build times), users running queries during data loads or continuous/streaming loading then this is not possible.
Creating indexes while loading data is likely to slow down the data load - not necessarily because of any impact on the tables being indexed but because indexing uses DB resources and therefore those resources are not available for use by the data loading activities.
Creating an index on a table while that table is being loaded will not speed up the data load and will probably slow it down (see previous paragraph). When SQL is executed on a DB one of the first thing the DB will do is generate an execution plan i.e. determine the most efficient way to execute the statement based on table statistics, available indexes, etc. Once the SQL statement is executing (based on the plan), it will not then continuously check if the indexes have changed since the execution started, re-calculate the plan and re-execute the statement if there is now a more efficient plan available.
Hope this helps? Please tick this answer if it does
I have underlying tables on which the data changes constantly. Every minute or so, I run a stored procedure to summarize the data in those underlying tables into a summary table. The summarization time is very long (~30s) so it does not make sense to have a "summary view." Additionally the summary table is constantly accessed by multiple users, it needs to be quick, responsive, and cannot be down.
To solve this, do the following in the stored procedure:
Summarize the data into "new summary table" (this can take as long as it needs because the "current summary table" is serving the needs of the users)
Drop the "current summary table"
Rename "new summary table" to "current summary table"
My questions are:
Is this safe/proper?
What happens if a user tries to access the "current summary table" when the summarization procedure is between steps 2 and 3 above?
What is the right way to do this? At the end of the day, I just need a summary to always be quickly (this is important) accessible and to be up-to-date (within a minute or so)
By using triggers on the details, you can make the summary stay in sync. For things like average, you need to track sum and count in the summary table as well, so you can recompute average. Triggers by row might be higher overhead than by all rows of the operation, if you have bulk churn and SQL Server has two flavors of trigger like Oracle. Inserts might make a summary row or update it, deletes may update or delete the summary row, and updates might change a key and so do both. Of course, there may be multiple sorts of summary row for any detail row.
Oracle had a materialized view, so maybe SQL Server has that, too. Oh look, it does! Google is my friend, so much to remember! It would be something like a shorthand for the above, at best.
There is the potential for a lot of delay in detail table churn with such triggers. Regenerating the summary table with a periodic query might suffice for some uses. A procedure can truncate the previous table for reuse, generate a new table in it, and then in a trans swap the names. If there is a time stamp in or for the table, the procedure can skip no-change updates. The lock, disk and CPU overhead for query is often a lot less than for churn.
Some summaries like median are very hard to support except by a view, but it might run fast if indexed (not clustered, sorted not hash index), as queries can be fulfilled right from non-clustered indexes. Excess indexes slow transactions (churn), so many use replicated tables for reporting, with few, narrow indexes on the parent transaction table and report-oriented indexes on the replicated table.
What is the use of sp_updatestats? Can I run that in the production environment for performance improvement?
sp_updatestats updates all statistics for all tables in the database, where even a single row has changed. It does it using the default sample, meaning it doesn't scan all rows in the table so it will likely produce less accurate statistics than the alternatives.
If you have a maintenance plan with 'rebuild indexes' included, it will also refresh statistics, but more accurate because it scans all rows. No need to rebuild stats after rebuilding indexes.
Manually updating particular statistics object or a table with update statistics command gives you much better control over the process. For automating it, take a look here.
Auto-update fires only when optimizer decides it has to. There was a change in math for 2012: in <2012, auto update was fired for every 500 + 20% change in table rows; in 2012+ it is SQRT(1000 * Table rows). It means it is more frequent on large tables. Temporary tables behave differently, of course.
To conclude, sp_updatestats could actually do more damage than good, and is the least recommendable option.
I have a table where i have millions of records. The total size of that table only is somewhere 6-7 GigaByte. This table is my application log table. This table is growing really fast, which makes sense. Now I want to move records from log table into backup table. Here is the scenario and here is my question.
Table Log_A
Insert into Log_b select * from Log_A;
Delete from Log_A;
I am using postgres database. the question is
When this query is performed Does all the records from Log_A gets load in physical memory ? NOTE: My both of the above query runs inside a stored procedure.
If No, then how will it works ?
I hope this question applies for all database.
I hope if somebody could provide me some idea on this.
In PostgreSQL, that's likely to execute a sequential scan, loading some records into shared_buffers, inserting them, writing the dirty buffers out, and carrying on.
All the records will pass through main memory, but they don't all have to be in memory at once. Because they all get read from disk using normal buffered reads (pread) it will affect the operating system disk cache, potentially pushing other data out of the cache.
Other databases may vary. Some could execute the whole SELECT before processing the INSERT (though I'd be surprised if any serious ones did). Some do use O_DIRECT reads or raw disk I/O to avoid the OS cache affects, so the buffer cache effects might be different. I'd be amazed if any database relied on loading the whole SELECT into memory, though.
When you want to see what PostgreSQL is doing and how, the EXPLAIN and EXPLAIN (BUFFERS, ANALYZE) commands are quite useful. See the manual.
You may find writable common table expressions interesting for this purpose; it lets you do all this in one statement. In this simple case there's probably little benefit, but it can be a big win in more complex data migrations.
BTW, make sure to run that pair of queries wrapped in BEGIN and COMMIT.
Probably not.
Each record is individually processed; this particular query doesn't need to have knowledge of any of the other records to successfully execute. So the only record that needs to be in memory at any given moment is the one currently being processed.
But it really depends on whether or not the database thinks it can do it faster by loading up the whole table. Check the execution plan of the query.
If your setup allows it, just rename the old table and create a new empty one. Much faster, obviously, as no copying is done at all.
ALTER TABLE log_a RENAME TO log_b;
CREATE TABLE log_a (LIKE log_b INCLUDING ALL);
The LIKE clause copies the structure of the (now renamed) old table. INCLUDING ALL includes defaults, constraints, indexes, ...
Foreign key constraints or views depending on the table or other less common dependencies (but not queries in plpgsql functions) might be a hurdle for this route. You would have to recreate those to have them point to the new table. But a logging table like you describe probably carries no such dependencies.
This acquires an exclusive lock on the table. I assume, typical write access will be INSERT only in your case? One way to deal with concurrent access would then be to create the new table in a different schema and alter the search_path for your application user. Then the applications starts to write to the new table without concurrency issues. Of course, you wouldn't schema-qualify the table name in your INSERT statements for this to take effect.
CREATE SCHEMA log20121018;
CREATE TABLE log20121018.log_a (LIKE log20121011.log_a INCLUDING ALL);
ALTER ROLE myrole SET search_path = app, log20121018, public;
Or alter the search_path setting at whatever level is effective for you:
globally, per database, per role, per session, per function ...