What impact do statistics have on a table

What impact do statistics have on a table - sql-server-2005

When I add an index to a table there is an obvious benefit in searching, however there is also a cost involved with insert/update/delete statements as the index needs to be updated.
If i create a new statistic on a table, does it incur similar costs to an index?

Whatever statistics are being used to find the data for a query will be checked to see if they are up to date. If they are not up to date (based on a random sample) then SQL will update them. If they are not up to date then your query will have a performance hit while it waits for the stats to update.
From what I've found statistics can be set to Auto Update Statistics Asynchronously. This will cause it to use the old statistics for the current query, but tells SQL to update the stats in the background for next time. This could make the current query perform badly if a lot of data has changed.
Main source: MSDN

Related

SQL How to properly create a summary table?

I have underlying tables on which the data changes constantly. Every minute or so, I run a stored procedure to summarize the data in those underlying tables into a summary table. The summarization time is very long (~30s) so it does not make sense to have a "summary view." Additionally the summary table is constantly accessed by multiple users, it needs to be quick, responsive, and cannot be down.
To solve this, do the following in the stored procedure:
Summarize the data into "new summary table" (this can take as long as it needs because the "current summary table" is serving the needs of the users)
Drop the "current summary table"
Rename "new summary table" to "current summary table"
My questions are:
Is this safe/proper?
What happens if a user tries to access the "current summary table" when the summarization procedure is between steps 2 and 3 above?
What is the right way to do this? At the end of the day, I just need a summary to always be quickly (this is important) accessible and to be up-to-date (within a minute or so)

By using triggers on the details, you can make the summary stay in sync. For things like average, you need to track sum and count in the summary table as well, so you can recompute average. Triggers by row might be higher overhead than by all rows of the operation, if you have bulk churn and SQL Server has two flavors of trigger like Oracle. Inserts might make a summary row or update it, deletes may update or delete the summary row, and updates might change a key and so do both. Of course, there may be multiple sorts of summary row for any detail row.
Oracle had a materialized view, so maybe SQL Server has that, too. Oh look, it does! Google is my friend, so much to remember! It would be something like a shorthand for the above, at best.
There is the potential for a lot of delay in detail table churn with such triggers. Regenerating the summary table with a periodic query might suffice for some uses. A procedure can truncate the previous table for reuse, generate a new table in it, and then in a trans swap the names. If there is a time stamp in or for the table, the procedure can skip no-change updates. The lock, disk and CPU overhead for query is often a lot less than for churn.
Some summaries like median are very hard to support except by a view, but it might run fast if indexed (not clustered, sorted not hash index), as queries can be fulfilled right from non-clustered indexes. Excess indexes slow transactions (churn), so many use replicated tables for reporting, with few, narrow indexes on the parent transaction table and report-oriented indexes on the replicated table.

Postgres SQL sentence performance

I´ve a Postgres instance running in a 16 cores/32 Gb WIndows Server workstation.
I followed performance improvements tips I saw in places like this: https://www.postgresql.org/docs/9.3/static/performance-tips.html.
When I run an update like:
analyze;
update amazon_v2
set states_id = amazon.states_id,
geom = amazon.geom
from amazon
where amazon_v2.fid = amazon.fid
where fid is the primary key in both tables and both has 68M records, it takes almost a day to run.
Is there any way to improve the performance of SQL sentences like this? Should I write a stored procedure to process it record by record, for example?

You don't show the execution plan but I bet it's probably performing a Full Table Scan on amazon_v2 and using an Index Seek on amazon.
I don't see how to improve performance here, since it's close to optimal already. The only think I can think of is to use table partitioning and parallelizing the execution.
Another totally different strategy, is to update the "modified" rows only. Maybe you can track those to avoid updating all 68 million rows every time.

Your query is executed in a very log transaction. The transaction may be blocked by other writers. Query pg_locks.
Long transactions have negative impact on performance of autovacuum. Does execution time increase other time? If,so check table bloat.
Performance usually increases when big transactions are dived into smaller. Unfortunately, the operation is no longer atomic and there is no golden rule on optimal batch size.
You should also follow advice from https://stackoverflow.com/a/50708451/6702373
Let's sum it up:
Update modified rows only (if only a few rows are modified)
Check locks
Check table bloat
Check hardware utilization (related to other issues)
Split the operation into batches.
Replace updates with delete/truncate & insert/copy (this works if the update changes most rows).
(if nothing else helps) Partition table

Generated de-normalised View table

We have a system that makes use of a database View, which takes data from a few reference tables (lookups) and then does a lot of pivoting and complex work on a hierarchy table of (pretty much fixed and static) locations, returning a view view of data to the application.
This view is getting slow, as new requirements are added.
A solution that may be an option would be to create a normal table, and select from the view, into this table, and let the application use that highly indexed and fast table for it's querying.
Issue is, I guess, if the underlying tables change, the new table will show old results. But the data that drives this table changes very infrequently. And if it does - a business/technical process could be made that means an 'Update the Table' procedure is run to refresh this data. Or even an update/insert trigger on the primary driving table?
Is this practice advised/ill-advised? And are there ways of making it safer?

The ideal solution is to optimise the underlying queries.
In SSMS run the slow query and include the actual execution plan (Ctrl + M), this will give you a graphical representation of how the query is being executed against your database.
Another helpful tool is to turn on IO statistics, this is usually the main bottleneck with queries, put this line at the top of your query window:
SET STATISTICS IO ON;
Check if SQL recommends any missing indexes (displayed in green in the execution plan), as you say the data changes infrequently so it should be safe to add additional indexes if needed.
In the execution plan you can hover your mouse over any element for more information, check the value for estimated rows vs actual rows returned, if this varies greatly update the statistics for the tables, this can help the query optimiser find the best execution plan.
To do this for all tables in a database:
USE [Database_Name]
GO
exec sp_updatestats
Still no luck in optimising the view / query?
Be careful with update triggers as if the schema changes on the view/table (say you add a new column to the source table) the new column will not be inserted into your 'optimised' table unless you update the trigger.
If it is not a business requirement to report on real time data there is not too much harm in having a separate optimized table for reporting (Much like a DataMart), just use a SQL Agent job to refresh it nightly during non-peak hours.
There are a few cons to this approach though:
More storage space / duplicated data
More complex database
Additional workload during the refresh
Decreased cache hits

Time complexity: UPDATE ... WHERE vs UPDATE ALL

I am having a database table DuplicatesRemoved with possibly large number of records. I execute certain operations to remove duplicates of users in my application, and every time I remove the duplicates, I keep a track of the duplicate UserID's by storing them in this table DuplicatesRemoved.
This table contains a bit field HistoryRecord. I need to update this field at the end of every "RemoveDuplicates" operation.
I do NOT have any indexes on DuplicatesRemoved.
I am wondering which of these would be better?
1.
UPDATE DuplicatesRemoved SET HistoryRecord=1 WHERE HistoryRecord<>1
OR
2.
UPDATE DuplicatesRemoved SET HistoryRecord=1
Will Query #1 take less time than Query #2?
I have referred this question but still am not sure about which one would be better for me.

In the first option:
UPDATE DuplicatesRemoved SET HistoryRecord=1 WHERE HistoryRecord<>1
You have to find those records and update only those.
In the second option:
UPDATE DuplicatesRemoved SET HistoryRecord=1
You have to update the entire table.
So first option will be better assuming you find the records quickly, and also minimizes the number of locks acquired during the time of the update, and the total size of the transaction that the engine writes to the log file (i.e the records that we need to be able to rollback).
Showing the execution plan will help in this decision.

In databases, you measure the number of disk accesses to evaluate the complexity of a query, since the time to read something from the external memory is order of magnitute greater than the time to perform few operations in main memory.
The two queries, if no index is present, have the same number of disk accesses, since both require the complete scan of the relation.

What does exec sp_updatestats do?

What is the use of sp_updatestats? Can I run that in the production environment for performance improvement?

sp_updatestats updates all statistics for all tables in the database, where even a single row has changed. It does it using the default sample, meaning it doesn't scan all rows in the table so it will likely produce less accurate statistics than the alternatives.
If you have a maintenance plan with 'rebuild indexes' included, it will also refresh statistics, but more accurate because it scans all rows. No need to rebuild stats after rebuilding indexes.
Manually updating particular statistics object or a table with update statistics command gives you much better control over the process. For automating it, take a look here.
Auto-update fires only when optimizer decides it has to. There was a change in math for 2012: in <2012, auto update was fired for every 500 + 20% change in table rows; in 2012+ it is SQRT(1000 * Table rows). It means it is more frequent on large tables. Temporary tables behave differently, of course.
To conclude, sp_updatestats could actually do more damage than good, and is the least recommendable option.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas