Snowflake database: Question on table performance which is stored in snowflake - sql

We have a continuous insert, update, and delete in a table that is in snowflake DB, can this slow down the performance of a table in snowflake over the period of time?

Yes. For two reasons.
because the changes of the INSERT, UPDATE, & DELETE alter the fragment the partition data, thus even if the same number of ROW are present after N hours/days, the layout of the rows can become unaligned to the affinity of queries you run, thus your performance profile can go from highly prunes partition reads, to full table reads.
Also with large number of changes, even if the data is all perfectly ordered after then, the share fact many changes are being made with mean you end up with way too many partitions, which slows down you SQL compilations.
You also can have bad performance if you are INSERT, UPDATE, & DELETE to the same table at the same time, as the second operation will be blocked by the former. This can waste wall clock, and credit allocation (if they are different warehouses)
Some things you can do to avoid this, is run clustering, rebuild the tables in "down time". Not delete the data, but insert into "delete tables" and then left join and exclude matches. We have done all the above.

Related

SQL How to properly create a summary table?

I have underlying tables on which the data changes constantly. Every minute or so, I run a stored procedure to summarize the data in those underlying tables into a summary table. The summarization time is very long (~30s) so it does not make sense to have a "summary view." Additionally the summary table is constantly accessed by multiple users, it needs to be quick, responsive, and cannot be down.
To solve this, do the following in the stored procedure:
Summarize the data into "new summary table" (this can take as long as it needs because the "current summary table" is serving the needs of the users)
Drop the "current summary table"
Rename "new summary table" to "current summary table"
My questions are:
Is this safe/proper?
What happens if a user tries to access the "current summary table" when the summarization procedure is between steps 2 and 3 above?
What is the right way to do this? At the end of the day, I just need a summary to always be quickly (this is important) accessible and to be up-to-date (within a minute or so)
By using triggers on the details, you can make the summary stay in sync. For things like average, you need to track sum and count in the summary table as well, so you can recompute average. Triggers by row might be higher overhead than by all rows of the operation, if you have bulk churn and SQL Server has two flavors of trigger like Oracle. Inserts might make a summary row or update it, deletes may update or delete the summary row, and updates might change a key and so do both. Of course, there may be multiple sorts of summary row for any detail row.
Oracle had a materialized view, so maybe SQL Server has that, too. Oh look, it does! Google is my friend, so much to remember! It would be something like a shorthand for the above, at best.
There is the potential for a lot of delay in detail table churn with such triggers. Regenerating the summary table with a periodic query might suffice for some uses. A procedure can truncate the previous table for reuse, generate a new table in it, and then in a trans swap the names. If there is a time stamp in or for the table, the procedure can skip no-change updates. The lock, disk and CPU overhead for query is often a lot less than for churn.
Some summaries like median are very hard to support except by a view, but it might run fast if indexed (not clustered, sorted not hash index), as queries can be fulfilled right from non-clustered indexes. Excess indexes slow transactions (churn), so many use replicated tables for reporting, with few, narrow indexes on the parent transaction table and report-oriented indexes on the replicated table.

Postgres SQL sentence performance

I´ve a Postgres instance running in a 16 cores/32 Gb WIndows Server workstation.
I followed performance improvements tips I saw in places like this: https://www.postgresql.org/docs/9.3/static/performance-tips.html.
When I run an update like:
analyze;
update amazon_v2
set states_id = amazon.states_id,
geom = amazon.geom
from amazon
where amazon_v2.fid = amazon.fid
where fid is the primary key in both tables and both has 68M records, it takes almost a day to run.
Is there any way to improve the performance of SQL sentences like this? Should I write a stored procedure to process it record by record, for example?
You don't show the execution plan but I bet it's probably performing a Full Table Scan on amazon_v2 and using an Index Seek on amazon.
I don't see how to improve performance here, since it's close to optimal already. The only think I can think of is to use table partitioning and parallelizing the execution.
Another totally different strategy, is to update the "modified" rows only. Maybe you can track those to avoid updating all 68 million rows every time.
Your query is executed in a very log transaction. The transaction may be blocked by other writers. Query pg_locks.
Long transactions have negative impact on performance of autovacuum. Does execution time increase other time? If,so check table bloat.
Performance usually increases when big transactions are dived into smaller. Unfortunately, the operation is no longer atomic and there is no golden rule on optimal batch size.
You should also follow advice from https://stackoverflow.com/a/50708451/6702373
Let's sum it up:
Update modified rows only (if only a few rows are modified)
Check locks
Check table bloat
Check hardware utilization (related to other issues)
Split the operation into batches.
Replace updates with delete/truncate & insert/copy (this works if the update changes most rows).
(if nothing else helps) Partition table

Selecting 80% of rows and table lock

One of my colleagues came to me with this statement:
Having a SELECT on a table that fetch 80% of the rows while having a
WHERE clause on a column with an index. So to avoid that add a WITH (NOLOCK) in your FROM clause.
His only argument was: Believe me I've experienced it myself. I cannot find a proper documentation for this.
I far has I know WITH (NOLOCK) only affects the table by letting UPDATE and INSERT occur while selecting and that can lead us to dirty read.
Is my colleague's assumption correct?
I think you're referring to lock escalation, https://technet.microsoft.com/en-us/library/ms184286(v=sql.105).aspx , combined with a table scan caused by an index with bad selectivity, and some possibilities for blocking.
If the statistics on a non clustered index show that the number of rows returned from a table for a specific value exceed some threshold, then the optimizer will choose to use a table scan to find the corresponding rows instead of an index seek with corresponding bookmark lookups, because they are slow in quantity.
I typically tell people that you want that percentage to be 5% or lower, but sometimes it will still index seek up to 10% or so. At 80%, it's definitely going to table scan.
Also, since the query is doing a table scan, the query has to be able to acquire some kind of lock on every single row in the table. If there are any other queries that are running performing updates, or otherwise preventing locks from being acquired on even a single row, the query will have to wait.
With lock escalation, it's not a percentage, but instead a specific magic number of 5,000. A query generally starts reading rows using row locks. If a single query reads 5,000 or more rows, it will escalate the locks that it is using against the table from row and/or page locks to full table locks.
This is when deadlocks happen, because another query may be trying to do the same thing.
These locks don't necessarily have anything to do with inserts/updates.
This is an actual thing. No, this does not mean that you should use NOLOCK. You'd be much better off looking at READPAST, TABLOCK, or TABLOCKX, https://msdn.microsoft.com/en-us/library/ms187373.aspx , if you're having issues with deadlocks.
Do not do any of these things just out of habit and only look into them for specific instances with highly transactional tables that are experiencing actual problems.
By default writers have priority and readers will wait on writers to finish. WITH(NOLOCK) will allow readers to read uncommitted data, avoiding waits on writers to finish. For read only queries against very large tables, this is ok if you are querying data such as an old partition of data or pulling back data that is not going to change often and changes are not critical in the presentation of data. This is the same as using the SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED directive in SP's.

T-SQL Optimize DELETE of many records

I have a table that can grew to millions records (50 millions for example). On each 20 minutes records that are older than 20 minutes are deleted.
The problems is that if the table has so many records such deletion can take a lot of time and I want to make it faster.
I can not do "truncate table" because I want to remove only records that are older than 20 minutes. I suppose that when doing the "delete" and filtering the information that need to be delete, the server is creating log file or something and this take much time?
Am I right? Is there a way to stop any flag or option to optimize the delete, and then to turn on the stopped option?
To expand on the batch delete suggestion, i'd suggest you do this far more regularly (every 20 seconds perhaps) - batch deletions are easy:
WHILE 1 = 1
BEGIN
DELETE TOP ( 4000 )
FROM YOURTABLE
WHERE YourIndexedDateColumn < DATEADD(MINUTE, -20, GETDATE())
IF ##ROWCOUNT = 0
BREAK
END
Your inserts may lag slightly whilst they wait for the locks to release but they should insert rather than error.
In regards to your table though, a table with this much traffic i'd expect to see on a very fast raid 10 array / perhaps even partitioned - are your disks up to it? Are your transaction logs on different disks to your data files? - they should be
EDIT 1 - Response to your comment
TO put a database into SIMPLE recovery:
ALTER DATABASE Database Name SET RECOVERY='SIMPLE'
This basically turns off transaction logging on the given database. Meaning in the event of data loss you would need loose all data since your last full backup. If you're OK with that, well this should save a lot of time when running large transactions. (NOTE that as the transaction is running, the logging still takes place in SIMPLE - to enable the rolling back of the transaction).
If there are tables within your database where you cant afford to loose data you'll need to leave your database in FULL recovery mode (i.e. any transaction gets logged (and hopefully flushed to *.trn files by your servers maintenance plans). As i stated in my question though, there is nothing stopping you having two databases, 1 in FULL and 1 in SIMPLE. the FULL database would be fore tables where you cant afford to loose any data (i.e. you could apply the transaction logs to restore data to a specific time) and the SIMPLE database would be for these massive high-traffic tables that you can allow data loss on in the event of a failure.
All of this is relevant assuming your creating full (*.bak) files every night & flushing your log files to *.trn files every half hour or so).
In regards to your index question, it's imperative your date column is indexed, if you check your execution plan and see any "TABLE SCAN" - that would be an indicator of a missing index.
Your date column i presume is DATETIME with a constraint setting the DEFAULT to getdate()?
You may find that you get better performance by replacing that with a BIGINT YYYYMMDDHHMMSS and then apply a CLUSTERED index to that column - note however that you can only have 1 clustered index per table, so if that table already has one you'll need to use a Non-Clustered index. (in case you didnt know, a clustered index basically tells SQL to store the information in that order, meaning that when you delete rows > 20 minutes SQL can literally delete stuff sequentially rather than hopping from page to page.
The log problem is probably due to the number of records deleted in the trasaction, to make things worse the engine may be requesting a lock per record (or by page wich is not so bad)
The one big thing here is how you determine the records to be deleted, i'm assuming you use a datetime field, if so make sure you have an index on the column otherwise it's a sequential scan of the table that will really penalize your process.
There are two things you may do depending of the concurrency of users an the time of the delete
If you can guarantee that no one is going to read or write when you delete, you can lock the table in exclusive mode and delete (this takes only one lock from the engine) and release the lock
You can use batch deletes, you would make a script with a cursor that provides the rows you want to delete, and you begin transtaction and commit every X records (ideally 5000), so you can keep the transactions shorts and not take that many locks
Take a look at the query plan for the delete process, and see what it shows, a sequential scan of a big table its never good.
Unfortunately for the purpose of this question and fortunately for the sake of consistency and recoverability of the databases in SQL server, putting a database into Simple recovery mode DOES NOT disable logging.
Every transaction still gets logged before committing it to the data file(s), the only difference would be that the space in the log would get released (in most cases) right after the transaction is either rolled back or committed in the Simple recovery mode, but this is not going to affect the performance of the DELETE statement in one way or another.
I had a similar problem when I needed to delete more than 70% of the rows from a big table with 3 indexes and a lot of foreign keys.
For this scenario, I saved the rows I wanted in a temp table, truncated the original table and reinserted the rows, something like:
SELECT * INTO #tempuser FROM [User] WHERE [Status] >= 600;
TRUNCATE TABLE [User];
INSERT [User] SELECT * FROM #tempuser;
I learned this technique with this link that explains:
DELETE is a a fully logged operation , and can be rolled back if something goes wrong
TRUNCATE Removes all rows from a table without logging the individual row deletions
In the article you can explore other strategies to resolve the delay in deleting many records, that one worked to me

Optimizing Delete on SQL Server

Deletes on sql server are sometimes slow and I've been often in need to optimize them in order to diminish the needed time.
I've been googleing a bit looking for tips on how to do that, and I've found diverse suggestions.
I'd like to know your favorite and most effective techinques to tame the delete beast, and how and why they work.
until now:
be sure foreign keys have indexes
be sure the where conditions are indexed
use of WITH ROWLOCK
destroy unused indexes, delete, rebuild the indexes
now, your turn.
The following article, Fast Ordered Delete Operations may be of interest to you.
Performing fast SQL Server delete operations
The solution focuses on utilising a view in order to simplify the execution plan produced for a batched delete operation. This is achieved by referencing the given table once, rather than twice which in turn reduces the amount of I/O required.
I have much more experience with Oracle, but very likely the same applies to SQL Server as well:
when deleting a large number of rows, issue a table lock, so the database doesn't have to do lots of row locks
if the table you delete from is referenced by other tables, make sure those other tables have indexes on the foreign key column(s) (otherwise the database will do a full table scan for each deleted row on the other table to ensure that deleting the row doesn't violate the foreign key constraint)
I wonder if it's time for garbage-collecting databases? You mark a row for deletion and the server deletes it later during a sweep. You wouldn't want this for every delete - because sometimes a row must go now - but it would be handy on occasion.
Summary of Answers through 2014-11-05
This answer is flagged as community wiki since this is an ever-evolving topic with a lot of nuances, but very few possible answers overall.
The first issue is you must ask yourself what scenario you're optimizing for? This is generally either performance with a single user on the db, or scale with many users on the db. Sometimes the answers are the exact opposite.
For single user optimization
Hint a TABLELOCK
Remove indexes not used in the delete then rebuild them afterward
Batch using something like SET ROWCOUNT 20000 (or whatever, depending on log space) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0)
If deleting a large % of table, just make a new one and delete the old table
Partition the rows to delete, then drop the parition. [Read more...]
For multi user optimization
Hint row locks
Use the clustered index
Design clustered index to minimize page re-organization if large blocks are deleted
Update "is_deleted" column, then do actual deletion later during a maintenance window
For general optimization
Be sure FKs have indexes on their source tables
Be sure WHERE clause has indexes
Identify the rows to delete in the WHERE clause with a view or derived table instead of referencing the table directly. [Read more...]
To be honest, deleting a million rows from a table scales just as badly as inserting or updating a million rows. It's the size of the rowset that's the problem, and there's not much you can do about that.
My suggestions:
Make sure that the table has a primary key and clustered index (this is vital for all operations).
Make sure that the clustered index is such that minimal page re-organisation would occur if a large block of rows were to be deleted.
Make sure that your selection criteria are SARGable.
Make sure that all your foreign key constraints are currently trusted.
(if the indexes are "unused", why are they there at all?)
One option I've used in the past is to do the work in batches. The crude way would be to use SET ROWCOUNT 20000 (or whatever) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0).
This might help reduce the impact upon other systems.
The problem is you haven't defined your conditions enough. I.e. what exactly are you optimizing?
For example, is the system down for nightly maintenance and no users are on the system? And are you deleting a large % of the database?
If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename. If deleting a small %, you likely want to batch things in as large batches as your log space allows. It entirely depends on your database, but dropping indexes for the duration of the rebuild may hurt or help -- if even possible due to being "offline".
If you're online, what's the likelihood your deletes are conflicting with user activity (and is user activity predominantly read, update, or what)? Or, are you trying to optimize for user experience or speed of getting your query done? If you're deleting from a table that's frequently updated by other users, you need to batch but with smaller batch sizes. Even if you do something like a table lock to enforce isolation, that doesn't do much good if your delete statement takes an hour.
When you define your conditions better, you can pick one of the other answers here. I like the link in Rob Sanders' post for batching things.
If you have lots of foreign key tables, start at the bottom of the chain and work up. The final delete will go faster and block less things if there are no child records to cascade delete (which I would NOT turn on if I had a large number fo child tables as it will kill performance).
Delete in batches.
If you have foreign key tables that are no longer being used (you'd be surprised how often production databses end up with old tables nobody will get rid of), get rid of them or at least break the FK/PK connection. No sense cheking a table for records if it isn't being used.
Don't delete - mark records as delted and then exclude marked records from all queries. This is best set up at the time of database design. A lot of people use this because it is also the best fastest way to get back records accidentlally deleted. But it is a lot of work to set up in an already existing system.
I'll add another one to this:
Make sure your transaction isolation level and database options are set appropriately. If your SQL server is set not to use row versioning, or you're using an isolation level on other queries where you will wait for the rows to be deleted, you could be setting yourself up for some very poor performance while the operation is happening.
On very large tables where you have a very specific set of criteria for deletes, you could also partition the table, switch out the partition, and then process the deletions.
The SQLCAT team has been using this technique on really really large volumes of data. I found some references to it here but I'll try and find something more definitive.
I think, the big trap with delete that kill the performance is that sql after each row deleted, it updates all the related indexes for any column in this row. what about delting all indexes before bulk delete?
There are deletes and then there are deletes. If you are aging out data as part of a trim job, you will hopefully be able to delete contiguous blocks of rows by clustered key. If you have to age out data from a high volume table that is not contiguous it is very very painful.
If it is true that UPDATES are faster than DELETES, you could add a status column called DELETED and filter on it in your selects. Then run a proc at night that does the actual deletes.
Do you have foreign keys with referential integrity activated?
Do you have triggers active?
Simplify any use of functions in your WHERE clause! Example:
DELETE FROM Claims
WHERE dbo.YearMonthGet(DataFileYearMonth) = dbo.YearMonthGet(#DataFileYearMonth)
This form of the WHERE clause required 8 minutes to delete 125,837 records.
The YearMonthGet function composed a date with the year and month from the input date and set day = 1. This was to ensure we deleted records based on year and month but not day of month.
I rewrote the WHERE clause to:
WHERE YEAR(DataFileYearMonth) = YEAR(#DataFileYearMonth)
AND MONTH(DataFileYearMonth) = MONTH(#DataFileYearMonth)
The result: The delete required about 38-44 seconds to delete those 125,837 records!