Redshift removal of explicit locks lead to missing rows/duplicated rows. Best resolution?

Redshift removal of explicit locks lead to missing rows/duplicated rows. Best resolution? - redis

I have a use case where some of our redshift tables are used by multiple data scientists at the same time for tuning. If they are tuning at the same time and import data at the same time, we end up with missing rows and duplicated rows.
Awhile ago, they had removed the explicit lock from the table to reduce the amount of loads that would hang on those locks.
I'm assuming that this removal of the explicit lock is causing the duplicated and missing rows.
In terms of path forward, I was thinking about having kinesis, redis, or something similar to batch these to be one import instead of inserts (not great for redshift anyways). Or if the real solution is to add explicit locking back to the table and deal with the hanging loads.
Any guidance would be appreciated. Thanks
Putting the explicit locks back on works, but other procedures hang behind the table/proc locks and slows down significantly.

Yes removing locks is causing this and they shouldn’t do it. Tell them to stop.
They are likely running into this because they aren’t COMMITting their changes. Changing their connections to AUTOCOMMIT might fix things. If one person changes a table but doesn’t COMMIT the change then they have a local copy and a lock until they do. If they never disconnect then this situation can last forever. If many are doing this then you have a mess of local copies and locks waiting to be resolved but nobody COMMITs.
When people come from lock-on-write databases they can get confused about what is happening. Read up on MVCC database coherency.

Related

Slow SQL Queries against a table causing blocking

What could be possible reasons on why all statements executing against a table would run extremely slowly causing blocking. No particular query was the culprit. At some point whatever was causing it ended and all statements started executing normal and all blocking was cleared up.

A corrupt index could cause the issue. If there are indexes, you can recreate them. If you're using table replication, if the replication is out of sync, this can cause slow queries especially if the tables handle a high volume of transactions. If you haven't done so, you may want to log the slow queries as even queries that take .5 seconds can quickly cause a bottleneck on high-traffic systems. Those are my "surface" thoughts. Other considerations such as disk space, RAM, disk integrity, etc. also come to mind. You may want to consider checking your system logs to see if anything shows up there during the time you experienced the issue.

Postgresql Application Insertion and Trigger Performance

I'm working on designing an application with a SQL backend (Postgresql) and I've got a some design questions. In short, the DB will serve to store network events as they occur on the fly, so insertion speed and performance is critical due 'real-time' actions depending on these events. The data is dumped into a speedy default format across a few tables, and I am currently using postgresql triggers to put this data into some other tables used for reporting.
On a typical event, data is inserted into two different tables each share the same primary key (an event ID). I then need to move and rearrange the data into some different tables that are used by a web-based reporting interface. My primary goal/concern is to keep the load off the initial insertion tables, so they can do their thing. Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
Once the data is in the appropriate reporting tables, I won't need to hang on to the data in the insertion tables too long, so I'll keep those regularly pruned for insertion performance. In thinking about this scenario, which I'm sure is semi-common, I've come up with three options:
Use triggers to trigger on the initial row insert and populate the reporting tables. This was my original plan.
Use triggers to copy the insertion data to a temporary table (same format), and then trigger or cron to populate the reporting tables. This was just a thought, but I figure that a simple copy operation to a temp table will offload any of the query-ing of the triggers in the solution above.
Modify my initial output program to dump all the data to a single table (vs across two) and then trigger on that insert to populate the reporting tables. So where solution 1 is a multi-table to multi-table trigger situation, this would be a single-table source to multi-table trigger.
Am I over thinking this? I want to get this right. Any input is much appreciated!

You may experience have a slight increase in performance since there are more "things" to do (although they should not affect operations in any way). But using Triggers/other PL is a good way to reduce it to minimum subce they are executed faster than code that gets sent from your application to the DB-Server.
I would go with your first idea 1) since it seems to me the cleanest and most efficient way.
2) is the most performance hungry solution since cron will do more queries than the other solutions that use server-side functions. 3) would be possible but will resulst in an "uglier" database layout.

This is an old one but adding my answer here.
Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
That may be way off, I'm afraid, but under a few cases it may not be. It depends on the effects of caching on the reports. Keep in mind that disk I/O and memory are your commodities, and that writers and readers rarely block eachother on PostgreSQL (unless they explicitly elevate locks--- a SELECT ... FOR UPDATE will block writers for example). Basically if your tables fit comfortably in RAM you are better off reporting from them since you are keeping disk I/O free for the WAL segment commit of your event entry. If they don't fit in RAM then you may have cache miss issues induced by reporting. Here materializing your views (i.e. making trigger-maintained tables) may cut down on these but they have a significant complexity cost. This, btw, if your option 1. So I would chalk this one up provisionally as premature optimization. Also keep in mind you may induce cache misses and lock contention on materializing the views this way, so you might induce performance problems regarding inserts this way.
Keep in mind if you can operate from RAM with the exception of WAL commits, you will have no performance problems.
For #2. If you mean temporary tables as CREATE TEMPORARY TABLE, that's asking for a mess including performance issues and reports not showing what you want them to show. Don't do it. If you do this, you might:
Force PostgreSQL to replan your trigger on every insert (or at least once per session). Ouch.
Add overhead creating/dropping tables
Possibilities of OID wraparound
etc.....
In short I think you are overthinking it. You can get very far by bumping RAM up on your Pg box and making sure you have enough cores to handle the appropriate number of inserting sessions plus the reporting one. If you plan your hardware right, none of this should be a problem.

How do I optimize table after delete many records

I deleted many records from my table but the DB size (Firebird) left the same. How do I decrease it?
I am looking for something similar to vacuum in PostgreS.

This is one of many pains of firebird.
Best and only effective and right way to do this - backup/restore your database using gbak

Firebird will occasionally run a sweep to remove the records from indexes etc., and regain the space for other use. In other words, as soon as the sweep has run, you will have the same performance as if the database file was smaller. You can enforce an immediate sweep, if that is what you are trying to do.
However, the size of the actual database will not shrink, no matter what, except if you do a backup and restore. If size is a problem, use the -USE_ALL_SPACE parameter for gbak, it will prevent that space is being reserved for future records, which will yield a smaller database.

From the official faq
Many users wonder why they don't get their disk space back when they
delete a lot of records from database.
The reason is that it is an expensive operation, it would require a
lot of disk writes and memory - just like doing refragmentation of
hard disk partition. The parts of database (pages) that were used by
such data are marked as empty and Firebird will reuse them next time
it needs to write new data.
If disk space is critical for you, you can get the space back by
doing backup and then restore. Since you're doing the backup to
restore right away, it's wise to use the "inhibit garbage collection"
or "don't use garbage collection" switch (-G in gbak), which will make
backup go A LOT FASTER. Garbage collection is used to clean up your
database, and as it is a maintenance task, it's often done together
with backup (as backup has to go throught entire database anyway).
However, you're soon going to ditch that database file, and there's no
need to clean it up.

How much does wrapping inserts in a transaction help performance on Sql Server?

Ok so say I have 100 rows to insert and each row has about 150 columns (I know that sounds like a lot of columns, but I need to store this data in a single table). The inserts will occur at random, (ie whenever a set of users decide to upload a file containing the data), about a 20 times a month. However the database will be under continuous load processing other functions of a large enterprise application. The columns are varchars, ints, as well as a variety of other types.
Is the performance gain of wrapping these inserts in a transaction (as opposed to running them one at a time) going to be huge, minimal, or somewhere in between?
Why?
EDIT:
This is for Sql Server 2005, but I'd be interested in 2000/2008 if there is something different to be said. Also I should mention that I understand the point about transactions being primarily for data-consistency, but I want to focus on performance effects.

It can be an impact actually. The point of transactions is not about how many you do, it's about keeping the data update consistent. If you have rows that need to be inserted together and are dependent on each other, those are the records you wrap in a transaction.
Transactions are about keeping your data consistent. This should be the first thing you think about when using transactions. For example, if you have a debit (withdrawl) from your checking account, you want to make sure the credit (deposit) is also done. If either of those don't succeed, the whole "transaction" should be rolled back. Therefore, both actions MUST be wrapped in a transaction.
When doing batch inserts, break them up in to 3000 or 5000 records and cycle through the set. 3000-5000 has been a sweet number range for me for inserts; don't go above that unless you've tested that the server can handle it. Also, I will put GOs in the batch at about every 3000 or 5000 records for inserts. Updates and deletes I'll put a GO at about 1000, because they require more resources to commit.
If your doing this from C# code, then in my opinion, you should build a batch import routine instead of doing millions of inserts one at a time through coding.

While transactions are a mechanism for keeping data consistent they actually have a massive impact on performance if they are used incorrectly or overused. I've just finished a blog post on the impact on performance of explicitly specifying transactions as opposed to letting them occur naturally.
If you are inserting multiple rows and each insert occurs in its own transaction there is a lot of overhead on locking and unlocking data. By encapsulating all inserts in a single transactions you can dramatically improve performance.
Conversely if you have many queries running against your database and have large transactions also occurring they can block each other and cause performance issues.
Transactions are definitively linked with performance, regardless of their underlying intent.

It depends on what you call huge, but it will help (it really depends on the overall number of inserts you are doing). It will force SQL Server to not do a commit after every insert, which in time adds up. With 100 inserts, you probably won't notice too much an increase depending on how often and what else is going on with the database.

As others have said, transactions have nothing to do with performance, but instead have to do with the integrity of your data.
That being said, worrying about the performance one way or the other when you're only talking about inserting 100 rows of data about 20 times a month (meaning 2000 records per month) is silly. Premature optimization is a waste of time; unless you have repeatedly tested the performance impact of these inserts (as small as they are, and as infrequent) and found them to be a major issue, don't worry about the performance. It's negligible compared to the other things you mentioned as being server load.

Transactions are not for performance but for data-integrity. Depending on the implementation there will be real no gain/loss of performance for only 100 rows (they just will be logged additionally, so they can all be rolled back).
Things to consider about the performance issues:
TAs will interact with other queries
writing TAs will lock tuples/pages/files
commits just might be (depending on lock protocol) update of a timestamp
more logs might be written for TAs (one should be able to roll TAs back, but the DB might log extensively already, sequential logging is cheap)
the degree of isolation (I know that one can switch this level in some DBs - and that nearly nobody uses level 3)
All in all: use TAs for ensuring the integrity.

practically - extremely. with large inserts, 100++ (provided that you configured mysql to have increased query size and transaction size to support monstrous queries/transactions, sorry don't remember exact variable names) - insert times can commonly be 10 times as fast and even much more

When to commit changes?

Using Oracle 10g, accessed via Perl DBI, I have a table with a few tens of million rows being updated a few times per second while being read from much more frequently form another process.
Soon the update frequency will increase by an order of magnitude (maybe two).
Someone suggested that committing every N updates instead of after every update will help performance.
I have a few questions:
Will that be faster or slower or it depends (planning to benchmark both way as soon as can get a decent simulation of the new load)
Why will it help / hinder performance.
If "it depends ..." , on what ?
If it helps what's the best value of N ?
Why can't my local DBA have an helpful straight answer when I need one? (Actually I know the answer to that one) :-)
EDIT:
#codeslave : Thanks, btw losing
uncommited changes is not a problem, I
don't delete the original data used
for updating till I am sure everything
is fine , btw cleaning lady did
unplugs the server, TWICE :-)
Some googling showed it might help
because of issue related to rollback
segments, but I still don't know a
rule of thumb for N every few tens ?
hundreds? thousand ?
#diciu : Great info, I'll definitely
look into that.

A commit results in Oracle writing stuff to the disk - i.e. in the redo log file so that whatever the transaction being commited has done can be recoverable in the event of a power failure, etc.
Writing in file is slower than writing in memory so a commit will be slower if performed for many operations in a row rather then for a set of coalesced updates.
In Oracle 10g there's an asynchronous commit that makes it much faster but less reliable: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-6158695.html
PS I know for sure that, in a scenario I've seen in a certain application, changing the number of coalesced updates from 5K to 50K makes it faster by an order of magnitude (10 times faster).

Reducing the frequency of commits will certainly speed things up, however as you are reading and writing to this table frequently there is the potential for locks. Only you can determine the likelihood of the same data being updated at the same time. If the chance of this is low, commit every 50 rows and monitor the situation. Trial and error I'm afraid :-)

As well as reducing the commit frequency, you should also consider performing bulk updates instead of individual ones.

If you "don't delete the original data used for updating till [you are] sure everything is fine", then why don't you remove all those incremental commits in between, and rollback if there's a problem? It sounds like you effectively have built a transaction systems on top of transactions.

#CodeSlave your your questions is answered by #stevechol , if i remove ALL the incremental commits there will be locks. I guess if nothing better comes along I'll follow his advice pick a random number , monitor the load and adjust accordingly. While applying #diciu twaks.
PS: the transaction on top of transaction is just accidental, I get the files used for updates by FTP and instead of deleting them immediately I set a cron job to deletes them a week later (if no one using the application has complained) that means if something goes wrong I have a week to catch the errors.

Faster/Slower?
It will probably be a little faster. However, you run a greater risk of running into deadlocks, losing uncommitted changes should something catastrophic happen (cleaning lady unplugs the server), FUD, Fire, Brimstone, etc.
Why would it help?
Obviously fewer commit operations, which in turn means fewer disk writes, etc.
DBA's and straight answers?
If it was easy, you won't need one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas