I have a scheduled job that runs once a day, synchronizing entities between multiple APIs. I'm looking for a reliable way to pull "pages" of data from my DB, without downloading GBs worth of it in one go, using LIMIT and OFFSET.
From what I understand, starting a transaction at the beginning of the process and executing repeated SELECTs within it will ensure that no records in my result set are added or skipped due to other concurrent processes?
Hopefully, that would allow me to perform the synchronization job on the exact state of DB records at the start of the transaction. Also, it may be worth to know that the sync job itself won't alter the records from said result set.
Given an SQL table with timestamped records. Every once in a while an application App0 does something like foreach record in since(certainTimestamp) do process(record); commitOffset(record.timestamp), i.e. periodically it consumes a batch of "fresh" data, processes it sequentially and commits success after each record and then just sleeps for reasonable time (to accumulate yet another batch). That works perfect with single instance.. however how to load balance multiple ones?
In exactly the same environment App0 and App1 concurrently competite for the fresh data. The idea is that ready query executed by the App0 must not overlay with the same read query executed by the App1 - such that they never try to process the same item. In other words, I need SQL-based guarantees that concurrent read queries return different data. Is that even possible?
P.S. Postgres is preferred option.
The problem description is rather vague on what App1 should do while App0 is processing the previously selected records.
In this answer, I make the following assumptions:
all Apps somehow know what the last certainTimestamp is and it is the same for all Apps whenever they start a DB query.
while App0 is processing, say the 10 records it found when it started working, new records come in. That means, the pile of new records with respect to certainTimestamp grows.
when App1 (or any further App) starts, the should process only those new records with respect to certainTimestamp that are not yet being handled by other Apps.
yet, if on App fails/crashes, the unfinished records should be picked the next time another App runs.
This can be achieved by locking records in many SQL databases.
One way to go about this is to use
SELECT ... FOR UPDATE SKIP LOCKED
This statement, in combination with the range-selection since(certainTimestamp) selects and locks all records matching the condition and not being locked currently.
Whenever a new App instance runs this query, it only gets "what's left" to do and can work on that.
This solves the problem of "overlay" or working on the same data.
What's left is then the definition and update of the certainTimestamp.
In order to keep this answer short, I don't go into that here and just leave the pointer to the OP that this needs to be thought through properly to avoid situations where e.g. a single record that cannot be processed for some reason keeps the certainTimestamp at a permanent minimum.
I have a table that records a lot of information at any moment, for example, 100 rows per second.
After completing each row, certain operations must be performed. That is, some of these rows should be copied to another table.
Now a few questions:
Can I use triggers to do this? Given the high number of entry rows
If multiple conditions are checked for copying to the table, can the triggers be responsive?
Additional explanation: the records added to this table are added by the fingerprint recorder
first of all, check these :
1.refer to define your trigger it can be called in insert or update etc. which not need to be executed for all operations(not required for all inserts)
2.you can forget your business during the times by changing some rules of your application
you need to pay attention to it for every change (prevent to introduce bugs)
4....
I strongly suggest you do not define trigger unless you have not any other choices.
if you have an application, you can do it in that and with putting the business
(for Instance, make a thread in your application to check and do your business)
you can have a windows service to do that for you
if you have just database access you can define a job in that to do it for you (not recommended)
finally, to avoiding blocks if you decided to use multi-thread(second thread according to your question is just for read data from your original table and insert into another), you can turn on the is_read_committed_snapshot_on in your database
I have been reading the SQLite documentation and also referencing code I have written previously but I don't seem to be able to find a definitive answer to what I imagine to be a rather simple question.
I would like to execute many (separate) compiled statements within a transaction, but child threads may also be creating transactions or just executing statements at the same time and I would not want them included in this particular transaction. Currently, I have a single database handle that I share between all threads.
So, my question is,
1) .. is it generally better to have some kind of semaphore around transactions to ensure they will not clash/collect with other statements being executed against a database handle. I already marshal writes to prevent problems with multithreaded issues with SQLite (although with WAL now it's very hard to unsettle it at all).
2) .. or are you expected to open multiple database connections and start/commit the transactions one per database connection if they will be concurrent?
Changes made in one database connection are invisible to all other database connections prior to commit.
So it seems a hybrid approach of having several connections open to the database provides adequate concurrency guarantees, trading off the expense of opening a new connection with the benefit of allowing multi-threaded write transactions.
A query sees all changes that are completed on the same database connection prior to the start of the query, regardless of whether or not those changes have been committed.
If changes occur on the same database connection after a query starts running but before the query completes, then it is undefined whether or not the query will see those changes.
If changes occur on the same database connection after a query starts running but before the query completes, then the query might return a changed row more than once, or it might return a row that was previously deleted.
For the purposes of the previous four items, two database connections that use the same shared cache and which enable PRAGMA read_uncommitted are considered to be the same database connection, not separate database connections.
Here is the SQLite information on isolation. Which is exceptionally useful to read and understand for this problem.
Everyone has accidentally forgotten the WHERE clause on a DELETE query and blasted some un-backed up data once or twice. I was pondering that problem, and I was wondering if the solution I came up with is practical.
What if, in place of actual DELETE queries, the application and maintenance scripts did something like:
UPDATE foo SET to_be_deleted=1 WHERE blah = 50;
And then a cron job was set to go through and actually delete everything with the flag? The downside would be that pretty much every other query would need to have WHERE to_be_deleted != 1 appended to it, but the upside would be that you'd never mistakenly lose data again. You could see "2,349,325 rows affected" and say, "Hmm, looks like I forgot the WHERE clause," and reset the flags. You could even make the to_be_deleted field a DATE column, so the cron job would check to see if a row's time had come yet.
Also, you could remove DELETE permission from the production database user, so even if someone managed to inject some SQL into your site, they wouldn't be able to remove anything.
So, my question is: Is this a good idea, or are there pitfalls I'm not seeing?
That is fine if you want to do that, but it seems like a lot of work. How many people are manually changing the database? It should be very few, especially if your users have an app to work with.
When I work on the production db I put EVERYTHING I do in a transaction so if I mess up I can rollback. Just having a standard practice like that for me has helped me.
I don't see anything really wrong with that though other than ever single point of data manipulation in each applicaiton will have to be aware of this functionality and not just the data it wants.
This would be fine as long as your appliction does not require that the data is immediately deleted since you have to wait for the next interval of the cron job.
I think a better solution and the more common practice is to use a development server and a production server. If your development database gets blown out, simply reload it. No harm done. If you're testing code on your production database, you deserve anything bad that happens.
A lot of people have a delete flag or a row status flag. But if someone is doing a change through the back end (and they will be doing it since often people need batch changes done that can't be accomplished through the front end) and they make a mistake they will still often go for delete. Ultimately this is no substitute for testing the script before applying it to a production environment.
Also...what happens if the following query gets executed "UPDATE foo SET to_be_deleted=1" because they left off the where clause. Unless you have auditing columns with a time stamp how do you know which columns were deleted and which ones were done in error? But even if you have auditing columns with a time stamp, if the auditing is done via a stored procedure or programmer convention then these back end queries may not supply information letting you know that they were just applied.
Too complicated. The standard approach to this is to do all your work inside a transaction, so if you screw up and forget a WHERE clause, then you simply roll back when you see the "2,349,325 rows affected" result.
It may be easier to create a parallel table for deleted rows. A DELETE trigger (and UPDATE too if you want to undo changes as well) on the original table could copy the affected rows to the parallel table. Adding a datetime column to the parallel table to record the date & time of the change would let you permanently remove rows past a certain age using your cron job.
That way, you'd use normal DELETE statements on the original table, so there's no chance you'll forget to run your special "DELETE" statement. You also sidestep the to_be_deleted != 1 expression, which is just a bug waiting to happen when someone inevitably forgets.
It looks like you're describing three cases here.
Case 1 - maintenance scripts. Risk can be minimized by developing them and testing them in an environment other than your production box. For quick maintenance, do the maintenance in a single transaction, and check everything before committing. If you made a mistake, issue the rollback command. For more serious maintenance that you can't necessarily wait around for, or do in a single transaction, consider taking a backup directly before running the maintenance job, so that you can always restore back to the point before you ran your script if you encounter serious problems.
Case 2 - SQL Injection. This is an architecture issue. Your application shouldn't pass SQL into the database, access should be controlled through packages / stored procedures / functions, and values that are going to come from the UI and be used in a DDL statement should be applied using bind variables, rather than by creating dynamic SQL by appending strings together.
Case 3 - Regular batch jobs. These should have been tested before being deployed to production. If you delete too much, you have a bug, and are going to have to rely on your backup strategy.
Everyone has accidentally forgotten
the WHERE clause on a DELETE query and
blasted some un-backed up data once or
twice.
No. I always prototype my DELETEs as SELECTs and only if the latter gives the results I want to delete change the statement before WHERE to a DELETE. This let's me inspect in any needed detail the rows I want to affect before doing anything.
You could set up a view on that table that selects WHERE to_be_deleted != 1, and all of your normal selects are done on that view - that avoids having to put the WHERE on all of your queries.
The pitfall is that it's unnecessarily complicated and someone will inadvertently forget too check the flag in their query. There's also the issue of potentially needing to delete something immediately instead of wait for the scheduled job to run.
To avoid the to_be_deleted WHERE clause you could create a trigger before the delete command fires off to insert the deleted rows into a separate table. This table could be cleared out when you're sure everything in it really needs to be deleted, or you could keep it around for archive purposes.
You also get a "soft delete" feature so you can give the(certain) end-users the power of "undo" - there would have to be a pretty strong downside in the mix to cancel the benefits of soft deleting.
The "WHERE to_be_deleted <> 1" on every other query is a huge one. Another is once you've ran your accidentally rogue query, how will you determine which of the 2,349,325 were previously marked as deleted?
I think the practical solution is regular backups, and failing that, perhaps a delete trigger that captures the tuples to be axed.
The other option would be to create a delete trigger on each table. When anything is deleted, it would insert that "to be deleted" record into another table, ideally named TABLENAME_deleted.
The downside would be that the db would have twice as many tables.
I don't recommend triggers in general, but it might be what you are looking for.
This is why, whenever you are editing data by hand, you should BEGIN TRAN, edit your data, check that it looks good (for instance that you didn't delete more data than you were expecting) and then END TRAN. If you're using Postgres then you want to create lots of savepoints as well so that a typo doesn't wipe out your intermediate work.
But that said, in many applications it does make sense to have software mark records as invalid rather than deleting them. Add a last_modified date that is automatically updated, and you are all prepared to set up incremental updates into a data warehouse. Even if you don't have a data warehouse now, it never hurts to prepare for the future when preparing is cheap. Plus in the event of manual mistakes you still have the data, and can just find all of the records that got "deleted" when you made your mistake and fix them. (You should still use transactions though.)