Load balancing SQL reads while batch-processing? - sql

Given an SQL table with timestamped records. Every once in a while an application App0 does something like foreach record in since(certainTimestamp) do process(record); commitOffset(record.timestamp), i.e. periodically it consumes a batch of "fresh" data, processes it sequentially and commits success after each record and then just sleeps for reasonable time (to accumulate yet another batch). That works perfect with single instance.. however how to load balance multiple ones?
In exactly the same environment App0 and App1 concurrently competite for the fresh data. The idea is that ready query executed by the App0 must not overlay with the same read query executed by the App1 - such that they never try to process the same item. In other words, I need SQL-based guarantees that concurrent read queries return different data. Is that even possible?
P.S. Postgres is preferred option.

The problem description is rather vague on what App1 should do while App0 is processing the previously selected records.
In this answer, I make the following assumptions:
all Apps somehow know what the last certainTimestamp is and it is the same for all Apps whenever they start a DB query.
while App0 is processing, say the 10 records it found when it started working, new records come in. That means, the pile of new records with respect to certainTimestamp grows.
when App1 (or any further App) starts, the should process only those new records with respect to certainTimestamp that are not yet being handled by other Apps.
yet, if on App fails/crashes, the unfinished records should be picked the next time another App runs.
This can be achieved by locking records in many SQL databases.
One way to go about this is to use
SELECT ... FOR UPDATE SKIP LOCKED
This statement, in combination with the range-selection since(certainTimestamp) selects and locks all records matching the condition and not being locked currently.
Whenever a new App instance runs this query, it only gets "what's left" to do and can work on that.
This solves the problem of "overlay" or working on the same data.
What's left is then the definition and update of the certainTimestamp.
In order to keep this answer short, I don't go into that here and just leave the pointer to the OP that this needs to be thought through properly to avoid situations where e.g. a single record that cannot be processed for some reason keeps the certainTimestamp at a permanent minimum.

Related

Delete a record after a period of time automatically in SQL Firebird 2.5?

We have a table which has Datetime stamp field when that record was created. How can we create a trigger or procedure to delete a record after 30 days?
Is there any advice how we can run this deletion scheduler?
Firebird doesn't have a scheduler. You will need to create an application that executes a clean up routine on a schedule yourself. You could do this as part of the normal application, or you could write a small application specifically for this purpose, and execute it with the scheduler of your OS (e.g. Windows Scheduled Tasks, or Linux Cron).
Firebird 2.1 introduced global triggers fired on database connection/disconnection and on transaction starting/ending.
https://www.firebirdsql.org/file/documentation/chunk/en/refdocs/fblangref30/fblangref30-ddl-trigger.html
While it is not exactly what you need it can be used to achieve similar results. Whether that similarity is good enough for you or not is for you to evaluate.
to delete a record after 30 days?
The question here is what you do specifically mean here. Would it still be okay, if the row is deleted in 31 day, in 40 days?
In our case, for a client-server office application, there was no time pressure and additionally there was no safe deletion as long as the programs had "documents" open.
We had to delete some global data, and while there were some marks in the database, which documents use them and which documents are currently opened - it was not very reliable. Which also meant that existing method of immediate delete occasionally could lead to application crashes.
So we reformulated a problem similar to yours the following way:
We need rows not deleted immediately but pending for deletion for 30 days or more. Those record would be rendered in the application in a special way, as a warning to users and also providing a way for them to cancel deletion, if they changed their mind (or if other users had different ideas).
The deletion would happen, in logic terms, "when there is no connected application". In technical term it could mean either "when first application is connecting, but before it started actual (business-related) work" or "when last application is disconnecting, after it ended doing actual work". We settled on the latter, we used on disconnect global database trigger.
We had not only main business-domain application, but a number of technical helper utilities. From the Firebird point of view there is no difference in them. So we had to modify "login sequence" in our main application: right after successful login it registered it's own CURRENT_CONNECTION into a special table. This is potentially slightly fragile.
ON DISCONNECT trigger used to do three actions:
it checked, if current_connection is in the table, and if it was - it called a special stored procedure, SP_LOCAL_CLEANUP.
it removed the current_connection from the table (it could had been BEFORE DELETE trigger then to call the procedure, but we decided our helper utilities should have a way to hook in, if they would need, so the call was put in the ON DISCONNECT trigger).
it checked if that table (known connected business-domain applications list) became empty, and if it did - called another special stored procedure, SP_GLOBAL_CLEANUP.
Those stored procedures were "umbrella" procedures, solely consisted of calls to different procedures, which did the actual work of checking for inconsistencies and fixing them. Like, removing marks "this document is opened for editing" if an application (or computer, or network) has crashed without removing the lock normal way. This way we could add or remove functionality without breaking Firebird object dependency chains.
In particular, one of the global sub-procedures looked into the "deletion pending" records, and deleted those "kept in recycle bin" for a time span running over 30 days. Actually, the records just had a column of planned deletion date and that could be more or less than 30 days, but that is technicality.
This meant that the actual deletion was happening "sometimes after 30 days" and it only happened when all main apps were shut down. When later those apps would be run again - they would re-read those global dictionaries tables in the updated, pruned state. The applications never again were in inconsistent state, using records removed from the database.
Potential fragile point: if users would not shut down application in the night, but just go home, it could mean there would never be a state "last application disconnected". This, however, would be a maintenance nightmare for their network admins (Windows updates and reboots, antivirus updates and reboots), so we documented the recommendation that those admins have to make sure at least once a week all the users went all together out of the database.
Potential fragile point: if the Firebird server crashes (not applications, but the server engine), then the "known connections" table would have stale values. We considered it not a practical problems, as then CURRENT_CONNECTION would be restarted as 1 value and go upward, eventually cleaning the table. But we also added a function into helper app, to use SYSDBA and monitoring tables and clean the table off non-existing connections.
You can re-use this framework if you do not have time pressure and you are okay if the actual deletion is deferred for a few days.
You can also use ON TRANSACTION START trigger instead, to shorten the delay to mere minutes, but I expect this would slow down your application badly, so would suggest against it.

Oracle can return time out when another connection already use the same table?

if i need run an DML (insert, update, delete) in one table of database, firstly he verify if has an active DML using that table. In this momment, if has another operation, my connection wait he has finished.
There's a way to get an "time out" in this cases? Not in a global mode, only for specific cases.
--Edit for more specifications of the problem
Not sure if any kind of lock is actually used. But in my case, there is an old application in Oracle Forms and a new application written by me.
The problem is that when the user opens a specific record to update any field in the old application, and i try to edit the same record in my app, the line is blocked.
So my app it's waiting for the unlock. But the problem is that the user thinks the application is frozen and kill him, losing the changes.
But this is not the case if another Oracle Forms application attempts to edit. When it does, Oracle Forms displays the message "Could not reserve record (2). Keep trying?". Maybe it's because this old app uses any kind of lock. But i need validate this in the code.
Obs: The number 2 is the number of tries to update.
If you do a 'lock table .... wait', then it will wait until any DML on this table that is inflight commits, then gives you the lock. This will make any one coming after you wait till you release the lock. Look at the doc to see how to use this.
Then there's the possibility of locking a single row (select for update). which is more granular.
That being said, can you please explain what are you exactly trying to do? As you may not need to do this at all.

How are transactions partitioned/isolated in SQLite?

I have been reading the SQLite documentation and also referencing code I have written previously but I don't seem to be able to find a definitive answer to what I imagine to be a rather simple question.
I would like to execute many (separate) compiled statements within a transaction, but child threads may also be creating transactions or just executing statements at the same time and I would not want them included in this particular transaction. Currently, I have a single database handle that I share between all threads.
So, my question is,
1) .. is it generally better to have some kind of semaphore around transactions to ensure they will not clash/collect with other statements being executed against a database handle. I already marshal writes to prevent problems with multithreaded issues with SQLite (although with WAL now it's very hard to unsettle it at all).
2) .. or are you expected to open multiple database connections and start/commit the transactions one per database connection if they will be concurrent?
Changes made in one database connection are invisible to all other database connections prior to commit.
So it seems a hybrid approach of having several connections open to the database provides adequate concurrency guarantees, trading off the expense of opening a new connection with the benefit of allowing multi-threaded write transactions.
A query sees all changes that are completed on the same database connection prior to the start of the query, regardless of whether or not those changes have been committed.
If changes occur on the same database connection after a query starts running but before the query completes, then it is undefined whether or not the query will see those changes.
If changes occur on the same database connection after a query starts running but before the query completes, then the query might return a changed row more than once, or it might return a row that was previously deleted.
For the purposes of the previous four items, two database connections that use the same shared cache and which enable PRAGMA read_uncommitted are considered to be the same database connection, not separate database connections.
Here is the SQLite information on isolation. Which is exceptionally useful to read and understand for this problem.

Rest philosophy for updating and getting records

In my app I'm displaying Race objects that essentially have three states: pending, inProgress and completed. I want to display all Races that are currently pending or inProgress, but not the ones that are completed. To do this, I want to create a RESTful API for getting these resources from my server, but I'm not sure what the best (i.e. most RESTful) approach would be.
The issue is that when someone opens or refreshes the app, I need to two things:
Perform a GET on all the Races that are currently displayed in the client to update their status.
GET all of the new pending or inProgress Races that have been created since the client last updated
I've come up with a few different solutions, though I don't know which, if any, would be best:
Simply delete the old Race records on the client and always GET all new records
Perform 2 separate GET operations, the first which updates all the old records, and the second where I GET all the new pending / inProgress Races
Perform a single GET operation where I specify the created date of the last client record, and GET all records that are newer.
To me, this seems like a pretty common scenario but I haven't been able to find a specific answer to this type of problem. I'd like to see what SO thinks :)
Thanks in advance for your help!
Simply delete the old Race records on the client and always GET all new records
This is probably the easiest solution. However you shouldn't do that if you need a very smooth update on your client (for games, data visualization, etc.).
Perform 2 separate GET operations (...) / Perform a single GET operation where I specify the created date of the last client record, and GET all records that are newer.
I would definitely do it with a single operation. Better than an update timestamp (timestamp operations are costly, and several operations could happen at the same time), I would use a sequence number. This is the way CouchDB handles "changes".
Moreover, as you will see in the documentation, this solution can then be upgraded for asynchronous notifications (if you need so).

Strategies to issue unique records via db?

We have more than 1 instances of a certain exe running from different locations. An exe is supposed to fetch a set of records and do some work based on them. The set of records fetched from exe A should not be fetched by exe B and vice versa. Exes A & B are the same exes; they are running from different locations. The number of instances may increase or decrease. All exes might run simultaneously at times.
So coming to my question...what is the best way I can tackle this problem?
I've thought about using transactions but the table that acts as the source for the exe is also used by others (scheduled jobs, websites, etc). The scheduled jobs insert data into the source table.
However if I had to use transactions can I start a transaction with BEGIN TRAN and then select the data from the source table using the WITH (TABLOCKX) hint. If I were to do this on views would it affect the actual underlying table/tables.
I just want to know what are the strategies used to deal with this...
You want to avoid race conditions between processes. My answer here goes into details: SQL Server Process Queue Race Condition
Transactions are not much use: it's the locking strategy you have to think about, with the knock on effect on concurrency.
One option might be to run an UPDATE query that "marks" which items the exe is going to fetch (with a where clause constraining it to only marking items which aren't already marked). Then do a second SELECT which pulls out the marked items. Thus you can run the SELECT query without worrying about delay between the UPDATE and it. As long as UPDATEs are run atomically (via a transaction that could be quickly closed), you shouldn't have concurrency issues.