Using Kettle-Spoon 5.1, I have a transform that:
pulls a row from a database (oracle)
sends the row to a REST api (via REST
Client)
uses a Switch/Case step that looks at the result. If the result is good, I remove the record from the database (Execute SQL), otherwise, I roll back the transaction
Note, the rollback is depending on the result not necessarily an error/exception. If a particular transaction gets rolled back, I still want to keep processing the other rows (don't want to the transform to stop).
Two questions:
Is there a clean way to force a rollback? (currently I've got a javascript step with a "throw"
How do I force the transaction to continue after the rollback?
Within a transformation, Pentaho processes rows within a stream concurrently, all in the same transaction. To roll back a single row, you would need a separate transaction for each row, which Pentaho doesn't support.
I recommend using another approach to achieve the same effect. Two possibilities:
Don't make the database changes until after the REST result is received. That way, there is nothing to roll back. You can, of course, stage the changes in memory (in stream fields), or in temporary database tables, so you know exactly what they will be.
Somehow keep track of the database changes you have made, so that you can undo them if the REST result indicates that you should do so. Depending on how you are doing things, you can keep track of those changes in stream fields, or in temporary database tables.
Related
I have a scheduled job that runs once a day, synchronizing entities between multiple APIs. I'm looking for a reliable way to pull "pages" of data from my DB, without downloading GBs worth of it in one go, using LIMIT and OFFSET.
From what I understand, starting a transaction at the beginning of the process and executing repeated SELECTs within it will ensure that no records in my result set are added or skipped due to other concurrent processes?
Hopefully, that would allow me to perform the synchronization job on the exact state of DB records at the start of the transaction. Also, it may be worth to know that the sync job itself won't alter the records from said result set.
Given an SQL table with timestamped records. Every once in a while an application App0 does something like foreach record in since(certainTimestamp) do process(record); commitOffset(record.timestamp), i.e. periodically it consumes a batch of "fresh" data, processes it sequentially and commits success after each record and then just sleeps for reasonable time (to accumulate yet another batch). That works perfect with single instance.. however how to load balance multiple ones?
In exactly the same environment App0 and App1 concurrently competite for the fresh data. The idea is that ready query executed by the App0 must not overlay with the same read query executed by the App1 - such that they never try to process the same item. In other words, I need SQL-based guarantees that concurrent read queries return different data. Is that even possible?
P.S. Postgres is preferred option.
The problem description is rather vague on what App1 should do while App0 is processing the previously selected records.
In this answer, I make the following assumptions:
all Apps somehow know what the last certainTimestamp is and it is the same for all Apps whenever they start a DB query.
while App0 is processing, say the 10 records it found when it started working, new records come in. That means, the pile of new records with respect to certainTimestamp grows.
when App1 (or any further App) starts, the should process only those new records with respect to certainTimestamp that are not yet being handled by other Apps.
yet, if on App fails/crashes, the unfinished records should be picked the next time another App runs.
This can be achieved by locking records in many SQL databases.
One way to go about this is to use
SELECT ... FOR UPDATE SKIP LOCKED
This statement, in combination with the range-selection since(certainTimestamp) selects and locks all records matching the condition and not being locked currently.
Whenever a new App instance runs this query, it only gets "what's left" to do and can work on that.
This solves the problem of "overlay" or working on the same data.
What's left is then the definition and update of the certainTimestamp.
In order to keep this answer short, I don't go into that here and just leave the pointer to the OP that this needs to be thought through properly to avoid situations where e.g. a single record that cannot be processed for some reason keeps the certainTimestamp at a permanent minimum.
If I open a transaction in READ UNCOMMITTED Isolation level, am I guaranteed to see the latest data on every table/row? I.e. as soon as some other transaction updates a row, my tranaction will see that change? (this would be analogous to a write-through to main memory)
Could it even be that my SELECT will get a row containing part of an UPDATE, but not all of it? What would in this case be the smallest element that is atomically updated/read?
Are there differences in the various relational database systems?
No. "Dirty data" means that you are relying on the internals of the database, so there are no guarantees. Data could be written to the data page and then removed due to a transaction rollback. Data could be written to the data page -- and then a later step in the same transaction could overwrite it.
In addition, what you are asking for is not possible. Your query could be scanning an entire table. Your reads are occurring at the page level. Each page could be a different amalgamation of transactions, with no consistency.
If I were to run a DELETE FROM some_table, and that were to timeout, what happens to the data?
The way I see it, one of two things might happen:
The data is deleted up to the point where the query times out, so if there were 1,000,000 entries in the database and the first 500,000 were deleted, they'd stay deleted. The database now contains half as many as it did before the query was run.
The data is deleted, the query times out, the data is rolled back (I would guess from the logs made by DELETE?). The database now contains the exact same data it started with.
Both seem logical. Would one happen 100% of the time? Or is this dependent on some settings I'm unaware of? Note that I'm not asking about the viability of the DELETE, I realize that TRUNCATE would likely be opportune. This is purely out of curiosity of how timeout functions with DELETE.
The Oracle, SQL Server,MySQL, PostgreSQL databases follows ACID properties. Hence whenever delete statement shows timed-out it must get rolled back.
You can get overview of ACID from the this Link.
I'm currenlty using SSIS for some process flow, scripting, and straight data import. Most of the data cleaning and transformation is happening within stored procedures that I'm calling from SSIS execute SQL tasks. For most of the sprocs, if it fails for any reason, I don't really care about rolling back any transactions. My SSIS error handling essentially wipes out any staging data and then logs the errors to a table. (A human needs to fix the underlying data issue at that point)
My question revolves around begin tran, end tran. Are there any cases where a stored proc can fail, and then not let the calling SSIS process know? I'm looking for hardware failure, lock timeouts, etc.
I'd prefer to avoid using transactions as much as possible and rely on my SSIS error handling.
Thoughts?
Once case I can think of (and transactions won't help either) would be if the stored proc did not update or insert any records. That would not be a failure, but it might need to be for an SSIS package. You might want to return how many rows were affected and check that after.
We also do this for some imports where a number significantly off from the last import indicates a data problem. So if we usually get 100,000 records from client A in Import B and we get 5000 instead, the SSIS package fails until a human can look at it and see it the file is bad or if they genuiinely did mean to reduce their work force or customer list.
Incidentally we stage to two tables (one with the raw unchanged data and one we use for cleaning. A failure of the SSIS package should not roll those back if you want to easily see what the data issues was. You can then tell if the data was wrong from the start or if somehow it got lost or fixed incorrectly inteh cleaning process. Sometimes the place where the error got logged is not the place where the error actually occurred and it is nice to see what the data looked like unchanged and after the change process. Sometimes you have bad data, yes (Ok the majority of times) but sometimes you have a bug. Having both those tables enables you to uickly see which of the two it is.
You could have all your procs insert to a logging table as the last step and make sure that the record is there before executing the next step if you are concerned that you are losing some executions that are not bubbling back to the package.