In my application which is implemented using SoftwareAG Webmethods, I'm using JMS to process more than 2 millions of records in parallel, so basically each JMS thread will have a batch of records (lets's x1000) to process and then insert into database table (let's call it table A) and after each thread inserts each batch they will send result message on JMS which I will aggregate later to update the process status.
The problem I'm facing now is that the thread will process its batch, insert and put the result message on JMS queue but the insert transactions will get queued in the mssql database but it doesn't wait in the application itself. it considers it as done and continues with the next line of logic.
Therefore the process on each thread is completed and the main process is marked as completed while there are a lot of records still waiting to get inserted into the database yet.
so my question is that is there any trigger in mssql that can be used for when the queued transactions on a table are finished?
I suggest you instead of INSERT batches, use batches that will create a jobs with two steps. First step is insertion of data and second insert data about batch complete in some results table. After that you can check table with results.
Related
I am looking at building a database based Message Queue implementation. I will essentially have a database table which will contain a autogenerated id (bigint), a message id and message data. I will be writing a pull based consumer which will query for the oldest record (min(id)) from the table and hands it over for processing.
Now my doubt is how would I handle the querying of the oldest record when there are mulitple threads of consumer. How do I lock the first read record to the first consumer and basically not even make it visible to the next one.
One idea that I have is to add another column called locked by where I will store, lets say the thread name and select the record for update and immediately update the locked by column and then continue processing it. So that I will not select the locked columns in the next query.
Is this a workable solution?
Edit:
Essentially, this is what I want.
Connection one queries the database table for a row. Reads first row and locks it while reading for update.
Connection two queries the database table for a row. Should not be able to read the first row, should read second row if available and lock it for update.
Similar logic for connection 3, 4, etc..
Connection one updates the record with its identifier. Processes it and subsequently deletes the record.
Connection one queries the database table for a row. Reads first row and locks it while reading for update.
Connection two queries the database table for a row. Should not be able to read the first row, should read second row if available and
lock it for update.
Similar logic for connection 3, 4, etc..
Connection one updates the record with its identifier. Processes it and subsequently deletes the record.
TL;DR, see Rusanu's using tables as queues. The example DDL below is gleaned from the article.
CREATE TABLE dbo.FifoQueueTable (
Id bigint not null identity(1,1)
CONSTRAINT pk_FifoQueue PRIMARY KEY CLUSTERED
,Payload varbinary(MAX)
);
GO
CREATE PROCEDURE dbo.usp_EnqueueFifoTableMessage
#payload varbinary(MAX)
AS
SET NOCOUNT ON;
INSERT INTO dbo.FifoQueueTable (Payload) VALUES (#Payload);
GO
CREATE PROCEDURE dbo.usp_DequeueFifoTableMessage
AS
SET NOCOUNT ON;
WITH cte AS (
SELECT TOP(1) Payload
FROM dbo.FifoQueueTable WITH (ROWLOCK, READPAST)
ORDER BY Id
)
DELETE FROM cte
OUTPUT deleted.Payload;
GO
This implementation is simple but handing the unhappy path can be complex depending on the nature of the messages and the cause of the error.
When message loss is acceptable, one can simply use the default autocommit transaction and log errors.
In cases where messages must not be lost, the dequeue must be done in a client-initiated transaction and committed only after successful processing or no message read. The transaction will also ensure messages are not lost if the application or database service crashes. A robust error handling strategy depends on the type of error, nature of messages, and message processing order implications.
A poison message (i.e. an error in the payload that prevents the message from ever being successfully), one can insert the bad message into a dead letter table for subsequent manual review and commit the to transaction.
A transient error, such as a failure calling an external service, can be handled with techniques like:
Rollback the transaction so the message is first in the FIFO queue for retry next iteration.
Requeue the erred message and commit so the message is last in the FIFO queue for retry.
Enqueue the erred message in a separate retry queue along with a retry count. The message can be inserted into dead letter table once a retry limit is reached.
The app code can also include retry logic during message processing but should avoid long running database transactions and fallback to one techniques above after some retry threshold.
These same concepts can be implemented with Service Broker to facilitate a T-SQL only solution (internal activation) but adds complexity when that's not a requirement (as in your case). Note that SB queues intrinsically implement the "READPAST" requirement but, because all messages within the same conversation group are locked, the implication is that each message will need to be in a separate conversation.
We have a DAG that as first task aggregates a table (A) into a staging table (B).
After that there is a task that reads from the staging table (B), and writes to another table (C).
However, the second task reads from the aggregated table (B) before it has been fully updated, which causes table C to have old data or sometimes it is empty. Airflow still logs everything as successful.
Updating table B is done as (pseudo):
delete all rows;
insert into table b
select xxxx from table A;
Task Concurrency is set as 10
pool size: 5
max_overflow: 10
Using local executor
Redshift seems to have a commit queue. Could it be that redshift tells airflow it has committed when the commit is in fact still in a queue, and the next task thus reads before the real commit takes place?
We have tried wrapping the update of table B in a transaction as (pseudo):
begin
delete all rows;
insert into table b
select xxxx from table A;
commit;
But even that does not work. For some reason airflow manages starting the second task before the first task is not fully committed.
UPDATE
It turned out there was a mistake in the dependencies. Downstream tasks were waiting for incorrect task to finish.
For future reference, never be 100 % sure you have checked everything. Check and recheck the whole flow.
You can achieve this goal by setting wait_for_downstream to True.
From https://airflow.apache.org/docs/stable/_api/airflow/operators/index.html :
when set to true, an instance of task X will wait for tasks
immediately downstream of the previous instance of task X to finish
successfully before it runs.
You can set this parameter at the default_dag_args level or at the tasks (operators) level.
default_dag_args = {
'wait_for_downstream': True,
}
There is a BQ table which has multiple data load/update/delete jobs scheduled in. Since this is automated jobs many of it are failing due to concurrent update issue.
I need to know if we have a provision in BigQuery to check if the table is already locked by DML operation and can we serialize the queries so that no job fails
You could use the job ID generated by client code to keep track of the job status and only begin the next query when that job is done. This and this describe that process.
Alternatively you could try exponential backoff to retry the query a certain amount of times to prevent automatic failure of the query due to locked tables.
I have a BigQuery table and I want to use a job with writeDisposition WRITE_TRUNCATE to overwrite the table with a subset of its rows. I am doing this because I'm trying to mimic a DELETE FROM … WHERE … operation.
Suppose while the job is running, I am simultaneously trying to stream rows into the table. Is it possible for rows to be inserted while the job is running and so be overwritten when the job completes? Or is there a locking mechanism that will prevent the rows from being inserted until the job finishes?
In this case you need to stop the streaming jobs until you do your operation. And resume once you are done with it. There is no locking.
Also you should allow some cooling down period after you stop streaming inserts, as they are processed in background and you need to let the system to finish.
Because of the table metadata caching layer in streaming system, it currently needs about 10 minutes to realize that a table has been truncated. During this ~10min, all streamed data will be dropped (because they are considered as part of truncated data).
As Pentium10 suggested, it's recommended to pause the streaming requests if you are doing a WRITE_TRUNCATE, and resume it ~10min after truncation is done.
I have a question regarding how to handle sql queries to a table while performing batch inserts to the same table.
I have an ASP.NET web application that creates lots of objects (perhaps 50000) that are inserted in a batch fashion to a table using nHibernate. Even with the Nhibernate optimizations in place this takes up to two minutes. I perform this in a database transaction with isolation level set to read commited.
During the batch insert clients in the web application must be able to read previously created data in this table. However, they should not be able to read uncommited data. My problem is that if I use isolation level "read committed" on the select queries they time out because they are waiting for the batch insert job to finish.
Is there any way to query the database in such a way so that the query runs fast and returns all of the committed rows in the table without waiting on the batch insert job to finish? I do not want to return any uncommitted data.
I have tested setting the isolation level to "snapshot" and that seems to solve my problem, but is it the best approach?
Best regards Whimsical
SNAPSHOT isolation returns data that existed prior to the beginning of the transaction, and it doesn't a lock on the table so it doesn't block. It also ignores other locking transactions, so in your scenario, it sounds like the best fit for you. What it does mean is that since your data is being inserted in a batch, no data from that batch will be available to the SELECT statement until the batch completes (i.e.)
Time 1: Dataset A exists in Table
Time 2: Batch starts inserting dataset B into table (but doesn't commit).
Time 3: App takes snapshot, and reads in dataset A.
Time 4: App finishes returing dataset A (and only dataset A).
Time 5: Batch finishes writing dataset B; Dataset A and DataSet B are
both available in table.