We are planning to implement a locking mechanism for our documents using xdmp:lock-acquire API in MarkLogic with no timeout option. The document would be locked until the user edits and save the document. As part of this, we are in need to release all the locks at a specified time, say 12.00 AM everyday.
For this, we could use xdmp:lock-release API, but if there are many documents it would take some time to get complete.
Can someone suggest a better way to achieve this in MarkLogic?
If you have a potentially large set of locks that you need to process, and are concerned about timeouts or other issues from doing all of the work in a single transaction, then you can break up the work into smaller chunks or individual transactions.
There are a variety of batch processing tools and frameworks to do that. CoRB is one option that makes it easy to plug in custom selectors and processing scripts, and to execute against giant sets.
If you are looking to initiate the work from a MarkLogic scheduled task and perform all of the work within MarkLogic, then you could spawn multiple tasks to process subsets.
A simple example demonstrating how to set a "chunk size" for each transaction and to keep spawning more work:
declare function local:release-locks($locks, $chunk-size){
if (exists($locks))
then (
(: release all of these locks(you might apply some sort of filter to restrict to a subset,
and maybe a try/catch in case the lock gets released before this runs) :)
$locks[1 to $chunk-size] ! xdmp:node-uri(.) ! xdmp:lock-release(.),
(: now spawn the next set to be released in a separate transaction :)
xdmp:spawn-function(function(){
local:release-locks(subsequence($locks, $chunk-size+1), $chunk-size)
},
<options xmlns="xdmp:eval">
<update>true</update>
<commit>auto</commit>
</options>)
)
else () (: nothing left to do, stop spawning work :)
};
let $locks := xdmp:document-locks()
let $chunk-size := 1000
local:release-locks($locks, $chunk-size)
If you are looking to go down this route, there are some libraries available:
https://github.com/bradmann/marklogic-spawnlib
https://github.com/mblakele/taskbot
The risk of spawning multiple items onto the task server is that if there is a restart or interruption, some tasks may not execute and all locks may not be released. But if you are just looking to release all of the locks, you could then just re-run the script to kick off another round.
Related
I want to implement a mutual exclusion system in PostgreSQL where multiple worker processes will temporarily lock resources (rows) from a table (queue) while they work on them. If the worker processes crash, I want the lock to be cleanly released and not have to rely on another process to clean up the leaked locks.
What I have come up with so far is to use a SELECT ... FOR UPDATE SKIP LOCKED query within a transaction, which locks the row it finds and skips any other locked row.
It works well but one of the issues is that the worker might take a while to do its task and I need to keep the transaction open for the entire duration of its task.
Another problem is that the workers work incrementally and persist their state to the database so that if they're stopped or crash, they can resume quickly where they were. The row being locked makes it impossible to persist their state in the same table (though I think I can get away from that by using another table to persist the state).
I've searched on the Web on how to implement a semaphore or a resource borrowing system in SQL/PostgreSQL but I haven't found something that fits my needs. Is there a simple way of achieving this with PostgreSQL?
In Mule, I have quite many records to process, where processing includes some calculations, going back and forth to database etc.. We can process collections of records with these options
Batch processing
ForEach
Splitter-Aggregator
So what are the main differences between them? When should we prefer one to others?
Mule batch processing option does not seem to have batch job scope variable definition, for example. Or, what if I want to benefit multithreading to fasten the overall task? Or, which is better if I want to modify the payload during processing?
When you write "quite many" I assume it's too much for main memory, this rules out spliter/aggregator because it has to collect all records to return them as a list.
I assume you have your records in a stream or iterator, otherwise you probably have a memory problem...
So when to use for-each and when to use batch?
For Each
The most simple solution, but it has some drawbacks:
It is single threaded (so may be too slow for your use case)
It is "fire and forget": You can't collect anything within the loop, e.g. a record count
There is not support handling "broken" records
Within the loop, you can have several steps (message processors) to process your records (e.g. for the mentioned database lookup).
May be a drawback, may be an advantage: The loop is synchronous. (If you want to process asynchronous, wrap it in an async-scope.)
Batch
A little more stuff to do / to understand, but more features:
When called from a flow, always asynchronous (this may be a drawback).
Can be standalone (e.g. with a poll inside for starting)
When the data generated in the loading phase is too big, it is automatically offloaded to disk.
Multithreading for free (number of threads configurable)
Handling for "broken records": Batch steps may be executed for good/broken records only.
You get statitstics at the end (number of records, number of successful records etc.)
So it looks like you better use batch.
For Splitter and Aggregator , you are responsible for writing the splitting logic and then joining them back at the end of processing. It is useful when you want to process records asynchronously using different server. It is less reliable compared to other option, here parallel processing is possible.
Foreach is more reliable but it process records iteratively using single thread ( synchronous), hence parallel processing is not possible. Each records creates a single message by default.
Batch processing is designed to process millions of records in a very fast and reliable way. By default 16 threads will process your records and it is reliable as well.
Please go through the link below for more details.
https://docs.mulesoft.com/mule-user-guide/v/3.8/splitter-flow-control-reference
https://docs.mulesoft.com/mule-user-guide/v/3.8/foreach
I have been using approach to pass on records in array to stored procedure.
You can call stored procedure inside for loop and setting batch size of the for loop accordingly to avoid round trips. I have used this approach and performance is good. You may have to create another table to log results and have that logic in stored procedure as well.
Below is the link which has all the details
https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro
My organization has recently had trouble with some SQL Server blocking processes. dbWarden has successfully reported blocking to us, but we often have the blocking SQL text reported as 'FETCH API_CURSOR'.
So, we're looking to alter the blocking alerts trigger in dbWarden to use sys.dm_exec_cursors and sys.dm_exec_sql_text to retrieve the text in the case where we find 'FETCH API_CURSOR' reported.
Trouble is, I cannot seem to come up with a way to recreate/simulate a blocking situation on our development server that will report as 'FETCH API_CURSOR'. I've started from the VB script here on SQL Authority to recreate the open cursor, but I cannot for the life of me figure out how to make it blocking.
I've seen many methods for recreating blocking transactions (open a transaction in one window, but do not commit/close, then try an update on same table in another), but not that would utilize FETCH API_CURSOR in a way that would allow us to successfully test. I'm somewhat at a loss here.
Has anyone had success in simulating blocking cursors in the past and can offer suggestions?
I'd suggest you to use Profiler tool to capture actual code that creates and fetches cursor. In this case, you'd see exactly what's going on in an application. It is not so difficult to reproduce similar blocking on a development server.
Let's say, one thread fetches rows from a cursor and another thread try to UPDATE same rows. See what's going on under the hood. Reading thread creates cursor to fetch result of SELECT back to an application. This technology is ancient and extremely slow and nowaday only some old (mostly) Java application use cursors for this purpose. Rows get fetched one-by-one, client is handling this process, so it takes time. During this time, reading thread holds shared locks on data it reads. It is by design, SQL Server is locker, it does use locks to function properly. If another thread tries to update a row that has been locked with shared lock, it get blocked. Because updating thread uses shared locks when searching rows for update, and tries to upgrade it to something more serious, but can't. For example, you can't upgrade your shared lock to U lock if another thread owns S lock on the same row. So I'd try to create a cursor, fetch several rows and tried to update in another tab. If you see difficulties, try to increase reading transaction serialization level.
But, seriously, I don't think you have to reproduce this or similar scenario on a development server. Stop use cursors for recordset fetching! Reads will be much faster and blocking issues will be reduced a lot. It's been a while since 1989 when cursors seen their great times. Client DB-access libraries evolved a lot, it is worth trying to pick up fruits of progress. Even in Java, it is a configuration option, use or not use them.
I apologize if cursor does get used on purpose, in this case. It is very unprobably but possible. I haven't seen such 'proper' cursor usage for ages! I'll be delighted to run into one more proper case.
I have a task that takes quite a long time. So I would like to let several programs/threads/computers execute the same task to speed things up. Each task requires unique ids which are stored in a db – so I thought these ids could be obtained like this:
NHibernateSession.Current.BeginTransaction(IsolationLevel.Serializable);
list = NHibernateSession.Current.CreateCriteria<RelevantId>().SetFirstResult(0).SetMaxResults(500).List<RelevantId>();
foreach (RelevantId x in list)
{
RelevantIdsRepository.Delete(x);
}
NHibernateSession.Current.Transaction.Commit();
Unfortunately, this throws an exception after a while if several processes access the database (nr of deleted objects is not the same as batch size). Why is this? The isolation level of the db should be ok shouldn’t it? Thanks.
Best wishes,
Christian
I'm not sure that I understand what you are doing here. It looks like each process should take some ids and process them but no two processes should take the same.
It doesn't work like you implemented it. All processes are reading the same ids. After committing the transaction they disappear from the database. Until then, they are visible to everyone. Isolation level only make sure that other transactions can't read them after they got deleted. But until then, they all can read them.
It's not so easy to distribute load. You could
maintain ids in a table where each process is registering itself as the executer and commits it before starting (handling conflicts, eg. StaleObjectStateException). Make sure to clean it up even when a process crashes.
write a central service which distributes ids.
The problem that it runs slow, is possibly due to the fact that you perform multiple SQL statements in a loop.
You should see if it is not possible to delete all entities in one batch-statement.
Do you know of any ORM tool that offers deadlock recovery? I know deadlocks are a bad thing but sometimes any system will suffer from it given the right amount of load. In Sql Server, the deadlock message says "Rerun the transaction" so I would suspect that rerunning a deadlock statement is a desirable feature on ORM's.
I don't know of any special ORM tool support for automatically rerunning transactions that failed because of deadlocks. However I don't think that a ORM makes dealing with locking/deadlocking issues very different. Firstly, you should analyze the root cause for your deadlocks, then redesign your transactions and queries in a way that deadlocks are avoided or at least reduced. There are lots of options for improvement, like choosing the right isolation level for (parts) of your transactions, using lock hints etc. This depends much more on your database system then on your ORM. Of course it helps if your ORM allows you to use stored procedures for some fine-tuned command etc.
If this doesn't help to avoid deadlocks completely, or you don't have the time to implement and test the real fix now, of course you could simply place a try/catch around your save/commit/persist or whatever call, check catched exceptions if they indicate that the failed transaction is a "deadlock victim", and then simply recall save/commit/persist after a few seconds sleeping. Waiting a few seconds is a good idea since deadlocks are often an indication that there is a temporary peak of transactions competing for the same resources, and rerunning the same transaction quickly again and again would probably make things even worse.
For the same reason you probably would wont to make sure that you only try once to rerun the same transaction.
In a real world scenario we once implemented this kind of workaround, and about 80% of the "deadlock victims" succeeded on the second go. But I strongly recommend to digg deeper to fix the actual reason for the deadlocking, because these problems usually increase exponentially with the number of users. Hope that helps.
Deadlocks are to be expected, and SQL Server seems to be worse off in this front than other database servers. First, you should try to minimize your deadlocks. Try using the SQL Server Profiler to figure out why its happening and what you can do about it. Next, configure your ORM to not read after making an update in the same transaction, if possible. Finally, after you've done that, if you happen to use Spring and Hibernate together, you can put in an interceptor to watch for this situation. Extend MethodInterceptor and place it in your Spring bean under interceptorNames. When the interceptor is run, use invocation.proceed() to execute the transaction. Catch any exceptions, and define a number of times you want to retry.
An o/r mapper can't detect this, as the deadlock is always occuring inside the DBMS, which could be caused by locks set by other threads or other apps even.
To be sure a piece of code doesn't create a deadlock, always use these rules:
- do fetching outside the transaction. So first fetch, then perform processing then perform DML statements like insert, delete and update
- every action inside a method or series of methods which contain / work with a transaction have to use the same connection to the database. This is required because for example write locks are ignored by statements executed over the same connection (as that same connection set the locks ;)).
Often, deadlocks occur because either code fetches data inside a transaction which causes a NEW connection to be opened (which has to wait for locks) or uses different connections for the statements in a transaction.
I had a quick look (no doubt you have too) and couldn't find anything suggesting that hibernate at least offers this. This is probably because ORMs consider this outside of the scope of the problem they are trying to solve.
If you are having issues with deadlocks certainly follow some of the suggestions posted here to try and resolve them. After that you just need to make sure all your database access code gets wrapped with something which can detect a deadlock and retry the transaction.
One system I worked on was based on “commands” that were then committed to the database when the user pressed save, it worked like this:
While(true)
start a database transaction
Foreach command to process
read data the command need into objects
update the object by calling the command.run method
EndForeach
Save the objects to the database
If not deadlock
commit the database transaction
we are done
Else
abort the database transaction
log deadlock and try again
EndIf
EndWhile
You may be able to do something like with any ORM; we used an in house data access system, as ORM were too new at the time.
We run the commands outside of a transaction while the user was interacting with the system. Then rerun them as above (when you use did a "save") to cope with changes other people have made. As we already had a good ideal of the rows the command would change, we could even use locking hints or “select for update” to take out all the write locks we needed at the start of the transaction. (We shorted the set of rows to be updated to reduce the number of deadlocks even more)