I stumbled onto a bit of groovy code deleting rows from a database by batches of a given size:
for (int index = 0; index <= loopCnt; index++) {
sql.execute("DELETE TOP(" + maxEntriesToDeleteAtOnce + ") FROM LoaderQueue WHERE Status = ? AND LastUpdated < ?", stateToDelete, date)
}
I have 0 experience with groovy, and I know that the delete statement can take a few seconds to be fully executed, even if the batch size is relatively small. I'm wondering if the execution will wait for each statement to be completed before looping , or if we might send a few statements in parallel ? In short I'm wondering if the sql.execute() command is synchronous
I can't really try anything yet as there is no DEV environment for the application involved.
Related
I understand that explicit transactions should be used even for reading data but I am unable to understand why the below code runs much slower under an NHibernate transaction (as opposed to running without it)
session.BeginTransaction();
var result = session.Query<Order>().Where(o=>o.OrderNumber > 0).Take(100).ToList();
session.Transaction.Commit();
I can post more detailed UT code if needed but if I am querying over 50,000 Order records, it takes about 1 sec for this query to run under NHibernate's explicit transaction, and it takes only about 15/20 msec without one.
Update 1/15/2019
Here is the detailed code
[Test]
public void TestQueryLargeDataUnderTransaction()
{
int count = 50000;
using (var session = _sessionFactory.OpenSession())
{
Order order;
// write large amount of data
session.BeginTransaction();
for (int i = 0; i < count; i++)
{
order = new Order {OrderNumber = i, OrderDate = DateTime.Today};
OrderLine ol1 = new OrderLine {Amount = 1 + i, ProductName = $"sun screen {i}", Order = order};
OrderLine ol2 = new OrderLine {Amount = 2 + i, ProductName = $"banjo {i}", Order = order};
order.OrderLines = new List<OrderLine> {ol1, ol2};
session.Save(order);
session.Save(ol1);
session.Save(ol2);
}
session.Transaction.Commit();
Stopwatch s = new Stopwatch();
// read the same data
session.BeginTransaction();
var result = session.Query<Order>().Where(o => o.OrderNumber > 0).Skip(0).Take(100).ToList();
session.Transaction.Commit();
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
}
}
Your for-loop iterates 50000 times and for each iteration it creates 3 objects. So by the time you reach the first call to Commit(), the session knows about 150000 objects that it will flush to the database at Commit time (or earlier) (subject to your id generator policy and flush mode).
So far, so good. NHibernate is not necessarily optimised to handle so many objects in the session, but it can be acceptable providing one is careful.
On to the problem...
It's important to realize that committing the transaction does not remove the 150000 objects from the session.
When you later perform the query, it will notice that it is inside a transaction, in which case, by default, "auto-flushing" will be performed. This means that before sending the SQL query to the database, NHibernate will check if any of the objects known to the session has changes that might affect the outcome of the query (this is somewhat simplified). If such changes are found, they will be transmitted to the database before performing the actual SQL query. This ensures that the executed query will be able to filter based on changes made in the same session.
The extra second you notice is the time it takes for NHibernate to iterate over the 150000 objects known to the session to check for any changes. The primary use cases for NHibernate rarely involves more than tens or a few hundreds of objects, in which case the time needed to check for changes is negligible.
You can use a new session for the query to not see this effect, or you can call session.Clear() immediately after the first commit. (Note that for production code, session.Clear() can be dangerous.)
Additional: The auto-flushing happens when querying but only if inside a transaction. This behaviour can be controlled using session.FlushMode. During auto-flush NHibernate will aim to flush only objects that may affect the outcome of the query (i.e. which database tables are affected).
There is an additional effect to be aware of with regards to keeping sessions around. Consider this code:
using (var session = _sessionFactory.OpenSession())
{
Order order;
session.BeginTransaction();
for (int i = 0; i < count; i++)
{
// Your code from above.
}
session.Transaction.Commit();
// The order variable references the last order created. Let's modify it.
order.OrderDate = DateTime.Today.AddDays(4);
session.BeginTransaction();
var result = session.Query<Order>().Skip(0).Take(100).ToList();
session.Transaction.Commit();
}
What will happen with the change to the order date done after the first call to Commit()? That change will be persisted to the database when the query is performed in the second transaction despite the fact that the object modification itself happened before the transaction was started. Conversely, if you remove the second transaction, that modification will not be persisted of course.
There are multiple ways to manage sessions and transaction that can be used for different purposes. However, by far the easiest is to always follow this simple unit-of-work pattern:
Open session.
Immediately open transaction.
Perform a reasonable amount of work.
Commit or rollback transaction.
Dispose transaction.
Dispose session.
Discard all objects loaded using the session. At this point they can still
be used in memory, but any changes will not be persisted. Safer to just get
rid of them.
So I have a script that is supposed to update a giant table (Postgres). Since the table has about 150m rows and I want to complete this as fast as possible, using multiple threads seemed like a perfect answer. However, I'm seeing something very weird.
When I use a single thread, the write time to an update is much much lower than when I use multiple threads.
require 'sequel'
.....
DB = Sequel.connect(DB_CREDS)
queue = Queue.new
read_query = query = DB["
SELECT id, extra_fields
FROM objects
WHERE XYZ IS FALSE
"]
read_query.use_cursor(:rows_per_fetch => 1000).each do |row|
queue.push(row)
end
Up until this point, IMO it shouldn't matter because we're just reading stuff from the DB and it has nothing to do with writing. From here, I've tried two approaches. Single-threaded and Multi-threaded.
NOTE - This is not the actual UPDATE query that I want to execute, it's just a pseudo one for demonstration purposes. The actual query is a lot longer and plays with JSON and stuff so I can't really update the entire table using a single query.
Single-threaded
until queue.empty?
photo = queue.shift
id = photo[:id]
update_query = DB["
UPDATE objects
SET XYZ = TRUE
WHERE id = #{id}
"]
result = update_query.update
end
If I execute this, I see in my DB logs that each update query takes time less than 0.01 seconds
I, [2016-08-15T10:45:48.095324 #54495] INFO -- : (0.001441s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395179
I, [2016-08-15T10:45:48.103818 #54495] INFO -- : (0.008331s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395181
I, [2016-08-15T10:45:48.106741 #54495] INFO -- : (0.002743s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395182
Multi-threaded
MAX_THREADS = 5
num_threads = 0
all_threads = []
until queue.empty?
if num_threads < MAX_THREADS
photo = queue.shift
num_threads += 1
all_threads << Thread.new {
id = photo[:id]
update_query = DB["
UPDATE photos
SET cv_tagged = TRUE
WHERE id = #{id}
"]
result = update_query.update
num_threads -= 1
Thread.exit
}
end
end
all_threads.each do |thread|
thread.join
end
Now, in theory it should be faster right? But each update takes about 0.5 seconds. I'm so surprised what that is the case.
I, [2016-08-15T11:02:10.992156 #54583] INFO -- : (0.414288s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498834
I, [2016-08-15T11:02:11.097004 #54583] INFO -- : (0.622775s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498641
I, [2016-08-15T11:02:11.097074 #54583] INFO -- : (0.415521s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498826
Any ideas on -
Why this is happening?
How can I increase the update speed for multiple threads approach.
Have you configured Sequel so that it has a connection pool of 5 connections?
Have you considered doing multiple updates per call via an IN clause?
If you haven't done 1, you have N threads fighting over N-n connections, which equates to resource starvation, which is a classic concurrency issue.
Your example can be reduced to: DB[:objects].where(:XYZ=>false).update(:XYZ=>true)
I'm guessing your actual need is not that simple. But the same approach may still work. Instead of issuing a query per row, use a single query to update all related rows.
I went through something similar on a project ("import all history from a legacy database into a new one with completely different structure and organization"). Unless you managed to shoot yourself in the foot somewhere else, you have 2 basic bottlenecks to look for:
the database's disk IO
the ruby process' CPU
Some suggestions,
database IO: use DB transactions, update 1000 records per transaction (you can tweak the exact number but 1000 is usually good) - huge DB table usually means a lot of indexes too, every couple of update actions will trigger a REINDEX and AUTOVACUUM actions within the DB which will result in a significant drop of update speed, a transaction basically allows you to push a 1000 updated records without REINDEX and AUTOVACUUM and then perform both actions, the result is MUCH faster (something like an order of magnitude)
database IO: change indexes, drop every index you can live without during the update process, ideally you will have only 1 very streamlined index which allows unique row lookups for update purposes
ruby CPU: unless you are using JRuby or Rubinius, or REALLY paying the price of network latency to your DB, threads will do you no big benefit, use fork/processes (see GIL). You did a great job choosing Sequel over AR for this
ruby CPU: if you decide to go threads + JRuby with this don't forget to try and plug in jProfiler, it's amazing at tracing bottlenecks in Java and author of SideKiq swears it is amazing for JRuby too - unfortunately, afaik, there is no equivalent of jProfiler for C Ruby (there are profiling tools, but nowhere as useful)
After you implement these suggestions you know you did all you could when:
all of the CPUs on the Ruby box are on 100% load
the hard disk IO of the DB is on 100% throughput
Find this sweet spot and don't add additional ruby update threads/processes after that (or add more hardware) and that's that
PS check out https://github.com/ruby-concurrency/concurrent-ruby - it's a great parallelization lib
I want to post some bulk messages. System takes some time to process them, so i do not want to proceed for 2nd iteration. My setup is something like this
While controller->jdbc request->beanshell postprocessor
In While controller, condition is ${__java script("${check_1}" != "0")}
check is the variable name as part of database sampler which checks whether all the messages are processed. Its a count, if it is 0 we have to stop looping.
As part of Bean Shell Post Processor, i have added a condition to wait if count is not equal to 0.
if(${check_1} != 0) {
out("check Count not zero, waiting for 5 sec " + ${check_1});
Thread.sleep(5000);
}else
out("check Count is zero " + ${check_1});
Whats happening is, the result is something like this
If the check_1 is > 0 , it waits for 5 sec and as soon as it is 0, it runs into infinite loop by executing the sampler multiple times
Is there something wrong with the condition. Please suggest if you have any other solution.
The correct way to use __javaScript() function and define condition is:
${__javaScript(${check_1} != 0,)}
The correct way of accessing JMeter Variables from Beanshell is:
if(vars.get("check_1").equals("0"))
Hope this helps.
I'm investigating some performance problems in an experimental scheduling application I'm working on. I found that calls to session.SaveChanges() were pretty slow, so I wrote a simple test.
Can you explain why the first iteration of the loop takes 200ms and subsequent loop 1-2 ms? How I can I leverage this in my application (I don't mind the first call to be this slow if all subsequent calls are quick)?
private void StoreDtos()
{
for (int i = 0; i < 3; i++)
{
StoreNewSchedule();
}
}
private void StoreNewSchedule()
{
var sw = Stopwatch.StartNew();
using (var session = DocumentStore.OpenSession())
{
session.Store(NewSchedule());
session.SaveChanges();
}
Console.WriteLine("Persisting schedule took {0} ms.",
sw.ElapsedMilliseconds);
}
Output is:
Persisting schedule took 189 ms. // first time
Persisting schedule took 2 ms. // second time
Persisting schedule took 1 ms. // ... etc
Above is for an in-memory database. Using a http connection to a Raven DB instance (on the same machine), I get similar results. The first call takes noticeably more time:
Persisting schedule took 1116 ms.
Persisting schedule took 37 ms.
Persisting schedule took 14 ms.
On Github: RavenDB 2.0 testcode and RavenDB 2.5 testcode.
The very first time that you call RavenDB, there are several things that have to happen.
We need to prepare the serializers for your entities, which takes time.
We need to create the TCP connection to the server.
On the next calls, we can reuse the connection that is already open and the created serializers.
I'm fetching aroung 6k articles from the Magento database. Traversing through them in beginning is very fast (0 seconds, just some ms) and gets slower and slower. The loop takes about 8 hours to run and in the end each loop in the foreach takes about 16-20 seconds ! It seems like mysql is getting slower and slower in the end, but I cannot explain why.
$product = Mage::getModel('catalog/product');
$data = $product->getCollection()->addAttributeToSelect('*')->addAttributeToFilter('type_id', 'simple');
$num_products = $product->getCollection()->count();
echo 'exporting '.$num_products."\n";
print "starting export\n";
$start_time = time();
foreach ($data as $tProduct) {
// doing some stuff, no sql !
}
Does anyone know why it is so slow ? Would it be faster, just to fetch the ids and selecting each product one by one ?
The memory usage of the script running this code has a constant memory usage of:
VIRT RES SHR S CPU% MEM%
680M 504M 8832 R 90.0 6.3
Regards, Alex
Oh, well, Shot in the dark time. If you are running Magento 1.4.x.x, previous to 1.4.2.0, you have a memory leak that displays exactly this symptom as it eats up more and more memory, leading eventually to memory exhaustion. Profile exports that took 3-8 minutes under 1.3.x.x will now take 2-5 hours if it doesn't throw an error before completion. Another symptom is exports that fail without finishing and without giving any indication of why the window freezes or gives some sort of funky completion message with no output.
The Array Of Death(tm) has been noted and here's the official repair in the new version. Maybe Data Will Flow again!
Excerpt from 1.4.2.0rc1 /lib/Varien/Db/Select.php that has been patched for memory leak
public function __construct(Zend_Db_Adapter_Abstract $adapter)
{
parent::__construct($adapter);
if (!in_array(self::STRAIGHT_JOIN_ON, self::$_joinTypes)) {
self::$_joinTypes[] = self::STRAIGHT_JOIN_ON;
self::$_partsInit = array(self::STRAIGHT_JOIN => false) + self::$_partsInit;
}
}
Excerpt from 1.4.1.1 /lib/Varien/Db/Select.php with memory leak
public function __construct(Zend_Db_Adapter_Abstract $adapter)
{
parent::__construct($adapter);
self::$_joinTypes[] = self::STRAIGHT_JOIN_ON;
self::$_partsInit = array(self::STRAIGHT_JOIN => false) + self::$_partsInit;
}