I stumbled onto a bit of groovy code deleting rows from a database by batches of a given size:
for (int index = 0; index <= loopCnt; index++) {
sql.execute("DELETE TOP(" + maxEntriesToDeleteAtOnce + ") FROM LoaderQueue WHERE Status = ? AND LastUpdated < ?", stateToDelete, date)
}
I have 0 experience with groovy, and I know that the delete statement can take a few seconds to be fully executed, even if the batch size is relatively small. I'm wondering if the execution will wait for each statement to be completed before looping , or if we might send a few statements in parallel ? In short I'm wondering if the sql.execute() command is synchronous
I can't really try anything yet as there is no DEV environment for the application involved.
Problem: I have huge amount of sql queries (around 10k-20k) and I want to run them asynchronous in 50 (or more) threads.
I wrote a powershell script for this job, but it is very slow (It took about 20 hours to execute all). Desired result is 3-4 hours max.
Question: How can I optimize this powershell script? Should I reconsider and use another technology like python or c#?
I think it's powershell issue, because when I check with whoisactive the queries are executing fast. Creating, exiting and unloading jobs takes a lot of time, because for each thread is created separate PS instances.
My code:
$NumberOfParallerThreads = 50;
$Arr_AllQueries = #('Exec [mystoredproc] #param1=1, #param2=2',
'Exec [mystoredproc] #param1=11, #param2=22',
'Exec [mystoredproc] #param1=111, #param2=222')
#Creating the batches
$counter = [pscustomobject] #{ Value = 0 };
$Batches_AllQueries = $Arr_AllQueries | Group-Object -Property {
[math]::Floor($counter.Value++ / $NumberOfParallerThreads)
};
forEach ($item in $Batches_AllQueries) {
$tmpBatch = $item.Group;
$tmpBatch | % {
$ScriptBlock = {
# accept the loop variable across the job-context barrier
param($query)
# Execute a command
Try
{
Write-Host "[processing '$query']"
$objConnection = New-Object System.Data.SqlClient.SqlConnection;
$objConnection.ConnectionString = 'Data Source=...';
$ObjCmd = New-Object System.Data.SqlClient.SqlCommand;
$ObjCmd.CommandText = $query;
$ObjCmd.Connection = $objConnection;
$ObjCmd.CommandTimeout = 0;
$objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter;
$objAdapter.SelectCommand = $ObjCmd;
$objDataTable = New-Object System.Data.DataTable;
$objAdapter.Fill($objDataTable) | Out-Null;
$objConnection.Close();
$objConnection = $null;
}
Catch
{
$ErrorMessage = $_.Exception.Message
$FailedItem = $_.Exception.ItemName
Write-Host "[Error processing: $($query)]" -BackgroundColor Red;
Write-Host $ErrorMessage
}
}
# pass the loop variable across the job-context barrier
Start-Job $ScriptBlock -ArgumentList $_ | Out-Null
}
# Wait for all to complete
While (Get-Job -State "Running") { Start-Sleep 2 }
# Display output from all jobs
Get-Job | Receive-Job | Out-Null
# Cleanup
Remove-Job *
}
UPDATE:
Resources: The DB server is on a remote machine with:
24GB RAM,
8 cores,
500GB Storage,
SQL Server 2016
We want to use the maximum cpu power.
Framework limitation: The only limitation is not to use SQL Server to execute the queries. The requests should come from outside source like: Powershell, C#, Python, etc.
RunspacePool is the way to go here, try this:
$AllQueries = #( ... )
$MaxThreads = 5
# Each thread keeps its own connection but shares the query queue
$ScriptBlock = {
Param($WorkQueue)
$objConnection = New-Object System.Data.SqlClient.SqlConnection
$objConnection.ConnectionString = 'Data Source=...'
$objCmd = New-Object System.Data.SqlClient.SqlCommand
$objCmd.Connection = $objConnection
$objCmd.CommandTimeout = 0
$query = ""
while ($WorkQueue.TryDequeue([ref]$query)) {
$objCmd.CommandText = $query
$objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter $objCmd
$objDataTable = New-Object System.Data.DataTable
$objAdapter.Fill($objDataTable) | Out-Null
}
$objConnection.Close()
}
# create a pool
$pool = [RunspaceFactory]::CreateRunspacePool(1, $MaxThreads)
$pool.ApartmentState = 'STA'
$pool.Open()
# convert the query array into a concurrent queue
$workQueue = New-Object System.Collections.Concurrent.ConcurrentQueue[object]
$AllQueries | % { $workQueue.Enqueue($_) }
$threads = #()
# Create each powershell thread and add them to the pool
1..$MaxThreads | % {
$ps = [powershell]::Create()
$ps.RunspacePool = $pool
$ps.AddScript($ScriptBlock) | Out-Null
$ps.AddParameter('WorkQueue', $workQueue) | Out-Null
$threads += [pscustomobject]#{
Ps = $ps
Handle = $null
}
}
# Start all the threads
$threads | % { $_.Handle = $_.Ps.BeginInvoke() }
# Wait for all the threads to complete - errors will still set the IsCompleted flag
while ($threads | ? { !$_.Handle.IsCompleted }) {
Start-Sleep -Seconds 1
}
# Get any results and display an errors
$threads | % {
$_.Ps.EndInvoke($_.Handle) | Write-Output
if ($_.Ps.HadErrors) {
$_.Ps.Streams.Error.ReadAll() | Write-Error
}
}
Unlike powershell jobs, a RunspacePools can share resources. So there is one concurrent queue of all the queries, and each thread keeps its own connection to the database.
As others have said though - unless you're stress testing your database, you're probably better off reorganising the queries into bulk inserts.
You need to reorganize your script so that you keep a database connection open in each worker thread, using it for all queries performed by that thread. Right now you are opening a new database connection for each query, which adds a large amount of overhead. Eliminating that overhead should speed things up to or beyond your target.
Try using SqlCmd.
You can use run multiple processes using Process.Start() and use sqlcmd to run queries in parallel processes.
Of course if you're obligated to do it in threads, this answer will no longer be the solution.
Group your queries based on the table and operations on that table.
Using this you can identity how much async sql queries you could run against your different tables.
Make sure the size of the each table against which you are going to run.
Because if table contains millions of rows and your doing a join operation with some other table as well will increase the time or if it is a CUD operation then might lock your table as well.
And also choose number of threads based on your CPU cores and not based on assumptions. Because CPU core will run one process at a time so better you could create number of cores * 2 threads are efficient one.
So first study your dataset and then do the above 2 items so that you could easily identity what are all the queries are run parallely and efficiently.
Hope this will give some ideas. Better you could use any python script for that So that you could easily trigger more than one process and also monitor their activites.
Sadly I don't have the time right this instant to answer this fully, but this should help:
First, you aren't going to use the entire CPU for inserting that many records, almost promised. But!
Since it appears you are using SQL string commands:
Split the inserts into groups of say ~100 - ~1000 and manually build bulk inserts:
Something like this as a POC:
$query = "INSERT INTO [dbo].[Attributes] ([Name],[PetName]) VALUES "
for ($alot = 0; $alot -le 10; $alot++){
for ($i = 65; $i -le 85; $i++) {
$query += "('" + [char]$i + "', '" + [char]$i + "')";
if ($i -ne 85 -or $alot -ne 10) {$query += ",";}
}
}
Once a batch is built, then pass it to SQL for the insert, using effectively your existing code.
The buld insert would look something like:
INSERT INTO [dbo].[Attributes] ([Name],[PetName]) VALUES ('A', 'A'),('B', 'B'),('C', 'C'),('D', 'D'),('E', 'E'),('F', 'F'),('G', 'G'),('H', 'H'),('I', 'I'),('J', 'J'),('K', 'K'),('L', 'L'),('M', 'M'),('N', 'N'),('O', 'O'),('P', 'P'),('Q', 'Q'),('R', 'R'),('S', 'S')
This alone should speed up your inserts by a ton!
Don't use 50 threads, as previous mentioned unless you have 25+ logical cores. You will spend most of your SQL insert times waiting on the network, and hard drives NOT the CPU. By having that many threads enqueued you will have most of your CPU time reserved on waiting for the slower parts of the stack.
These two things alone I'd imagine can get your inserts down to a matter of minutes (I did 80k+ once using basically this approach in about 90 seconds).
The last part could be refactoring so that each core gets its own Sql connection, and then you leave it open until you are ready to dispose of all threads.
I don't know much about powershell, but I do execute SQL in C# all the time at work.
C#'s new async/await keywords make it extremely easy to do what you are talking about.
C# will also make a thread pool for you with the optimal amount of threads for your machine.
async Task<DataTable> ExecuteQueryAsync(query)
{
return await Task.Run(() => ExecuteQuerySync(query));
}
async Task ExecuteAllQueriesAsync()
{
IList<Task<DataTable>> queryTasks = new List<Task<DataTable>>();
foreach query
{
queryTasks.Add(ExecuteQueryAsync(query));
}
foreach task in queryTasks
{
await task;
}
}
The code above will add all the queries to the thread pool's work queue.
Then wait on them all before completing. The result being that the max level of parallelism will be reached for your SQL.
Hope this helps!
I'm investigating some performance problems in an experimental scheduling application I'm working on. I found that calls to session.SaveChanges() were pretty slow, so I wrote a simple test.
Can you explain why the first iteration of the loop takes 200ms and subsequent loop 1-2 ms? How I can I leverage this in my application (I don't mind the first call to be this slow if all subsequent calls are quick)?
private void StoreDtos()
{
for (int i = 0; i < 3; i++)
{
StoreNewSchedule();
}
}
private void StoreNewSchedule()
{
var sw = Stopwatch.StartNew();
using (var session = DocumentStore.OpenSession())
{
session.Store(NewSchedule());
session.SaveChanges();
}
Console.WriteLine("Persisting schedule took {0} ms.",
sw.ElapsedMilliseconds);
}
Output is:
Persisting schedule took 189 ms. // first time
Persisting schedule took 2 ms. // second time
Persisting schedule took 1 ms. // ... etc
Above is for an in-memory database. Using a http connection to a Raven DB instance (on the same machine), I get similar results. The first call takes noticeably more time:
Persisting schedule took 1116 ms.
Persisting schedule took 37 ms.
Persisting schedule took 14 ms.
On Github: RavenDB 2.0 testcode and RavenDB 2.5 testcode.
The very first time that you call RavenDB, there are several things that have to happen.
We need to prepare the serializers for your entities, which takes time.
We need to create the TCP connection to the server.
On the next calls, we can reuse the connection that is already open and the created serializers.
Server information:
$ httpd -v
Server version: Apache/2.2.24 (Unix)
Server built: May 8 2013 15:17:37
I create a self-signed SSL Certificate with openssl.
Test Code(Java with selenium webdriver):
long startTime, useTime = 0, t;
int count = 10;
for (int i = 0; i < count; i++) {
ChromeDriver driver = new ChromeDriver(capabilities);
startTime = System.nanoTime();
driver.get("https://*.*.*.*/pic.html");
//When testing Http,it will be:driver.get("http://*.*.*.*/pic.html");
//pic.html is a simple page with many images.
t = System.nanoTime() - startTime;
useTime += t;
driver.quit();
}
System.out.println("Average Time: " + useTime/1000000.0/count +" ms");
Result:
HTTPs:Average Time: 1718.13659 ms
HTTP:Average Time: 2484.122677 ms
Thanks in advance.
It might be that using https also enables transparent compression of the content. The time added for compression and encryption (and back of course) might be less than the time saved by transferring less content over a slow link.
You can verify this by:
Using incompressible content (e.g. a large JPEG image)
Speeding up the transfer link significantly (e.g. by using "localhost")
Because Apache and chrome (I see you're using chromedriver) both support http2.0 which is faster for reasons other than encryption but only works with encryption.
I'm fetching aroung 6k articles from the Magento database. Traversing through them in beginning is very fast (0 seconds, just some ms) and gets slower and slower. The loop takes about 8 hours to run and in the end each loop in the foreach takes about 16-20 seconds ! It seems like mysql is getting slower and slower in the end, but I cannot explain why.
$product = Mage::getModel('catalog/product');
$data = $product->getCollection()->addAttributeToSelect('*')->addAttributeToFilter('type_id', 'simple');
$num_products = $product->getCollection()->count();
echo 'exporting '.$num_products."\n";
print "starting export\n";
$start_time = time();
foreach ($data as $tProduct) {
// doing some stuff, no sql !
}
Does anyone know why it is so slow ? Would it be faster, just to fetch the ids and selecting each product one by one ?
The memory usage of the script running this code has a constant memory usage of:
VIRT RES SHR S CPU% MEM%
680M 504M 8832 R 90.0 6.3
Regards, Alex
Oh, well, Shot in the dark time. If you are running Magento 1.4.x.x, previous to 1.4.2.0, you have a memory leak that displays exactly this symptom as it eats up more and more memory, leading eventually to memory exhaustion. Profile exports that took 3-8 minutes under 1.3.x.x will now take 2-5 hours if it doesn't throw an error before completion. Another symptom is exports that fail without finishing and without giving any indication of why the window freezes or gives some sort of funky completion message with no output.
The Array Of Death(tm) has been noted and here's the official repair in the new version. Maybe Data Will Flow again!
Excerpt from 1.4.2.0rc1 /lib/Varien/Db/Select.php that has been patched for memory leak
public function __construct(Zend_Db_Adapter_Abstract $adapter)
{
parent::__construct($adapter);
if (!in_array(self::STRAIGHT_JOIN_ON, self::$_joinTypes)) {
self::$_joinTypes[] = self::STRAIGHT_JOIN_ON;
self::$_partsInit = array(self::STRAIGHT_JOIN => false) + self::$_partsInit;
}
}
Excerpt from 1.4.1.1 /lib/Varien/Db/Select.php with memory leak
public function __construct(Zend_Db_Adapter_Abstract $adapter)
{
parent::__construct($adapter);
self::$_joinTypes[] = self::STRAIGHT_JOIN_ON;
self::$_partsInit = array(self::STRAIGHT_JOIN => false) + self::$_partsInit;
}