How to execute large amount of sql queries asynchronous and in threads - sql

Problem: I have huge amount of sql queries (around 10k-20k) and I want to run them asynchronous in 50 (or more) threads.
I wrote a powershell script for this job, but it is very slow (It took about 20 hours to execute all). Desired result is 3-4 hours max.
Question: How can I optimize this powershell script? Should I reconsider and use another technology like python or c#?
I think it's powershell issue, because when I check with whoisactive the queries are executing fast. Creating, exiting and unloading jobs takes a lot of time, because for each thread is created separate PS instances.
My code:
$NumberOfParallerThreads = 50;
$Arr_AllQueries = #('Exec [mystoredproc] #param1=1, #param2=2',
'Exec [mystoredproc] #param1=11, #param2=22',
'Exec [mystoredproc] #param1=111, #param2=222')
#Creating the batches
$counter = [pscustomobject] #{ Value = 0 };
$Batches_AllQueries = $Arr_AllQueries | Group-Object -Property {
[math]::Floor($counter.Value++ / $NumberOfParallerThreads)
};
forEach ($item in $Batches_AllQueries) {
$tmpBatch = $item.Group;
$tmpBatch | % {
$ScriptBlock = {
# accept the loop variable across the job-context barrier
param($query)
# Execute a command
Try
{
Write-Host "[processing '$query']"
$objConnection = New-Object System.Data.SqlClient.SqlConnection;
$objConnection.ConnectionString = 'Data Source=...';
$ObjCmd = New-Object System.Data.SqlClient.SqlCommand;
$ObjCmd.CommandText = $query;
$ObjCmd.Connection = $objConnection;
$ObjCmd.CommandTimeout = 0;
$objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter;
$objAdapter.SelectCommand = $ObjCmd;
$objDataTable = New-Object System.Data.DataTable;
$objAdapter.Fill($objDataTable) | Out-Null;
$objConnection.Close();
$objConnection = $null;
}
Catch
{
$ErrorMessage = $_.Exception.Message
$FailedItem = $_.Exception.ItemName
Write-Host "[Error processing: $($query)]" -BackgroundColor Red;
Write-Host $ErrorMessage
}
}
# pass the loop variable across the job-context barrier
Start-Job $ScriptBlock -ArgumentList $_ | Out-Null
}
# Wait for all to complete
While (Get-Job -State "Running") { Start-Sleep 2 }
# Display output from all jobs
Get-Job | Receive-Job | Out-Null
# Cleanup
Remove-Job *
}
UPDATE:
Resources: The DB server is on a remote machine with:
24GB RAM,
8 cores,
500GB Storage,
SQL Server 2016
We want to use the maximum cpu power.
Framework limitation: The only limitation is not to use SQL Server to execute the queries. The requests should come from outside source like: Powershell, C#, Python, etc.

RunspacePool is the way to go here, try this:
$AllQueries = #( ... )
$MaxThreads = 5
# Each thread keeps its own connection but shares the query queue
$ScriptBlock = {
Param($WorkQueue)
$objConnection = New-Object System.Data.SqlClient.SqlConnection
$objConnection.ConnectionString = 'Data Source=...'
$objCmd = New-Object System.Data.SqlClient.SqlCommand
$objCmd.Connection = $objConnection
$objCmd.CommandTimeout = 0
$query = ""
while ($WorkQueue.TryDequeue([ref]$query)) {
$objCmd.CommandText = $query
$objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter $objCmd
$objDataTable = New-Object System.Data.DataTable
$objAdapter.Fill($objDataTable) | Out-Null
}
$objConnection.Close()
}
# create a pool
$pool = [RunspaceFactory]::CreateRunspacePool(1, $MaxThreads)
$pool.ApartmentState = 'STA'
$pool.Open()
# convert the query array into a concurrent queue
$workQueue = New-Object System.Collections.Concurrent.ConcurrentQueue[object]
$AllQueries | % { $workQueue.Enqueue($_) }
$threads = #()
# Create each powershell thread and add them to the pool
1..$MaxThreads | % {
$ps = [powershell]::Create()
$ps.RunspacePool = $pool
$ps.AddScript($ScriptBlock) | Out-Null
$ps.AddParameter('WorkQueue', $workQueue) | Out-Null
$threads += [pscustomobject]#{
Ps = $ps
Handle = $null
}
}
# Start all the threads
$threads | % { $_.Handle = $_.Ps.BeginInvoke() }
# Wait for all the threads to complete - errors will still set the IsCompleted flag
while ($threads | ? { !$_.Handle.IsCompleted }) {
Start-Sleep -Seconds 1
}
# Get any results and display an errors
$threads | % {
$_.Ps.EndInvoke($_.Handle) | Write-Output
if ($_.Ps.HadErrors) {
$_.Ps.Streams.Error.ReadAll() | Write-Error
}
}
Unlike powershell jobs, a RunspacePools can share resources. So there is one concurrent queue of all the queries, and each thread keeps its own connection to the database.
As others have said though - unless you're stress testing your database, you're probably better off reorganising the queries into bulk inserts.

You need to reorganize your script so that you keep a database connection open in each worker thread, using it for all queries performed by that thread. Right now you are opening a new database connection for each query, which adds a large amount of overhead. Eliminating that overhead should speed things up to or beyond your target.

Try using SqlCmd.
You can use run multiple processes using Process.Start() and use sqlcmd to run queries in parallel processes.
Of course if you're obligated to do it in threads, this answer will no longer be the solution.

Group your queries based on the table and operations on that table.
Using this you can identity how much async sql queries you could run against your different tables.
Make sure the size of the each table against which you are going to run.
Because if table contains millions of rows and your doing a join operation with some other table as well will increase the time or if it is a CUD operation then might lock your table as well.
And also choose number of threads based on your CPU cores and not based on assumptions. Because CPU core will run one process at a time so better you could create number of cores * 2 threads are efficient one.
So first study your dataset and then do the above 2 items so that you could easily identity what are all the queries are run parallely and efficiently.
Hope this will give some ideas. Better you could use any python script for that So that you could easily trigger more than one process and also monitor their activites.

Sadly I don't have the time right this instant to answer this fully, but this should help:
First, you aren't going to use the entire CPU for inserting that many records, almost promised. But!
Since it appears you are using SQL string commands:
Split the inserts into groups of say ~100 - ~1000 and manually build bulk inserts:
Something like this as a POC:
$query = "INSERT INTO [dbo].[Attributes] ([Name],[PetName]) VALUES "
for ($alot = 0; $alot -le 10; $alot++){
for ($i = 65; $i -le 85; $i++) {
$query += "('" + [char]$i + "', '" + [char]$i + "')";
if ($i -ne 85 -or $alot -ne 10) {$query += ",";}
}
}
Once a batch is built, then pass it to SQL for the insert, using effectively your existing code.
The buld insert would look something like:
INSERT INTO [dbo].[Attributes] ([Name],[PetName]) VALUES ('A', 'A'),('B', 'B'),('C', 'C'),('D', 'D'),('E', 'E'),('F', 'F'),('G', 'G'),('H', 'H'),('I', 'I'),('J', 'J'),('K', 'K'),('L', 'L'),('M', 'M'),('N', 'N'),('O', 'O'),('P', 'P'),('Q', 'Q'),('R', 'R'),('S', 'S')
This alone should speed up your inserts by a ton!
Don't use 50 threads, as previous mentioned unless you have 25+ logical cores. You will spend most of your SQL insert times waiting on the network, and hard drives NOT the CPU. By having that many threads enqueued you will have most of your CPU time reserved on waiting for the slower parts of the stack.
These two things alone I'd imagine can get your inserts down to a matter of minutes (I did 80k+ once using basically this approach in about 90 seconds).
The last part could be refactoring so that each core gets its own Sql connection, and then you leave it open until you are ready to dispose of all threads.

I don't know much about powershell, but I do execute SQL in C# all the time at work.
C#'s new async/await keywords make it extremely easy to do what you are talking about.
C# will also make a thread pool for you with the optimal amount of threads for your machine.
async Task<DataTable> ExecuteQueryAsync(query)
{
return await Task.Run(() => ExecuteQuerySync(query));
}
async Task ExecuteAllQueriesAsync()
{
IList<Task<DataTable>> queryTasks = new List<Task<DataTable>>();
foreach query
{
queryTasks.Add(ExecuteQueryAsync(query));
}
foreach task in queryTasks
{
await task;
}
}
The code above will add all the queries to the thread pool's work queue.
Then wait on them all before completing. The result being that the max level of parallelism will be reached for your SQL.
Hope this helps!

Related

is Sql.execute() synchronous or asynchronous with groovy?

I stumbled onto a bit of groovy code deleting rows from a database by batches of a given size:
for (int index = 0; index <= loopCnt; index++) {
sql.execute("DELETE TOP(" + maxEntriesToDeleteAtOnce + ") FROM LoaderQueue WHERE Status = ? AND LastUpdated < ?", stateToDelete, date)
}
I have 0 experience with groovy, and I know that the delete statement can take a few seconds to be fully executed, even if the batch size is relatively small. I'm wondering if the execution will wait for each statement to be completed before looping , or if we might send a few statements in parallel ? In short I'm wondering if the sql.execute() command is synchronous
I can't really try anything yet as there is no DEV environment for the application involved.

How can I get the Last Processed timestamp for an SSAS tabular cube?

In SSMS I have connected to a SSAS tabular cube. When I view the properties screen I see the Last Processed timestamp of 11/24/2015 2:59:20 PM.
If I use SELECT LAST_DATA_UPDATE FROM $system.MDSchema_Cubes I see a timestamp of 11/25/2015 12:13:28 PM (if I adjust for the timezone).
If I open up the partitions screen for one of the tables in my cube I see that the most Last Processed timestamp is 11/25/2015 12:13:28 PM which matches the value from the DMV.
I want the Last Processed timestamp for my BISM, the one from the Database Properties screen, not the one from a partition that happened to be processed later.
Is there a way to get this programmatically?
You can use Analysis Services Stored Procedure assembly that you can download from here.
Once you get the assembly file that corresponds to your Analysis Server version connect to your instance via SSMS.
Look for your your Database (Database Cube)
Go to Assemblies folder
Right click and New Assembly...
Browse and select the assembly.
Set the permissions as described in the documentation assembly
Once you have imported the assembly use this MDX query to get the last processed timestamp.
--
with member [Measures].[LastProcessed] as ASSP.GetCubeLastProcessedDate()
select [Measures].[LastProcessed] on 0
from [Armetales DWH]
Let me know if this can help you.
After looking at the code in the Analysis Services Stored Procedure assembly I was able to put together a powershell script that got the date I was looking for. Here is the code:
#we want to always stop the script if any error occurs
$ErrorActionPreference = "Stop"
$error.Clear()
[System.Reflection.Assembly]::LoadWithPartialName("Microsoft.AnalysisServices") | Out-Null
$databases = #('BISM1', 'BISM2')
$servers = #('Server1\BISM', 'Server2\BISM')
function Get-BISMLastProcessed
{
param(
[string] $connStr
)
Begin {
$server = New-Object Microsoft.AnalysisServices.Server
$server.Connect($connStr)
}
Process {
Try {
$database = $server.Databases.GetByName($_)
Write-Host " Database [$($database.Name)] was last processed $($database.LastProcessed)"
}
Catch [System.Exception] {
Write-Host $Error[0].Exception
}
Finally {
if ($database -ne $null) {
$database.Dispose()
}
}
}
End {
$server.Dispose()
}
}
foreach ($server in $servers) {
$connectStr = "Integrated Security=SSPI;Persist Security Info=False;Initial Catalog=BISM1;Data Source=$server"
Write-Host "Server [$server]"
$databases | Get-BISMLastProcessed $connectStr
Write-Host "----------------"
}
The results are:
Server [Server1\BISM]
Database [BISM1] was last processed 11/30/2015 12:25:48
Database [BISM2] was last processed 12/01/2015 15:53:56
----------------
Server [Server2\BISM]
Database [BISM1] was last processed 11/30/2015 12:19:32
Database [BISM2] was last processed 11/02/2015 23:46:34
----------------

SQL Server transactions in Powershell

I am new to Powershell scripting and SQL server. I am trying to write a test case to verify the integrity of SQL server database (running inside a VM) w.r.t backups.
The goal is to check that when a backup was taken in the middle of a transaction, database is still consistent after the restore.
My test case takes periodic backups of the SQL server VM while another powershell script performs database transactions in parallel (transferring money from one account to another).
I frequently find that database is inconsistent: sum of the money from all accounts is lesser than the initial deposit.
I am wondering if the SQL server transaction logic is buggy. So, can anybody see what is wrong with the below Powershell and SQL Server code? Does it get the transaction semantics right?
Function TransferMoney {
$conn = $args[0]
$from = $args[1]
$to = $args[2]
$amount = $args[3]
$conn.BeginTransaction()
# Keep this transaction intentionally dumb, so that it takes longer to
# execute and Uvm has more chance of getting replicated in the middle of the
# transaction.
# Read the current balances.
$reader = $conn.ExecuteReader("SELECT balance FROM $tableName WHERE id=$from")
$reader.Read()
$from_balance = $reader.GetValue(0)
$reader.close()
$reader = $conn.ExecuteReader("SELECT balance FROM $tableName WHERE id=$to")
$reader.Read()
$to_balance = $reader.GetValue(0)
$reader.close()
$from_balance = $from_balance - $amount
$to_balance = $to_balance + $amount
$conn.ExecuteNonQuery("UPDATE $tableName SET balance=$from_balance WHERE id=$from")
$conn.ExecuteNonQuery("UPDATE $tableName SET balance=$to_balance WHERE id=$to")
$conn.CommitTransaction()
Write-Output "$amount dollars are transferred from account $from to $to. Current balances are $from_balance and $to_balance dollars respectively."
}
Function WorkUnit {
$from = Get-Random -minimum 0 -maximum $numAccounts
$to = ($from + 1) % $numAccounts
$conn = CreateConnection
$conn.ExecuteNonQuery("SET XACT_ABORT ON")
$conn.ExecuteNonQuery("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE")
# Transfer money from one account to another. Transactions may fail if
# multiple jobs pick conflicting account numbers, in which case ignore that
# transfer. Since we use transactions, such failures wouldn't cause any
# loss of money, so test should still succeed.
TransferMoney $conn $from $to $from
# Number of dollars transferred from an account is kept unique (the account
# number itself) so that, in the event of data inconsistency, we can deduce
# which transfer operation has caused the data inconsistency and it can be
# helpful in debugging.
}

Weird timeout issues with Dapper.net

I started to use dapper.net a while ago for performance reasons and that i really like the named parameters feature compared to just run "ExecuteQuery" in LINQ To SQL.
It works great for most queries but i get some really weird timeouts from time to time. The strangest thing is that this timeout only happens when the SQL is executed via dapper. If i take the executed query copied from the profiler and just run it in Management Studio its fast and works perfect. And it's not just a temporary issues. The query consistently timeout via dapper and consistently works fine in Management Studio.
exec sp_executesql N'SELECT Item.Name,dbo.PlatformTextAndUrlName(Item.ItemId) As PlatformString,dbo.MetaString(Item.ItemId) As MetaTagString, Item.StartPageRank,Item.ItemRecentViewCount
NAME_SRCH.RANK as NameRank,
DESC_SRCH.RANK As DescRank,
ALIAS_SRCH.RANK as AliasRank,
Item.itemrecentviewcount,
(COALESCE(ALIAS_SRCH.RANK, 0)) + (COALESCE(NAME_SRCH.RANK, 0)) + (COALESCE(DESC_SRCH.RANK, 0) / 20) + Item.itemrecentviewcount / 4 + ((CASE WHEN altrank > 60 THEN 60 ELSE altrank END) * 4) As SuperRank
FROM dbo.Item
INNER JOIN dbo.License on Item.LicenseId = License.LicenseId
LEFT JOIN dbo.Icon on Item.ItemId = Icon.ItemId
LEFT OUTER JOIN FREETEXTTABLE(dbo.Item, name, #SearchString) NAME_SRCH ON
Item.ItemId = NAME_SRCH.[KEY]
LEFT OUTER JOIN FREETEXTTABLE(dbo.Item, namealiases, #SearchString) ALIAS_SRCH ON
Item.ItemId = ALIAS_SRCH.[KEY]
INNER JOIN FREETEXTTABLE(dbo.Item, *, #SearchString) DESC_SRCH ON
Item.ItemId = DESC_SRCH.[KEY]
ORDER BY SuperRank DESC OFFSET #Skip ROWS FETCH NEXT #Count ROWS ONLY',N'#Count int,#SearchString nvarchar(4000),#Skip int',#Count=12,#SearchString=N'box,com',#Skip=0
That is the query that i copy pasted from SQL Profiler. I execute it like this in my code.
using (var connection = new SqlConnection(ConfigurationManager.ConnectionStrings["Conn"].ToString())) {
connection.Open();
var items = connection.Query<MainItemForList>(query, new { SearchString = searchString, PlatformId = platformId, _LicenseFilter = licenseFilter, Skip = skip, Count = count }, buffered: false);
return items.ToList();
}
I have no idea where to start here. I suppose there must be something that is going on with dapper since it works fine when i just execute the code.
As you can see in this screenshot. This is the same query executed via code first and then via Management Studio.
I can also add that this only (i think) happens when i have two or more word or when i have a "stop" char in the search string. So it may have something todo with the full text search but i cant figure out how to debug it since it works perfectly from Management Studio.
And to make matters even worse, it works fine on my localhost with a almost identical database both from code and from Management Studio.
Dapper is nothing more than a utility wrapper over ado.net; it does not change how ado.net operates. It sounds to me that the problem here is "works in ssms, fails in ado.net". This is not unique: it is pretty common to find this occasionally. Likely candidates:
"set" option: these have different defaults in ado.net - and can impact performance especially if you have things like calculated+persisted+indexed columns - if the "set" options aren't compatible it can decide it can't use the stored value, hence not the index - and instead table-scan and recompute. There are other similar scenarios.
system load / transaction isolation-level / blocking; running something in ssms does not reproduce the entire system load at that moment in time
cached query plans: sometimes a duff plan gets cached and used; running from ssms will usually force a new plan - which will naturally be tuned for the parameters you are using in your test. Update all your index stats etc, and consider adding the "optimise for" query hint
In ADO is the default value for CommandTimeout 30 Seconds, in Management Studio infinity. Adjust the command timeout for calling Query<>, see below.
var param = new { SearchString = searchString, PlatformId = platformId, _LicenseFilter = licenseFilter, Skip = skip, Count = count };
var queryTimeoutInSeconds = 120;
using (var connection = new SqlConnection(ConfigurationManager.ConnectionStrings["Conn"].ToString()))
{
connection.Open();
var items = connection.Query<MainItemForList>(query, param, commandTimeout: queryTimeoutInSeconds, buffered: false);
return items.ToList();
}
See also
SqlCommand.CommandTimeout Property on MSDN
For Dapper , default timeout is 30 seconds But we can increase the timeout in this way. Here we are incresing the timeout 240 seconds (4 minutes).
public DataTable GetReport(bool isDepot, string fetchById)
{
int? queryTimeoutInSeconds = 240;
using (IDbConnection _connection = DapperConnection)
{
var parameters = new DynamicParameters();
parameters.Add("#IsDepot", isDepot);
parameters.Add("#FetchById", fetchById);
var res = this.ExecuteSP<dynamic>(SPNames.SSP_GetSEPReport, parameters, queryTimeoutInSeconds);
return ToDataTable(res);
}
}
In the repository layer , we can call our custom ExecuteSP method for the Stored Procedures with additional parameters "queryTimeoutInSeconds".
And below is the "ExecuteSP" method for dapper:-
public virtual IEnumerable<TEntity> ExecuteSP<TEntity>(string spName, object parameters = null, int? parameterForTimeout = null)
{
using (IDbConnection _connection = DapperConnection)
{
_connection.Open();
return _connection.Query<TEntity>(spName, parameters, commandTimeout: parameterForTimeout, commandType: CommandType.StoredProcedure);
}
}
Could be a matter of setting the command timeout in Dapper. Here's an example of how to adjust the command timeout in Dapper:
Setting Command Timeout in Dapper

Magento - fetching products and looping through is getting slower and slower

I'm fetching aroung 6k articles from the Magento database. Traversing through them in beginning is very fast (0 seconds, just some ms) and gets slower and slower. The loop takes about 8 hours to run and in the end each loop in the foreach takes about 16-20 seconds ! It seems like mysql is getting slower and slower in the end, but I cannot explain why.
$product = Mage::getModel('catalog/product');
$data = $product->getCollection()->addAttributeToSelect('*')->addAttributeToFilter('type_id', 'simple');
$num_products = $product->getCollection()->count();
echo 'exporting '.$num_products."\n";
print "starting export\n";
$start_time = time();
foreach ($data as $tProduct) {
// doing some stuff, no sql !
}
Does anyone know why it is so slow ? Would it be faster, just to fetch the ids and selecting each product one by one ?
The memory usage of the script running this code has a constant memory usage of:
VIRT RES SHR S CPU% MEM%
680M 504M 8832 R 90.0 6.3
Regards, Alex
Oh, well, Shot in the dark time. If you are running Magento 1.4.x.x, previous to 1.4.2.0, you have a memory leak that displays exactly this symptom as it eats up more and more memory, leading eventually to memory exhaustion. Profile exports that took 3-8 minutes under 1.3.x.x will now take 2-5 hours if it doesn't throw an error before completion. Another symptom is exports that fail without finishing and without giving any indication of why the window freezes or gives some sort of funky completion message with no output.
The Array Of Death(tm) has been noted and here's the official repair in the new version. Maybe Data Will Flow again!
Excerpt from 1.4.2.0rc1 /lib/Varien/Db/Select.php that has been patched for memory leak
public function __construct(Zend_Db_Adapter_Abstract $adapter)
{
parent::__construct($adapter);
if (!in_array(self::STRAIGHT_JOIN_ON, self::$_joinTypes)) {
self::$_joinTypes[] = self::STRAIGHT_JOIN_ON;
self::$_partsInit = array(self::STRAIGHT_JOIN => false) + self::$_partsInit;
}
}
Excerpt from 1.4.1.1 /lib/Varien/Db/Select.php with memory leak
public function __construct(Zend_Db_Adapter_Abstract $adapter)
{
parent::__construct($adapter);
self::$_joinTypes[] = self::STRAIGHT_JOIN_ON;
self::$_partsInit = array(self::STRAIGHT_JOIN => false) + self::$_partsInit;
}