ssis simultaneous execution of query - sql

In advance I'm new to SSIS.
I run a query that gives me distinct values in columnA from TableA that needs to be processed in order (1 then 2 then 3 and so on but the numbers constantly change start and end values).
these columnA values then have groups of values in columnB, and these values have to be run through a stored procedure but they can all run simultaneously. Currently they run in a linear manner
Here is a visual of what I need to do (sudo code)
foreach
{
foreach
{
processX
}
}
what I want:
foreach
{
processA processB ProcessC simultaneously there are no collisions to worry about
}
I am using a control flow in SSIS and it has the foreach loop which is good however I don't know what to use to allow it to run the second part simultaneously.

When I want parallel execution in SQL Server in the Control Flow, I usually put several For-Each loops and bring back a distinct recordset for each one of them.

There currently is no way to run a Foreach Loop in "parallel mode".
The best that I can think of is to rework your architecture into a flexible 'worker' threading model, where you can parallelize independently.
What that would require is two SSIS packages. One to supply the work units, and one to work on them. So the "controller" package would perform the foreach loop on TableA, collecting whatever values it needs to. It would then insert those values into a "work to do" table. The "worker" package would consist of a For Loop, inside of which you'd have an Execute SQL Task that queried the "work to do" table for the first row that wasn't being worked on, and if it found such a row, mark it as being worked on (all inside a transaction to ensure no collisions). You'd then have your "work unit" to do work with, or not. A precedence constraint to your next task would only execute if it actually had some instructions. Your For Loop's Eval expression could be crafted to stop when you don't see any new work units (although you might want a delay in there to make sure your workers weren't faster than the controller).
To run all this, you'd start the controller (in an Agent Job), then start multiple workers (same package, different jobs) - as many as you wanted.

Related

Databricks - automatic parallelism and Spark SQL

I have a general question about Databrick cells and auto-parallelism with Spark SQL. I have a summary table that has a number of fields of which most have a complex logic behind them.
If I put blocks (%SQL) of individual field logic in individual cells, will the scheduler automatically attempt to allocate the cells to different nodes on the cluster to improve performance ( depending on how many nodes my cluster has) ? Alternatively are their PySpark functions I can use to organise the parallel running myself ? I cant find much about this elsewhere...
I am using LTS 10.4 (Spark 3.2.1 Scala 2.12)
Many thanks
Richard
If you write python "pyspark" code over multiple cells there is something called "lazy execution" meaning the actual work only happens at the last possible moment (for example when data is written or displayed). So before you run for example a display(df) no actual work is done on the cluster. So technically here the code of multiple code cells is parallelized efficiently.
However, in Databricks Spark SQL a single cell is executed to completion before the next one is started. If you want to run those concurrently you can take a look at running multiple notebooks at the same time (or multiple parameterized instances of the same notebook) with dbutils.notebook.run(). Then the cluster will automatically split the resources evenly between those queries running at the same time.
You can try run the sql statements using spark.sql() and assign the outputs to different dataframes. In the last step, you could execute an operation (for ex: join) that brings all into one dataframe. The lazy evaluation should then evaluate all dataframes (i.e. your sql queries) in parallel.

How to discard / ignore one of a stored procedure's many return values

I have a stored procedure that returns 2 values.
In another procedure, I call this (edit: NOT selectable) procedure but only need one of the two returned values.
Is there a way to discard the other value? I'm wondering what is a good practice, and hoping for a small performance gain.
Here is how I call the procedure without error:
CREATE or ALTER procedure my_proc1
as
declare variable v_out1 integer default null;
declare variable v_out2 varchar(10) default null;
begin
execute procedure my_proc2('my_param')
returning_values :v_out1, :v_out2;
end;
That is the only way I found to call this procedure without getting a -607 error 'unsuccessful metadata update request depth exceeded. (Recursive definition?)' whenever I use only one variable v_out1.
So my actual question is: can I avoid creating a v_out2 variable for nothing, as I will never use it (that value is only used in other procedures which also call my_proc2)?
Edit: the stored procedure my_proc2 is actually not selectable. But I made it selectable after all.
Because your stored procedure is selectable, you should call it by SELECT statement, ie
select out1, out2 from my_proc2('my_param')
and in that case you can indeed omit some of the return value(s). However, I wouldn't expect noticeable performance gain as the logic inside the SP which calculates the omitted field is still executed.
If your procedure is not selectable, then creating a wrapper SP is the only way, but again, it woulnd't give any performance gain as the code which does the hard work inside the original SP is still executed.
The answer is made to use text formatting while demonstrating "race conditions" in the multithreading programming (which SQL is) when [ab]using out-of-transaction objects (SQL sequences aka Firebird Generators).
So, the "use case".
Initial condition: table is empty, generator=0.
You start two concurrent transactions, A and B. For ease of imagining you may think those transactions were started from concurrent connections made by two persons working with your program on two networked computers. Though actually it does not matter much, if you open them transactions from one same connection - the scenario would not change a bit. Just for the ease of imagining.
The Tx.A issues UPDATE-OR-INSERT which inserts new row into the table. Doing so it up-ticks the generator. The transaction is not committed yet. Database condition: the table has one invisible (non-committed) row with auto_id=1, the generator = 1.
The Tx.B issues UPDATE-OR-INSERT too which inserts yet another row into the table. Doing so it also up-ticks the generator. The transaction maybe commits now, or maybe later, irrelevant. Database condition: the table has two rows (one or both are invisible (non-committed)) with auto_id=1 and auto_id=2, the generator = 2.
The Tx.A meets some error, throws the exception, DOWNTICKS the generator and rolls back. Database condition: the table has one row with auto_id=2 the generator = 1.
If Tx.B was not committed before, it is committed now. (this "if" just to demonstrate that it does not matter when other transactions would be committed, earlier or later, it only matters that Tx.A downticks the generator after any other transaction upticked it)
So, the final database condition: the table has one committed=visible row with auto_id=2 and the generator = 1.
Any next attempt to add yet one more row would try to up the generator 1+1=2 and then fail to insert new row with PK violation, then it would down the generator to 1 to recreate the faulty condition outlined above.
Your database stuck and without direct intervention by DB Administrator can not have data added further.
The very idea of rolling back the generator is defeating all intentions generators were created for and all expectations about generators behavior that the database and connection libraries and other programmers have.
You just placed a trap on the highway. It is only a matter of time until someone will be caught into it.
Even if you would continue guarding this hack by other hacks for now - wasting a lot of time and attention to do that scrupulously and pervasively - still one unlucky day in the future there would be another programmer, or even you would forget this gory details - and you would start using the generator in standard intended way - and would run into the trap.
Generators were not made to be backtracked during normal work.
existence of primary key is checked in the procedure before doing anything
Yep, that is the first reaction when multithreading programmer meets his first race condition. Let's just add more prior checks.
First few checks indeed can decrease probability of a clash, but it never can alleviate it completely. And the more use your program would see, the more transactions would get opened by more and more concurrent and active users - it is only a matter of time until this somewhat lowered probability would turn out still too much.
Think about it, SQL is about transactions, yet they had to invent and introduce explicitly out-of-transactions device Generator/Sequence is. If there was reliable solution without them - it would be just used instead of creating that so non-SQLish transaction boundary breaking tool.
When you say your SP "checks for PK violation" it is exactly the same as if you would drop the generator altogether and instead just issue "good old"
:new_id = ( select max(auto_id)+1 from MyTable );
By your description you actually do something like that, but in some indirect way. Something like
while exists( select * from MyTable where auto_id = gen_id(MyGen, +1))
do ;
:new_id = gen_id(MyGen, 0);
You may feel, that because you mentioned generators, you somehow overcame the cross-transaction invisibility problem. But you did not, because the very check "if PK was already taken" is done against in-transaction table.
That changes nothing, your two transactions Tx.A and Tx.B would not see each other's records, because they both did not committed yet. Now it only takes some unlucky Tx.C that would fail and downtick the generator to them collide on the same ID.
Or not, you do not even need Tx.C and downticking at all!
Here we bump into the multithreading idea about "atomic operations".
Let's look at it again.
while exists( select * from MyTable where auto_id = gen_id(MyGen, +1))
do ;
:new_id = gen_id(MyGen, 0);
In a single-threaded application that code is okay: you just keep running the generator up until the free slot, then you just query the value without changing it. "What could possibly go wrong?" But in multithreaded environment it is rooks waiting to be stepped over. Example:
Initial condition, table has 100 rows (auto_id goes from 1 to 100), the generator = 100.
Tx.A starts adding the row, upticks the generator in the while loop and exits the loop. It does not yet pass to the second line where local variable gets assigned. Not yet. The generator = 101, rows not added yet.
Tx.B starts adding the row, upticks the generator in the while loop and exits the loop. The generator = 102, rows not added yet.
Tx.A goes to the second line and reads gen_id(MyGen,0) into a variable for new row. While it was 101 out of the loop, it is 102 now!
Tx.B goes to the second line and reads gen_id(MyGen,0) and gets 102 too.
Tx.A and Tx.B both try to insert new row with auto_id=102
RACE CONDITIONS - both Tx.A and Tx.B try to commit their work. One of them succeeds, another fails. Which one? It is not predictable. A lucky one commits, an unlucky one fails.
The failed transaction downticks the generator.
Final condition: the table has 101 rows, the auto_id consistently goes from 1 to 100 and then skips to 102. The generator = 101, which his less than MAX(auto_id)
Now you might want to add more hacks, I mean more prior checks before actually inserting rows and committing. It will make mistakes yet less probable, right? Wrong. The more checks you do - the slower gets the code. The slower gets the code - the greater gets probability, that while one thread runs throw all them checks there happens another thread that interferes and alters the situation that was checked a moment ago.
The fundamental issue with multithreading is that any check is SEPARATE action. And between those actions the situation MAY change. Your procedure may check whatever it wants BEFORE actually inserting the row. It would not warrant much. Because when you finally gets at the row inserting statement, all the checks you did in the PAST are a matter of past. And the situation is potentially already altered. And warrants your checks were giving in the PAST only belong to that past, not to the moment at hands.
And even if you no more look for warranting sure thing, still adding every new check you can not even be sure if doing so you just decreased or increased probability of failure. Because multithreading is a bitch, it is flowing chaotically out of your control.
So, remember the KISS principle. Until proven otherwise - you most probably do not need SP2 at all, you only need one single UPDATE-OR-INSERT statement.
PS. There was a pretty fun game in my school days, it was called Pascal Robots. There are also C Robots I heard and probably implementation for other many languages. With Pascal Robots though came a number of already coded robots, demonstrating different strategies and approaches. Some of them were really thought out in very intrinsic details. And there was one robot which program was PRIMITIVE. It only had two loops: if you do not see an enemy - keep turning your radar around, if you do see an enemy - keep running to it and shooting at it. That was all. What could this idiot do against sophisticated robots having creative attack and defense strategies, flanking maneuvers, optimal distance to maintain by back and forth movements, escape tricks and more? Those sophisticated robots employed very extensive checks and very thought through hacks to be triggered by those checks. So... ...so that primitive idiot was second or maybe third best robot in the shipped set. there was only one or two smarties who could outwit it. With ALL the other robots this lean-and-fast idiot finished them before they could run through all their checks and hacks thrice. That is what multithreading does to programming. It was astonishing to watch those battles, which went so against out single-threaded intuition.

SSIS 2012 - Conditional Split and Rejoin in Control Flow Stops

I have created an SSIS package to process some file imports, manipulation, etc but am having a problem with a conditional split I have created.
I have an Execute SQL Task which simply does a count of a table. One constraint has an expression for if the result is 0 and the other if its greater than 0. For the constraint where it is 0, I have three more Execute SQL Tasks for dropping and creating various tables. For the other constraint, it jumps past these three tasks to the next Execute SQL task, lets called it Bob for now. The equal 0 constraint once complete rejoins to Bob and then the remainder of the package.
When I run the package, the zero condition is met, the three Execute SQL Tasks are complete and then it stops, saying package execution complete. It does not appear to be rejoining the main stream.
I have tried putting the three tasks in a Sequence Container but made no difference. I have obviously done something strange or missed a configuration somewhere. If anyone could shed any light on this, it would be greatly appreciated.
Unbelievable. I've sorted it. The last constraint with the equal 0 part needed the logical OR option setting.

Is there a way for VBA UDF to "know" what other functions will be run?

Assume I have a UDF that will be used in a worksheet 100,000+ times. Is there a way, within the function, for it to know how many more times it is going to be called in the batch? Basically what I want to do is have every function create a to-do list of work to do. I want to do something like:
IF remaining functions to be executed after this one = 0 then ...
Is there a way to do this?
Background:
I want to make a UDF that will perform SQL queries with the user just giving parameters(date, hour, node, type). This is pretty easy to make if you're willing to actually execute the SQL query every time the function is run. I know its easy because I did this and it was ridiculously slow. My new idea is to have the function first see if the data it is looking for exists in a global cache variable and if it isn't to add it to a global variable "job-list".
What I want it to do is when the last function is called to then go through the job list and perform the fewest number of SQL queries and fill the global cache variable. Once the cache variable is full it would do a table refresh to make all the other functions get called again since on the subsequent call they'll find the data they need in the cache.
Firstly:
VBA UDF performance is extremely sensitive to the way the UDF is coded:
see my series of posts about writing efficient VBA UDFs:
http://fastexcel.wordpress.com/2011/06/13/writing-efficient-vba-udfs-part-3-avoiding-the-vbe-refresh-bug/
http://fastexcel.wordpress.com/2011/05/25/writing-efficient-vba-udfs-part-1/
You should also consider using an Array UDF to return multiple results:
http://fastexcel.wordpress.com/2011/06/20/writing-efiicient-vba-udfs-part5-udf-array-formulas-go-faster/
Secondly:
The 12th post in this series outlines using the AfterCalculate event and a cache
http://fastexcel.wordpress.com/2012/12/05/writing-efficient-udfs-part-12-getting-used-range-fast-using-application-events-and-a-cache/
Basically the approach you would need is for the UDF to check the cache & if not current or available then add a request to the queue. Then use the after-calculation event to process the queue and if neccessary trigger another recalc.
Performing 100,000 SQL queries from an Excel spreadsheet seems like a poor design. Creating a cache'ing mechanism on top of these seems to compound the problem, making it more complicated than it probably needs to be. There are some circumstances where this might be appropriate, but I would consider other design approaches instead.
The most obvious is to take the data from the Excel spreadsheet and load it into a table in the database. Then use the database to do the processing on all the rows as once. The final step is to read the result back into Excel.
I find that the best way to get large numbers of rows from Excel into a database is to save the Excel file as csv and bulk insert them.
This approach may not work for your problem. In general, though, set-based approaches running in the database are going to perform much better.
As for the cach'ing mechanism, if you have to go down that route. I can imagine a function that has the following pseudo-code:
Check if input values are in cache.
If so, read values from cache.
Else do complex processing.
Load values in cache.
This logic could go in the function. As #Bulat suggests, though, it is probably better to add an additional caching layer around the function.

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.