I have created an SSIS package to process some file imports, manipulation, etc but am having a problem with a conditional split I have created.
I have an Execute SQL Task which simply does a count of a table. One constraint has an expression for if the result is 0 and the other if its greater than 0. For the constraint where it is 0, I have three more Execute SQL Tasks for dropping and creating various tables. For the other constraint, it jumps past these three tasks to the next Execute SQL task, lets called it Bob for now. The equal 0 constraint once complete rejoins to Bob and then the remainder of the package.
When I run the package, the zero condition is met, the three Execute SQL Tasks are complete and then it stops, saying package execution complete. It does not appear to be rejoining the main stream.
I have tried putting the three tasks in a Sequence Container but made no difference. I have obviously done something strange or missed a configuration somewhere. If anyone could shed any light on this, it would be greatly appreciated.
Unbelievable. I've sorted it. The last constraint with the equal 0 part needed the logical OR option setting.
Related
I have a stored procedure that returns 2 values.
In another procedure, I call this (edit: NOT selectable) procedure but only need one of the two returned values.
Is there a way to discard the other value? I'm wondering what is a good practice, and hoping for a small performance gain.
Here is how I call the procedure without error:
CREATE or ALTER procedure my_proc1
as
declare variable v_out1 integer default null;
declare variable v_out2 varchar(10) default null;
begin
execute procedure my_proc2('my_param')
returning_values :v_out1, :v_out2;
end;
That is the only way I found to call this procedure without getting a -607 error 'unsuccessful metadata update request depth exceeded. (Recursive definition?)' whenever I use only one variable v_out1.
So my actual question is: can I avoid creating a v_out2 variable for nothing, as I will never use it (that value is only used in other procedures which also call my_proc2)?
Edit: the stored procedure my_proc2 is actually not selectable. But I made it selectable after all.
Because your stored procedure is selectable, you should call it by SELECT statement, ie
select out1, out2 from my_proc2('my_param')
and in that case you can indeed omit some of the return value(s). However, I wouldn't expect noticeable performance gain as the logic inside the SP which calculates the omitted field is still executed.
If your procedure is not selectable, then creating a wrapper SP is the only way, but again, it woulnd't give any performance gain as the code which does the hard work inside the original SP is still executed.
The answer is made to use text formatting while demonstrating "race conditions" in the multithreading programming (which SQL is) when [ab]using out-of-transaction objects (SQL sequences aka Firebird Generators).
So, the "use case".
Initial condition: table is empty, generator=0.
You start two concurrent transactions, A and B. For ease of imagining you may think those transactions were started from concurrent connections made by two persons working with your program on two networked computers. Though actually it does not matter much, if you open them transactions from one same connection - the scenario would not change a bit. Just for the ease of imagining.
The Tx.A issues UPDATE-OR-INSERT which inserts new row into the table. Doing so it up-ticks the generator. The transaction is not committed yet. Database condition: the table has one invisible (non-committed) row with auto_id=1, the generator = 1.
The Tx.B issues UPDATE-OR-INSERT too which inserts yet another row into the table. Doing so it also up-ticks the generator. The transaction maybe commits now, or maybe later, irrelevant. Database condition: the table has two rows (one or both are invisible (non-committed)) with auto_id=1 and auto_id=2, the generator = 2.
The Tx.A meets some error, throws the exception, DOWNTICKS the generator and rolls back. Database condition: the table has one row with auto_id=2 the generator = 1.
If Tx.B was not committed before, it is committed now. (this "if" just to demonstrate that it does not matter when other transactions would be committed, earlier or later, it only matters that Tx.A downticks the generator after any other transaction upticked it)
So, the final database condition: the table has one committed=visible row with auto_id=2 and the generator = 1.
Any next attempt to add yet one more row would try to up the generator 1+1=2 and then fail to insert new row with PK violation, then it would down the generator to 1 to recreate the faulty condition outlined above.
Your database stuck and without direct intervention by DB Administrator can not have data added further.
The very idea of rolling back the generator is defeating all intentions generators were created for and all expectations about generators behavior that the database and connection libraries and other programmers have.
You just placed a trap on the highway. It is only a matter of time until someone will be caught into it.
Even if you would continue guarding this hack by other hacks for now - wasting a lot of time and attention to do that scrupulously and pervasively - still one unlucky day in the future there would be another programmer, or even you would forget this gory details - and you would start using the generator in standard intended way - and would run into the trap.
Generators were not made to be backtracked during normal work.
existence of primary key is checked in the procedure before doing anything
Yep, that is the first reaction when multithreading programmer meets his first race condition. Let's just add more prior checks.
First few checks indeed can decrease probability of a clash, but it never can alleviate it completely. And the more use your program would see, the more transactions would get opened by more and more concurrent and active users - it is only a matter of time until this somewhat lowered probability would turn out still too much.
Think about it, SQL is about transactions, yet they had to invent and introduce explicitly out-of-transactions device Generator/Sequence is. If there was reliable solution without them - it would be just used instead of creating that so non-SQLish transaction boundary breaking tool.
When you say your SP "checks for PK violation" it is exactly the same as if you would drop the generator altogether and instead just issue "good old"
:new_id = ( select max(auto_id)+1 from MyTable );
By your description you actually do something like that, but in some indirect way. Something like
while exists( select * from MyTable where auto_id = gen_id(MyGen, +1))
do ;
:new_id = gen_id(MyGen, 0);
You may feel, that because you mentioned generators, you somehow overcame the cross-transaction invisibility problem. But you did not, because the very check "if PK was already taken" is done against in-transaction table.
That changes nothing, your two transactions Tx.A and Tx.B would not see each other's records, because they both did not committed yet. Now it only takes some unlucky Tx.C that would fail and downtick the generator to them collide on the same ID.
Or not, you do not even need Tx.C and downticking at all!
Here we bump into the multithreading idea about "atomic operations".
Let's look at it again.
while exists( select * from MyTable where auto_id = gen_id(MyGen, +1))
do ;
:new_id = gen_id(MyGen, 0);
In a single-threaded application that code is okay: you just keep running the generator up until the free slot, then you just query the value without changing it. "What could possibly go wrong?" But in multithreaded environment it is rooks waiting to be stepped over. Example:
Initial condition, table has 100 rows (auto_id goes from 1 to 100), the generator = 100.
Tx.A starts adding the row, upticks the generator in the while loop and exits the loop. It does not yet pass to the second line where local variable gets assigned. Not yet. The generator = 101, rows not added yet.
Tx.B starts adding the row, upticks the generator in the while loop and exits the loop. The generator = 102, rows not added yet.
Tx.A goes to the second line and reads gen_id(MyGen,0) into a variable for new row. While it was 101 out of the loop, it is 102 now!
Tx.B goes to the second line and reads gen_id(MyGen,0) and gets 102 too.
Tx.A and Tx.B both try to insert new row with auto_id=102
RACE CONDITIONS - both Tx.A and Tx.B try to commit their work. One of them succeeds, another fails. Which one? It is not predictable. A lucky one commits, an unlucky one fails.
The failed transaction downticks the generator.
Final condition: the table has 101 rows, the auto_id consistently goes from 1 to 100 and then skips to 102. The generator = 101, which his less than MAX(auto_id)
Now you might want to add more hacks, I mean more prior checks before actually inserting rows and committing. It will make mistakes yet less probable, right? Wrong. The more checks you do - the slower gets the code. The slower gets the code - the greater gets probability, that while one thread runs throw all them checks there happens another thread that interferes and alters the situation that was checked a moment ago.
The fundamental issue with multithreading is that any check is SEPARATE action. And between those actions the situation MAY change. Your procedure may check whatever it wants BEFORE actually inserting the row. It would not warrant much. Because when you finally gets at the row inserting statement, all the checks you did in the PAST are a matter of past. And the situation is potentially already altered. And warrants your checks were giving in the PAST only belong to that past, not to the moment at hands.
And even if you no more look for warranting sure thing, still adding every new check you can not even be sure if doing so you just decreased or increased probability of failure. Because multithreading is a bitch, it is flowing chaotically out of your control.
So, remember the KISS principle. Until proven otherwise - you most probably do not need SP2 at all, you only need one single UPDATE-OR-INSERT statement.
PS. There was a pretty fun game in my school days, it was called Pascal Robots. There are also C Robots I heard and probably implementation for other many languages. With Pascal Robots though came a number of already coded robots, demonstrating different strategies and approaches. Some of them were really thought out in very intrinsic details. And there was one robot which program was PRIMITIVE. It only had two loops: if you do not see an enemy - keep turning your radar around, if you do see an enemy - keep running to it and shooting at it. That was all. What could this idiot do against sophisticated robots having creative attack and defense strategies, flanking maneuvers, optimal distance to maintain by back and forth movements, escape tricks and more? Those sophisticated robots employed very extensive checks and very thought through hacks to be triggered by those checks. So... ...so that primitive idiot was second or maybe third best robot in the shipped set. there was only one or two smarties who could outwit it. With ALL the other robots this lean-and-fast idiot finished them before they could run through all their checks and hacks thrice. That is what multithreading does to programming. It was astonishing to watch those battles, which went so against out single-threaded intuition.
I have the following script:
SELECT
DEPT.F03 AS F03, DEPT.F238 AS F238, SDP.F04 AS F04, SDP.F1022 AS F1022,
CAT.F17 AS F17, CAT.F1023 AS F1023, CAT.F1946 AS F1946
FROM
DEPT_TAB DEPT
LEFT OUTER JOIN
SDP_TAB SDP ON SDP.F03 = DEPT.F03,
CAT_TAB CAT
ORDER BY
DEPT.F03
The tables are huge, when I execute the script in SQL Server directly it takes around 4 min to execute, but when I run it in the third party program (SMS LOC based on Delphi) it gives me the error
<msg> out of memory</msg> <sql> the code </sql>
Is there anyway I can lighten the script to be executed? or did anyone had the same problem and solved it somehow?
I remember having had to resort to the ROBUST PLAN query hint once on a query where the query-optimizer kind of lost track and tried to work it out in a way that the hardware couldn't handle.
=> http://technet.microsoft.com/en-us/library/ms181714.aspx
But I'm not sure I understand why it would work for one 'technology' and not another.
Then again, the error message might not be from SQL but rather from the 3rd-party program that gathers the output and does so in a 'less than ideal' way.
Consider adding paging to the user edit screen and the underlying data call. The point being you dont need to see all the rows at one time, but they are available to the user upon request.
This will alleviate much of your performance problem.
I had a project where I had to add over 7 million individual lines of T-SQL code via batch (couldn't figure out how to programatically leverage the new SEQUENCE command). The problem was that there was limited amount of memory available on my VM (I was allocated the max amount of memory for this VM). Because of the large amount lines of T-SQL code I had to first test how many lines it could take before the server crashed. For whatever reason, SQL (2012) doesn't release the memory it uses for large batch jobs such as mine (we're talking around 12 GB of memory) so I had to reboot the server every million or so lines. This is what you may have to do if resources are limited for your project.
I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:
Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.
In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.
I have column "date" in my table.I need to call my function for this table every time when the current time is equal to time in my "date" column. I don't know if it's possible to do this in ms sql server?
It seems like you are trying to implement some kind of scheduling.
You could try implementing one using one of SQL Server services called SQL Server Agent. It may not be fit for all kinds of response to time events, though, but it should be able to manage certain tasks.
You would need to set up a SQL Server Agent job for it.
A job would need to consist of at least one job step and have at least one schedule to be runnable. Perhaps, it would be easiest for you at this point to use the Transact-SQL type of job step.
A Transact-SQL job step is just a Transact-SQL script, a multi-statement query. In your case it would probably first check if there are rows matching the current time. Then, either for every matching row separately or for the entire set of them, it would perform whatever kind of operation Transact-SQL allows you to perform.
In advance I'm new to SSIS.
I run a query that gives me distinct values in columnA from TableA that needs to be processed in order (1 then 2 then 3 and so on but the numbers constantly change start and end values).
these columnA values then have groups of values in columnB, and these values have to be run through a stored procedure but they can all run simultaneously. Currently they run in a linear manner
Here is a visual of what I need to do (sudo code)
foreach
{
foreach
{
processX
}
}
what I want:
foreach
{
processA processB ProcessC simultaneously there are no collisions to worry about
}
I am using a control flow in SSIS and it has the foreach loop which is good however I don't know what to use to allow it to run the second part simultaneously.
When I want parallel execution in SQL Server in the Control Flow, I usually put several For-Each loops and bring back a distinct recordset for each one of them.
There currently is no way to run a Foreach Loop in "parallel mode".
The best that I can think of is to rework your architecture into a flexible 'worker' threading model, where you can parallelize independently.
What that would require is two SSIS packages. One to supply the work units, and one to work on them. So the "controller" package would perform the foreach loop on TableA, collecting whatever values it needs to. It would then insert those values into a "work to do" table. The "worker" package would consist of a For Loop, inside of which you'd have an Execute SQL Task that queried the "work to do" table for the first row that wasn't being worked on, and if it found such a row, mark it as being worked on (all inside a transaction to ensure no collisions). You'd then have your "work unit" to do work with, or not. A precedence constraint to your next task would only execute if it actually had some instructions. Your For Loop's Eval expression could be crafted to stop when you don't see any new work units (although you might want a delay in there to make sure your workers weren't faster than the controller).
To run all this, you'd start the controller (in an Agent Job), then start multiple workers (same package, different jobs) - as many as you wanted.