Pentaho kettle: How to prevent db transformation step from always executing?

Pentaho kettle: How to prevent db transformation step from always executing? - pentaho

I have a transformation that has a switch case which can either run a database retrieval transformation or do nothing based on the switch case value.
The problem is that the database transformation seems to get executed always no matter what the result of the switch case is. The database name is parametrized and the switch case tries to make sure that non existing database names are ignored and not queries (as this causes an error). But now the database transformation runs every time and causes an error.
So the question is: Is there a way to prevent the a database transformation from automatically executing? I've tried adding a blocking step before it but with no results.

I've tried to do this before and hit the exact same problem. It's fundamentally to how PDI works, if your step wont initialize then nothing will run - there is no solution to this. In fact, I even have a jira about this, but it doesn't seem to be going anywhere.
However perhaps you're doing this about face. Why do you have conditional connections? IF you could explain the use case then perhaps we can propose a better solution.

Related

Is it possible for a program cannot find the failure by using dynamic testing, but have fault?

Is it possible for a program cannot find the failure by using dynamic testing, but have fault? any simple example?
Please help! thanks.

Yes. Testing can only prove the absence of bugs for what you tested. Dynamic testing cannot cover all possible inputs and outputs in all environments with all dependencies.
First is to simply not test the code in question. This can be verified by checking the coverage of your test. Even if you achieve 100% coverage there can still be flaws.
Next is to not check all possible types and ranges of inputs. For example, if you have a function that scans for a word in a string, you need to check for...
The word at the start of the string.
The word at the end of the string.
The word in the middle of the string.
A string without the word.
The empty string.
These are known as boundary conditions and include things like:
0
Negative numbers
Empty strings
Null
Extremely large values
Decimals
Unicode
Empty files
Extremely large files
If the code in question keeps state, maybe in an object, maybe in global variables, you have to test that state does not become corrupted or interfere with subsequent runs.
If you're doing parallel processing you must test any number of possibilities for deadlocks or corruption resulting from trying to do the same thing at the same time. For example, two processes trying to write to the same file. Or two processes both waiting for a lock on the same resource. Do they lock only what they need? Do they give up their locks ASAP?
Once you test all the ways the code is supposed to work, you have to test all the ways that it can fail, whether it fails gracefully with an exception (instead of garbage), whether an error leaves it in a corrupted state, and so on. How does it handle resource failure, like failing to connect to a database? This becomes particularly important working with databases and files to ensure a failure doesn't leave things partially altered.
For example, if you're transferring money from one account to another you might write:
my $from_balance = get_balance($from);
my $to_balance = get_balance($to);
set_balance($from, $from_balance - $amount);
set_balance($to, $to_balance + $amount);
What happens if the program crashes after the first set_balance? What happens if another process changes either balance between get_balance and set_balance? These sorts of concurrency issues must be thought of and tested.
There's all the different environments the code could run in. Different operating systems. Different compilers. Different dependencies. Different databases. And all with different versions. All these have to be tested.
The test can simply be wrong. It can be a mistake in the test. It can be a mistake in the spec. Generally one tests the same code in different ways to avoid this problem.
The test can be right, the spec can be right, but the feature is wrong. It could be a bad design. It could be a bad idea. You can argue this isn't a "bug", but if the users don't like it, it needs to be fixed.
If your testing makes use of a lot of mocking, your mocks may not reflect how thing thing being mocked actually behaves.
And so on.
For all these flaws, dynamic testing remains the best we've got for testing more than a few dozen lines of code.

Where to put Stored Procedures & Swapping DB Tech?

IS THIS ABSOLUTELY BONKERS.....
I have just started working on a new project and I'm shocked at what I have just seen. This project is a C# web app that sits on top of an Oracle database. Now all the stored procedures are not actually stored procedures.... They are just SQL scripts stored in text files in a directory on the server. When the application starts it looks in the directory and goes through each file and reading out the text and saving it in a dictionary. It also runs a Regex over the text removing special sequences like [PARAM] and replaces them with the correct symbol e.g. ':' in Oracles case or '#' for SQL SERVER. Then when the code wants to execute one of these statements it calls a method which finds the correct one in the dictionary and runs it.
Now this appears to have been done in case they ever wanted to swap underlying db technologies. They say they would just swap the sql files out of the directory for files in the appropriate syntax and it would work.
Now I would normally expect the stored procedures to be actually stored procedures and live on the db. A separate project (layer) that talks to the db. Then if the db technology changes just add another data layer project and swap the dlls out....
I see massive problems with the way its been done currently:
No execution plan on the db server being created.
Massive overhead reading hundreds of text files, building up a string for each, running regex over it.
No checking of SQL syntax.
Big memory foot print having all these stored procedures in memory
What do you lot think?
Is this really bad or am I just moaning because I have never seen anything like this before?
What else is wrong with this approach?
Any comments will be much appreciated as I'm trying to get across to colleagues that this is crazy....

Why don't they use the scripts at build time to create stored procedure scripts in the native syntax (PSQL, T-SQL), which can then be deployed to the database? I can't see that would too much more work, and you get all the benefits of compiled stored procedure code etc.
I have personal experience of run-time compilation of stored procedures (SQL Server) being a big performance overhead, and on a production system this was a real problem.
I can sort of see the reasoning behind this design:
Stored procedure code is too database specific, so we won't use
stored procedures, we will use SQL statements instead.
Even SQL statements can have database specific syntax in them, so
we'll have some hokey method for converting them on the fly at
run-time.
Even if you don't use stored procedures, I still think the conversion should be done at build time (e.g., to generate C# code), not run-time.

What do you lot think?
I think it's a classic case of over-engineering: who changes their DB provider? Is it really bringing anything to the table?
Is this really bad or am I just moaning because I have never seen
anything like this before?
A bit of both in my opinion: it's quite bad but if they've been running like this for some time it's got to be working.
What else is wrong with this approach?
As it is, re-calculating the execution plan every time is probably the biggest issue:
so much performance loss! I'm saying probably because performance is not always a requirement (over-optimization isn't any better ;)
What is really wrong is that they reinvented the wheel: LINQ does that natively.
And that means as LINQ gets improved over time, the company will steadily fall behind since they won't benefit from the enhancements.
If you have any leverage power, try to talk to them about LINQ.

Please can anyone why explain dropping and recreating stored procedures in SQL Server 2005 causes much more of an initial slow down than expected?

My first post here, please be gentle. =)
I work for a company that inherited the maintenance of a bespoke system used by one of our customers. The previous developer (no longer with us) encrypted all the database objects (WITH ENCRYPTION).
The system has been plagued with various timeout issues well before we took ownership of it, and we want to get to the bottom of these.
The database is on SQL Express 2005 in production. We want to run the profiler on it but because the various objects are encrypted, most stored procedure calls etc.. show up as '-- Encrypted Text'.
Not very useful. I've written a little console app in C# to decrypt all the database objects, which works perfectly as far as I can tell
It finds all encrypted objects in the database and for each one, decrypts it, removes the with encryption clause, drops the original and recreates it using the new 'without encryption' text.
There are some computed columns that get dropped before trying to decrypt the functions that are used in their definitions, then get recreated.
What I'm finding is that once everything is decrypted, I can't get into the system because the stored procedures etc.. take far too long to run on their first call. Execution plans are being compiled for the first time, so some delay is understandable, but we're talking 1 minute plus.. after 30 seconds the command timeout is hit, so the plans never get compiled.
I also have the same issue if I drop and recreate the database objects using their original scripts (keeping the WITH ENCRYPTION clause in).
So there's some consistency there. However, what absolutely mystifies me is that if I drop the execution plans from the original copy of the database (which was created from a backup of the production database), the same stored procedures are much faster. 10 seconds for first call. As far as I can tell, the stored procedures, functions etc.. are the same.
From my testing, I don't think it's a particular procedure or function that is causing the problem. It seems like the delay is cumulative, the more objects I drop & recreate the slower things are.
I've taken a few random stabs in the dark, rebuilding indexes and updating stats - this has had no effect at all.
We could write something to execute all 540 functions, triggers, sprocs etc.. to pre-empt the first real call from a user, however once SQL server is restarted (and our client does restart their server from time to time) the execution plans will be dropped and we'd need to run the same tool again. To me it doesn't seem a viable option (neither is increasing the CommandTimeout property), I want to know why I'm seeing this behaviour.
I've been using sys.dm_exec_query_plan and sys.dm_exec_sql_text to look at the execution plans, and using DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE as part of my testing.
I'm totally stumped, please help me before I jump out the office window.
Thanks in advance,
Andy.
--EDIT--
I don't quite know how I missed it, but the Activity Monitor is showing a session being blocked by a recompile of a table valued function. It takes far too long to compile and the blocked query hits the timeout.
I don't understand why in original version of the database (restored from backup taken from the customer site), the compilation takes around 10 seconds, but after dropping and recreating these objects in the same database, the table valued function takes almost a minute to compile.
I've tried truncating the log, which didn't have any effect. I still need to have a look at the file sizes.
-- Another edit --
The TVF returns a temporary table, and has 12 outer joins in the query, all on either sys.server_principals or sys.database_role_members.
I seem to remember reading something about recompiles and temporary tables, which I'll have to check again..

You said yourself that (computed) columns were dropped. Is it possible that other stuff was manipulated in the tables? If so, you will probably want to reindex your tables, (which will update the tables' statistics as well) using a command such as:
Exec sp_msForEachTable #COMMAND1= 'DBCC DBREINDEX ( "?")'
...though it sounds like you've done something like this. Either way, I recommend doing it once you make such a big change to all of those objects.
Third recommendation:
While you are waiting for your procs to execute, run an sp_who2 on the database to make sure nothing is blocking your queries. It's quite possible that you might have some sort of long-lived transaction happening that you haven't accounted for.
Fourth recommendation:
Make sure your server has enough memory. Make sure your transaction log files and datafiles aren't auto-growing after all of those big index and object updates. That can take FOREVER to happen, especially on under-spec'ed hardware like you may have running SQL Express.
Fifth recommendation:
Run a SQL Server Profiler trace against the database and look at what statements are starting specifically, and which are timing out. "Zoom in" on those and analyze them piece by piece and see what's up. This will likely just take a lot of good ol' hard work to fully understand.
In summary, the act of dropping and recreating procs itself shouldn't cause this slowdown if the statistics and indexes they were initially built against are sufficiently similar to what they are now. It's likely that you will find that there's Something Else happening which isn't necessarily directly related to changing the proc definitions themselves.

Another shot in the dark: Were the computed columns which you had to drop originally persisted (and not persisted after recreation) or vice versa?
If the functions called in the computation are complex or expensive, persisted columns are very advantageous and might be responsible for the behavior you are seeing.

Turns out that if I pass the parameter of the TVF to a variable, then replace where the original parameter was used, normal service is resumed (query takes less than a second, instead of a minute!)
Some kind of parameter sniffing shenanigans going on, I don't really understand why though - at the point I'm trying to call the function, no query plans exist, good or bad.
I'm in contact with Microsoft on this one (first time I've ever used my MSDN support entitlement) so hopefully we'll find out more and I'll post what I've discovered.
Thanks all for your help, we're getting there!

Are there downsides to using prepared statements?

I've been reading a lot about prepared statements and in everything I've read, no one talks about the downsides of using them. Therefore, I'm wondering if there are any "there be dragons" spots that people tend to overlook?

Prepared statement is just a parsed and precompiled SQL statement which just waits for the bound variables to be provided to be executed.
Any executed statement becomes prepared sooner or later (it need to be parsed, optimized, compiled and then executed).
A prepared statement just reuses the results of parsing, optimization and compilation.
Usually database systems use some kind of optimization to save some time on query preparation even if you don't use prepared queries yourself.
Oracle, for instance, when parsing a query first checks the library cache, and if the same statement had already been parsed, it uses the cached execution plan instead.

If you use a statement only once, or if you automatically generate dynamic sql statements (and either properly escape everythin or know for certain your parameters have only safe characters) then you should not use prepared statements.

There is one other small issue with prepared statements vs dynamic sql, and that is that it can be harder to debug them. With dynamic sql, you can always just write out a problem query to a log file and run it directly on the server exactly as your program sees it. With prepared statements it can take a little more work to test your query with a specific set of parameters determined from crash data. But not that much more, and the extra security definitely justifies the cost.

in some situations, the database engine might come up with an inferior query plan when using a prepared statement (because it can't make the right assumptions without having the actual bind values for a search).
see e.g. the "Notes" section at
http://www.postgresql.org/docs/current/static/sql-prepare.html
so it might be worth testing your queries with and without preparing statements to find out which is faster. ideally, you would then decide on a per-statement basis whether to use prepared statements or not, although not all ORMs will allow you to do that.

The only downside that I can think of is that they take up memory on the server. It's not much, but there are probably some edge cases where it would be a problem but I'm hard pressed to think of any.

Is there a better way to debug SQL?

I have worked with SQL for several years now, primarily MySQL/PhpMyAdmin, but also Oracle/iSqlPlus and PL/SQL lately. I have programmed in PHP, Java, ActionScript and more. I realise SQL isn't an imperative programming language like the others - but why do the error messages seem so much less specific in SQL? In other environments I'm pointed straight to the root of the problem. More often that not, MySQL gives me errors like "error AROUND where u.id = ..." and prints the whole query. This is even more difficult with stored procedures, where debugging can be a complete nightmare.
Am I missing a magic tool/language/plugin/setting that gives better error reporting or are we stuck with this? I want a debugger or language which gives me the same amount of control that Eclipse gives me when setting breakpoints and stepping trough the code. Is this possible?

I think the answer lies in the fact that SQL is a set-based language with a few procedural things attached. Since the designers were thinking in set-based terms, they didn't think that the ordinary type of debugging that other languages have is important. However, I think some of this is changing. You can set breakpoints in SQL Server 2008. I haven't used it really as you must have SQL Server 2008 databases before it will work and most of ours are still SQL Server 2000. But it is available and it does allow you to step through things. You still are going to have problems when your select statement is 150 lines long and it knows that the syntax isn't right but it can't point out exactly where as it is all one command.
Personally when I am writing a long procedural SP, I build in a test mode that includes showing me the results of things I do, the values of key variables at specific points I'm interested in, and print staments that let me know what steps have been completed and then rolling the whole thing back when done. That way I can see what would have happened if it had run for real, but not have hurt any of the data in the database if I got it wrong. I find this very useful. It can vastly increase the size of your proc though. I have a template I use that has most of the structure I need set up in it, so it doesn't really take me too long to do. Especially since I never add an insert. update or delete to a proc without first testing the associated select to ensure I have the records I want.

I think the explanation is that "regular" languages have much smaller individual statements than SQL, so that single-statement granularity points to a much smaller part of the code in them than in SQL. A single SQL statement can be a page or more in length; in other languages it's usually a single line.
I don't think that makes it impossible for debuggers / IDEs to more precisely identify errors, but I suspect it makes it harder.

I agree with your complaint.
Building a good logging framework and overusing it in your sprocs is what works best for me.
Before and after every transaction or important piece of logic, I write out the sproc name, step timestamp and a rowcount (if relevant) to my log table. I find that when I have done this, I can usually narrow down the problem spot within a few minutes.
Add a debug parameter to the sproc (default to "N") and pass it through to any other sprocs that it calls so that you can easily turn logging on or off.

As for breakpoints and stepping through code, you can do this with MS SQL Server (in my opinion, it's easier on 2005+ than with 2000).
For the simple cases, early development debugging, the sometimes cryptic messages are usually good enough to get the error resolved -- syntax error, can't do X with Y. If I'm in a tough sproc, I'll revert to "printf debugging" on the sproc text because it's quick and easy. After a while with your database of choice, the simple issues become old hat and you just take them in stride.
However, once the code is released, the complexity of the issues is way too high. I consider myself lucky if I can reproduce them. Also, the places where the developer in me would want a debugger the DBA in me says "no way you're putting a debugger there."

I do use the following tactics.
During writing of the stored procedure have a #procStep var
each time a new logical step is executed
set #procStep = "What the ... is happening here " ;
the rest is here

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas