Control data flow in Pentaho transformation with variables - variables

I want to control data flow in pentaho transformation files with system variables. I found a component called 'simple evaluation' which is exactly what I want, however it can only be used in job files.
I have gone through component-tree of transformation from spoon but cannot find any one like 'simple evaluation'.
Can anyone give me some idea, how to make it?
Thanks

IIRC you can't use variables in the filter rows step. That would probably be a worthy
change request to raise in jira.pentaho.com
So, simply use a "Get Variables" step to get the variable into the stream
and then use the filter rows step. ( or switch/case depending on complexity )

Related

How can I find out the JOBCOUNT value of a periodic background job?

I have a report which creates a background job. The user can decide if the job should be periodic or not. Now I want to show information about the actual job. So how can I find out the JOBCOUNT of the following job (periodic) after the old one was executed?
I guess SAP would store that information only if it's needed for internal operations.
I think there's no need, so you won't find that information stored anywhere.
You might do an approximation yourself by searching the currently-scheduled job which has its creation date/time (TBTCO/TBTCS) close to the end of a previous one (TBTCO), with same characteristics (including it(s) step(s) in table TBTCP)... You may get inspired from a few programs prefixed BTCAUX (04, 13).
If you do this piece of code, don't hesitate to post it as a separate answer, that could be very helpful for future visitors.
You can use BP_JOB_SELECT FM for that, it mainly resembles SM37 selection parameters.
Set JOBSELECT_DIALOG param to N to omit GUI screen and fill in job name into JOBSEL_PARAM_IN-JOBNAME param, these are the only two mandatory parameters.
The JOBCOUNT value resides in JOBSELECT_JOBLIST table:

Order step metrics in Pentaho Data Integration

I´m working on a rather long Transformation in Kettle and I put some Steps in the middle of the Flow.
So now my Step metrics are all scrambled up and very hard to read.
Is there any way i could sort this to be in order (with the direction of the flow) again?
If you click on # in a "Step metrics" tab it will sort the steps by their order. The visualisation in a "Metrics" tab will be also sorted.
Steps are stored in the order of insertion. The step metrics grid allows the steps to be shown in a different order by clicking on the column header, but since a transformation graph can be meshed, it's generally not possible to sort the steps in the order of the data flow. Only a single path in your graph could be sorted by analyzing the hops, anyway.
What you can do is change the name of each step and add a number in front of it. Then sort by name.
Boring, I know, but it is what we have...
It's unfortunate that assigning a step number isn't an option. And maybe it differs by version, but in 8.3 the step metrics # column assignment seems to be somewhat based on the step's order in the flow (which of course breaks down when the flow branches), not by when the step was added. It does ring a bell that it was based on when the step was added in past versions though.
It's also unfortunate that the sort by step name is case sensitive - so steps that start with "a" come after steps that start with "Z". Perhaps there's a way to work that behavior into a naming strategy that actually leverages that for some benefit, but I haven't found one.
So I'm inclined to agree with #recacon - using a number prefix for the step names and then sorting execution metrics by step name seems like the best option. I haven't done much of this yet since without having a team standard it's unlikely to be maintained.
For the few times I have done it, I've used a three digit numeric prefix where values are lowest at the start of the flow and increase farther down the path. To reduce the need for re-sequencing when steps are added later, I start out incrementing by ten from one step to the next, then use a number between when splitting hops later on.
I also increment the 100's digit for branches in the flow or if there's a significant section of logic for a particular purpose.

Elegantly handle samples with insufficient data in workflow?

I've set up a Snakemake pipeline for doing some simple QC and analysis on shallow shotgun metagenomics samples coming through our lab.
Some of the tools in the pipeline will fail or error when samples with low amounts of data are delivered as inputs -- but this is sometimes not knowable from the raw input data, as intermediate filtering steps (such as adapter trimming and host genome removal) can remove varying numbers of reads.
Ideally, I would like to be able to handle these cases with some sort of check on certain input rules, which could evaluate the number of reads in an input file and choose whether or not to continue with that portion of the workflow graph. Has anyone implemented something like this successfully?
Many thanks,
-jon
I'm not aware of the possibility to not complete the workflow based on some computation happening inside the workflow. The rules to be executed are determined based on the final required output, and failure will happen if this final output cannot be generated.
One approach could be catch the particular tool failure (try ... except construct in a run section or return code handling in a shell section) and generate a dummy output file for the corresponding rule, and have the downstream rules "propagate" dummy file generation based on a test identifying the rule's input as such a dummy file.
Another approach could be to pre-process the data outside of your snakemake workflow to determine which input to skip, and then use some filtering on the wildcards combinations as described here: https://stackoverflow.com/a/41185568/1878788.
I've been trying to find a solution to this issue as well.
Thus far I think I've identified a few potential solution but have yet to be able to correctly implement them.
I use seqkit stats to quickly generate a txt file and use the num_seqs column to filter with. You can write a quick pandas function to return a list of files which pass your threshold, and I use config.yaml to pass the minimum read threshold:
def get_passing_fastq_files(wildcards):
qc = pd.read_table('fastq.stats.txt').fillna(0)
passing = list(qc[qc['num_seqs'] > config['minReads']]['file'])
return passing
Trying to implement that as an input function in Snakemake has been an esoteric nightmare to be honest. Probably my own lack of nuanced understand about the Wildcards object.
I think the use of a checkpoint is also necessary in the process to force Snakemake to recompute the DAG after filtering samples out. Haven't been able to connect all the dots yet however, and I'm trying to avoid janky solutions that use token files etc.

Is there a way for VBA UDF to "know" what other functions will be run?

Assume I have a UDF that will be used in a worksheet 100,000+ times. Is there a way, within the function, for it to know how many more times it is going to be called in the batch? Basically what I want to do is have every function create a to-do list of work to do. I want to do something like:
IF remaining functions to be executed after this one = 0 then ...
Is there a way to do this?
Background:
I want to make a UDF that will perform SQL queries with the user just giving parameters(date, hour, node, type). This is pretty easy to make if you're willing to actually execute the SQL query every time the function is run. I know its easy because I did this and it was ridiculously slow. My new idea is to have the function first see if the data it is looking for exists in a global cache variable and if it isn't to add it to a global variable "job-list".
What I want it to do is when the last function is called to then go through the job list and perform the fewest number of SQL queries and fill the global cache variable. Once the cache variable is full it would do a table refresh to make all the other functions get called again since on the subsequent call they'll find the data they need in the cache.
Firstly:
VBA UDF performance is extremely sensitive to the way the UDF is coded:
see my series of posts about writing efficient VBA UDFs:
http://fastexcel.wordpress.com/2011/06/13/writing-efficient-vba-udfs-part-3-avoiding-the-vbe-refresh-bug/
http://fastexcel.wordpress.com/2011/05/25/writing-efficient-vba-udfs-part-1/
You should also consider using an Array UDF to return multiple results:
http://fastexcel.wordpress.com/2011/06/20/writing-efiicient-vba-udfs-part5-udf-array-formulas-go-faster/
Secondly:
The 12th post in this series outlines using the AfterCalculate event and a cache
http://fastexcel.wordpress.com/2012/12/05/writing-efficient-udfs-part-12-getting-used-range-fast-using-application-events-and-a-cache/
Basically the approach you would need is for the UDF to check the cache & if not current or available then add a request to the queue. Then use the after-calculation event to process the queue and if neccessary trigger another recalc.
Performing 100,000 SQL queries from an Excel spreadsheet seems like a poor design. Creating a cache'ing mechanism on top of these seems to compound the problem, making it more complicated than it probably needs to be. There are some circumstances where this might be appropriate, but I would consider other design approaches instead.
The most obvious is to take the data from the Excel spreadsheet and load it into a table in the database. Then use the database to do the processing on all the rows as once. The final step is to read the result back into Excel.
I find that the best way to get large numbers of rows from Excel into a database is to save the Excel file as csv and bulk insert them.
This approach may not work for your problem. In general, though, set-based approaches running in the database are going to perform much better.
As for the cach'ing mechanism, if you have to go down that route. I can imagine a function that has the following pseudo-code:
Check if input values are in cache.
If so, read values from cache.
Else do complex processing.
Load values in cache.
This logic could go in the function. As #Bulat suggests, though, it is probably better to add an additional caching layer around the function.

Call RESTful service in Pig script

I'm working on a Pig script (my first) that loads a large text file. For each record in that text file, the content of one field needs to be sent off to a RESTful service for processing. Nothing needs to be evaluated or filtered. Capture data, send it off and the script doesn't need anything back.
I'm assuming that a UDF is required for this kind of functionality, but I'm new enough to Pig that I don't have a clear picture of what type of function I should build. My best guess would be a Store Function since the data is ultimately getting stored somewhere, but I feel like the amount of guesswork involved in coming to that conclusion is higher than I'd like.
Any insight or guidance would be much appreciated.
Have you had a look to DBStorage which does something similar?
everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
...
STORE ordered INTO RestStorage('https://...');
Having never found even a hint of an answer to this, I decided to move in a different direction. I'm using Pig to load and parse the large file, but then streaming each record that I care about to PHP for additional processing that Pig doesn't seem to have the capability to handle cleanly.
It's still not complete (read: there's a great big, very unhappy bug in the mix), but I think the concept is solid--just need to work out the implementation details.
everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
-- apply filter
-- apply filter
-- ...
-- apply last filter
ordered = ORDER filtered_categories BY category;
streamed = STREAM limited THROUGH `php -nF process_categories.php`;
DUMP streamed;