Elegantly handle samples with insufficient data in workflow? - snakemake

I've set up a Snakemake pipeline for doing some simple QC and analysis on shallow shotgun metagenomics samples coming through our lab.
Some of the tools in the pipeline will fail or error when samples with low amounts of data are delivered as inputs -- but this is sometimes not knowable from the raw input data, as intermediate filtering steps (such as adapter trimming and host genome removal) can remove varying numbers of reads.
Ideally, I would like to be able to handle these cases with some sort of check on certain input rules, which could evaluate the number of reads in an input file and choose whether or not to continue with that portion of the workflow graph. Has anyone implemented something like this successfully?
Many thanks,
-jon

I'm not aware of the possibility to not complete the workflow based on some computation happening inside the workflow. The rules to be executed are determined based on the final required output, and failure will happen if this final output cannot be generated.
One approach could be catch the particular tool failure (try ... except construct in a run section or return code handling in a shell section) and generate a dummy output file for the corresponding rule, and have the downstream rules "propagate" dummy file generation based on a test identifying the rule's input as such a dummy file.
Another approach could be to pre-process the data outside of your snakemake workflow to determine which input to skip, and then use some filtering on the wildcards combinations as described here: https://stackoverflow.com/a/41185568/1878788.

I've been trying to find a solution to this issue as well.
Thus far I think I've identified a few potential solution but have yet to be able to correctly implement them.
I use seqkit stats to quickly generate a txt file and use the num_seqs column to filter with. You can write a quick pandas function to return a list of files which pass your threshold, and I use config.yaml to pass the minimum read threshold:
def get_passing_fastq_files(wildcards):
qc = pd.read_table('fastq.stats.txt').fillna(0)
passing = list(qc[qc['num_seqs'] > config['minReads']]['file'])
return passing
Trying to implement that as an input function in Snakemake has been an esoteric nightmare to be honest. Probably my own lack of nuanced understand about the Wildcards object.
I think the use of a checkpoint is also necessary in the process to force Snakemake to recompute the DAG after filtering samples out. Haven't been able to connect all the dots yet however, and I'm trying to avoid janky solutions that use token files etc.

Related

Is it possible to access SCIP's Statistics output values directly from PyScipOpt Model Object?

I'm using SCIP to solve MILPs in Python using PyScipOpt. After solving a problem, the solver statistics can be either 1) printed as a string using printStatistics(), or 2) saved to an external file using writeStatistics(). For example:
import pyscipopt as pso
model = pso.Model()
model.addVar(name="x", obj=1)
model.optimize()
model.printStatistics()
model.writeStatistics(filename="stats.txt")
There's a lot of information in printStatistics/writeStatistics that doesn't seem to be accessible from the Python model object directly (e.g. primal-dual integral value, data for individual branching rules or primal heuristics, etc.) It would be helpful to be able to extract the data from this output via, e.g., attributes of the model object or a dictionary.
Is there any way to access this information from the model object without having to parse the raw text/file output?
PySCIPOpt does not provide access to the statistics directly. The data for the various tables (e.g. separators, presolvers, etc.) are stored separately for every single plugin in SCIP and are sometimes not straightforward to collect.
If you are only interested in certain statistics about the general solving process, then you might want to add PySCIPOpt wrappers for a few of the simple get functions defined in scip_solvingstats.c.
Lastly, you might want to check out IPET for parsing the statistics output.

Airflow: BigQueryOperator vs BigQuery Quotas and Limits

Is there any pratical way to control quotas and limits on Airflow?.
I'm specially interested on controlling BigQuery concurrency.
There are different levels of quotas on BigQuery . So according to the Operator inputs, there should be a way to check if conditions are met, otherwise waiting for it to fulfill.
It seems to be a composition of Sensor-Operators, querying against a database like redis for example:
QuotaSensor(Project, Dataset, Table, Query) >> QuotaAddOperator(Project, Dataset, Table, Query)
QuotaAddOperator(Project, Dataset, Table, Query) >> BigQueryOperator(Project, Dataset, Table, Query)
BigQueryOperator(Project, Dataset, Table, Query) >> QuotaSubOperator(Project, Dataset, Table, Query)
The Sensor must check conditions like:
- Global running queries <= 300
- Project running queries <= 100
- .. etc
Is there any lib that already does that for me? A plugin perhaps?
Or any other easier solution?
Otherwise, following the Sensor-Operators approach.
How can I encapsulate all of it under a single operator? To avoid repetition of code,
a single operator: QuotaBigQueryOperator
Currently, it is only possible to get the Compute Engine quotas programmatically. However, there is an opened feature request to get/set other project quotas via API. You can post there about the specific case you would like to have implemented and follow it to track it and ask for updates.
Meanwhile, as workaround you can try to use the PythonOperator. With it you can define your own custom code and you would be able to implement retries for the queries that you send that get a quotaExceeded error (or the specific error you are getting). In this way you wouldn't have to explicitly check for the quota levels. You just run the queries and retry until they get executed. This is a simplified code for the strategy I am thinking about:
for query in QUERIES_TO_RUN:
while True:
try:
run(query)
except quotaExceededException:
continue # Jumps to the next cycle of the nearest enclosing loop.
break

Control data flow in Pentaho transformation with variables

I want to control data flow in pentaho transformation files with system variables. I found a component called 'simple evaluation' which is exactly what I want, however it can only be used in job files.
I have gone through component-tree of transformation from spoon but cannot find any one like 'simple evaluation'.
Can anyone give me some idea, how to make it?
Thanks
IIRC you can't use variables in the filter rows step. That would probably be a worthy
change request to raise in jira.pentaho.com
So, simply use a "Get Variables" step to get the variable into the stream
and then use the filter rows step. ( or switch/case depending on complexity )

Call RESTful service in Pig script

I'm working on a Pig script (my first) that loads a large text file. For each record in that text file, the content of one field needs to be sent off to a RESTful service for processing. Nothing needs to be evaluated or filtered. Capture data, send it off and the script doesn't need anything back.
I'm assuming that a UDF is required for this kind of functionality, but I'm new enough to Pig that I don't have a clear picture of what type of function I should build. My best guess would be a Store Function since the data is ultimately getting stored somewhere, but I feel like the amount of guesswork involved in coming to that conclusion is higher than I'd like.
Any insight or guidance would be much appreciated.
Have you had a look to DBStorage which does something similar?
everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
...
STORE ordered INTO RestStorage('https://...');
Having never found even a hint of an answer to this, I decided to move in a different direction. I'm using Pig to load and parse the large file, but then streaming each record that I care about to PHP for additional processing that Pig doesn't seem to have the capability to handle cleanly.
It's still not complete (read: there's a great big, very unhappy bug in the mix), but I think the concept is solid--just need to work out the implementation details.
everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
-- apply filter
-- apply filter
-- ...
-- apply last filter
ordered = ORDER filtered_categories BY category;
streamed = STREAM limited THROUGH `php -nF process_categories.php`;
DUMP streamed;

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability