Snakemake hangs after DAG is completed for a large workflow - snakemake

I have a relatively large workflow as a result of combination of two wildcards. The first wildcard has 293 values. The second wildcard has 62 values. If I run the workflow for one of the values for the second wildcard it runs fine. However, if I run with all the values for the second wildcard, the workflow computes the DAG and then just hangs. If I run with --debug-dag I see the output stops after:
wildcards:
file None:
Producer found, hence exceptions are ignored.
And then nothing happens, it just freezes.
For now I was able to solve this by running with a bash for loop for each of the 62 values of the second wildcard. I also tried --batch flag. The DAG computation was faster but still the command line froze after DAG computation.
The problem is, I now need to aggregate over the 62 values of the second wildcard to produce the final output and I can't run the workflow.
Any ideas on how to get this working?
Thank you,
Ilya.

Related

How do I parse by regular expressions only on filtered lines on Cloudwatch log insights?

Is there a way to restructure this cloudwatch insights query so that it runs faster?
fields #timestamp, #message
| filter #message like /NewProductRequest/
| parse #message /.*"productType":\s*"(?<productType>\w+)"/
| stats count(*) group productType
I am running it over a limited period (1 day's worth of logs). It is taking very long to run.
When I remove the parse command, and count(*) the filtered lines: there are only 2500 matches out of 20,000,000 lines: the query returns in several seconds
With the parse command, the query takes >15 minutes. I can see the throughput drop from ~1GBps to ~2MBps.
Running a parse regexp on 2500 filtered lines should be negligible. It takes less then 2 seconds if I download the filtered results to my macbook and run the regexp in Python.
This leads me to believe that cloudwatch is running the parse command on every line in the log, and not just the filtered lines.
Is there a way to restructure my query so that the parse command will run after my filter command? ( Effectively parsing 2.5k lines instead of 20 million lines)
Removing the .* at the beginning of the expression increases performance. If you only searching for a string starting after any character sequence (.*), then this solution will work for you. This does not solve problems if the beginning of your regexp is anything other than .*

Snakemake 200000 job submission

I have 200000 fasta sequences. I am doing GATK to call variants and created a wildcard for every sequence. Now I would like to submit 200000 jobs using snakemake. Will this cause a problem to cluster? Is there a way to submit jobs in a set of 10-20?
First off, it might take some time to calculate the DAG, but I have been told the DAG calculation recently has been greatly improved. Anyways, it might be wise to split up in batches.
Most clusters won't allow you to submit more than X jobs at the same time, usually in the range of 100-1000. I believe the documentation is not fully correct, but when using --cluster cluster I believe the --jobs argument controls the number of active/submitted jobs at the same time, so by using snakemake --jobs 20 --cluster "myclustercommand" you should be able to control this. Know that this control the number of submitted jobs, not active jobs. It might be that all your jobs are in the queue, so probably best to check in with your cluster administrator and ask what the maximum number of submitted jobs is, and get as close to that number.

PDI: Output only if no errors

i want to transform a csv-file to an XML-file. In the Transformation i have also a small Validation of data, for example length of a string must be < 50. So i have a Textfile Input Step > Modified JavaScript Step with two hops to Abort Step (for the error handling hop) and XML Output Step. My goal is to only create the XML-file if no error occurs. At the moment it create a XML with 2 "rows" and Abort because in row 3 in csv is a very long string. I think it is a very simple scenario but i have no approach how to solve it. Please can someone give me a tip.
Thanks a lot.
Marko
EDITED:
It seems your flow is indeed halting strings longer than 50 characters if it is aborting midway, but since Pentaho works in parallelism, if the first row is valid and reaches the output step, the output will start recording, what you want is to block this step until all rows have been processed by the prior step.
Simply add a "Blocking Step"(do not mistake the Block this step until steps finish, you want Blocking Step)before your output step. Remenber to check 'Pass all rows?' option ins this step, this will effectively "Hold" all the rows in the transformation right before the output.

Pentaho "Return value id can't be found in the input row"

I have a pentaho transformation, which is used to read a text file, to check some conditions( from which you can have errors, such as the number should be a positive number). From this errors I'm creating an excel file and I need for my job the number of the lines in this error file plus to log which lines were with problem.
The problem is that sometimes I have an error " the return values id can't be found in the input row".
This error is not every time. The job is running every night and sometimes it can work without any problems like one month and in one sunny day I just have this error.
I don't think that this is from the file, because if I execute the job again with the same file it is working. I can't understand what is the reason to fail, because it is saying the value "id", but I don't have such a value/column. Why it is searching a value, which doesn't exists.
Another strange thing is that normally the step, which fails should be executed at all( as far as I know), because no errors were found, so we don't have rows at all to this step.
Maybe the problem is connected with the "Prioritize Stream" step? Here I'm getting all errors( which use exactly the same columns). I tried before the grouping steps to put a sorting, but it didn't help. Now I'm thinking to try with "Blocking step".
The problem is that I don't know why this happen and how to fix it. Any suggestions?
see here
Check if all your aggregates ins the Group by step have a name.
However, sometimes the error comes from a previous step: the group (count...) request data from the Prioritize Stream, and if that step has an error, the error gets reported mistakenly as coming from the group rather than from the Prioritze.
Also, you mention a step which should not be executed because there is no data: I do not see any Filter which would prevent rows with missing id to flow from the Prioritize to the count.
This is a bug. It happens randomly in one of my transformations that often ends up with empty stream (no rows). It mostly works, but once in a while it gives this error. It seems to only fail when the stream is empty though.

Elegantly handle samples with insufficient data in workflow?

I've set up a Snakemake pipeline for doing some simple QC and analysis on shallow shotgun metagenomics samples coming through our lab.
Some of the tools in the pipeline will fail or error when samples with low amounts of data are delivered as inputs -- but this is sometimes not knowable from the raw input data, as intermediate filtering steps (such as adapter trimming and host genome removal) can remove varying numbers of reads.
Ideally, I would like to be able to handle these cases with some sort of check on certain input rules, which could evaluate the number of reads in an input file and choose whether or not to continue with that portion of the workflow graph. Has anyone implemented something like this successfully?
Many thanks,
-jon
I'm not aware of the possibility to not complete the workflow based on some computation happening inside the workflow. The rules to be executed are determined based on the final required output, and failure will happen if this final output cannot be generated.
One approach could be catch the particular tool failure (try ... except construct in a run section or return code handling in a shell section) and generate a dummy output file for the corresponding rule, and have the downstream rules "propagate" dummy file generation based on a test identifying the rule's input as such a dummy file.
Another approach could be to pre-process the data outside of your snakemake workflow to determine which input to skip, and then use some filtering on the wildcards combinations as described here: https://stackoverflow.com/a/41185568/1878788.
I've been trying to find a solution to this issue as well.
Thus far I think I've identified a few potential solution but have yet to be able to correctly implement them.
I use seqkit stats to quickly generate a txt file and use the num_seqs column to filter with. You can write a quick pandas function to return a list of files which pass your threshold, and I use config.yaml to pass the minimum read threshold:
def get_passing_fastq_files(wildcards):
qc = pd.read_table('fastq.stats.txt').fillna(0)
passing = list(qc[qc['num_seqs'] > config['minReads']]['file'])
return passing
Trying to implement that as an input function in Snakemake has been an esoteric nightmare to be honest. Probably my own lack of nuanced understand about the Wildcards object.
I think the use of a checkpoint is also necessary in the process to force Snakemake to recompute the DAG after filtering samples out. Haven't been able to connect all the dots yet however, and I'm trying to avoid janky solutions that use token files etc.