snakemake dry run for a single wildcard in order of execution - snakemake

Is it possible to do a dry run for snakemake for a single wildcard, in the order of execution?
When I call a dry run, I get the following at the bottom:
Job counts:
count jobs
1 all
1 assembly_eval
5 cat_fastq
1 createGenLogDir
5 createLogDir
5 flye
5 medaka_first
5 medaka_second
5 minimap_first
5 quast_medaka_first
5 quast_medaka_second
5 quast_racon_first
5 racon_first
5 symLinkFQ
58
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
So I guess it would be useful to:
get the dry run commands for a single wildcard (except for the aggregate rules, obviously), after all, the only thing that differs among the commands of any of those rules is the wildcard in the input, output and param directives.
get the workflow printed in the order of execution, for enhanced visualisation.
I did not find a suitable option using snakemake -h, and I'd be looking for something that --rulegraph, does compared --dag, which is to avoid redundancy.
If there is no solution to this, or if the solution is too cumbersome, I guess I will suggest this as enhancement in their github page.

Here are some possible solutions:
You can specify a target file with the specific wildcard you want, e.g. snakemake -nq output_wc1.txt
If your wildcards are stored in a list/dataframe, limit to just the first. I frequently do this while developing, e.g. chroms = range(1,2) # was range(1, 23)
If you have a single job for each rule and dependencies are simple (A -> B -> C), the jobs should be listed in order of execution. This is not true when your workflow has concurrent or branching rules.
Have you also checked --filegraph and --summary?

Related

Can Snakemake parallelize the same rule both within and across nodes?

I have a somewhat basic question about Snakemake parallelization when using cluster execution: can jobs from the same rule be parallelized both within a node and across multiple nodes at the same time?
Let's say for example that I have 100 bwa mem jobs and my cluster has nodes with 40 cores each. Could I run 4 bwa mem per node, each using 10 threads, and then have Snakemake submit 25 separate jobs? Essentially, I want to parallelize both within and across nodes for the same rule.
Here is my current snakefile:
SAMPLES, = glob_wildcards("fastqs/{id}.1.fq.gz")
print(SAMPLES)
rule all:
input:
expand("results/{sample}.bam", sample=SAMPLES)
rule bwa:
resources:
time="4:00:00",
partition="short-40core"
input:
ref="/path/to/reference/genome.fa",
fwd="fastqs/{sample}.1.fq.gz",
rev="fastqs/{sample}.2.fq.gz"
output:
bam="results/{sample}.bam"
log:
"results/logs/bwa/{sample}.log"
params:
threads=10
shell:
"bwa mem -t {params.threads} {input.ref} {input.fwd} {input.rev} 2> {log} | samtools view -bS - > {output.bam}"
I've run this with the following command:
snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25
With this setup, I get 25 jobs submitted, each to a different node. However, only one bwa mem process (using 10 threads) is run per node.
Is there some straightforward way to modify this so that I could get 4 different bwa mem jobs (each using 10 threads) to run on each node?
Thanks!
Dave
Edit 07/28/22:
In addition to Troy's suggestion below, I found a straightforward way of accomplishing what I was trying to do by simply following the job grouping documentation.
Specifically, I did the following when executing my Snakemake pipeline:
snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25 --groups bwa=group0 --group-components group0=4 --rerun-incomplete --cores 40
By specifying a group ("group0") for the bwa rule and setting "--group-components group0=4", I was able to group the jobs such that 4 bwa runs are occurring on each node.
You can try job grouping but note that resources are typically summed together when submitting group jobs like this. Usually that's not what is desired, but in your case it seems to be correct.
Instead you can make a group job with another rule that does the grouping for you in batches of 4.
rule bwa_mem:
group: 'bwa_batch'
output: '{sample}.bam'
...
def bwa_mem_batch(wildcards):
# for wildcard.i, pick 4 bwa_mem outputs to put in this group
return expand('{sample}.bam', sample=SAMPLES[i*4:i*4+4])
rule bwa_mem_batch:
input: bwa_mem_batch_input
output: touch('flag_{i}') # could be temp too
group 'bwa_batch'
The consuming rule must request flag_{i} for i in {0..len(SAMPLES)//4}. With cluster integration, each slurm job gets 1 bwa_mem_batch job and 4 bwa_mem jobs with resources for a single bwa_mem job. This is useful for batching together multiple jobs to increase the runtime.
As a final point, this may do what you want, but I don't think it will help you get around QOS or other job quotas. You are using the same amount of CPU hours either way. You may be waiting in the queue longer because the scheduler can't find 40 threads to give you at once, where it could have given you a few 10 thread jobs. Instead, consider refining your resource values to get better efficiency, which may get your jobs run earlier.

Reading and handling many small CSV-s to concatenate one large Dataframe

I have two folders each contains about 8,000 small csv files. One with an aggregated size of around 2GB and another with aggregated size of around 200GB.
These files are stored like this to better update them in a daily basis. However, when I conduct EDA, I would like them to be assigned to a single variable. For example.
path = "some random path"
df = pd.concat([pd.read_csv(f"{path}//{files}") for files in os.listdir(path)])
It would take much less time for me to read the dataset with 2GB in total size than reading it on the super computer cluster. And it is impossible to read the 200GB dataset on the local machine unless using some sort of scaling Pandas solutions. The situation does not seem to improve on the cluster even using the popular open-source tools like Dask and Modin.
Is there an effective way that enables to read those csv files effectively with given situation?
Q :"Is there an effective way that enables to read those csv files effectively ... ?"
A :Oh, sure, there is :
CSV format ( standard attempts in RFC4180 ) is not unambiguous and is not obeyed under all circumstances ( commas inside fields, header present or not ), so some caution & care is needed here. Given you are your own data curator, you shall be able to decide plausible steps for handling your own data properly.
So, the as-is state is :
# in <_folder_1_>
:::::::: # 8000 CSV-files ~ 2GB in total
||||||||||||||||||||||||||||||||||||||||||| # 8000 CSV-files ~ 200GB in total
# in <_folder_2_>
Speaking efficiency, O/S coreutils provide the best, stable, proven and most efficient (as system tool used to be since ever ) tools for the phase of merging thousands and thousands of plain CSV-files' content :
###################### if need be,
###################### use an in-place remove of all CSV-file headers first :
for F in $( ls *.csv ); do sed -i '1d' $F; done
this helps for case we cannot avoid headers on the CSV-exporter side. Works like this :
(base):~$ cat ?.csv
HEADER
1
2
3
HEADER
4
5
6
HEADER
7
8
9
(base):~$ for i in $( ls ?.csv ); do sed -i '1d' $i; done
(base):~$ cat ?.csv
1
2
3
4
5
6
7
8
9
Now, the merging phase :
###################### join
cat *.csv > __all_CSVs_JOINED.csv
Given the nature of the said file storage policy, performance can be boosted by using more processes for independent taking small files and large files separately, as defined above, having put the logic inside a pair of conversion_script_?.sh shell-scripts :
parallel --jobs 2 conversion_script_{1}.sh ::: $( seq -f "%1g" 1 2 )
As the transformation is a "just"-[CONCURRENT] flow of processing for a sake of removing the CSV-headers, but a pure-[SERIAL] ( for larger number of files, there might become interesting to use a multi-staged tree of trees - using several stages of [SERIAL]-collections of [CONCURRENT]-ly pre-processed leaves, yet for just 8000 files, not knowing the actual file-system details, the latency-masking from a just-[CONCURRENT] processing both of the directories just independently will be fine to start with )
Last but not least, the final pair of ___all_CSVs_JOINED.csv are safe to get opened using in a way, that prevents moving all disk-stored date into RAM at once ( using chunk-size-fused file-reading-iterator, avoiding RAM-spillovers by using mmaped-mode as a context manager ) :
with pandas.read_csv( "<_folder_1_>//___all_CSVs_JOINED.csv",
sep = NoDefault.no_default,
delimiter = None,
...
chunksize = SAFE_CHUNK_SIZE,
...
memory_map = True,
...
) \
as df_reader_MMAPer_CtxMGR:
...
When tweaking for ultimate performance, details matter and depend on physical hardware bottlenecks ( disk-I/O-wise, filesystem-wise, RAM-I/O-wise ), so due care may take further improvement for minimising the repetitive performed end-to-end processing times ( sometimes even turning data into a compressed/zipped form, in cases, where CPU/RAM resources permit sufficient performance advantages over limited performance of disk-I/O throughput - moving less bytes is so faster, that CPU/RAM-decompression costs are still lower, than moving 200+ [GB]s of uncompressed plain text data.
Details matter,tweak options,benchmark,tweak options,benchmark,tweak options,benchmark
would be nice to post your progress on testing the performanceend-2-end duration of strategy ... [s] AS-IS nowend-2-end duration of strategy ... [s] with parallel --jobs 2 ...end-2-end duration of strategy ... [s] with parallel --jobs 4 ...end-2-end duration of strategy ... [s] with parallel --jobs N ... + compression ... keep us posted

GitLab API - get the overall # of lines of code

I'm able to get the stats (additions, deletions, total) for each commit, however how can I get the overall #?
For example, if one MR has 30 commits, I need the net # of lines of code added\deleted which you can see in the top corner.
This # IS NOT the sum of all #'s per commit.
So, I would need an API that returns the net # of lines of code added\removed at MR level (no matter how many commits are).
For example, if I have 2 commits: 1st one adds 10 lines, and the 2nd one removes the exact same 10 lines, then the net # is 0.
Here is the scenario:
I have an MR with 30 commits.
GitLab API provides support to get the stats (lines of code added\deleted) per Commit (individually).
If I go in GitLab UI, go to the MR \ Changes, I see the # of lines added\deleted that is not the SUM of all the Commits stats that I'm getting thru API.
That's my issue.
A simpler example: let's say I have 2 commits, one adds 10 lines of code, while the 2nd commit removes the exact same 10 lines of code. Using the API, I'm getting the sum, which is 20 LOCs added. However, if I go in the GitLab UI \ Changes, it's showing me 0 (zero), which is correct; that's the net # of chgs overall. This is the inconsistency I noticed.
To do this for an MR, you would use the MR changes API and count the occurrences of lines starting with + and - in the changes[].diff fields to get the additions and deletions respectively.
Using bash with gitlab-org/gitlab-runner!3195 as an example:
GITLAB_HOST="https://gitlab.com"
PROJECT_ID="250833"
MR_ID="3195"
URL="${GITLAB_HOST}/api/v4/projects/${PROJECT_ID}/merge_requests/${MR_ID}/changes"
DIFF=$(curl ${URL} | jq -r ".changes[].diff")
ADDITIONS=$(grep -E "^\+" <<< "$DIFF")
DELETIONS=$(grep -E "^\-" <<< "$DIFF")
NUM_ADDITIONS=$(wc -l <<< "$ADDITIONS")
NUM_DELETIONS=$(wc -l <<< "$DELETIONS")
echo "${MR_ID} has ${NUM_ADDITIONS} additions and ${NUM_DELETIONS} deletions"
The output is
3195 has 9 additions and 2 deletions
This matches the UI, which also shows 9 additions and 2 deletions
This, as you can see is a representative example of your described scenario since the combined total of the individual commits in this MR are 13 additions and 6 deletions.

Is there a way to get a nice error report summary when running many jobs on DRMAA cluster?

I need to run a snakemake pipeline on a DRMAA cluster with a total number of >2000 jobs. When some of the jobs have failed, I would like to receive in the end an easy readable summary report, where only the failed jobs are listed instead of the whole job summary as given in the log.
Is there a way to achieve this without parsing the log file by myself?
These are the (incomplete) cluster options:
jobs: 200
latency-wait: 5
keep-going: True
rerun-incomplete: True
restart-times: 2
I am not sure if there is another way than parsing the log file yourself, but I've done it several times with grep and I am happy with the results:
cat .snakemake/log/[TIME].snakemake.log | grep -B 3 -A 3 error
Of course you should change the TIME placeholder for whichever run you want to check.

File I/O in gnu parallel

I have a program that takes a single argument. I am using gnu parallel to perform parameter sweeps on this argument. Each run generates a single result, and I want to append all results into a single file, say Results.txt.
What would be a correct way to do this?
I should not have each instance open the file and write to it, as this could create conflicts and also mess up the order of results. The only way I can think of doing this is having each run generate its output in a file with a unique name, and then , when gnu parallel finishes running, merge the results into a single file using a script.
Is there a simpler way of achieving this?
What happens when multiple instances write to/read from the same file? Does gnu parallel create multiple copies, one for each instances, as it does for stdout and stderror?
thanks
If your command sends the result to stdout (standard output) the solution is trivial:
seq 1000 | parallel echo > Results.txt
GNU Parallel guarantees the output will not be mixed.
Normally GNU Parallel prints the output of a job as soon as it's completed. When jobs run for a different amount of time, this can lead to their output being mixed.
To keep the output in order, simply add -k / --keep-order parameter.
Try for example:
parallel -j4 sleep {}\; echo {} ::: 2 1 4 3
parallel -j4 -k sleep {}\; echo {} ::: 2 1 4 3