--immediate-submit {dependencies} string contains script paths, not job IDs? - snakemake

I'm trying to use the --immediate-submit on a PBSPro cluster. I tried using an in-place modification of the dependencies string to adapt it to PBSPro, similar to what is done here.
snakemake --cluster "qsub -l wd -l mem={cluster.mem}GB -l ncpus={threads} -e {cluster.stderr} -q {cluster.queue} -l walltime={cluster.walltime} -o {cluster.stdout} -S /bin/bash -W $(echo '{dependencies}' | sed 's/^/depend=afterok:/g' | sed 's/ /:/g')"
This last part gets converted into, for example:
-W depend=afterok: /g/data1a/va1/dk0741/analysis/2018-03-25_marmo_test/.snakemake/tmp.cyrhf51c/snakejob.trimmomatic_pe.7.sh
There are two problems here:
How can I get the dependencies string to output job ID instead of the script path? The qsub command normally outputs the job ID to stdout, so I'm not sure why it's not doing so here.
How do I get rid of the space after afterok:? I've tried everything!
As an aside, it would be helpful if there were some option to debug the submission or not to delete the tmp.cyrhf51c directory in .snakemake -- is there some way to do this?
Thanks,
David

I suggest to use a profile for this, instead of trying to find an ad-hoc solution. This will also help with debugging. E.g., there is already a pbs-torque profile available (https://github.com/Snakemake-Profiles/pbs-torque), probably there is not much to change towards pbspro?

Related

Is it possible to print commands instead of rules in snakemake dry run?

Dry runs are a super important functionality of workflow languages. What I am looking at is mostly what would be executed if I run the command and this is exactly what one see when running make -n.
However analogical functionality snakemake -n prints something like
Building DAG of jobs...
rule produce_output:
output: my_output
jobid: 0
wildcards: var=something
Job counts:
count jobs
1 produce_output
1
The log contains kind of everything else than commands that get executed. Is there a way how to get command from snakemake?
snakemake -p --quiet -n
-p for print shell commands
-n for dry run
--quiet for removing the rest
EDIT 2019-Jan
This solution seems broken for lasts versions of snakemake
snakemake -p -n
Avoid the --quiet reported in the #eric-c answer, at least in some situations the combination on -p -n -q does not print the command executed without -n.

Run RapSearch-Program with Torque PBS and qsub

My problem is that I have a cluster-server with Torque PBS and want to use it to run a sequence-comparison with the program rapsearch.
The normal RapSearch command is:
./rapsearch -q protein.fasta -d database -o output -e 0.001 -v 10 -x t -z 32
Now I want to run it with 2 nodes on the cluster-server.
I've tried with: echo "./rapsearch -q protein.fasta -d database -o output -e 0.001 -v 10 -x t -z 32" | qsub -l nodes=2 but nothing happened.
Do you have any suggestions? Where I'm wrong? Help please.
Standard output (and error output) files are placed in your home directory by default; take a look. You are looking for a file named STDIN.e[numbers], it will contain the error message.
However, I see that you're using ./rapsearch but are not really being explicit about what directory you're in. Your problem is therefore probably a matter of changing directory into the directory that you submitted from. When your terminal is in the directory of the rapsearch executable, try echo "cd \$PBS_O_WORKDIR && ./rapsearch [arguments]" | qsub [arguments] to submit your job to the cluster.
Other tips:
You could add rapsearch to your path if you use it often. Then you can use it like a regular command anywhere. It's a matter of adding the line export PATH=/full/path/to/rapsearch/bin:$PATH to your .bashrc file.
Create a submission script for use with qsub. Here is a good example.

No such file or directory from sh script

Looking for the origin of this error message:
Processing: +([^_]).flv
date: +([^_]).flv: No such file or directory
I started getting this at some point in the last few months (can't say when as I wasn't logging my cron output. I know, I know!).
When I originally wrote this, it worked ok for at least two months. I'm wondering if there was an sh update that broke it?
The script runs via crontab and gets all .flv files in the current directory without an underscore and processes each one. It then checks the modified date for files that have been created in the last 24 hours and runs the yamdi meta tag injector for .flv files.
It seems like it's not recognizing the pattern as a pattern and looking for it as an actual file to me. If I run this script from an ssh shell it works ok, it's only when running via cron that it gives this error.
shopt -s extglob
now=$(date +"%s")
for f in +([^_]).flv; do
echo "Processing: $f"
age=$(date -r "$f" +"%s")
calc=$(((now-age) / 60 / 60))
if(( calc < 24 )); then
echo "$f age=$calc"
yamdi -i "$f" -o "$f".seek
rm "$f"
cp "$f".seek "$f"
touch -d #$age "$f"
fi
done
This is most likely a problem of the wrong shell being used; make sure your script's first line represents the right shell:
#!/bin/bash
for bash, or whatever shell you wrote this for. You might want to check your environment variables that cron may set (that's a very common problem -- one assumes everything is set up correctly, but the environment that cron offers to scripts it executes is different).

Remote rsync in parallel

I'm trying to run rsync over ssh in parallel to transfer files between two machines for evaluation purposes. I wanna see how faster can I get compared to a single rsync process.
I tried these two solutions:
https://wiki.ncsa.illinois.edu/display/~wglick/Parallel+Rsync but with no great success.
https://gist.github.com/rcoup/5358786 (I couldn't make it work)
Based on the first link I run a command like this:
ssh HOST "mkdir -p ~/destdir/basefolder"
cd ./basefolder; ls | xargs -n1 -P 4 -I% rsync -arvuz -e ssh % HOST:~/destdir/basefolder/.
and I get the files transfered, but it doesn't seem to work well... In this case, It will run a process for every file and folder in the basefolder, but when it finds a folder, it will transfer everything inside that folder using only 1 process.
I tried to use find -type f, but I got problems because I loose the file hierarchy.
Does anyone how some methods to do what I want? (Use rsync in parallel over ssh while keeping files and folders hierarchy).
Since you tagged your question 'gnu-parallel' the obvious is to refer you to http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallelizing-rsync
cd src-dir; find . -type f -size +100000 | parallel -v ssh fooserver mkdir -p /dest-dir/{//}\;rsync -Havessh {} fooserver:/dest-dir/{}

How to detect code change frequency?

I am working on a program written by several folks with largely varying skill level. There are files in there that have never changed (and probably never will, as we're afraid to touch them) and others that are changing constantly.
I wonder, are there any tools out there that would look at the entire repo history (git) and produce analysis on how frequently a given file changes? Or package? Or project?
It would be of value to recognize that (for example) we spent 25% of our time working on a set of packages, which would be indicative or code's fragility, as compared with code that "just works".
If you're looking for an OS solution, I'd probably consider starting with gitstats and look at extending it by grabbing file logs and aggregating that data.
I'd have a look at NChurn:
NChurn is a utility that helps asses the churn level of your files in
your repository. Churn can help you detect which files are changed the
most in their life time. This helps identify potential bug hives, and
improper design.The best thing to do is to plug NChurn into your build
process and store history of each run. Then, you can plot the
evolution of your repository's churn.
I wrote something that we use to visualize this information successfully.
https://github.com/bcarlso/defect-density-heatmap
Take a look at the project and you can see what the output looks like in the readme.
You can do what you need by first getting a list of files that have changed in each commit from Git.
~ $ git log --pretty="format:" --name-only | grep -v ^$ > file-changes.txt
~ $ for i in `cat file-changes.txt | cut -d"." -f1,2 | uniq`; do num=`cat file-changes.txt | grep $i | wc -l`; if (( $num > 1 )); then echo $num,0,$i; fi; done | heatmap > results.html
This will give you a tag cloud with files that churn more will show up larger.
I suggest using a command like
git log --follow -p file
That will give you all the changes that happened to the file in the history (including renames). If you want to get the number of commits that changed the file then you can do on a UNIX-based OS :
git log --follow --format=oneline Gemfile | wc -l
You can then create a bash script to apply this to multiple files with the name aside.
Hope it helped !
Building on a previous answer I suggest the following script to parse all project files
#!/bin/sh
cd $1
find . -path ./.git -prune -o -name "*" -exec sh -c 'git log --follow --format=oneline $1 | wc -l | awk "{ print \$1,\"\\t\",\"$1\" }" ' {} {} \; | sort -nr
cd ..
If you call the script as file_churn.sh you can parse your git project directory calling
> ./file_churn.sh project_dir
Hope it helps.