How to know the directory where the current script lies - scripting

When writing a script, I'd like to know where lies the current script, in order to locate other files.
With a regular Scala script, I know how to do so, but with an Ammonite script I don't.

it's been a long time since the question but you can get directory for script file like this in current Ammonite:
val src: String = sourcecode.File()
val rootDir: Path = os.Path(src) / os.up

Instead of the standard bang line I normally use:
#!/usr/bin/env amm
I changed my script to:
#!/bin/bash
exec amm "$0" `dirname "$0"` "$#"
!#
#main
def main(dir: String) {
print(dir)
}
The dir argument receives the path where the script lies. It can be absolute or relative.
If we always desire an absolute path:
#!/bin/bash
exec amm "$0" $(cd `dirname "${BASH_SOURCE[0]}"` && pwd) "$#"
!#

Related

Nextflow: publishDir, output channels, and output subdirectories

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.
Here is the nextflow process in question:
process GupcallBases {
publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
output:
path "*.bam" into bams_ch
script:
"""
module load cuda/11.4.2
singularity exec --nv $params.Gup_container \
guppy_basecaller --config $params.P1_gupConf \
--device "cuda:0" \
--bam_out \
--recursive \
--compress \
--align_ref $params.refGen \
-i $params.P1_inDir \
-s $params.P1_outDir \
--gpu_runners_per_device $params.P1_GPU_runners \
--num_callers $params.P1_callers
"""
}
The output of the process is something like this:
$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)
I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.
The output syntax was chosen since once this process is done, using the following channel works:
// Channel
// .fromPath("${params.P1_outDir}/pass/*.bam")
// .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
// .set { bams_ch }
But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.
Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.
Thanks in advance.
Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.
The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.
Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:
params.publishDir = './results'
input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )
process GuppyBasecaller {
publishDir(
path: "${params.publishDir}/GuppyBasecaller",
mode: 'copy',
saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
)
beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
container '/path/to/guppy_basecaller.img'
input:
path input_dir
path guppy_config
path ref_genome
output:
path "outdir/pass/*.bam" into bams_ch
"""
mkdir outdir
guppy_basecaller \\
--config "${guppy_config}" \\
--device "cuda:0" \\
--bam_out \\
--recursive \\
--compress \\
--align_ref "${ref_genome}" \\
-i "${input_dir}" \\
-s outdir \\
--gpu_runners_per_device "${params.guppy_gpu_runners}" \\
--num_callers "${params.guppy_callers}"
"""
}

Is there a way to set parameters for the Java VM when using a Snakemake wrapper?

When using tools like picard or fgbio through snakemake wrappers, I keep running into out-of-memory issues. At the moment I resort to direct shell calls, which allow me to set the VMs memory. I would prefer to pass these parameters to the wrapped tools. Is there a way, maybe through the resources directive, passing something like mem_mb=10000? I tried, but have not gotten it to work yet.
I have never used the wrapper directive but looking for example at markduplicates/wrapper.py the shell command is picard MarkDuplicates {snakemake.params} .... So maybe using the params slot works?
rule markdups:
input:
'in.bam',
output:
bam= 'out.bam',
metrics= 'metrics.tmp',
params:
mem= "-Xmx4g",
wrapper:
"0.31.0/bio/picard/markduplicates"
picard should understand that -Xmx... is a java parameter.
According to wrapper sources (https://bitbucket.org/snakemake/snakemake-wrappers/src/bd3178f4b82b1856370bb48c8bdbb1932ace6a19/bio/picard/markduplicates/wrapper.py?at=master&fileviewer=file-view-default), it uses cmdline:
from snakemake.shell import shell
shell("picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
"OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
"&> {snakemake.log}")
So you could pass any options using params: "smth" section.
If you check picard excecutable script sources:
cat `which picard`
You will find:
...
pass_args=""
for arg in "$#"; do
case $arg in
'-D'*)
jvm_prop_opts="$jvm_prop_opts $arg"
;;
'-XX'*)
jvm_prop_opts="$jvm_prop_opts $arg"
;;
'-Xm'*)
jvm_mem_opts="$jvm_mem_opts $arg"
;;
*)
if [[ ${pass_args} == '' ]] #needed to avoid preceeding space on first arg e.g. ' MarkDuplicates'
then
pass_args="$arg"
else
pass_args="$pass_args \"$arg\"" #quotes later arguments to avoid problem with ()s in MarkDuplicates regex arg
fi
;;
esac
done
...
So I assume this should work:
rule markdups:
input:
"in.bam",
output:
bam = "out.bam",
metrics = "metrics.tmp",
params:
"-Xmx10000m"
wrapper:
"0.31.0/bio/picard/markduplicates"

issue with a modification of youtube-dl in .zshrc

the code I have in my .zshrc is:
ytdcd () { #youtube-dl that automatically puts stuff in a specific folder and returns to the former working directory after.
cd ~/youtube/new/ && {
youtube-dl "$#"
cd - > /dev/null
}
}
ytd() { #sofar, this function can only take one page. so, i can only send one youttube video code per line. will modify it to accept multiple lines..
for i in $*;
do
params=" $params https://youtu.be/$i"
done
ytdcd -f 18 $params
}
so, on the commandline (terminal), when i enter ytd DFreHo3UCD0, i would like to have the video at https://youtu.be/DFreHo3UCD0 to be downloaded. the problem is that when I enter the command in succession, the system just tries to download the video for the previous command and rightly claims the download is complete.
For example, entering:
> ytd DFreHo3UCD0
> ytd L3my9luehfU
would not attempt to download the video for L3my9luehfU but only the video for DFreHo3UCD0 twice.
First -- there's no point to returning to the old directory for ytdcd: You can change to a new directory only inside a subshell, and then exec youtube-dl to replace that subshell with the application process:
This has fewer things to go wrong: Aborting the function's execution can't leave things in the wrong directory, because the parent shell (the one you're interactively using) never changed directories in the first place.
ytdcd () {
(cd ~/youtube/new/ && exec youtube-dl "$#")
}
Second -- use an array when building argument lists, not a string.
If you use set -x to log its execution, you'll see that your original command runs something like:
ytdcd -f 18 'https://youtu.be/one https://youtu.be/two https://youtu.be/three'
See those quotes? That's because $params is a string, passed as a single argument, not an array. (In bash -- or another shell following POSIX rules -- an unquoted string expansion would be string-split and glob-expanded, but zsh doesn't follow POSIX rules).
The following builds up an array of separate arguments and passes them individually:
ytd() {
local -a params=( )
local i
for i; do
params+=( "https://youtu.be/$i" )
done
ytdcd -f 18 "${params[#]}"
}
Finally, it's come up that you don't actually intend to pass all the URLs to just one youtube-dl instance. To run a separate instance per URL, use:
ytd() {
local i retval=0
for i; do
ytdcd -f 18 "$i" || retval=$?
done
return "$retval"
}
Note here that we're capturing non-success exit status, so as not to hide an error in any ytdcd instance other than the last (which would otherwise occur).
I would declare param as local, so that you are not appending url after urls...
You can try to add this awesome function to your .zshrc:
funfun() {
local _fun1="$_fun1 fun1!"
_fun2="$_fun2 fun2!"
echo "1 says: $_fun1"
echo "2 says: $_fun2"
}
To observe the thing ;)
EDIT (Explanation):
When sourcing shell script, you add it to you current environment, that is why you can run those function you define. So, when those function use variables, by default, those variable will be global and accessible from anywhere in your environment! Therefore, In this case param is defined globally for all the length of your shell session. Since you want to allow the download of several video at once, you are appending values to this global variable, which will grow all the time.
Enforcing local tells zsh to limit the scope of params to the function only.
Another solution is to reset the variable when you call the function.

Cronjob does not execute command line in perl script

I am unfamiliar with linux/linux environment so do pardon me if I make any mistakes, do comment to clarify.
I have created a simple perl script. This script creates a sql file and as shown, it would execute the lines in the file to be inserted into the database.
#!/usr/bin/perl
use strict;
use warnings;
use POSIX 'strftime';
my $SQL_COMMAND;
my $HOST = "i";
my $USERNAME = "need";
my $PASSWORD = "help";
my $NOW_TIMESTAMP = strftime '%Y-%m-%d_%H-%M-%S', localtime;
open my $out_fh, '>>', "$NOW_TIMESTAMP.sql" or die 'Unable to create sql file';
printf {$out_fh} "INSERT INTO BOL_LOCK.test(name) VALUES ('wow');";
sub insert()
{
my $SQL_COMMAND = "mysql -u $USERNAME -p'$PASSWORD' ";
while( my $sql_file = glob '*.sql' )
{
my $status = system ( "$SQL_COMMAND < $sql_file" );
if ( $status == 0 )
{
print "pass";
}
else
{
print "fail";
}
}
}
insert();
This works if I execute it while I am logged in as a user(I do not have access to Admin). However, when I set a cronjob to run this file let's say at 10.08am by using the line(in crontab -e):
08 10 * * * perl /opt/lampp/htdocs/otpms/Data_Tsunami/scripts/test.pl > /dev/null 2>&1
I know the script is being executed as the sql file is created. However no new rows are inserted into the database after 10.08am. I've searched for solutions and some have suggested using the DBI module but it's not available on the server.
EDIT: Didn't manage to solve it in the end. A root/admin account was used to to execute the script so that "solved" the problem.
First things first, get rid of the > /dev/null 2>&1 at the end of your crontab entry (at least temporarily) so you can actually see any errors that may be occurring.
In other words, change it temporarily to something like:
08 10 * * * perl /opt/lampp/htdocs/otpms/Data_Tsunami/scripts/test.pl >/tmp/myfile 2>&1
Then you can examine the /tmp/myfile file to see what's being output.
The most likely case is that mysql is not actually on the path in your cron job, because cron itself gives a rather minimal environment.
To fix that problem (assuming that's what it is), see this answer, which gives some guidelines on how best to expand the cron environment to give you what you need. That will probably just involve adding the MySQL executable directory to your PATH variable.
The other thing you may want to consider is closing the out_fh file before trying to pass it to mysql - if the buffers haven't been flushed, it may still be an empty file as far as other processes are concerned.
The expression glob(".* *") matches all files in the current working
directory.
- http://perldoc.perl.org/functions/glob.html
you should not rely on the wd in a cron job. If you want to use a glob (or any file operation) with a relative path, set the wd with chdir first.
source: http://www.perlmonks.org/bare/?node_id=395387
So if your working directory is, for example /home/user, you should insert
chdir('/home/user/');
before the WHILE, ie:
sub insert()
{
my $SQL_COMMAND = "mysql -u $USERNAME -p'$PASSWORD' ";
chdir('/home/user/');
while( my $sql_file = glob '*.sql' )
{
...
replace /home/user with wherever your sql files are being created.
It's better to do as much processing within Perl as possible. It avoids the overhead of generating a separate shell process and leaves everything under the control of the program so that you can handle any errors much more simply
Database access from Perl is done using the DBI module. This program demonstrates how to achieve what you have written using the mysql utility. As you can see it's also much more concise
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $host = "i";
my $username = "need";
my $password = "help";
my $dbh = DBI->connect("DBI:mysql:database=test;host=$host", $username, $password);
my $insert = $dbh->prepare('INSERT INTO BOL_LOCK.test(name) VALUES (?)');
my $rv = $insert->execute('wow');
print $rv ? "pass\n" : "fail\n";

Bash $PATH is caching every modification

How to clear the cache of $PATH in BASH. Every time I modify the $PATH, the former modifications are conserved too! So my $PATH is already one page :-), and it bothers me to work, because it points to some not right places (because every modification is being appended in the end of the $PATH variable). Please help me to solve this problem.
because every modification is being
appended in the end of the $PATH
variable
Take a close look at where you are setting $PATH, I bet it looks something like this:
PATH="$PATH:/some/new/dir:/another/newdir:"
Having $PATH in the new assignment gives you the appending behavior you don't want.
Instead do this:
PATH="/some/new/dir:/another/newdir:"
Update
If you want to strip $PATH of all duplicate entries but still maintain the original order then you can do this:
PATH=$(awk 'BEGIN{ORS=":";RS="[:\n]"}!a[$0]++' <<<"${PATH%:}")
PATH=$(echo $PATH | tr ':' '\n' | sort | uniq | tr '\n' ':')
Once in a while execute the above command. It will tidy up your PATH variable by removing any duplication.
-Cheers
PS: Warning: This will reorder the Paths in PATH variable. And can have undesired effects !!
When I'm setting my PATH, I usually use this script - which I last modified in 1999, it seems (but use daily on all my Unix-based computers). It allows me to add to my PATH (or LD_LIBRARY_PATH, or CDPATH, or any other path-like variable) and eliminate duplicates, and trim out now unwanted values.
Usage
export PATH=$(clnpath /important/bin:$PATH:/new/bin /old/bin:/debris/bin)
The first argument is the new path, built by any technique you like. The second argument is a list of names to remove from the path (if they appear - no error if they don't). For example, I have up to about five versions of the software I work on installed at any given time. To switch between versions, I use this script to adjust both PATH and LD_LIBRARY_PATH to pick up the correct values for the version I'm about to start using, and remove the values of the version I'm no longer using.
Code
: "#(#)$Id: clnpath.sh,v 1.6 1999/06/08 23:34:07 jleffler Exp $"
#
# Print minimal version of $PATH, possibly removing some items
case $# in
0) chop=""; path=${PATH:?};;
1) chop=""; path=$1;;
2) chop=$2; path=$1;;
*) echo "Usage: `basename $0 .sh` [$PATH [remove:list]]" >&2
exit 1;;
esac
# Beware of the quotes in the assignment to chop!
echo "$path" |
${AWK:-awk} -F: '#
BEGIN { # Sort out which path components to omit
chop="'"$chop"'";
if (chop != "") nr = split(chop, remove); else nr = 0;
for (i = 1; i <= nr; i++)
omit[remove[i]] = 1;
}
{
for (i = 1; i <= NF; i++)
{
x=$i;
if (x == "") x = ".";
if (omit[x] == 0 && path[x]++ == 0)
{
output = output pad x;
pad = ":";
}
}
print output;
}'
Commentary
The ':' is an ancient way of using /bin/sh (originally the Bourne shell - now as often Bash) to run the script. If I updated it, the first line would become a shebang. I'd also not use tabs in the code. And there are ways to get the 'chop' value set that do not involve as many quotes:
awk -F: '...script...' chop="$chop"
But it isn't broken, so I haven't fixed it.
When adding entries to PATH, you should check to see if they're already there. Here's what I use in my .bashrc:
pathadd() {
if [ -d "$1" ] && [[ ":$PATH:" != *":$1:"* ]]; then
PATH="$PATH:$1"
fi
}
pathadd /usr/local/bin
pathadd /usr/local/sbin
pathadd ~/bin
This only adds directories to PATH if they exist (i.e. no bogus entries) and aren't already there. Note: the pattern matching feature I use to see if the entry is already in PATH is only available in bash, not the original Bourne shell; if you want to use this with /bin/sh, that part'd need to be rewritten.
I have a nice set of scripts that add path variables to the beginning or end of PATH depending on the ordering I want. The problem is OSX starts with /usr/local/bin after /usr/bin, which is exactly NOT what I want (being a brew user and all). So what I do is put a new copy of /usr/local/bin in front of everything else and use the following to remove all duplicates (and leave ordering in place).
MYPATH=$(echo $MYPATH|perl -F: -lape'$_=join":",grep!$s{$_}++,#F')
I found this on perlmonks. Like most perl, it looks like line noise to me so I have no idea how it works, but work it does!