tabular configuration snakemake - pandas

I try to build a Snakemake pipeline where the processed files and their paramters are definied in a seperate table.
In my minimal example the folder structure would look like this.
├── data
│   ├── name1
│   ├── name2
│   ├── name3
│   └── name4
├── output
├── samples.tsv
└── Snakefile
I would like to copy varying amounts of file "name1", "name2", and "name3" (not "name4") to the output folder. The amounts are definiend in in the sample.tsv.
fileName parameter
name1 4
name2 6
name3 7
I guess the basic structure of the Snakefile should look a bit like this
import pandas as pd
samples = pd.read_table("samples.tsv").set_index("fileName", drop=False)
rule copy:
input:
"data/{<file name X>}"
output:
"output/{<file name X>}_cut"
shell:
"head -n {parameter of X} {input} > {output} "
But I have no idea how to parse the information from "samples" to the rule.

To expand Dimitry's answer, you probably want to add to your rule a lambda function that links each filename to its parameter(s) given in the sample sheet. E.g.:
import pandas as pd
samples = pd.read_table("samples.tsv").set_index("fileName", drop=False)
rule all:
input:
expand('output/{fileName}_cut', fileName= samples.fileName),
rule copy:
input:
"data/{fileName}"
output:
"output/{fileName}_cut"
params:
n= lambda wc: samples[samples.fileName == wc.fileName].parameter.iloc[0],
shell:
"head -n {params.n} {input} > {output}"

Add the rule all as the first rule in your file:
rule all:
input: expand("output/{filenameX}_cut", filenameX=samples)
This rule would collect all required targets as the goal of your pipeline, then Snakemake would automatically assign an instance of the rule copy for each individual file.

Related

Defining multiple cases for an Ansible variable based on multiple conditions

I have this variable here, set in a .yaml variables file
patch_plan: 'foo-{{ patch_plan_week_and_day }}-bar'
I want my patch_plan_week_and_day variable to be set dynamically, based on role and environment which are 2 other variables set elsewhere (doesn't matter now) outside this variables file.
For instance, I will explain 3 cases:
If role = 'master' and environment = 'srvb' then patch_plan_week_and_day = 'Week1_Monday' and thus the end result of patch_plan = 'foo-Week1_Monday-bar'.
If role != 'master' and environment = 'srvb' then patch_plan_week_and_day = 'Week1_Tuesday' and thus the end result of patch_plan = 'foo-Week1_Tuesday-bar'
If role = 'slave' and environment = 'pro' then patch_plan_week_and_day = 'Week3_Wednesday' and hus the end result of patch_plan = 'foo-Week3_Wednesday-bar'
This is the idea of the code:
patch_plan: 'foo-{{ patch_plan_week_and_day }}-bar'
# Patch Plans
## I want something like this:
# case 1
patch_plan_week_and_day: Week1_Monday
when: role == 'master' and environment == 'srvb'
# case 2
patch_plan_week_and_day: Week1_Tuesday
when: role != 'master' and environment == 'srvb'
# case 3
patch_plan_week_and_day: Week3_Wednesday
when: role == 'slave' and environment == 'pro'
I have 14 cases in total.
Put the logic into a dictionary. For example,
patch_plan_week_and_day_dict:
srvb:
master: Week1_Monday
default: Week1_Tuesday
pro:
slave: Week3_Wednesday
default: WeekX_Wednesday
Create the project for testing
shell> tree .
.
├── ansible.cfg
├── hosts
├── pb.yml
└── roles
├── master
│   ├── defaults
│   │   └── main.yml
│   └── tasks
│   └── main.yml
├── non_master
│   ├── defaults
│   │   └── main.yml
│   └── tasks
│   └── main.yml
└── slave
├── defaults
│   └── main.yml
└── tasks
└── main.yml
10 directories, 9 files
shell> cat ansible.cfg
[defaults]
gathering = explicit
inventory = $PWD/hosts
roles_path = $PWD/roles
retry_files_enabled = false
stdout_callback = yaml
shell> cat hosts
localhost
shell> cat pb.yml
- hosts: localhost
vars:
patch_plan_week_and_day_dict:
srvb:
master: Week1_Monday
default: Week1_Tuesday
pro:
slave: Week3_Wednesday
default: WeekX_Wednesday
roles:
- "{{ my_role }}"
The code of all roles is identical
shell> cat roles/master/defaults/main.yml
patch_plan_role: "{{ (my_role in patch_plan_week_and_day_dict[env].keys()|list)|
ternary(my_role, 'default') }}"
patch_plan_week_and_day: "{{ patch_plan_week_and_day_dict[env][patch_plan_role] }}"
shell> cat roles/master/tasks/main.yml
- debug:
var: patch_plan_week_and_day
Example 1.
shell> ansible-playbook pb.yml -e env=srvb -e my_role=master
...
patch_plan_week_and_day: Week1_Monday
Example 2.
shell> ansible-playbook pb.yml -e env=srvb -e my_role=non_master
...
patch_plan_week_and_day: Week1_Tuesday
Example 3.
shell> ansible-playbook pb.yml -e env=pro -e my_role=slave
...
patch_plan_week_and_day: Week3_Wednesday
A lot of considerations here ...
It seems you try to use Ansible as a programming language which it isn't. You've started to implement something without any description about your use case and what is actually the problem. The given example looks like an anti-pattern.
... set dynamically, based on role and environmentv ...
It is in fact "static" and based on the properties of the systems. You only try to generate the values at runtime. Timeslots when patches can or should be applied (Patch Window) are facts about the system and usually configured within the Configuration Management Database (CMDB). So this kind of information should be already there, either in a database or within the Ansible inventory or as a Custom fact on the system itself.
... which are 2 other variables set elsewhere (doesn't matter now) outside this variables file. ...
Probably it does matter and maybe you could configure the Patch Cycle or Patch Window there.
By pursuing your approach further you'll mix up Playbook Logic with Infrastructure Description or Configuration Properties leading fast into less readable and probably future unmaintainable code. You'll deny yourself the opportunity to maintain the system configuration within a Version Control System (VCS), CMDB or the inventory.
Therefore avoid CASE, SWITCH and IF THEN ELSE ELSEIF structures and describe the desired state of your systems instead.
Some Further Readings
In addition to the sources already given.
Best Practices - Content Organization
General tips
At last, this is what fixed it, thank you everyone
patch_plan: 'foo-{{ patch_plan_week_and_day[environment][role] }}-bar'
srvb:
master: Week1_Monday
slave: Week1_Tuesday
pre:
master: Week1_Sunday
slave: Week1_Friday
pro:
master: Week1_Thursday
slave: Week1_Wednesday

gzip command block returns "Too many levels of symbolic links"

Trying to perform a fairly simple gzip command across my fastq files, but a strange error returns.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.gzip = "sequences/sequences_split/sequences_trimmed/trimmed*fastq"
workflow {
gzip_ch = Channel.fromPath(params.gzip)
GZIP(gzip_ch)
GZIP.out.view()
}
process GZIP {
input:
path read
output:
stdout
script:
"""
gzip ${read}
"""
}
Error:
Command error:
gzip: trimmed_SRR19573319_R2.fastq: Too many levels of symbolic links
Tried running a loop in the script instead or run gzip on individual files which works, but would rather use the nextflow syntax.
By default, Nextflow will try to stage process input files using symbolic links. The problem is that gzip actually ignores symbolic links. From the GZIP(1) man page:
The gzip command will only attempt to compress regular files. In particular, it will ignore symbolic links.
If the objective is to create a reproducible workflow, it's usually best to avoid modifying the workflow inputs directly anyway. Either use the stageInMode directive to change how the input files are staged in. For example:
process GZIP {
stageInMode 'copy'
input:
path fastq
output:
path "${fastq}.gz"
"""
gzip "${fastq}"
"""
}
Or, preferably, just modify the command to redirect stdout to a file:
process GZIP {
input:
path fastq
output:
path "${fastq}.gz"
"""
gzip -c "${fastq}" > "${fastq}.gz"
"""
}
Michael!
I can't reproduce your issue. I created the folders in my current directory like you described and created four files in it, as you can see below:
➜ ~ tree sequences/
sequences/
└── sequences_split
└── sequences_trimmed
├── trimmed_a_fastq
├── trimmed_b_fastq
└── trimmed_c_fastq
Then I copy-pasted your Nextflow script file (the only change I did was to use gzip -f ${read} instead of without the -f option. Then everything worked fine. The reason you need -f is because Nextflow has every task contained to a subfolder within work. This means your input files are symbolically linked and gunzip will complain they're not regular files (happened here, macOS Ventura) or something like that (It may depend on OS? Not sure). The -f solves for this issue.
N E X T F L O W ~ version 22.10.1
Launching `ex2.nf` [golden_goldstine] DSL2 - revision: 70559e4bcb
executor > local (3)
[ad/447348] process > GZIP (1) [100%] 3 of 3 ✔
➜ ~ tree work
work
├── 0c
│   └── ded66d5f2e56cfa38d85d9c86e4e87
│   └── trimmed_a_fastq.gz
├── 67
│   └── 949c28cce5ed578e9baae7be2d8cb7
│   └── trimmed_c_fastq.gz
└── ad
└── 44734845950f28f658226852ca4200
└── trimmed_b_fastq.gz
They're gzip compressed files (even though they may look just like text files, depending on the demo content). I decided to reply with an answer because it allows me to use markdown to show you how I did it. Feel free to comment this answer if you want to discuss this topic.

Nested folders not appearing due to ownership permissions

I have been trying to get my permissions working for my jellyfin server.
I have a folder on my second hard drive (auto-mounting on start, formated as exFAT)
jellyfin/
├── Cache
├── Config
...
└── Media
├── movies
└── Batman
└── ...
├── music
├── photos
└── shows
When I cd into Media > movies, I cannot view any other folders inside the movies folder, even though I can in a file viewer. It just appears empty
I tried fixing this by doing
chown -R 1000:1000 jellyfin
Since my jellyfin docker executes as 1000:1000.
But it still has the same problem. ls -l returns 0 folders.
Any advice is appreciated

snakemake script access to stdin / stdout for stream processing

For a Snakemake workflow, I need to manipulate tags in many BAM files, and would like to process these by piping them through a script (using the Snakemake script: directive). The specific way I'm doing this is with pysam stream processing.
infile = pysam.AlignmentFile("-", "rb")
outfile = pysam.AlignmentFile("-", "wb", template=infile)
for s in infile:
(flowcell, lane) = s.query_name.split(':')[0:2]
rg_id = ".".join([flowcell, lane])
s.set_tag('RG',rg_id,'Z')
outfile.write(s)
This script works well standalone, but I haven't been able to figure out how to integrate it via the snakemake script directive.
I prefer this way to minimize IO and RAM usage.
Edit: Resorted to direct loading to fix the RG tag.
# parameters passed from snakemake
bam_file = snakemake.input[0]
fixed_bam_file = snakemake.output[0]
bamfile = pysam.AlignmentFile(bam_file, "rb")
fixed_bamfile = pysam.AlignmentFile(fixed_bam_file, "wb", template = bamfile)
for i, read in enumerate(bamfile.fetch()):
(flowcell, lane) = read.query_name.split(':')[0:2]
rg_id = ".".join([flowcell, lane])
read.set_tag('RG',rg_id,'Z')
fixed_bamfile.write(read)
if not (i % 100000):
print("Updated the read group for {} reads to {}".format(i, rg_id))
bamfile.close()
fixed_bamfile.close()
EDIT: Snakemakes run: and shell: directives set the workdir: directory, while the script: directive operates relative to the directory where the Snakefile was executed (keeping everything nice and tidy). Hence the problem of putting a stream processor under script:.
Using shell instead of script directive:
rule all:
input:
expand('{sample}_edited.bam'), sample=['a', 'b', 'c']
rule somename:
input:
'{sample}.bam'
output:
'{sample}_edited.bam'
shell:
'''
cat {input} > python edit_bam.py > {output}
'''
#Krischan it seems you found a solution already and if so maybe good to post it as an answer.
Alternatively, you can use the object {workflow} to get the directory of the Snakefile and from there construct the path to your python script. If your directory structure is:
./
├── Snakefile
├── data
│   └── sample.bam
└── scripts
└── edit_bam.py
The Snakefile may look like:
rule all:
input:
'test.tmp',
rule one:
input:
'sample.bam',
output:
'test.tmp',
shell:
r"""
cat {input} \
| {workflow.basedir}/scripts/edit_bam.py > {output}
"""
Executed with snakemake -d data ...
It seems the workflow object is not documented but check this thread Any way to get the full path of the Snakefile within the Snakefile?

How to list active sub-modules in a Maven project?

I have a complex project where there are many directories that have POM files, but only some of which are sub-modules (possibly transitively) of a particular parent project.
Obviously, Maven knows the list of relevant files because it parses all the <module> tags to find them. But, I only see a list of the <name>s in the [INFO] comments, not the paths to those modules.
Is there a way to have Maven output a list of all the POM files that provided references to projects that are part of the reactor build for a given project?
This is quite simple but it only gets the artifactId, from the root (or parent) module:
mvn --also-make dependency:tree | grep maven-dependency-plugin | awk '{ print $(NF-1) }'
If you want the directories
mvn -q --also-make exec:exec -Dexec.executable="pwd"
The following command prints artifactId's of all sub-modules:
mvn -Dexec.executable='echo' -Dexec.args='${project.artifactId}' exec:exec -q
Example output:
build-tools
aws-sdk-java-pom
core
annotations
utils
http-client-spi
http-client-tests
http-clients
apache-client
test-utils
sdk-core
...
mvn help:evaluate -Dexpression=project.modules
mvn help:evaluate -Dexpression=project.modules[0]
mvn help:evaluate -Dexpression=project.modules[1]
IFS=$'\n'
modules=($(mvn help:evaluate -Dexpression=project.modules | grep -v "^\[" | grep -v "<\/*strings>" | sed 's/<\/*string>//g' | sed 's/[[:space:]]//'))
for module in "${modules[#]}"
do
echo "$module"
done
Here's a way to do this on Linux outside of Maven, by using strace.
$ strace -o opens.txt -f -e open mvn dependency:tree > /dev/null
$ perl -lne 'print $1 if /"(.*pom\.xml)"/' opens.txt
The first line runs mvn dependency:tree under strace, asking strace to output to the file opens.txt all the calls to the open(2) system call, following any forks (because Java is threaded). This file looks something like:
9690 open("/etc/ld.so.cache", O_RDONLY) = 3
9690 open("/lib/libncurses.so.5", O_RDONLY) = 3
9690 open("/lib/libdl.so.2", O_RDONLY) = 3
The second line asks Perl to print any text inside quotes that happens to end in pom.xml. (The -l flag handles printing newlines, the -n wraps the code single quotes in a loop that simply reads any files on the command line, and the -e handles the script itself which uses a regex to find interesting calls to open.)
It'd be nice to have a maven-native way of doing this :-)
The solution I found is quite simple:
mvn -B -f "$pom_file" org.codehaus.mojo:exec-maven-plugin:1.4.0:exec \
-Dexec.executable=/usr/bin/echo \
-Dexec.args='${basedir}/pom.xml'| \
grep -v '\['
This is a little bit tricky due to the need to grep out the [INFO|WARNING|ERROR] lines and make it usable for scripting but saved me a lot of time since you can put any expression there.
Get exactly name. Not ID. Result is appropriate for mvn -pl.
mvn help:evaluate -Dexpression=project.modules -q -DforceStdout | tail -n +2 | head -n -1 | sed 's/\s*<.*>\(.*\)<.*>/\1/'
or with main pom.xml
cat pom.xml | grep "<module>" | sed 's/\s*<.*>\(.*\)<.*>/\1/'
I don't have a direct answer to the question. But using some kind of "module path" as naming convention for the <name> of my modules works for me. As you'll see, this convention is self explaining.
Given the following project structure:
.
├── pom
│   ├── pom.xml
│   └── release.properties
├── pom.xml
├── samples
│   ├── ejb-cargo-sample
│   │   ├── functests
│   │   │   ├── pom.xml
│   │   │   └── src
│   │   ├── pom.xml
│   │   └── services
│   │   ├── pom.xml
│   │   └── src
│   └── pom.xml
└── tools
├── pom.xml
└── verification-resources
├── pom.xml
└── src
Here is the output of a reactor build:
$ mvn compile
[INFO] Scanning for projects...
[INFO] Reactor build order:
[INFO] Personal Sandbox - Samples - Parent POM
[INFO] Personal Sandbox - Samples - EJB3 and Cargo Sample
[INFO] Personal Sandbox - Tools - Parent POM
[INFO] Personal Sandbox - Tools - Shared Verification Resources
[INFO] Personal Sandbox - Samples - EJB3 and Cargo Sample - Services
[INFO] Personal Sandbox - Samples - EJB3 and Cargo Sample - Functests
[INFO] Sandbox Externals POM
...
This gives IMHO a very decent overview of what is happening, scales correctly, and it's pretty easy to find any module in the file system in case of problems.
Not sure this does answer all your needs though.
I had the same problem but solved it without strace. The mvn exec:exec plugin is used to touch pom.xml in every project, and then find the recently modified pom.xml files:
ctimeref=`mktemp`
mvn --quiet exec:exec -Dexec.executable=/usr/bin/touch -Dexec.args=pom.xml
find . -mindepth 2 -type f -name pom.xml -cnewer "$ctimeref" > maven_projects_list.txt
rm "$ctimeref"
And you have your projects list in the maven_projects_list.txt file.
This is the command I use for listing all pom.xml files inside a project at the root of the project.
find -name pom.xml | grep -v target | sort
What the command do :
find -name pom.xml what I search
grep -v target avoid to list pom.xml inside target/ directory
sort list the result in alphabetical order
An example to list all modules and the parent of each
export REPO_DIR=$(pwd)
export REPO_NAME=$(basename ${REPO_DIR})
echo "${REPO_DIR} ==> ${REPO_NAME}"
mvn exec:exec -q \
-Dexec.executable='echo' \
-Dexec.args='${basedir}:${project.parent.groupId}:${project.parent.artifactId}:${project.parent.version}:${project.groupId}:${project.artifactId}:${project.version}:${project.packaging}' \
| perl -pe "s/^${REPO_DIR//\//\\\/}/${REPO_NAME}/g" \
| perl -pe 's/:/\t/g;'
I prepared the script below as mvn exec:exec runs slow on gitlab. I couldn't find a free time to investigate it more but I'm suspicious about it tries to get a new runner as it needs a new Runtime. So, if you're working with quite limited runners, it affects the overall build time in an unpredictable way if you used mvn exec:exec to determine the modules.
The below snippet gives you the module name, packaging and path to the module
#!/bin/bash
set -e;
mvnOptions='--add-opens java.base/java.lang=ALL-UNNAMED';
string=$(MAVEN_OPTS="$mvnOptions" mvn help:active-profiles)
delimiter='Active Profiles for Project*';
modules=()
while read -r line; do
if [[ $line == $delimiter ]]; then
module=$(echo $line | sed -E "s/.*'(.*):(.*):(.*):(.*)'.*/\2/");
packaging=$(echo $line | sed -E "s/.*'(.*):(.*):(.*):(.*)'.*/\3/");
path=$(MAVEN_OPTS="$mvnOptions" mvn help:evaluate -Dexpression=project.basedir -pl "$module" -q -DforceStdout || true);
if [[ $path == *" $module "* ]]; then
path=$(pwd);
fi
modules+=("$module" "$packaging" "$path")
fi;
done <<< "$string"
size="$(echo ${#modules[#]})";
moduleCount=$(( $size / 3 ));
# prints the found modules
if [ $moduleCount -gt 0 ]; then
echo "$moduleCount module(s) found"
for (( i=0; i<$moduleCount; ++i)); do
line=$(($i + 1));
moduleIndex=$(($i * 3));
pathIndex=$(($i * 3+2));
module=${modules[moduleIndex]};
path=${modules[pathIndex]};
echo " $line. '$module' at '$path'";
done;
fi;