gzip command block returns "Too many levels of symbolic links" - gzip

Trying to perform a fairly simple gzip command across my fastq files, but a strange error returns.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.gzip = "sequences/sequences_split/sequences_trimmed/trimmed*fastq"
workflow {
gzip_ch = Channel.fromPath(params.gzip)
GZIP(gzip_ch)
GZIP.out.view()
}
process GZIP {
input:
path read
output:
stdout
script:
"""
gzip ${read}
"""
}
Error:
Command error:
gzip: trimmed_SRR19573319_R2.fastq: Too many levels of symbolic links
Tried running a loop in the script instead or run gzip on individual files which works, but would rather use the nextflow syntax.

By default, Nextflow will try to stage process input files using symbolic links. The problem is that gzip actually ignores symbolic links. From the GZIP(1) man page:
The gzip command will only attempt to compress regular files. In particular, it will ignore symbolic links.
If the objective is to create a reproducible workflow, it's usually best to avoid modifying the workflow inputs directly anyway. Either use the stageInMode directive to change how the input files are staged in. For example:
process GZIP {
stageInMode 'copy'
input:
path fastq
output:
path "${fastq}.gz"
"""
gzip "${fastq}"
"""
}
Or, preferably, just modify the command to redirect stdout to a file:
process GZIP {
input:
path fastq
output:
path "${fastq}.gz"
"""
gzip -c "${fastq}" > "${fastq}.gz"
"""
}

Michael!
I can't reproduce your issue. I created the folders in my current directory like you described and created four files in it, as you can see below:
➜ ~ tree sequences/
sequences/
└── sequences_split
└── sequences_trimmed
├── trimmed_a_fastq
├── trimmed_b_fastq
└── trimmed_c_fastq
Then I copy-pasted your Nextflow script file (the only change I did was to use gzip -f ${read} instead of without the -f option. Then everything worked fine. The reason you need -f is because Nextflow has every task contained to a subfolder within work. This means your input files are symbolically linked and gunzip will complain they're not regular files (happened here, macOS Ventura) or something like that (It may depend on OS? Not sure). The -f solves for this issue.
N E X T F L O W ~ version 22.10.1
Launching `ex2.nf` [golden_goldstine] DSL2 - revision: 70559e4bcb
executor > local (3)
[ad/447348] process > GZIP (1) [100%] 3 of 3 ✔
➜ ~ tree work
work
├── 0c
│   └── ded66d5f2e56cfa38d85d9c86e4e87
│   └── trimmed_a_fastq.gz
├── 67
│   └── 949c28cce5ed578e9baae7be2d8cb7
│   └── trimmed_c_fastq.gz
└── ad
└── 44734845950f28f658226852ca4200
└── trimmed_b_fastq.gz
They're gzip compressed files (even though they may look just like text files, depending on the demo content). I decided to reply with an answer because it allows me to use markdown to show you how I did it. Feel free to comment this answer if you want to discuss this topic.

Related

Nested folders not appearing due to ownership permissions

I have been trying to get my permissions working for my jellyfin server.
I have a folder on my second hard drive (auto-mounting on start, formated as exFAT)
jellyfin/
├── Cache
├── Config
...
└── Media
├── movies
└── Batman
└── ...
├── music
├── photos
└── shows
When I cd into Media > movies, I cannot view any other folders inside the movies folder, even though I can in a file viewer. It just appears empty
I tried fixing this by doing
chown -R 1000:1000 jellyfin
Since my jellyfin docker executes as 1000:1000.
But it still has the same problem. ls -l returns 0 folders.
Any advice is appreciated

tabular configuration snakemake

I try to build a Snakemake pipeline where the processed files and their paramters are definied in a seperate table.
In my minimal example the folder structure would look like this.
├── data
│   ├── name1
│   ├── name2
│   ├── name3
│   └── name4
├── output
├── samples.tsv
└── Snakefile
I would like to copy varying amounts of file "name1", "name2", and "name3" (not "name4") to the output folder. The amounts are definiend in in the sample.tsv.
fileName parameter
name1 4
name2 6
name3 7
I guess the basic structure of the Snakefile should look a bit like this
import pandas as pd
samples = pd.read_table("samples.tsv").set_index("fileName", drop=False)
rule copy:
input:
"data/{<file name X>}"
output:
"output/{<file name X>}_cut"
shell:
"head -n {parameter of X} {input} > {output} "
But I have no idea how to parse the information from "samples" to the rule.
To expand Dimitry's answer, you probably want to add to your rule a lambda function that links each filename to its parameter(s) given in the sample sheet. E.g.:
import pandas as pd
samples = pd.read_table("samples.tsv").set_index("fileName", drop=False)
rule all:
input:
expand('output/{fileName}_cut', fileName= samples.fileName),
rule copy:
input:
"data/{fileName}"
output:
"output/{fileName}_cut"
params:
n= lambda wc: samples[samples.fileName == wc.fileName].parameter.iloc[0],
shell:
"head -n {params.n} {input} > {output}"
Add the rule all as the first rule in your file:
rule all:
input: expand("output/{filenameX}_cut", filenameX=samples)
This rule would collect all required targets as the goal of your pipeline, then Snakemake would automatically assign an instance of the rule copy for each individual file.

snakemake script access to stdin / stdout for stream processing

For a Snakemake workflow, I need to manipulate tags in many BAM files, and would like to process these by piping them through a script (using the Snakemake script: directive). The specific way I'm doing this is with pysam stream processing.
infile = pysam.AlignmentFile("-", "rb")
outfile = pysam.AlignmentFile("-", "wb", template=infile)
for s in infile:
(flowcell, lane) = s.query_name.split(':')[0:2]
rg_id = ".".join([flowcell, lane])
s.set_tag('RG',rg_id,'Z')
outfile.write(s)
This script works well standalone, but I haven't been able to figure out how to integrate it via the snakemake script directive.
I prefer this way to minimize IO and RAM usage.
Edit: Resorted to direct loading to fix the RG tag.
# parameters passed from snakemake
bam_file = snakemake.input[0]
fixed_bam_file = snakemake.output[0]
bamfile = pysam.AlignmentFile(bam_file, "rb")
fixed_bamfile = pysam.AlignmentFile(fixed_bam_file, "wb", template = bamfile)
for i, read in enumerate(bamfile.fetch()):
(flowcell, lane) = read.query_name.split(':')[0:2]
rg_id = ".".join([flowcell, lane])
read.set_tag('RG',rg_id,'Z')
fixed_bamfile.write(read)
if not (i % 100000):
print("Updated the read group for {} reads to {}".format(i, rg_id))
bamfile.close()
fixed_bamfile.close()
EDIT: Snakemakes run: and shell: directives set the workdir: directory, while the script: directive operates relative to the directory where the Snakefile was executed (keeping everything nice and tidy). Hence the problem of putting a stream processor under script:.
Using shell instead of script directive:
rule all:
input:
expand('{sample}_edited.bam'), sample=['a', 'b', 'c']
rule somename:
input:
'{sample}.bam'
output:
'{sample}_edited.bam'
shell:
'''
cat {input} > python edit_bam.py > {output}
'''
#Krischan it seems you found a solution already and if so maybe good to post it as an answer.
Alternatively, you can use the object {workflow} to get the directory of the Snakefile and from there construct the path to your python script. If your directory structure is:
./
├── Snakefile
├── data
│   └── sample.bam
└── scripts
└── edit_bam.py
The Snakefile may look like:
rule all:
input:
'test.tmp',
rule one:
input:
'sample.bam',
output:
'test.tmp',
shell:
r"""
cat {input} \
| {workflow.basedir}/scripts/edit_bam.py > {output}
"""
Executed with snakemake -d data ...
It seems the workflow object is not documented but check this thread Any way to get the full path of the Snakefile within the Snakefile?

Run mocha excluding paths

I have this (in gulpfile.js):
var gulp = require("gulp");
var mocha = require("gulp-mocha");
gulp.task("test", function() {
gulp
.src(["./**/*_test.js", "!./node_modules/**/*.js"]);
});
and it works.
I want to replicate the same behavior, excluding "node_modules" folder, from mocha command, running npm test (in package.json):
"scripts": {
"test": "mocha **\\*_test.js !./node_modules/**/*.js*",
}
and it doesn't work.
I'm using Windows.
Any suggestion?
I was able to solve this using globbing patterns in the argument to mocha. Like you I didn't want to put all my tests under a single tests folder. I wanted them in the same directory as the class they were testing. My file structure looked like this:
project
|- lib
|- class1.js
|- class1.test.js
|- node_modules
|- lots of stuff...
Running this from the project folder worked for me:
mocha './{,!(node_modules)/**}/*.test.js'
Which match any *.test.js file in the tree, so long is its path isn't rooted at ./node_modules/.
This is an online tool for testing glob patterns that I found useful.
You can exclude files in mocha by passing opts
mocha -h|grep -i exclude
--exclude <file> a file or glob pattern to ignore (default: )
mocha --exclude **/*-.jest.js
Additionally, you can also create a test/mocha.opts file and add it there
# test/mocha.opts
--exclude **/*-test.jest.js
--require ./test/setup.js
If you want to exclude a particular file type you could do something like this
// test/setup.js
require.extensions['.graphql'] = function() {
return null
}
This is useful when processing extensions with a module loader such as webpack that mocha does not understand.
For Windows users
This script will run perfectly
"test": "mocha \"./{,!(node_modules)/**/}*.test.js\"",
I hope this will help.
cheers!
I'm not a guru on mocha or ant-style pattern but maybe it isn't possible escluding specific path in the mocha command line.
You can put all your test files under a test folder, and set your package.json like this:
"scripts": {
"test": "mocha ./test/**/*_test.js"
}
You can also provide more than one starting folder:
"scripts": {
"test": "mocha ./test/**/*_test.js ./another_test_folder/**/*_test.js"
}
As of 2019 the modern way of configuring Mocha under Node is via config file in your project root (e.g. via .mocharc.js).
Here is the example of the .mocharc.js that
rederfines the default test directory (spec key) and
excludes the example (or can be any experimental tests) from the overall suite (exclude key).
module.exports = {
'spec': 'src/front/js/tests/**/*.spec.js',
'exclude': 'src/front/js/tests/examples/*.spec.js',
'reporter': 'dot'
};
As you may see there can be more options used in the config. In part they are just replicas of Mocha CLI options. Just look up ones what you like and try to use within .mocharc.js (use camelCase for dash-comprising CLI options). Or see the config examples.
As suggested in a comment by #thebearingedge, in the end I put ALL the source files (with the relative test files) in a new "src" dir.
In this way I can define the root for tests with a path that exclude by default the "node_modules" folder.
.
├── src
├── fileA.js
├── fileA_test.js
├── fileB.js
├── fileB_test.js
├── node_modules
├── ...
I had to update the path in the package.json, gulpfile.js and in some batch files that I use as utilities.
Changes in gulpfile.js:
.src(["./src/**/*_test.js"]);
and in package.json:
"test": "mocha src\\**\\*_test.js",
Simple change and it works.
I'm free to choose whatever naming conventions I like.
Each test files remain close to the relative JS file.
I had a spec directory containing all my specs. Within that directory, I had several sub-directories, one of which was the e2e specs directory. In that scenario, I used the mocha specs $(find specs -name '*.js' -not -path "specs/e2e/*") command to run all my tests ignoring those within the e2e directory.

How to list active sub-modules in a Maven project?

I have a complex project where there are many directories that have POM files, but only some of which are sub-modules (possibly transitively) of a particular parent project.
Obviously, Maven knows the list of relevant files because it parses all the <module> tags to find them. But, I only see a list of the <name>s in the [INFO] comments, not the paths to those modules.
Is there a way to have Maven output a list of all the POM files that provided references to projects that are part of the reactor build for a given project?
This is quite simple but it only gets the artifactId, from the root (or parent) module:
mvn --also-make dependency:tree | grep maven-dependency-plugin | awk '{ print $(NF-1) }'
If you want the directories
mvn -q --also-make exec:exec -Dexec.executable="pwd"
The following command prints artifactId's of all sub-modules:
mvn -Dexec.executable='echo' -Dexec.args='${project.artifactId}' exec:exec -q
Example output:
build-tools
aws-sdk-java-pom
core
annotations
utils
http-client-spi
http-client-tests
http-clients
apache-client
test-utils
sdk-core
...
mvn help:evaluate -Dexpression=project.modules
mvn help:evaluate -Dexpression=project.modules[0]
mvn help:evaluate -Dexpression=project.modules[1]
IFS=$'\n'
modules=($(mvn help:evaluate -Dexpression=project.modules | grep -v "^\[" | grep -v "<\/*strings>" | sed 's/<\/*string>//g' | sed 's/[[:space:]]//'))
for module in "${modules[#]}"
do
echo "$module"
done
Here's a way to do this on Linux outside of Maven, by using strace.
$ strace -o opens.txt -f -e open mvn dependency:tree > /dev/null
$ perl -lne 'print $1 if /"(.*pom\.xml)"/' opens.txt
The first line runs mvn dependency:tree under strace, asking strace to output to the file opens.txt all the calls to the open(2) system call, following any forks (because Java is threaded). This file looks something like:
9690 open("/etc/ld.so.cache", O_RDONLY) = 3
9690 open("/lib/libncurses.so.5", O_RDONLY) = 3
9690 open("/lib/libdl.so.2", O_RDONLY) = 3
The second line asks Perl to print any text inside quotes that happens to end in pom.xml. (The -l flag handles printing newlines, the -n wraps the code single quotes in a loop that simply reads any files on the command line, and the -e handles the script itself which uses a regex to find interesting calls to open.)
It'd be nice to have a maven-native way of doing this :-)
The solution I found is quite simple:
mvn -B -f "$pom_file" org.codehaus.mojo:exec-maven-plugin:1.4.0:exec \
-Dexec.executable=/usr/bin/echo \
-Dexec.args='${basedir}/pom.xml'| \
grep -v '\['
This is a little bit tricky due to the need to grep out the [INFO|WARNING|ERROR] lines and make it usable for scripting but saved me a lot of time since you can put any expression there.
Get exactly name. Not ID. Result is appropriate for mvn -pl.
mvn help:evaluate -Dexpression=project.modules -q -DforceStdout | tail -n +2 | head -n -1 | sed 's/\s*<.*>\(.*\)<.*>/\1/'
or with main pom.xml
cat pom.xml | grep "<module>" | sed 's/\s*<.*>\(.*\)<.*>/\1/'
I don't have a direct answer to the question. But using some kind of "module path" as naming convention for the <name> of my modules works for me. As you'll see, this convention is self explaining.
Given the following project structure:
.
├── pom
│   ├── pom.xml
│   └── release.properties
├── pom.xml
├── samples
│   ├── ejb-cargo-sample
│   │   ├── functests
│   │   │   ├── pom.xml
│   │   │   └── src
│   │   ├── pom.xml
│   │   └── services
│   │   ├── pom.xml
│   │   └── src
│   └── pom.xml
└── tools
├── pom.xml
└── verification-resources
├── pom.xml
└── src
Here is the output of a reactor build:
$ mvn compile
[INFO] Scanning for projects...
[INFO] Reactor build order:
[INFO] Personal Sandbox - Samples - Parent POM
[INFO] Personal Sandbox - Samples - EJB3 and Cargo Sample
[INFO] Personal Sandbox - Tools - Parent POM
[INFO] Personal Sandbox - Tools - Shared Verification Resources
[INFO] Personal Sandbox - Samples - EJB3 and Cargo Sample - Services
[INFO] Personal Sandbox - Samples - EJB3 and Cargo Sample - Functests
[INFO] Sandbox Externals POM
...
This gives IMHO a very decent overview of what is happening, scales correctly, and it's pretty easy to find any module in the file system in case of problems.
Not sure this does answer all your needs though.
I had the same problem but solved it without strace. The mvn exec:exec plugin is used to touch pom.xml in every project, and then find the recently modified pom.xml files:
ctimeref=`mktemp`
mvn --quiet exec:exec -Dexec.executable=/usr/bin/touch -Dexec.args=pom.xml
find . -mindepth 2 -type f -name pom.xml -cnewer "$ctimeref" > maven_projects_list.txt
rm "$ctimeref"
And you have your projects list in the maven_projects_list.txt file.
This is the command I use for listing all pom.xml files inside a project at the root of the project.
find -name pom.xml | grep -v target | sort
What the command do :
find -name pom.xml what I search
grep -v target avoid to list pom.xml inside target/ directory
sort list the result in alphabetical order
An example to list all modules and the parent of each
export REPO_DIR=$(pwd)
export REPO_NAME=$(basename ${REPO_DIR})
echo "${REPO_DIR} ==> ${REPO_NAME}"
mvn exec:exec -q \
-Dexec.executable='echo' \
-Dexec.args='${basedir}:${project.parent.groupId}:${project.parent.artifactId}:${project.parent.version}:${project.groupId}:${project.artifactId}:${project.version}:${project.packaging}' \
| perl -pe "s/^${REPO_DIR//\//\\\/}/${REPO_NAME}/g" \
| perl -pe 's/:/\t/g;'
I prepared the script below as mvn exec:exec runs slow on gitlab. I couldn't find a free time to investigate it more but I'm suspicious about it tries to get a new runner as it needs a new Runtime. So, if you're working with quite limited runners, it affects the overall build time in an unpredictable way if you used mvn exec:exec to determine the modules.
The below snippet gives you the module name, packaging and path to the module
#!/bin/bash
set -e;
mvnOptions='--add-opens java.base/java.lang=ALL-UNNAMED';
string=$(MAVEN_OPTS="$mvnOptions" mvn help:active-profiles)
delimiter='Active Profiles for Project*';
modules=()
while read -r line; do
if [[ $line == $delimiter ]]; then
module=$(echo $line | sed -E "s/.*'(.*):(.*):(.*):(.*)'.*/\2/");
packaging=$(echo $line | sed -E "s/.*'(.*):(.*):(.*):(.*)'.*/\3/");
path=$(MAVEN_OPTS="$mvnOptions" mvn help:evaluate -Dexpression=project.basedir -pl "$module" -q -DforceStdout || true);
if [[ $path == *" $module "* ]]; then
path=$(pwd);
fi
modules+=("$module" "$packaging" "$path")
fi;
done <<< "$string"
size="$(echo ${#modules[#]})";
moduleCount=$(( $size / 3 ));
# prints the found modules
if [ $moduleCount -gt 0 ]; then
echo "$moduleCount module(s) found"
for (( i=0; i<$moduleCount; ++i)); do
line=$(($i + 1));
moduleIndex=$(($i * 3));
pathIndex=$(($i * 3+2));
module=${modules[moduleIndex]};
path=${modules[pathIndex]};
echo " $line. '$module' at '$path'";
done;
fi;