Is there a way to access/modify the contents of a Nextflow channel? - channel

I have a situation where my workflow outputs a main directory, which I emit from a process using DSL2. I feed this output to a python script, which can easily loop over the sub-directories and their respective files, pulling out information and compiling it into a .tsv
Two important pieces of information the python script is getting, is the name of the subdirectory and which file is actually important within the subdirectory.
I would like to take my process output ("root dir") + subdirectory (from file) + important filename (from file) and make it into a new generator path to feed to another process.
Am I just using a bad method? Is there a better way to access a generator? In the documentation I saw subscribe, but I haven't had luck using this functionality. Thank you in advance.
Example .tsv file (column 1 and 3 are what I want to append to generator)
GCF_000005845.2 Escherichia coli str. K-12 substr. MG1655, complete genome GCF_000005845.2_ASM584v2_genomic.fna
GCF_000008865.2 Escherichia coli O157:H7 str. Sakai DNA, complete genome GCF_000008865.2_ASM886v2_genomic.fna
Work directory structure
├── c6
│ └── 6598d4838f61d0421f03216990465c
│ ├── ecoli
│ │ ├── README.md
│ │ └── ncbi_dataset
│ │ ├── data
│ │ │ ├── GCF_000005845.2
│ │ │ │ ├── GCF_000005845.2_ASM584v2_genomic.fna
│ │ │ │ ├── genomic.gff
│ │ │ │ ├── protein.faa
│ │ │ │ └── sequence_report.jsonl
│ │ │ ├── GCF_000008865.2
│ │ │ │ ├── GCF_000008865.2_ASM886v2_genomic.fna
│ │ │ │ ├── genomic.gff
│ │ │ │ ├── protein.faa
│ │ │ │ └── sequence_report.jsonl
│ │ │ ├── assembly_data_report.jsonl
│ │ │ └── dataset_catalog.json
│ │ └── fetch.txt
Here is my nextflow script (constructive criticism very welcome):
#!/usr/bin/env Nextflow
nextflow.enable.dsl=2
workflow {
//ref_genome_ch = Channel.fromPath("$params.ref_genome")
println([params.taxon, params.zipName, params.unzippedDir])
DOWNLOAD_ZIP(params.taxon, params.zipName)
UNZIP(DOWNLOAD_ZIP.out.zipFile)
REHYDRATE(UNZIP.out.unzippedDir)
COLLECT_NAMES(REHYDRATE.out.dataDir)
// I want to get the dir name and file name out of
// relations.txt
//thing = Channel.from( )
//thing.view()
//organism_genomes = REHYDRATE.out.dataDir.subscribe { println("$it/")}
}
process DOWNLOAD_ZIP {
errorStrategy 'ignore'
input:
val taxonName
val zipName
output:
path "${zipName}" , emit: zipFile
script:
def reference = params.reference
"""
datasets download genome \\
taxon '${taxonName}' \\
--dehydrated \\
--filename ${zipName} \\
${reference} \\
--exclude-genomic-cds
"""
}
process UNZIP {
input:
path zipFile
output:
path "${zipFile.baseName}" , emit: unzippedDir
script:
"""
unzip $zipFile -d ${zipFile.baseName}
"""
}
process REHYDRATE {
input:
path unzippedDir
output:
path "$unzippedDir/ncbi_dataset/data" , emit: dataDir
script:
"""
datasets rehydrate \\
--directory $unzippedDir
"""
}
process COLLECT_NAMES {
publishDir params.results
input:
path dataDir
output:
path "relations.txt" , emit: org_names
script:
"""
python "$baseDir/bin/collect_org_names.py" $dataDir
"""
}
Edit: User #Steve recommended channel operators. I don't fully understand the groovy {thing -> stuff} syntax yet, but I tried to do this:
thing = REHYDRATE.out.dataDir.map{"$it/*"}
thing.view()
and I get
/mnt/c/Users/mkozubov/Desktop/nextflow_tutorial/tRNA_stuff/work/d0/long_hash/ecoli/ncbi_dataset/data/*
printed... But when I feed this into a process that just has a script: println(input) I get an error saying that the command executed is null, command ouput is (empty) and that target '*' is not a directory.
My question is why didn't the .map operator expand the * as entering "PATH/*" into a channel would've?
Edit2: I feel like I almost had something. I changed the output of the COLLECT_NAMES script to contain the path to the files. I now want to parse this file and read the contents into a channel. For that I did
organism_genome_files = Channel.from()
COLLECT_NAMES.out.org_names.map {
new File(it.toString()).eachLine { line ->
organism_genome_files << line.split('\t')[3] }
}
If I replace the organism_genome_files << line.split('\t')[3] with println line.split('\t')[3] I can see the content I want, but I can't seem to find a way of pulling this info out.
I also tried it with organism_genome_files as a list, but nothing seems to be working, I just can't seem to pull info from channels and effectively mutate it.
The .splitCSV() method seems like it could be useful, but I still don't understand how to get a channel to work as an input to another channel :(

Is there a way to access/modify the contents of a Nextflow channel?
You can use one or more transforming operators for this. For example, to get the directory name and filename of 'relations.txt', you could use:
COLLECT_NAMES.out.org_names.map { tuple( it.parent, it.name ) }.view()
See also: Check file attributes
My question is why didn't the .map operator expand the * as entering
"PATH/*" into a channel would've?
It's only been told to return a String (actually a GString). Groovy won't automatically expand this in the same way your shell would. I think what you want is some way to list the contents of that directory. For this you can use the listFiles() method:
REHYDRATE.out.dataDir.map { tuple( it.listFiles() ) }.view()
See also: List directory content
I changed the output of the COLLECT_NAMES script to contain the path to the files. I now want to parse this file and read the contents into a channel.
Without more details about what these files are, how big they are, how they are going to be used, and what the return type needs to be, I'm really only guessing here. So I've put together some potential solutions that might help get you started:
This is implemented as a closure and returns a channel of lists:
def getOrganismGenomeFiles = { reader ->
def values = []
reader.splitEachLine('\t') { fields ->
values.add( fields[3] )
}
return values
}
ch.map( getOrganismGenomeFiles ).view()
This slurps the lines but also returns a channel of lists:
ch.map { it.readLines().collect { it.split('\t')[3] } }.view()
This slurps the file contents, split them into records using the splitCsv operator and returns a channel of values:
ch.map { it.text }.splitCsv(sep: '\t').map { it[3] }.view()
Note: I shortened the input channel name for readability. Please replace ch with COLLECT_NAMES.out.org_names in the above examples.
My (maybe not so constructive) criticism actually regards the workflow design not so much the style, layout etc. My preference is and will always be to avoid using some web get command like curl, wget, or in this case NCBI Datasets, inside of a Nextflow process. Sure, you can make things work this way, but you'll ultimately run into problems when you later decide to share your workflow with others. Even if everyone agrees that wasting additional resources on downloading files is fine (which they won't but maybe these costs are negligible in the scheme of things...) you can't necessarily guarantee that the machine or node your process lands on will even be able to resolve the specified URL(s). There's ways to work around these issues, but my advice is to just let Nextflow localize the required files. The problem is how. And this of course depends on what you're actually trying to do...
These files are available from the NCBI FTP Site and their URLs could be added to your configuration, perhaps something like:
params {
genomes {
'GCF_000005845.2_ASM584v2' {
genomic_fna = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz'
genomic_gff = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz'
protein_faa = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz'
}
'GCF_000008865.2_ASM886v2' {
genomic_fna = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000008865.2_ASM886v2/GCF_000008865.2_ASM886v2_genomic.fna.gz'
genomic_gff = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000008865.2_ASM886v2/GCF_000008865.2_ASM886v2_genomic.gff.gz'
protein_faa = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000008865.2_ASM886v2/GCF_000008865.2_ASM886v2_protein.faa.gz'
}
}
}
Then to access the files for a given genome, use something like:
genome = 'GCF_000005845.2_ASM584v2'
genomic_fna = genomes[genome].genomic_fna
genomic_gff = genomes[genome].genomic_gff
protein_faa = genomes[genome].protein_faa

Related

Nextflow input how to declare tuple in tuple

I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:
[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]
Note that this is the same channel structure that you get from .fromFilePairs().
I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.
The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.
How does one declare the inner tuple in the input section of a nextflow process?
In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:
input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel
This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.
The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:
all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
.groupTuple()
.set { grouped_files }
The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:
The normal file input constructs introduced in the input of files
section are valid for collections of multiple files as well.
This means, you can use a script variable, e.g.:
input:
tuple val(sample_id), path(my_files)
In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.
This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:
Alternatively, you could specify a target filename, e.g.:
input:
tuple val(sample_id), path('my_file')
In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:
process test {
tag { sample_id }
debug true
stageInMode 'rellink'
input:
tuple val(sample_id), path('fastq')
"""
echo "${sample_id}:"
ls -g --time-style=+"" fastq*
"""
}
workflow {
readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
test( readgroups )
}
Results:
$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run .
N E X T F L O W ~ version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor > local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../baz_2.fastq
foo:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../foo_2.fastq
bar:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../bar_2.fastq
Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.

Python unit tests for Foundry's transforms?

I would like to set up tests on my transforms into Foundry, passing test inputs and checking that the output is the expected one. Is it possible to call a transform with dummy datasets (.csv file in the repo) or should I create functions inside the transform to be called by the tests (data created in code)?
If you check your platform documentation under Code Repositories -> Python Transforms -> Python Unit Tests, you'll find quite a few resources there that will be helpful.
The sections on writing and running tests in particular is what you're looking for.
// START DOCUMENTATION
Writing a Test
Full documentation can be found at https://docs.pytest.org
Pytest finds tests in any Python file that begins with test_.
It is recommended to put all your tests into a test package under the src directory of your project.
Tests are simply Python functions that are also named with the test_ prefix and assertions are made using Python’s assert statement.
PyTest will also run tests written using Python’s builtin unittest module.
For example, in transforms-python/src/test/test_increment.py a simple test would look like this:
def increment(num):
return num + 1
def test_increment():
assert increment(3) == 5
Running this test will cause checks to fail with a message that looks like this:
============================= test session starts =============================
collected 1 item
test_increment.py F [100%]
================================== FAILURES ===================================
_______________________________ test_increment ________________________________
def test_increment():
> assert increment(3) == 5
E assert 4 == 5
E + where 4 = increment(3)
test_increment.py:5: AssertionError
========================== 1 failed in 0.08 seconds ===========================
Testing with PySpark
PyTest fixtures are a powerful feature that enables injecting values into test functions simply by adding a parameter of the same name. This feature is used to provide a spark_session fixture for use in your test functions. For example:
def test_dataframe(spark_session):
df = spark_session.createDataFrame([['a', 1], ['b', 2]], ['letter', 'number'])
assert df.schema.names == ['letter', 'number']
// END DOCUMENTATION
If you don't want to specify your schemas in code, you can also read in a file in your repository by following the instructions in documentation under How To -> Read file in Python repository
// START DOCUMENTATION
Read file in Python repository
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.yaml', '*.csv']
}
)
This tells python to bundle the yaml and csv files into the package. Then place a config file (for example config.yaml, but can be also csv or txt) next to your python transform (e.g. read_yml.py see below):
- name: tbl1
primaryKey:
- col1
- col2
update:
- column: col3
with: 'XXX'
You can read it in your transform read_yml.py with the code below:
from transforms.api import transform_df, Input, Output
from pkg_resources import resource_stream
import yaml
import json
#transform_df(
Output("/Demo/read_yml")
)
def my_compute_function(ctx):
stream = resource_stream(__name__, "config.yaml")
docs = yaml.load(stream)
return ctx.spark_session.createDataFrame([{'result': json.dumps(docs)}])
So your project structure would be:
some_folder
config.yaml
read_yml.py
This will output in your dataset a single row with one column "result" with content:
[{"primaryKey": ["col1", "col2"], "update": [{"column": "col3", "with": "XXX"}], "name": "tbl1"}]
// END DOCUMENTATION

Using checkpoints with snakemake gives each instance of a rule all input files

I've recently come across checkpoints in snakemake and realized they will work perfectly with what I am trying to do. I've been able to implement the workflow listed here. I also found this stackoverflow question, but can't quite make sense of it or how I might make it work for what I am doing
The rules I am working with are as follows:
def ReturnBarcodeFolderNames():
path = config['results_folder'] + "Barcode/"
return_direc = []
for root, directory, files in os.walk(path):
for direc in directory:
return_direc.append(direc)
return return_direc
rule all:
input:
expand(config['results_folder'] + "Barcode/{folder}.merged.fastq", folder=ReturnBarcodeFolderNames())
checkpoint barcode:
input:
expand(config['results_folder'] + "Basecall/{fast5_files}", fast5_files=FAST5_FILES)
output:
temp(directory(config['results_folder'] + "Barcode/.tempOutput/"))
shell:
"guppy_barcoder "
"--input_path {input} "
"--save_path {output} "
"--barcode_kits EXP-PBC096 "
"--recursive"
def aggregate_barcode_folders(wildcards):
checkpoint_output = checkpoints.barcode.get(**wildcards).output[0]
folder_names = []
for root, directories, files in os.walk(checkpoint_output):
for direc in directories:
folder_names.append(direc)
return expand(config['results_folder'] + "Barcode/.tempOutput/{folder}", folder=folder_names)
rule merge:
input:
aggregate_barcode_folders
output:
config['results_folder'] + "Barcode/{folder}.merged.fastq"
shell:
"echo {input}"
The rule barcode and def aggregate_barcode_folders work as expected, but when rule merge is reached, every input folder is being passed to each instance of the rule. This results in something like the following:
rule merge:
input: /Results/Barcode/.tempOutput/barcode81,
/Results/Barcode/.tempOutput/barcode28,
/Results/Barcode/.tempOutput/barcode17,
/Results/Barcode/.tempOutput/barcode10,
/Results/Barcode/.tempOutput/barcode26,
/Results/Barcode/.tempOutput/barcode21,
/Results/Barcode/.tempOutput/barcode42,
/Results/Barcode/.tempOutput/barcode89,
/Results/Barcode/.tempOutput/barcode45,
/Results/Barcode/.tempOutput/barcode20,
/Results/Barcode/.tempOutput/barcode18,
/Results/Barcode/.tempOutput/barcode27,
/Results/Barcode/.tempOutput/barcode11,
.
.
.
.
.
output: /Results/Barcode/barcode75.merged.fastq
jobid: 82
wildcards: folder=barcode75
The same exact input is needed for each job of rule merge, which amounts to about 80 instances. But, the wildcards portion in each job is different for each folder. How can I use this as input for each instance of my rule merge, instead of passing the entire list received from def aggregate_barcode_folders?
I feel there may be something amiss with the input from rule all, but I'm not 100% sure what the problem may be.
As a note, I know snakemake will throw an error stating that it is waiting for output files from rule merge, as I am not doing anything with the output other than printing it to the screen.
EDIT
I've decided to go against checkpoints for now, and instead opt for the following. To make things more clear, the goal for this pipeline is as follows: I am attempting to merge fastq files from an output folder into one file, with the input files having a variable number of files (1 to about 3 per folder, but I won't know how many). The structure of the input is as follows
INPUT
|-- Results
|-- FolderA
|-- barcode01
|-- file1.fastq
|-- barcode02
|-- file1.fastq
|-- file2.fastq
|-- barcode03
|-- file1.fastq
|-- FolderB
|-- barcode01
|-- file1.fastq
|-- barcode02
|-- file1.fastq
|-- file2.fastq
|-- barcode03
|-- file1.fastq
|-- FolderC
|-- barcode01
|-- file1.fastq
|-- file2.fastq
|-- barcode02
|-- file1.fastq
|-- barcode03
|-- file1.fastq
|-- file2.fastq
OUTPUT
I would like to turn that output resembling something such as:
|-- Results
|-- barcode01.merged.fastq
|-- barcode02.merged.fastq
|-- barcode03.merged.fastq
The output files would contain data from all file#.fastq from its respective barcode folder, from folder A, B, and C.
I've been able to get (I think) further than I was before, but snakemake is throwing an error that says Missing input files for rule basecall: /Users/joshl/PycharmProjects/ARS/Results/DataFiles/fast5/FAL03879_67a0761e_1055/ barcode72.fast5. My code relevant code is here:
CODE
configfile: "config.yaml"
FAST5_FILES = glob_wildcards(config['results_folder'] + "DataFiles/fast5/{fast5_files}.fast5").fast5_files
def return_fast5_folder_names():
path = config['results_folder'] + "Basecall/"
fast5_folder_names = []
for item in os.scandir(path):
if Path(item).is_dir():
fast5_folder_names.append(item.name)
return fast5_folder_names
def return_barcode_folder_names():
path = config['results_folder'] + ".barcodeTempOutput"
fast5_folder_names = []
collated_barcode_folder_names = []
for item in os.scandir(path):
if Path(item).is_dir():
full_item_path = os.path.join(path, item.name)
fast5_folder_names.append(full_item_path)
index = 0
for item in fast5_folder_names:
collated_barcode_folder_names.append([])
for folder in os.scandir(item):
if Path(folder).is_dir():
collated_barcode_folder_names[index].append(folder.name)
index += 1
return collated_barcode_folder_names
rule all:
input:
# basecall
expand(config['results_folder'] + "Basecall/{fast5_file}", fast5_file=FAST5_FILES),
# barcode
expand(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}", fast5_folders=return_fast5_folder_names()),
# merge files
expand(config['results_folder'] + "Barcode/{barcode_numbers}.merged.fastq", barcode_numbers=return_barcode_folder_names())
rule basecall:
input:
config['results_folder'] + "DataFiles/fast5/{fast5_file}.fast5"
output:
directory(config['results_folder'] + "Basecall/{fast5_file}")
shell:
r"""
guppy_basecaller \
--input_path {input} \
--save_path {output} \
--quiet \
--config dna_r9.4.1_450bps_fast.cfg \
--num_callers 2 \
--cpu_threads_per_caller 6
"""
rule barcode:
input:
config['results_folder'] + "Basecall/{fast5_folders}"
output:
directory(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}")
threads: 12
shell:
r"""
for item in {input}; do
guppy_barcoder \
--input_path $item \
--save_path {output} \
--barcode_kits EXP-PBC096 \
--recursive
done
"""
rule merge_files:
input:
expand(config['results_folder'] + ".barcodeTempOutput/" + "{fast5_folder}/{barcode_numbers}",
fast5_folder=glob_wildcards(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}/{barcode_numbers}/{fastq_files}.fastq").fast5_folders,
barcode_numbers=glob_wildcards(config['results_folder'] +".barcodeTempOutput/{fast5_folders}/{barcode_numbers}/{fastq_files}.fastq").barcode_numbers)
output:
config['results_folder'] + "Barcode/{barcode_numbers}.merged.fastq"
shell:
r"""
echo "Hello world"
echo {input}
"""
Under rule all, if I comment out the line that corresponds to merge files, there is no error
I am not fully understanding what you mean, but I think the problem lies indeed in the input for rule all. I currently also do not have access to a computer (I'm on my phone right now), so I can not make a real example.. Probably what you want to do is change ReturnBarcodeFolderNames to use a checkpoint. I guess only after rule barcode you actually know what you want as final output.
def ReturnBarcodeFolderNames(wildcards):
# the wildcard here makes sure that barcode is executed first
checkpoint_output = checkpoints.barcode.get().output[0]
folder_names = []
for root, directories, files in os.walk(checkpoint_output):
for direc in directories:
folder_names.append(direc)
return expand(config['results_folder'] + "Barcode/{folder}.merged.fastq", folder=folder_names)
rule all:
input:
ReturnBarcodeFolderNames
rule merge:
input:
config['results_folder'] + "Barcode/.tempOutput/{folder}"
output:
config['results_folder'] + "Barcode/{folder}.merged.fastq"
shell:
"echo {input}"
Obviously ReturnBarcodeFolderNames does not work in its current form. However, the idea is that you check what you want as final output in rule all after rule barcode has been executed. Rule merge then does not have to use the checkpoint, as its input and output can be clearly defined.
I hope this helps :), but maybe I have been addressing something else than your problem. It wasn't completely clear to me from the question unfortunately.
edit
Here is a stripped down version of the code, but it should be easy to implement the last parts now. It works for the folder structure you gave in the example:
import os
import glob
def get_merged_barcodes(wildcards):
tmpdir = checkpoints.barcode.get(**wildcards).output[0] # this forces the checkpoint to be executed before we continue
barcodes = set() # a set is like a list, but only stores unique values
for folder in os.listdir(tmpdir):
for barcode in os.listdir(tmpdir + "/" + folder):
barcodes.add(barcode)
mergedfiles = ["results/" + barcode + ".merged.fastq" for barcode in barcodes]
return mergedfiles
rule all:
input:
get_merged_barcodes
checkpoint barcode:
input:
rules.basecall.output
output:
directory("results")
shell:
"""
stuff
"""
def get_merged_input(wildcards):
return glob.glob(f"results/**/{wildcards.barcode}/*.fastq")
rule merge_files:
input:
get_merged_input
output:
"results/{barcode}.merged.fastq"
shell:
"""
echo {input}
"""
Basically what you did in the original question was almost working!

Traverse directory at URL to root in Python

How can you traverse directory to get to root in Python? I wrote some code using BeautifulSoup, but it says 'module not found'. So I have this:
#
# There is a directory traversal vulnerability in the
# following page http://127.0.0.1:8082/humantechconfig?file=human.conf
# Write a script which will attempt various levels of directory
# traversal to find the right amount that will give access
# to the root directory. Inside will be a human.conf with the flag.
#
# Note: The script can timeout if this occurs try narrowing
# down your search
import urllib.request
import os
req = urllib.request.urlopen("http://127.0.0.1:8082/humantechconfig?file=human.conf")
dirName = "/tmp"
def getListOfFiles(dirName):
listOfFile = os.listdir(dirName)
allFiles = list()
for entry in listOfFile:
# Create full path
fullPath = os.path.join(dirName, entry)
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
allFiles.append(fullPath)
return allFiles
listOfFiles = getListOfFiles(dirName)
print(listOfFiles)
for file in listOfFiles:
if file.endswith(".conf"):
f = open(file, "r")
print(f.read())
This outputs:
/tmp/level-0/level-1/level-2/human.conf
User : Human 66
Flag: Not-Set (Must be Root Human)
However. If I change the URL to 'http://127.0.0.1:8082/humantechconfig?file=../../../human.conf' it gives me the output:
User : Human 66
Flag: Not-Set (Must be Root Human)
User : Root Human
Flag: Well done the flag is: {}
The level of directory traversal it is at fluctuates wildly, from /tmp/level-2 to /tmp/level-15; if it's at the one I wrote, then it says I'm 'Root Human'. But it won't give me the flag, despite the fact that I am suddenly 'Root Human'. Is there something wrong with the way I am traversing directory?
It doesn't seem to matter at all if I take away the req = urllib.request.urlopen("http://127.0.0.1:8082/humantechconfig?file=human.conf") line. How can I actually send the code to that URL?
Thanks!
cyber discovery moon base challenge?
For this one, you need to keep adding '../' in front of human.conf (for example 'http://127.0.0.1:8082/humantechconfig?file=../human.conf') which becomes your URL. This URL you need to request (using urllib.request.urlopen(URL)).
The main bit of the challenge is to attach the ../ multiple times which shall not be very hard using a simple loop. You don't need to use the OS.
Make sure to break the loop once you find the flag (or it will go into an infinite loop and give you errors).

snakemake wildcards or expand command

I want a rule to perform realignment between normal and tumor. The main problem is I don't know how to manage that problem. Is it the wildcard or the expand the answer to my problem?
This is my list of samples:
conditions:
pair1:
tumor: "432"
normal: "433"
So the rule need to be something like this
rule gatk_RealignerTargetCreator:
input:
expand("mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['tumor']),
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['normal']),
output:
"mapped_reads/merged_samples/{pair1}.realign.intervals"
How can I do this operation for all keys on conditions? (I suppose to have more that one pair)
I have tried this code:
input:
lambda wildcards: config["conditions"][wildcards.condition],
tumor= expand("mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",tumor=config['conditions'][wildcards.condition]['tumor']),
normal = expand("mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",normal=config['conditions'][wildcards.condition]['normal']),
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
name 'wildcards' is not defined
??
wildcards is not "directly" defined in the input of a rule. You need to use a function of wildcards instead. I'm not sure I understand exactly what you want to do, but you may try something like that.
def condition2tumorsamples(wildcards):
return expand(
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
sample=config['conditions'][wildcards.condition]['tumor'])
def condition2normalsamples(wildcards):
return expand(
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
sample=config['conditions'][wildcards.condition]['normal'])
rule gatk_RealignerTargetCreator:
input:
tumor = condition2tumorsamples,
normal = condition2normalsamples,
output:
"mapped_reads/merged_samples/{condition}.realign.intervals"
# remainder of the rule here...
DISCLAIMER: You want to read your pairings from a YAML file, however,
I advise against this. I couldn't figure out how to do it elegantly using YAML formatting. I have an ad-hoc way of doing it to pair my SNP and INDEL annotations, however, there is a lot of boiler plate code JUST so it can write it from the YAML. This was okay because the YAML variable is likely never edited, so maintenance in a pedantically formatted string is no longer important in this case.
I think the code you tried is just about right.
What I think is missing is the ability to "request" the correct pairings in your "rule all" input. I personally prefer to do this using Pandas. It is listed on the homepage of the Python Software Foundation, so it's a robust choice.
The pandas setup is very easy to maintain, it's a single file tab or space separated. Easier for the end user than formatting nest YAML files (What I think would be required if setup via YAML format). This is how I do it in my system. It scales indefinitely. I'll admit accessing the pandas object is a bit tricky, but I've provided the code for you. Just know that first layer of objects (The [#] in the 'sample[1][tumor]' call), the [0] I think is just meta data on the file being read. I have yet to find a use for it and otherwise just ignore it.
tree structure of workspace
(CentOS5-Compatible) [tboyarski#login3 Test]$ tree
.
|-- [tboyarsk 620 Aug 4 10:57] Snakefile
|-- [tboyarsk 47 Aug 4 10:52] config.yaml
|-- [tboyarsk 512 Aug 4 10:57] output
| |-- [tboyarsk 0 Aug 4 10:54] ABC.bam
| |-- [tboyarsk 0 Aug 4 10:53] TimNorm.bam
| |-- [tboyarsk 0 Aug 4 10:53] TimTum.bam
| `-- [tboyarsk 0 Aug 4 10:57] XYZ.bam
`-- [tboyarsk 36 Aug 4 10:49] sampleFILEpair.txt
sampleFILEpair.txt (Proof the sample names can be unrelated)
tumor normal
TimTum TimNorm
XYZ ABC
config.yaml
pathDIR: output
sampleFILE: sampleFILEpair.txt
Snakefile
from pandas import read_table
configfile: "config.yaml"
rule all:
input:
expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows())
rule gatk_RealignerTargetCreator:
input:
"{pathGRTC}/{normal}.bam",
"{pathGRTC}/{tumor}.bam",
output:
"{pathGRTC}/{tumor}_{normal}.bam"
# wildcard_constraints:
# tumor = '[^_|-|\/][0-9a-zA-Z]*',
# normal = '[^_|-|\/][0-9a-zA-Z]*'
run:
call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)
With the merging of wildcards, in the past, I have found it to be a source of cyclical dependencies, so I also always include wildcard_constraints when merging (essentially that's what we're doing). They aren't actually necessary here. The "rule all" contains no wildcards, and it is calling "gatk", so in this exact example where is no room for ambiguity, but if this rule connects with other rules utilizing wildcards, usually it can generate some funky DAG's.