Pig: positionals counting from right? - apache-pig

I have a tuple in my pig script:
((,v1,fb,fql))
I know I can choose elements from the left as $0 (blank), $1 ("v1"), etc. But can I choose elements from the right? The tuples will be different lengths, but I would always like to get the last element.

You can't. You can however write a python UDF to extract it:
# Make sure to include the appropriate ouputSchema
def getFromBack(TUPLE, pos):
# gets elements from the back of TUPLE
# You can also do TUPLE[-pos], but this way it starts at 0 which is closer
# to Pig
return TUPLE[::-1][pos]
# This works the same way as above, but may be faster
# return TUPLE[-(pos + 1)]
And used like:
register 'myudf.py' using jython as pythonUDFs ;
B = FOREACH A GENERATE pythonUDFs.getFromBack(T, 0) ;

Related

Nextflow: add unique ID, hash, or row number to tuple

ch_files = Channel.fromPath("myfiles/*.csv")
ch_parameters = Channel.from(['A','B, 'C', 'D'])
ch_samplesize = Channel.from([4, 16, 128])
process makeGrid {
input:
path input_file from ch_files
each parameter from ch_parameters
each samplesize from ch_samplesize
output:
tuple path(input_file), parameter, samplesize, path("config_file.ini") into settings_grid
"""
echo "parameter=$parameter;sampleSize=$samplesize" > config_file.ini
"""
}
gives me a number_of_files * 4 * 3 grid of settings files, so I can run some script for each combination of parameters and input files.
How do I add some ID to each line of this grid? A row ID would be OK, but I would even prefer some unique 6-digit alphanumeric code without a "meaning" because the order in the table doesn't matter. I could extract out the last part of the working folder which is seemingly unique per process; but I don't think it is ideal to rely on sed and $PWD for this, and I didn't see it provided as a runtime metadata variable provider. (plus it's a bit long but OK). In a former setup I had a job ID from the LSF cluster system for this purpose, but I want this to be portable.
Every combination is not guaranteed to be unique (e.g. having parameter 'A' twice in the input channel should be valid).
To be clear, I would like this output
file1.csv A 4 pathto/config.ini 1ac5r
file1.csv A 16 pathto/config.ini 7zfge
file1.csv A 128 pathto/config.ini ztgg4
file2.csv A 4 pathto/config.ini 123js
etc.
Given the input declaration, which uses the each qualifier as an input repeater, it will be difficult to append some unique id to the grid without some refactoring to use either the combine or cross operators. If the inputs are just files or simple values (like in your example code), refactoring doesn't make much sense.
To get a unique code, the simple options are:
Like you mentioned, there's no way, unfortunately, to access the unique task hash without some hack to parse $PWD. Although, it might be possible to use BASH parameter substitution to avoid sed/awk/cut (assuming BASH is your shell of course...) you could try using: "${PWD##*/}"
You might instead prefer using ${task.index}, which is a unique index within the same task. Although the task index is not guaranteed to be unique across executions, it should be sufficient in most cases. It can also be formatted for example:
process example {
...
script:
def idx = String.format("%06d", task.index)
"""
echo "${idx}"
"""
}
Alternatively, create your own UUID. You might be able to take the first N characters but this will of course decrease the likelihood of the IDs being unique (not that there was any guarantee of that anyway). This might not really matter though for a small finite set of inputs:
process example {
...
script:
def uuid = UUID.randomUUID().toString()
"""
echo "${uuid}"
echo "${uuid.take(6)}"
echo "${uuid.takeBefore('-')}"
"""
}

Snakemake variable number of files

I'm in a situation, where I would like to scatter my workflow into a variable number of chunks, which I don't know beforehand. Maybe it is easiest to explain the problem by being concrete:
Someone has handed me FASTQ files demultiplexed using bcl2fastq with the no-lane-splitting option. I would like to split these files according to lane, map each lane individually, and then finally gather everything again. However, I don't know the number of lanes beforehand.
Ideally, I would like a solution like this,
rule split_fastq_file: (...) # results in N FASTQ files
rule map_fastq_file: (...) # do this N times
rule merge_bam_files: (...) # merge the N BAM files
but I am not sure this is possbile. The expand function requires me to know the number of lanes, and can't see how it would be possible to use wildcards for this, either.
I should say that I am rather new to Snakemake, and that I may have complete misunderstood how Snakemake works. It has taken me some time to get used to think about things "upside-down" by focusing on output files and then working backwards.
One option is to use checkpoint when splitting the fastqs, so that you can dynamically re-evaluate the DAG at a later point to get the resulting lanes.
Here's an MWE step by step:
Setup and make an example fastq file.
# Requires Python 3.6+ for f-strings, Snakemake 5.4+ for checkpoints
import pathlib
import random
random.seed(1)
rule make_fastq:
output:
fastq = touch("input/{sample}.fastq")
Create a random number of lanes between 1 and 9 each with random identifier from 1 to 9. Note that we declare this as a checkpoint, rather than a rule, so that we can later access the result. Also, we declare the output here as a directory specific to the sample, so that we can later glob in it to get the lanes that were created.
checkpoint split_fastq:
input:
fastq = rules.make_fastq.output.fastq
output:
lane_dir = directory("temp/split_fastq/{sample}")
run:
pathlib.Path(output.lane_dir).mkdir(exist_ok=True)
n_lanes = random.randrange(1, 10)-
lane_numbers = random.sample(range(1, 10), k = n_lanes)
for lane_number in lane_numbers:
path = pathlib.Path(output.lane_dir) / f"L00{lane_number}.fastq"
path.touch()
Do some intermediate processing.
rule map_fastq:
input:
fastq = "temp/split_fastq/{sample}/L00{lane_number}.fastq"
output:
bam = "temp/map_fastq/{sample}/L00{lane_number}.bam"
run:
bam = pathlib.Path(output.bam)
bam.parent.mkdir(exist_ok=True)
bam.touch()
To merge all the processed files, we use an input function to access the lanes that were created in split_fastq, so that we can do a dynamic expand on these. We do the expand on the last rule in the chain of intermediate processing steps, in this case map_fastq, so that we ask for the correct inputs.
def get_bams(wildcards):
lane_dir = checkpoints.split_fastq.get(**wildcards).output[0]
lane_numbers = glob_wildcards(f"{lane_dir}/L00{{lane_number}}.fastq").lane_number
bams = expand(rules.map_fastq.output.bam, **wildcards, lane_number=lane_numbers)
return bams
This input function now gives us easy access to the bam files we wish to merge, however many there are, and whatever they may be called.
rule merge_bam:
input:
get_bams
output:
bam = "temp/merge_bam/{sample}.bam"
shell:
"cat {input} > {output.bam}"
This example runs, and with random.seed(1) happens to create three lanes (l001, l002, and l005).
If you don't want to use checkpoint, I think you could achieve something similar by creating an input function for merge_bam that opens up the original input fastq, scans the read names for lane info, and predicts what the input files ought to be. This seems less robust, however.

Getting wildcard from input files when not used in output files

I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.
In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?
runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole = ['big','small']
rule all:
input:
expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)
rule merge_results:
input:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
output:
os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
run:
import pandas as pd
import os
# Iterate over input files, read into pandas df
tmplist = []
for f in input:
data = pd.read_csv(f, sep='\t')
# getting the pheno from the input file and adding it to the data frame
pheno = os.path.split(f)[1].split('.')[0]
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.
In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:
run:
import pandas as pd
import os
# Iterate over phenotypes, read into pandas df
tmplist = []
for pheno in runpheno:
# conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example
file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')
data = pd.read_csv(file, sep='\t')
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.

Can gather be used to unroll Junctions?

In this program:
use v6;
my $j = +any "33", "42", "2.1";
gather for $j -> $e {
say $e;
} # prints 33␤42␤2.1␤
for $j -> $e {
say $e; # prints any(33, 42, 2.1)
}
How does gather in front of forchange the behavior of the Junction, allowing to create a loop over it? The documentation does not seem to reflect that behavior. Is that spec?
Fixed by jnthn in code and test commits.
Issue filed.
Golfed:
do put .^name for any 1 ; # Int
put .^name for any 1 ; # Mu
Any of ten of the thirteen statement prefixes listed in the doc can be used instead of do or gather with the same result. (supply unsurprisingly produces no output and hyper and race are red herrings because they try and fail to apply methods to the junction values.)
Any type of junction produces the same results.
Any number of elements of the junction produces the same result for the for loop without a statement prefix, namely a single Mu. With a statement prefix the for loop repeats the primary statement (the put ...) the appropriate number of times.
I've searched both rt and gh issues and failed to find a related bug report.

Missing wildcards in S4 snakemake Object in R

I'm running a workflow with a main Snakefile including rules from the rules folder and calling rscripts from those included rules.
Here are a few lines and their specific files:
Snakefile:
samples = pd.read_table("samples.csv", header=0, sep=',', index_col=0)
rule extract:
input:
'summary/umi_expression_matrix.tsv'
include: "rules/extract_expression_single.smk"
rules/extract_expression_single.smk:
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
scripts/merge_counts_single.R:
samples = read.csv('samples.csv', header=TRUE, stringsAsFactors=FALSE)$samples
read_list = c()
for (i in 1:length(samples)){
temp_matrix = read.table(snakemake#input[[i]][1], header=T, stringsAsFactors = F)
cell_barcodes = colnames(temp_matrix)[-1]
colnames(temp_matrix) = c("GENE",paste(samples[i], cell_barcodes, sep = "_"))
read_list=c(read_list, list(temp_matrix))
}
# Little function that allows to merge unequal matrices
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="GENE")
}
read_counts <- Reduce(merge.all, read_list)
read_counts[is.na(read_counts)] = 0
rownames(read_counts) = read_counts[,1]
read_counts = read_counts[,-1]
write.table(read_counts, file=snakemake#output[[1]], sep='\t')
The "clean" way to do it would be to call snakemake#wildcard.sample to attribute sample names to the script. But for some reason snakemake#wildcards is an empty vector.
In python:
print(type(snakemake.wildcards))
print(snakemake.wildcards)
print('done')
gives:
<class 'snakemake.io.Wildcards'>
done
which means it's also empty.
So right now I have to rely on getting back to the samples.csv file and getting the sample names there. I will also have to double check matching indexes maybe using greps, don't want the samples and the files to get mixed up.
Any idea why this is happening?
Update:
I've tried adding the sample_name as params to see if this would work and it actually does.
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
params:
sample_name = lambda wildcards: samples.index
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
I'm gonna use this for now, but my guess is there is still an issue with the scope of wildcards in included rules. Or maybe I'm doing it wrong.
The idea of using wildcards is to call a rule for each value in the wildcards. If you use the expand function in the input of a rule, then your rule will take all of the wildcard values and create a list of strings. Which means, your rule will be invoked just for once (not for each wildcard value). Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values.
By doing so, you cannot use that wildcard inside your rule any longer. Because when that rule is invoked, it gets all of the wildcard values and convert them into a list that will be given to your R script just for once (not for each wildcard value).
In your case, using wildcards is not suitable, since your merge_count rule will be run only for once (not for each wildcard value).