How to efficiently combine N files, two at a time - snakemake

Based on the discussion in another question.
Some tools will only accept two input files at a time, but the final, merged output requires merging N output files. Examples include paste and some bed or vcf tools. Assume a list of samples is present and the binary operation is associative, (a+b)+c == a+(b+c). The required, merged output must be generated by repeatedly combining input and intermediate files. How can you efficiently merge the files?

The two solutions I will present are to sequentially combine input files and to recursively build intermediate files as a binary tree. For each, consider pasting together a few hundred samples with the following start of a snakefile:
ids = list('abcdefghijklmnopqrstuvwxyz')
samples = expand('{id1}{id2}', id1=ids, id2=ids) # 676 samples, need not be numbers
# aa, ab, ac, .., zz
rule all:
input: 'merged.txt'
rule generate_data:
output: 'sample_{sample}.txt'
shell:
'echo {wildcards.sample} > {output}'
Sequential Solution
The sequential solution is fairly easy to remember and understand. You combine files 1 and 2 into a temporary file, then combine the temporary file with file 3, ... until file N. You can do this with a run directive and shell commands but I will present this as just a shell directive
rule merge:
input:
first_files=expand('sample_{sample}.txt', sample=samples[:2]),
rest_files=expand('sample_{sample}.txt', sample=samples[2:])
output: 'merged.txt'
shell:
'paste {input.first_files} > {output} \n'
'for file in {input.rest_files} ; do '
'paste {output} $file > {output}_tmp \n'
'mv {output}_tmp {output} \n'
'done '
Recursive Solution
The general idea behind the recursive solution is to combine files 1 and 2, 3 and 4, 5 and 6, ... in the first step them combine those intermediate files until one merged file is left. The difficulty is that snakemake evaluates the dag from the top down and the number of files may not be evenly divisible by 2.
rule merge:
"""Request final output from merged files 0 to N-1."""
input:
f'temp_0_{len(samples)-1}'
output: 'merged.txt'
shell:
'cp {input} {output}'
def merge_intermediate_input(wildcards):
"""From start and end indices, request input files. Raises ValueError when indices are equal."""
start, end = int(wildcards.start), int(wildcards.end)
if start == end: # perform link instead
raise ValueError
if start + 1 == end: # base case
return expand('sample_{sample}.txt',
sample=(samples[start], samples[end]))
# default
return [f'temp_{start}_{(start+end)//2}', f'temp_{(start+end)//2+1}_{end}']
rule merge_intermediate:
"""Solve subproblem, producing start to end."""
input: merge_intermediate_input
output: temp('temp_{start}_{end}')
shell:
'paste {input} > {output}'
def merge_base_input(wildcards):
"""Get input sample from index in list."""
index = int(wildcards.start)
return f'sample_{samples[index]}.txt'
rule merge_base:
"""Create temporary symbolic link for input file with start==end."""
input: merge_base_input
output: temp('temp_{start}_{start}')
shell:
'ln -sr {input} {output}'
merge_intermediate solves the subproblem of producing the merged files from start to end from the two merged files split halfway. When start == end, the merged file is created as a symbolic link. When start + 1 == end, the base case is to merge the input files at those indices. The recursive solution is clearly more code and more complex, but it can be more efficient in long-running or complex merge operations.
Runtime Complexity, Performance
Let each of the N files have k lines and the runtime complexity of the merge operation have O(f(n)). In the sequential solution, the temporary file is created N-1 times and its length increases as 2k, 3k ... for a total of k*N*(N+1)/2 - k ~ O(f(k N^2)).
For the recursive solution, in the first layer, each pair of files is joined. Each operation requires O(f(2k)) and there are N/2 such operations. Next, each pair of the resulting files are merged, at a cost of O(f(4k)) with N/4 operations. Overall, ln(N) layers of merges are required to produce the final output, again with N-1 merge operations. The complexity of the entire operation is O(f(k N ln(n))).
In terms of overhead, the recursive solution launches N-1 snakemake jobs with any associated calls to the scheduler, activating environments, etc. The sequential version launches a single job and runs everything in a single shell process.
The recursive solution can run with more parallelism; each 'level' of the recursive solution is independent, allowing up to N/2 jobs to run at once. The sequential solution requires the results of each previous step. There is an additional challenge with resource estimation for the recursive solution. The first merges have O(2k) while the last has O(k N). The resources could be dynamically estimated or, if the merge step doesn't increase the resulting file size (e.g. intersecting regions), the resources could be similar.
Conclusion
While the recursive solution offers better asymptotic runtime complexity, it introduces more snakemake jobs, temporary files, and complex logic. The sequential solution is straight-forward and contained in a single job, though could be N/ln(N) times slower. Quick merge operations can be successfully performed with the sequential solution and the runtime won't be much worse until N is quite large. However, if merging takes 10s of minutes or longer, depends on the input file sizes, and produces outputs longer than inputs (e.g. cat, paste, and similar), the recursive solution may offer better performance and a significantly shorter wall clock time.

Related

Way to force snakemake to re-evaluate dag before checkpoint with --list-input-changes?

I wonder if anyone might have some ideas for a small problem I am having with checkpoints. I am trying to produce a workflow that is robust to changes in the sample list, so that necessary rules re-run if a sample is removed. However, if I have a successful workflow run, but remove some samples, re-running with -R $(snakemake --list-input-changes) does not manage to detect the input changes in rules completed before the most recent dag re-evaluation after a checkpoint. Does anyone know how to force snakemake to also check for input changes in rules that happen before a checkpoint output is produced? A small example of a use case might be:
# index the reference to get a list of all chromosomes
checkpoint index_ref:
input:
"reference.fa"
output:
"reference.fa.fai"
shell:
"samtools faidx {input}"
# make a function to get a chromosome list item from checkpoint output
def get_contigs(wildcards):
with checkpoints.index_ref.get().output.open() as index:
return pd.read_table(index,header=None,usecols=[0]).squeeze("columns")
# calculate depth per chromosome for a set of samples
rule samtools_depth:
input:
expand({sample}.bam, sample=sample_list)
output:
"dataset.chr{chrom}.depth"
shell:
"samtools depth -r {wildcards.chrom} {input} > {output}"
# combine depths into single output file
rule combine_depths:
input:
expand("dataset.chr{chrom}.depth", chrom=get_contigs())
output:
"dataset.genome.depth"
shell:
"cat {input} > {output}"
rule all:
input:
"dataset.genome.depth"
In this case, if the workflow is successfully run, then a sample removed from sample_list, using snakemake -R $(snakemake --list-input-changes) will suggest that the workflow is already complete. If dataset.genome.depth is removed, the same command will result in suggesting that rules samtools_depth and combine_depths both need to be re-run due to the former's input file changes, which is what I would prefer to happen even if downstream outputs are already created.

Recursive input calling in Snakemake rule

I am writing rule to process some data:
data in directory will be something like:
myfirst.trim_1P, myfirst.trim_2P, mysecond.trim_1P, mysecond.trim_2P,...
rule trim_data:
input:"{dataset}/{sample}.trim_{r}P"
output:"{dataset}/{sample}.{r}.fastq"
params:
length=14
shell:
"""
reformat.sh forcetrimleft="{params.length}" in="{input}" out="{output}"
"""
I have this error:
WorkflowError:
RecursionError: maximum recursion depth exceeded
If building the DAG exceeds the recursion limit
myDir/myfirst.1.trimed.1.trimed.2.trimed.2.trimed.2....
Why it run in recursive way if the output different from the input? and how I can fix it?
This is a wild guess... Maybe the wildcards capture more then they should since they are interpreted as regular expression. If {dataset}, {sample} and {r} take a defined list of values, try constraining their scope with:
wildcard_constraints:
dataset= '|'.join([re.escape(x) for x in DATASET]),
sample= '|'.join([re.escape(x) for x in SAMPLE]),
r= '|'.join([re.escape(x) for x in R]),
Where DATASET, SAMPLE and R are lists of values (e.g. R= ['1', '2'])
I had a similar error (below). This turned out to be caused by the fact the file name for the rule output was identical to a different rule. Changing the name of the output file fixed this error.
Code:
rule merge_sample_chr:
input:
bcftools=config["bcftools"],
chr_list=expand(config["vcf_dir_ase"]+"/Eagle/all_{{sample}}_chr{this_chr}_sorted.vcf.gz",this_chr=CHRS)
params:
chr_list=expand("I="+config["vcf_dir_ase"]+"/Eagle/all_{{sample}}_chr{this_chr}_sorted.vcf.gz",this_chr=CHRS)
output:
vcf=config["vcf_dir_ase"]+"/Eagle/all_{sample}.vcf.gz"
shell:
"""
{input.bcftools} concat {input.chr_list} -Oz -o {output}
"""
Error:
WorkflowError:
RecursionError: maximum recursion depth exceeded in __instancecheck__
If building the DAG exceeds the recursion limit, this is likely due to a cyclic dependency.E.g. you might have a sequence of rules that can generate their own input. Try to make the output files more specific. A common pattern is
Problematic file pattern: /path/samplid_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_.vcf.gz

Reading sequential data from TFRecords files within the TensorFlow graph?

I'm working with video data, but I believe this question should apply to any sequential data. I want to pass my RNN 10 sequential examples (video frames) from a TFRecords file. When I first start reading the file, I need to grab 10 examples, and use this to create a sequence-example which is then pushed onto the queue for the RNN to take when it's ready. However, now that I have the 10 frames, next time I read from the TFRecords file, I only need to take 1 example and just shift the other 9 over. But when I hit the end of the first TFRecords file, I need to restart the process on the second TFRecords file. It's my understanding that the cond op will process the ops required under each condition even if that condition is not the one that is to be used. This would be a problem when using a condition to check whether to read 10 examples or only 1. Is there anyway to resolve this problem to still have the desired result outlined above?
You can use the recently added Dataset.window() transformation in TensorFlow 1.12 to do this:
filenames = tf.data.Dataset.list_files(...)
# Define a function that will be applied to each filename, and return the sequences in that
# file.
def get_examples_from_file(filename):
# Read and parse the examples from the file using the appropriate logic.
examples = tf.data.TFRecordDataset(filename).map(...)
# Selects a sliding window of 10 examples, shifting along 1 example at a time.
sequences = examples.window(size=10, shift=1, drop_remainder=True)
# Each element of `sequences` is a nested dataset containing 10 consecutive examples.
# Use `Dataset.batch()` and get the resulting tensor to convert it to a tensor value
# (or values, if there are multiple features in an example).
return sequences.map(
lambda d: tf.data.experimental.get_single_element(d.batch(10)))
# Alternatively, you can use `filenames.interleave()` to mix together sequences from
# different files.
sequences = filenames.flat_map(get_examples_from_file)

Optimizing Python KD Tree Searches

Scipy (http://www.scipy.org/) offers two KD Tree classes; the KDTree and the cKDTree.
The cKDTree is much faster, but is less customizable and query-able than the KDTree (as far as I can tell from the docs).
Here is my problem:
I have a list of 3million 2 dimensional (X,Y) points. I need to return all of the points within a distance of X units from every point.
With the KDtree, there is an option to do just this: KDtree.query_ball_tree() It generates a list of lists of all the points within X units from every other point. HOWEVER: this list is enormous and quickly fills up my virtual memory (about 744 million items long).
Potential solution #1: Is there a way to parse this list into a text file as it is writing?
Potential solution #2: I have tried using a for loop (for every point in the list) and then finding that single point's neighbors within X units by employing: KDtree.query_ball_point(). HOWEVER: this takes forever as it needs to run the query millions of times. Is there a cKDTree equivalent of this KDTree tool?
Potential solution #3: Beats me, anyone else have any ideas?
From scipy 0.12 on, both KD Tree classes have feature parity. Quoting its announcement:
cKDTree feature-complete
Cython version of KDTree, cKDTree, is now feature-complete. Most
operations (construction, query, query_ball_point, query_pairs,
count_neighbors and sparse_distance_matrix) are between 200 and 1000
times faster in cKDTree than in KDTree. With very minor caveats,
cKDTree has exactly the same interface as KDTree, and can be used as a
drop-in replacement.
Try using KDTree.query_ball_point instead. It takes a single point, or array of points, and produces the points within a given distance of the input point(s).
You can use this function to perform batch queries. Give it, say, 100000 points at a time, and then write the results out to a file. Something like this:
BATCH_SIZE = 100000
for i in xrange(0, len(pts), BATCH_SIZE):
neighbours = tree.query_ball_point(pts[i:i+BATCH_SIZE], X)
# write neighbours to a file...

mahout lucene document clustering howto?

I'm reading that i can create mahout vectors from a lucene index that can be used to apply the mahout clustering algorithms.
http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
I would like to apply K-means clustering algorithm in the documents in my Lucene index, but it is not clear how can i apply this algorithm (or hierarchical clustering) to extract meaningful clusters with these documents.
In this page http://cwiki.apache.org/confluence/display/MAHOUT/k-Means
says that the algorithm accepts two input directories: one for the data points and one for the initial clusters. My data points are the documents? How can i "declare" that these are my documents (or their vectors) , simply take them and do the clustering?
sorry in advance for my poor grammar
Thank you
If you have vectors, you can run KMeansDriver. Here is the help for the same.
Usage:
[--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
--input (-i) input The Path for input Vectors. Must be a
SequenceFile of Writable, Vector
--clusters (-c) clusters The input centroids, as Vectors. Must be a
SequenceFile of Writable, Cluster/Canopy.
If k is also specified, then a random set
of vectors will be selected and written out
to this path first
--output (-o) output The Path to put the output in
--distance (-m) distance The Distance Measure to use. Default is
SquaredEuclidean
--convergence (-d) convergence The threshold below which the clusters are
considered to be converged. Default is 0.5
--max (-x) max The maximum number of iterations to
perform. Default is 20
--numReduce (-r) numReduce The number of reduce tasks
--k (-k) k The k in k-Means. If specified, then a
random selection of k Vectors will be
chosen as the Centroid and written to the
clusters output path.
--vectorClass (-v) vectorClass The Vector implementation class name.
Default is SparseVector.class
--overwrite (-w) If set, overwrite the output directory
--help (-h) Print out help
Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.
A pretty good howto is here:
integrating apache mahout with apache lucene
# maiky
You can read more about reading the output and using clusterdump utility in this page -> https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper