Nextflow - How to avoid random sample IDs for input files in two or more channels with "Join" or similar operator? - nextflow

I have implemented some NGS data analysis workflows with Nextflow. I used "Paired End" channels (fromFilePairs method) for some of my workflow processes. I ran into a problem I wasn't expecting after multiple workflow executions : my samples ID would sometimes be mixed, resulting in inaccurate outputs for the processes where it happened. I think this is related to the Non-deterministic input channels issue (https://www.nextflow.io/blog/2019/troubleshooting-nextflow-resume.html).
Let's suppose I apply my worklow on these paired-end files : sample1_R{1,2}.fastq, sample2_R{1,2}.fastq
process Step1 {
input:
tuple pair_ID, file(A) from channelA
tuple pair_ID, file(B) from channelB
tuple pair_ID, file(C) from channelC
...
}
For this kind of process with more than one "tuple pair_ID" as input, the data pair_ID (= my samples names) can be mixed up and my process would end up using randomly input files A and B of the sample1, and the input file C of the sample2 instead of all files (A,B,C) of the same pair_ID (key = only sample1 or only sample2).
I had this randomly mixed input filenames issue (which impacted the outputs) after several workflow executions, after using -resume when an error occurred but also after full successful workflow runs.
In order to have the same key (pair_ID) between the input files emitted by each of the 3 channels, I used the join operator:
Process Step1 {
input:
tuple pair_ID, file(A), file(B), file(C) from channelA.join(channelB).join(channelC)
...
}
This operator seems to make everything work as expected, I don't see any mix in my sample IDs and in my final outputs. In the doc (https://www.nextflow.io/docs/latest/operator.html?highlight=join#join), join seems to be suited for a 2 channels use only, so I am unsure if I am using it right for 3 channels.
Is my method using join legit ? Or does it still have some flaws ?
Is there a better way to correct my issue ?
If I am unsure that this method is correct to avoid any mix in my samples ID, I might change to another workflow management system such as Snakemake but I would really like to solve this issue and to continue using Nextflow.
Thank you in advance, don't hesitate if something isn't clear !

As you have discovered, you should avoid using the same variable name (pair_ID) more than once in your input block. Using the same variable name does not guarantee the inputs will be joined up using this key. I imagine that whatever value you get for pair_ID from one input channel will just get clobbered by the pair_ID you get from one of your other input channels. You have also discovered that when you declare two or more input channels, the overall input ordering may not be consistent across multiple executions (like when using the -resume).
To join two or more channels with a common key, you can simply use the join operator:
join
The join operator creates a channel that joins together the items
emitted by two channels for which exits a matching key. The key is
defined, by default, as the first element in each item emitted.
Note that the join operator creates (returns) a new channel. Therefore, this:
joined = channelA.join(channelB).join(channelC)
Is functionally the same as:
temp = channelA.join(channelB)
joined = temp.join(channelC)

Related

Mathematical equations to create a virtual channel in LabVIEW

I need some help in creating a VI that generates virtual or calculated channels based on several channels I measure.
e.g.
I measure voltage on several AI, lets say, ch A,B,C,D,E were B,C and E represent current on a shunt and would like to calculate a the power of the system
Q[A] = B+C
R[W] = A*Q
S[W] = D*E
T[W] = R+S
I would like to load the equations externally from a configuration file that may vary from one project to another equations would come in a format of a string Q=A+B , R= A*Q .....
*(during a run equation and channel count don't change - only when loading config).
The main issues that I am facing is that the inputs to each equation may have dependencies on virtual channels that do not have data yet
Was trying to use:
formula nodes/ Math scripts: https://zone.ni.com/reference/en-XX/help/371361R-01/lvconcepts/formula_nodes/
https://knowledge.ni.com/KnowledgeArticleDetails?id=kA03q000000x30HCAQ&l=en-IL
All data that should be chunked into a data stream (continues sampling) that can be presented on a Chart/Graph and saved to CSV/TDMS
do I need some additional packages?
I have tried the following based on the the example given - getting strange result
Answer
The elements you are looking for are not the Formula/Math Nodes but rather the:
Formula Parsing VIs
Using these VIs you are able to pass a calculation in the form of a string and an array of variable names and then evaluate the formula. This allows for run-time variable scripting, where most other nodes require compile time formula evaluation (With the exception of the python node).
Example
Example of using a very simple program to evaluate two different calculations using the same values and variables.

Reading TTree Friend with uproot

Is there an equivalent of TTree::AddFriend() with uproot ?
I have 2 parallel trees in 2 different files which I'd need to read with uproot.iterate and using interpretations (setting the 'branches' option of uproot.iterate).
Maybe I can do that by manually obtaining several iterators from iterate() calls on the files, and then calling next() on each iterators... but maybe there's a simpler way akin to AddFriend ?
Thanks for any hint !
edit: I'm not sure I've been clear, so here's a bit more details. My question is not about usage of arrays, but about how to read them from different files. Here's a mockup of what I'm doing :
# I will fill this array and give it as input to my DNN
# it's very big so I will fill it in place
bigarray = ndarray( (2,numentries),...)
# get a handle on a tree, just to be able to build interpretations :
t0 = .. first tree in input_files
interpretations = dict(
a=t0['a'].interpretation.toarray(bigarray[0]),
b=t0['b'].interpretation.toarray(bigarray[1]),
)
# iterate with :
uproot.iterate( input_files, treename,
branches = interpretations )
So what if a and b belong to 2 trees in 2 different files ?
In array-based programming, friends are implicit: you can JOIN any two columns after the fact—you don't have to declare them as friends ahead of time.
In the simplest case, if your arrays a and b have the same length and the same order, you can just use them together, like a + b. It doesn't matter whether a and b came from the same file or not. Even if I've if these is jagged (like jets.phi) and the other is not (like met.phi), you're still fine because the non-jagged array will be broadcasted to match the jagged one.
Note that awkward.Table and awkward.JaggedArray.zip can combine arrays into a single Table or jagged Table for bookkeeping.
If the two arrays are not in the same order, possibly because each writer was individually parallelized, then you'll need some column to act as the key associating rows of one array with different rows of the other. This is a classic database-style JOIN and although Uproot and Awkward don't provide routines for it, Pandas does. (Look up "merging, joining, and concatenating" in the Pandas documenting—there's a lot!) You can maintain an array's jaggedness in Pandas by preparing the column with the awkward.topandas function.
The following issue talks about a lot of these things, though the users in the issue below had to join sets of files, rather than just a single tree. (In principle, a process would have to look ahead to all the files to see which contain which keys: a distributed database problem.) Even if that's not your case, you might find more hints there to see how to get started.
https://github.com/scikit-hep/uproot/issues/314
This is how I have "friended" (befriended?) two TTree's in different files with uproot/awkward.
import awkward
import uproot
iterate1 = uproot.iterate(["file_with_a.root"]) # has branch "a"
iterate2 = uproot.iterate(["file_with_b.root"]) # has branch "b"
for array1, array2 in zip(iterate1, iterate2):
# join arrays
for field in array2.fields:
array1 = awkward.with_field(array1, getattr(array2, field), where=field)
# array1 now has branch "a" and "b"
print(array1.a)
print(array1.b)
Alternatively, if it is acceptable to "name" the trees,
import awkward
import uproot
iterate1 = uproot.iterate(["file_with_a.root"]) # has branch "a"
iterate2 = uproot.iterate(["file_with_b.root"]) # has branch "b"
for array1, array2 in zip(iterate1, iterate2):
# join arrays
zippedArray = awkward.zip({"tree1": array1, "tree2": array2})
# zippedArray. now has branch "tree1.a" and "tree2.b"
print(zippedArray.tree1.a)
print(zippedArray.tree2.b)
Of course you can use array1 and array2 together without merging them like this. But if you have already written code that expects only 1 Array this can be useful.

Building relationships in Neo4j via Py2neo is very slow

We have 5 different types of nodes in database. Largest one has ~290k, the smallest is only ~3k. Each node type has an id field and they are all indexed. I am using py2neo to build relationship, but it is very slow (~ 2 relationships inserted per second)
I used pandas read from a relationship csv, iterate each row to create a relationship wrapped in transaction. I tried batch out 10k creation statements in one transaction, but it does not seem to improve the speed a lot.
Below is the code:
df = pd.read_csv(r"C:\relationship.csv",dtype = datatype, skipinitialspace=True, usecols=fields)
df.fillna('',inplace=True)
def f(node_1 ,rel_type, node_2):
try:
tx = graph.begin()
tx.evaluate('MATCH (a {node_id:$label1}),(b {node_id:$label2}) MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'label1': node_1, 'label2': node_2})
tx.commit()
except Exception as e:
print(str(e))
for index, row in df.iterrows():
if(index%1000000 == 0):
print(index)
try:
f(row["node_1"],row["rel_type"],row["node_2"])
except:
print("error index: " + index)
Can someone help me what I did wrong here. Thanks!
You state that there are "5 different types of nodes" (which I interpret to mean 5 node labels, in neo4j terminology). And, furthermore, you state that their id properties are already indexed.
But your f() function is not generating a Cypher query that uses the labels at all, and neither does it use the id property. In order to take advantage of your indexes, your Cypher query has to specify the node label and the id value.
Since there is currently no efficient way to parameterize the label when performing a MATCH, the following version of the f() function generates a Cypher query that has hardcoded labels (as well as a hardcoded relationship type):
def f(label_1, id_1, rel_type, label_2, id_2):
try:
tx = graph.begin()
tx.evaluate(
'MATCH' +
'(a:' + label_1 + '{id:$id1}),' +
'(b:' + label_2 + '{id:$id2}) ' +
'MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'id1': id_1, 'id2': id_2})
tx.commit()
except Exception as e:
print(str(e))
The code that calls f() will also have to be changed to pass in both the label names and the id values for a and b. Hopefully, your df rows will contain that data (or enough info for you to derive that data).
If your aim is for better performance then you will need to consider a different pattern for loading these, i.e. batching. You're currently running one Cypher MERGE statement for each relationship and wrapping that in its own transaction in a separate function call.
Batching these by looking at multiple statements per transaction or per function call will reduce the number of network hops and should improve performance.

Block types in GNU Radio

I am still learning GNU Radio and I have trouble understanding something about signal processing block type. I understand that if I create a block taking let say 2 samples in the input and output 4 samples, it will be an interpolator of 2.
But now, I would like to create a block which will be a framer. So, it will have two inputs and one output. The block will receives the n samples from the first input, then take m inputs from the second input and append to the samples received from input one, and then output them. In this case, my samples are supposed to be bytes.
How to proceed in this case please ? Am I taking the right path like that? Do any one know to proceed with this type of scenario?
Your case (input 0 and input 1 having different relative rates to the output) is not covered by the sync_block/interpolator/decimator "templates" that GNU Radio has, so you have to use the general block approach.
Assuming you're familiar with gr_modtool¹, you can use it to add things like interpolator (relative rate >1), decimators (<1) and sync (=1) blocks:
-t BLOCK_TYPE, --block-type=BLOCK_TYPE
One of sink, source, sync, decimator, interpolator,
general, tagged_stream, hier, noblock.
But also note the general type. Using that, you can implement a block that doesn't have any restrictions on the relation between in- and output. That means that
you will have to manually consume() items from the inputs, because the number of items you took from the input can no longer be derived by the number of output items, and
you will have to implement a forecast method to tell the GNU Radio scheduler how much items you'll need for a given output.
gr_modtool will give you a stub where you'll only have to add the right code!
¹ if you're not: It's introduced in the GNU Radio Guided Tutorials, part 3 or so, somethig that I think will be a quick and fun read to you.
Considering that the question was asked 4 years ago and that there has been many changes in GNU Radio since then, I want to add to the answer that now this is possible to do with the Patterned Interleaver block.
patterned_interleaver_image
This block works the following way: it receives inputs in the ports to the left and outputs a single interleaved pattern in the port that is to the right. So let's imagine a block with 2 inputs, V1 and V2:
V1 = [0,1,0,0,1,1]
V2 = [1,1,1,0,1,0]
Suppose we want the output to be the first 2 bits of V1 followed by the first 2 bits of V2 followed by the next 2 bits of V1 and then the next 2 bits of V2 and so on...that is, we want the output to be
Vo = [0,1,1,1,0,0,1,0,1,1,1,0].
In order to accomplish this we go to the properties of the Patterned Interleaver block which looks like this:
patterned_interleaver_properties
The Pattern field allows us to control the order in which the bits in the input ports will be interleaved. By default they are in [0,0,1,1] meaning that the block will take 2 bits from input port 0 followed by 2 bits from input port 1. The corresponding output will be
[0,1,1,1,0,0,1,0,1,1,1,0],
that is, the first 2 bits of V1 followed by the first 2 bits of V2 and then the next 2 bits of V1, etc.
Let's see another example. In case the Pattern field is set to [0,0,1,1,1,0] the output will be 2 bits from input port 0 followed by 3 bits from input port 1 and then 1 bit from input port 0. In the output we will obtain [0,1,1,1,1,0,0,1,0,1,0,0].
Lastly, the Pattern field is also used to determine the number of input ports. If the Pattern field is [0,0,1,2] we will see that another input port is added to the block.
patterned_interleaver_3_inputs

Custom EQ AudioUnit on iOS

The only effect AudioUnit on iOS is the "iTunes EQ", which only lets you use EQ pre-sets. I would like to use a customized eq in my audio graph
I came across this question on the subject and saw an answer suggesting using this DSP code in the render callback. This looks promising and people seem to be using this effectively on various platforms. However, my implementation has a ton of noise even with a flat eq.
Here's my 20 line integration into the "MixerHostAudio" class of Apple's "MixerHost" example application (all in one commit):
https://github.com/tassock/mixerhost/commit/4b8b87028bfffe352ed67609f747858059a3e89b
Any ideas on how I could get this working? Any other strategies for integrating an EQ?
Edit: Here's an example of the distortion I'm experiencing (with the eq flat):
http://www.youtube.com/watch?v=W_6JaNUvUjA
In the code in EQ3Band.c, the filter coefficients are used without being initialized. The init_3band_state method initialize just the gains and frequencies, but the coefficients themselves - es->f1p0 etc. are not initialized, and therefore contain some garbage values. That might be the reason for the bad output.
This code seems wrong in more then one way.
A digital filter is normally represented by the filter coefficients, which are constant, the filter inner state history (since in most cases the output depends on history) and the filter topology, which is the arithmetic used to calculate the output given the input and the filter (coeffs + state history). In most cases, and of course when filtering audio data, you expect to get 0's at the output if you feed 0's to the input.
The problems in the code you linked to:
The filter coefficients are changed in each call to the processing method:
es->f1p0 += (es->lf * (sample - es->f1p0)) + vsa;
The input sample is usually multiplied by the filter coefficients, not added to them. It doesn't make any physical sense - the sample and the filter coeffs don't even have the same physical units.
If you feed in 0's, you do not get 0's at the output, just some values which do not make any sense.
I suggest you look for another code - the other option is debugging it, and it would be harder.
In addition, you'd benefit from reading about digital filters:
http://en.wikipedia.org/wiki/Digital_filter
https://ccrma.stanford.edu/~jos/filters/