Reading TTree Friend with uproot - numpy

Is there an equivalent of TTree::AddFriend() with uproot ?
I have 2 parallel trees in 2 different files which I'd need to read with uproot.iterate and using interpretations (setting the 'branches' option of uproot.iterate).
Maybe I can do that by manually obtaining several iterators from iterate() calls on the files, and then calling next() on each iterators... but maybe there's a simpler way akin to AddFriend ?
Thanks for any hint !
edit: I'm not sure I've been clear, so here's a bit more details. My question is not about usage of arrays, but about how to read them from different files. Here's a mockup of what I'm doing :
# I will fill this array and give it as input to my DNN
# it's very big so I will fill it in place
bigarray = ndarray( (2,numentries),...)
# get a handle on a tree, just to be able to build interpretations :
t0 = .. first tree in input_files
interpretations = dict(
a=t0['a'].interpretation.toarray(bigarray[0]),
b=t0['b'].interpretation.toarray(bigarray[1]),
)
# iterate with :
uproot.iterate( input_files, treename,
branches = interpretations )
So what if a and b belong to 2 trees in 2 different files ?

In array-based programming, friends are implicit: you can JOIN any two columns after the fact—you don't have to declare them as friends ahead of time.
In the simplest case, if your arrays a and b have the same length and the same order, you can just use them together, like a + b. It doesn't matter whether a and b came from the same file or not. Even if I've if these is jagged (like jets.phi) and the other is not (like met.phi), you're still fine because the non-jagged array will be broadcasted to match the jagged one.
Note that awkward.Table and awkward.JaggedArray.zip can combine arrays into a single Table or jagged Table for bookkeeping.
If the two arrays are not in the same order, possibly because each writer was individually parallelized, then you'll need some column to act as the key associating rows of one array with different rows of the other. This is a classic database-style JOIN and although Uproot and Awkward don't provide routines for it, Pandas does. (Look up "merging, joining, and concatenating" in the Pandas documenting—there's a lot!) You can maintain an array's jaggedness in Pandas by preparing the column with the awkward.topandas function.
The following issue talks about a lot of these things, though the users in the issue below had to join sets of files, rather than just a single tree. (In principle, a process would have to look ahead to all the files to see which contain which keys: a distributed database problem.) Even if that's not your case, you might find more hints there to see how to get started.
https://github.com/scikit-hep/uproot/issues/314

This is how I have "friended" (befriended?) two TTree's in different files with uproot/awkward.
import awkward
import uproot
iterate1 = uproot.iterate(["file_with_a.root"]) # has branch "a"
iterate2 = uproot.iterate(["file_with_b.root"]) # has branch "b"
for array1, array2 in zip(iterate1, iterate2):
# join arrays
for field in array2.fields:
array1 = awkward.with_field(array1, getattr(array2, field), where=field)
# array1 now has branch "a" and "b"
print(array1.a)
print(array1.b)
Alternatively, if it is acceptable to "name" the trees,
import awkward
import uproot
iterate1 = uproot.iterate(["file_with_a.root"]) # has branch "a"
iterate2 = uproot.iterate(["file_with_b.root"]) # has branch "b"
for array1, array2 in zip(iterate1, iterate2):
# join arrays
zippedArray = awkward.zip({"tree1": array1, "tree2": array2})
# zippedArray. now has branch "tree1.a" and "tree2.b"
print(zippedArray.tree1.a)
print(zippedArray.tree2.b)
Of course you can use array1 and array2 together without merging them like this. But if you have already written code that expects only 1 Array this can be useful.

Related

filtering "events" in awkward-array

I am reading data from a file of "events". For each event, there is some number of "tracks". For each track there are a series of "variables". A stripped down version of the code (using awkward0 as awkward) looks like
f = h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r")
afile = awkward.hdf5(f)
pocaz = np.asarray(afile["poca_z"].astype(dtype_X))
pocaMx = np.asarray(afile["major_axis_x"].astype(dtype_X))
pocaMy = np.asarray(afile["major_axis_y"].astype(dtype_X))
pocaMz = np.asarray(afile["major_axis_z"].astype(dtype_X))
In this snippet of code, "pocaz", "pocaMx", etc. are what I have called variables (a physics label, not a Python data type). On rare occasions, pocaz takes on an extreme value, pocaMx and/or pocaMy take on nan values, and/or pocaMz takes on the value inf. I would like to remove these tracks from the events using some syntactically simple method. I am guessing this functionality exists (perhaps in the current version of awkward but not awkward0), but cannot find it described in a transparent way. Is there a simple example anywhere?
Thanks,
Mike
It looks to me, from the fact that you're able to call np.asarray on these arrays without error, that they are one-dimensional arrays of numbers. If so, then Awkward Array isn't doing anything for you here; you should be able to find the one-dimensional NumPy arrays inside
f["poca_z"], f["major_axis_x"], f["major_axis_y"], f["major_axis_z"]
as groups (note that this is f, not afile) and leave Awkward Array entirely out of it.
The reason I say that is because you can use np.isfinite on these NumPy arrays. (There's an equivalent in Awkward v1, v2, but you're talking about Awkward v0 and I don't remember.) That will give you an array of booleans for you to slice these arrays.
I don't have the HDF5 file for testing, but I think it would go like this:
f = h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r")
pocaz = np.asarray(a["poca_z"]["0"], dtype=dtype_X)
pocaMx = np.asarray(a["major_axis_x"]["0"], dtype=dtype_X) # the only array
pocaMy = np.asarray(a["major_axis_y"]["0"], dtype=dtype_X) # in each group
pocaMz = np.asarray(a["major_axis_z"]["0"], dtype=dtype_X) # is named "0"
good = np.ones(len(pocaz), dtype=bool)
good &= np.isfinite(pocaz)
good &= np.isfinite(pocaMx)
good &= np.isfinite(pocaMy)
good &= np.isfinite(pocaMz)
pocaz[good], pocaMx[good], pocaMy[good], pocaMz[good]
If you also need to cut extreme finite values, you can include
good &= (-1000 < pocaz) & (pocaz < 1000)
etc. in the good selection criteria.
(The way you'd do this in Awkward Array is not any different, since Awkward is just generalizing what NumPy does here, but if you don't need it, you might as well leave it out.)
If you want numpy arrays, why not read the data with h5py functions? It provides a very natural way to return the datasets as arrays. Code would look like this. (FYI, I used the file context manager to open the file.)
with h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r") as h5f:
# the [()] returns the dataset as an array:
pocaz_arr = h5f["poca_z"]["0"][()]
# verify array shape and datatype:
print(f"Shape: {pocaz_arr.shape}, Dtype: {poca_z_arr.dtype})")
pocaMx_arr = h5f["major_axis_x"]["0"][()] # the only dataset
pocaMy_arr = h5f["major_axis_y"]["0"][()] # in each group
pocaMz_arr = h5f["major_axis_z"]["0"][()] # is named "0"

Nextflow - How to avoid random sample IDs for input files in two or more channels with "Join" or similar operator?

I have implemented some NGS data analysis workflows with Nextflow. I used "Paired End" channels (fromFilePairs method) for some of my workflow processes. I ran into a problem I wasn't expecting after multiple workflow executions : my samples ID would sometimes be mixed, resulting in inaccurate outputs for the processes where it happened. I think this is related to the Non-deterministic input channels issue (https://www.nextflow.io/blog/2019/troubleshooting-nextflow-resume.html).
Let's suppose I apply my worklow on these paired-end files : sample1_R{1,2}.fastq, sample2_R{1,2}.fastq
process Step1 {
input:
tuple pair_ID, file(A) from channelA
tuple pair_ID, file(B) from channelB
tuple pair_ID, file(C) from channelC
...
}
For this kind of process with more than one "tuple pair_ID" as input, the data pair_ID (= my samples names) can be mixed up and my process would end up using randomly input files A and B of the sample1, and the input file C of the sample2 instead of all files (A,B,C) of the same pair_ID (key = only sample1 or only sample2).
I had this randomly mixed input filenames issue (which impacted the outputs) after several workflow executions, after using -resume when an error occurred but also after full successful workflow runs.
In order to have the same key (pair_ID) between the input files emitted by each of the 3 channels, I used the join operator:
Process Step1 {
input:
tuple pair_ID, file(A), file(B), file(C) from channelA.join(channelB).join(channelC)
...
}
This operator seems to make everything work as expected, I don't see any mix in my sample IDs and in my final outputs. In the doc (https://www.nextflow.io/docs/latest/operator.html?highlight=join#join), join seems to be suited for a 2 channels use only, so I am unsure if I am using it right for 3 channels.
Is my method using join legit ? Or does it still have some flaws ?
Is there a better way to correct my issue ?
If I am unsure that this method is correct to avoid any mix in my samples ID, I might change to another workflow management system such as Snakemake but I would really like to solve this issue and to continue using Nextflow.
Thank you in advance, don't hesitate if something isn't clear !
As you have discovered, you should avoid using the same variable name (pair_ID) more than once in your input block. Using the same variable name does not guarantee the inputs will be joined up using this key. I imagine that whatever value you get for pair_ID from one input channel will just get clobbered by the pair_ID you get from one of your other input channels. You have also discovered that when you declare two or more input channels, the overall input ordering may not be consistent across multiple executions (like when using the -resume).
To join two or more channels with a common key, you can simply use the join operator:
join
The join operator creates a channel that joins together the items
emitted by two channels for which exits a matching key. The key is
defined, by default, as the first element in each item emitted.
Note that the join operator creates (returns) a new channel. Therefore, this:
joined = channelA.join(channelB).join(channelC)
Is functionally the same as:
temp = channelA.join(channelB)
joined = temp.join(channelC)

Best way to insert values of 3D array inside of another larger array

There must be some 'pythonic' way to do this, but I don't think np.place, np.insert, or np.put are what I'm looking for. I want to replace the values inside of a large 3D array A with those of a smaller 3D array B, starting at location [i,j,k] in the larger array. See drawing:
I want to type something like A[i+, j+, k+] = B, or np.embed(B, A, (i,j,k)) but of course those are not right.
EDIT: Oh, there is this. So I should modify the question to ask if this is the best way (where "best" means fastest for a 500x500x50 array of floats on a laptop):
s0, s1, s2 = B.shape
A[i:i+s0, j:j+s1, k:k+s2] = B
Your edited answer looks fine for the 3D case.
If you want the "embed" function you mentioned in the original post, for arrays of any number of dimensions, the following should work:
def embed( small_array, big_array, big_index):
"""Overwrites values in big_array starting at big_index with those in small_array"""
slices = [np.s_[i:i+j] for i,j in zip(big_index, small_array.shape)]
big_array[slices]=small_array
It may be worth noting that it's not obvious how one would want "embed" to perform in cases where big_array has more dimensions than small_array does. E.g., I could imagine someone wanting a 1:1 mapping from small_array members to overwritten members of big_array (equivalent to adding extra length-1 dimensions to small_array to bring its ndim up to that of big_array), or I could imagine someone wanting small_array to broadcast out to fill the remainder of big_array for the "missing" dimensions of small_array. Anyway, you might want to avoid calling the function in those cases, or to tweak the function to ensure it will do what you want in those cases.

Clearing numerical values in Mathematica

I am working on fairly large Mathematica projects and the problem arises that I have to intermittently check numerical results but want to easily revert to having all my constructs in analytical form.
The code is fairly fluid I don't want to use scoping constructs everywhere as they add work overhead. Is there an easy way for identifying and clearing all assignments that are numerical?
EDIT: I really do know that scoping is the way to do this correctly ;-). However, for my workflow I am really just looking for a dirty trick to nix all numerical assignments after the fact instead of having the foresight to put down a Block.
If your assignments are on the top level, you can use something like this:
a = 1;
b = c;
d = 3;
e = d + b;
Cases[DownValues[In],
HoldPattern[lhs_ = rhs_?NumericQ] |
HoldPattern[(lhs_ = rhs_?NumericQ;)] :> Unset[lhs],
3]
This will work if you have a sufficient history length $HistoryLength (defaults to infinity). Note however that, in the above example, e was assigned 3+c, and 3 here was not undone. So, the problem is really ambiguous in formulation, because some numbers could make it into definitions. One way to avoid this is to use SetDelayed for assignments, rather than Set.
Another alternative would be to analyze the names in say Global' context (if that is the context where your symbols live), and then say OwnValues and DownValues of the symbols, in a fashion similar to the above, and remove definitions with purely numerical r.h.s.
But IMO neither of these approaches are robust. I'd still use scoping constructs and try to isolate numerics. One possibility is to wrap you final code in Block, and assign numerical values inside this Block. This seems a much cleaner approach. The work overhead is minimal - you just have to remember which symbols you want to assign the values to. Block will automatically ensure that outside it, the symbols will have no definitions.
EDIT
Yet another possibility is to use local rules. For example, one could define rule[a] = a->1; rule[d]=d->3 instead of the assignments above. You could then apply these rules, extracting them as say
DownValues[rule][[All, 2]], whenever you want to test with some numerical arguments.
Building on Andrew Moylan's solution, one can construct a Block like function that would takes rules:
SetAttributes[BlockRules, HoldRest]
BlockRules[rules_, expr_] :=
Block ## Append[Apply[Set, Hold#rules, {2}], Unevaluated[expr]]
You can then save your numeric rules in a variable, and use BlockRules[ savedrules, code ], or even define a function that would apply a fixed set of rules, kind of like so:
In[76]:= NumericCheck =
Function[body, BlockRules[{a -> 3, b -> 2`}, body], HoldAll];
In[78]:= a + b // NumericCheck
Out[78]= 5.
EDIT In response to Timo's comment, it might be possible to use NotebookEvaluate (new in 8) to achieve the requested effect.
SetAttributes[BlockRules, HoldRest]
BlockRules[rules_, expr_] :=
Block ## Append[Apply[Set, Hold#rules, {2}], Unevaluated[expr]]
nb = CreateDocument[{ExpressionCell[
Defer[Plot[Sin[a x], {x, 0, 2 Pi}]], "Input"],
ExpressionCell[Defer[Integrate[Sin[a x^2], {x, 0, 2 Pi}]],
"Input"]}];
BlockRules[{a -> 4}, NotebookEvaluate[nb, InsertResults -> "True"];]
As the result of this evaluation you get a notebook with your commands evaluated when a was locally set to 4. In order to take it further, you would have to take the notebook
with your code, open a new notebook, evaluate Notebooks[] to identify the notebook of interest and then do :
BlockRules[variablerules,
NotebookEvaluate[NotebookPut[NotebookGet[nbobj]],
InsertResults -> "True"]]
I hope you can make this idea work.

Obtaining all possible states of an object for a NP-Complete(?) problem in Python

Not sure that the example (nor the actual usecase) qualifies as NP-Complete, but I'm wondering about the most Pythonic way to do the below assuming that this was the algorithm available.
Say you have :
class Person:
def __init__(self):
self.status='unknown'
def set(self,value):
if value:
self.status='happy'
else :
self.status='sad'
... blah . Maybe it's got their names or where they live or whatev.
and some operation that requires a group of Persons. (The key value is here whether the Person is happy or sad.)
Hence, given PersonA, PersonB, PersonC, PersonD - I'd like to end up a list of the possible 2**4 combinations of sad and happy Persons. i.e.
[
[ PersonA.set(true), PersonB.set(true), PersonC.set(true), PersonD.set(true)],
[ PersonA.set(true), PersonB.set(true), PersonC.set(true), PersonD.set(false)],
[ PersonA.set(true), PersonB.set(true), PersonC.set(false), PersonD.set(true)],
[ PersonA.set(true), PersonB.set(true), PersonC.set(false), PersonD.set(false)],
etc..
Is there a good Pythonic way of doing this? I was thinking about list comprehensions (and modifying the object so that you could call it and get returned two objects, true and false), but the comprehension formats I've seen would require me to know the number of Persons in advance. I'd like to do this independent of the number of persons.
EDIT : Assume that whatever that operation that I was going to run on this is part of a larger problem set - we need to test out all values of Person for a given set in order to solve our problem. (i.e. I know this doesn't look NP-complete right now =) )
any ideas?
Thanks!
I think this could do it:
l = list()
for i in xrange(2 ** n):
# create the list of n people
sublist = [None] * n
for j in xrange(n):
sublist[j] = Person()
sublist[j].set(i & (1 << j))
l.append(sublist)
Note that if you wrote Person so that its constructor accepted the value, or such that the set method returned the person itself (but that's a little weird in Python), you could use a list comprehension. With the constructor way:
l = [ [Person(i & (1 << j)) for j in xrange(n)] for i in xrange(2 ** n)]
The runtime of the solution is O(n 2**n) as you can tell by looking at the loops, but it's not really a "problem" (i.e. a question with a yes/no answer) so you can't really call it NP-complete. See What is an NP-complete in computer science? for more information on that front.
According to what you've stated in your problem, you're right -- you do need itertools.product, but not exactly the way you've stated.
import itertools
truth_values = itertools.product((True, False), repeat = 4)
people = (person_a, person_b, person_c, person_d)
all_people_and_states = [[person(truth) for person, truth in zip(people, combination)] for combination in truth_values]
That should be more along the lines of what you mentioned in your question.
You can use a cartesian product to get all possible combinations of people and states. Requires Python 2.6+
import itertools
people = [person_a,person_b,person_c]
states = [True,False]
all_people_and_states = itertools.product(people,states)
The variable all_people_and_states contains a list of tuples (x,y) where x is a person and y is either True or False. It will contain all possible pairings of people and states.