Numpy : How to assign directly a subarray from values when these values are step spaced - numpy

I have 2 global arrays "tab1" and "tab2" with dimensions respectively equal to 21x21 and 17x17.
I would like to assign the block of "tab1" ( indexed by [15:20,0:7]) by the block of "tab2" indexed by [7:17:2,0:7] (so with a step between elements of 1st array dimension) : I tried whith this syntax :
tab1[15:20,0:7] = tab2[7:17:2,0:7]
Unfortunately, this doesn't work, it seems that only "diagonal" (I mean one by one) elements of 15:20 are taken into account following the values of "tab2" along [7:17:2].
Is there a way to assign a subarray of "tab1" with another subarray "tab2" composed of indexes with step spaced values ?
If someone could see what's wrong or suggest another method, this would be nice.
UPDATE 1: indeed, from my last tests, it seems good but is it also the same for the assignment of block [15:20,15:20] :
tab1[15:20,15:20] = tab2[7:17:2,7:17:2]
??
ANSWER : it seems ok also for this block assignment, sorry

The assignment works as I expect.
In [1]: arr = np.ones((20,10),int)
The two blocks have the same shape:
In [2]: arr[15:20, 0:7].shape
Out[2]: (5, 7)
In [3]: arr[7:17:2, 0:7].shape
Out[3]: (5, 7)
and assigning something interesting, looks right:
In [4]: arr2 = np.arange(200).reshape(20,10)
In [5]: arr[15:20, 0:7] = arr2[7:17:2, 0:7]
In [6]: arr
Out[6]:
array([[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
...
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 70, 71, 72, 73, 74, 75, 76, 1, 1, 1],
[ 90, 91, 92, 93, 94, 95, 96, 1, 1, 1],
[110, 111, 112, 113, 114, 115, 116, 1, 1, 1],
[130, 131, 132, 133, 134, 135, 136, 1, 1, 1],
[150, 151, 152, 153, 154, 155, 156, 1, 1, 1]])
I see a (5,7) block of values from arr2, skipping rows like [80, 100,...]

Related

Sorting Pandas dataframe by multiple conditions

I have a large dataframe (thousands of rows by hundreds of columns), a short excerpt is as the following:
data = {'Step':['', '', '', 'First', 'First', 'Second', 'Third', 'Second', 'First', 'Second', 'First', 'First', 'Second', 'Second'],
'Stuff':['tot', 'white', 'random', 7583, 3563, 824, 521, 7658, 2045, 33, 9823, 5, 8090, 51],
'Mark':['marking', '', '', 1, 5, 5, 5, 1, 27, 27, 1, 6, 1, 9],
'A':['item_a', 100, 'st1', 142, 2, 2, 2, 100, 150, 105, 118, 118, 162, 156],
'B':['skill', 66, 'abc', 160, 2, 130, 140, 169, 1, 2, 130, 140, 144, 127],
'C':['item', 50, 'st1', 2000, 2, 65, 2001, 1999, 1, 2, 2000, 4, 2205, 2222],
'D':['item_c', 100, 'st1', 433, 430, 150, 170, 130, 1, 2, 300, 4, 291, 606],
'E':['test', 90, 'st1', 111, 130, 5, 10, 160, 1, 2, 232, 4, 144, 113],
'F':['done', 80, 'abc', 765, 755, 5, 10, 160, 1, 2, 733, 4, 666, 500],
'G':['nd', 90, 'mag', 500, 420, 5, 10, 160, 1, 2, 300, 4, 469, 500],
'H':['prt', 100, 'st1', 999, 200, 5, 10, 160, 1, 2, 477, 4, 620, 7],
'Name':['NS', '', '', "Pat", "Lucy", "Lucy", "Lucy", "Nick", "Kirk", "Kirk", "Joe", "Nico", "Nico", "Bryan"],
'Value':[ -1, 0, 0, 0, 3, 6, 5, 0, 7, 7, 0, 6, 0, 1]}
df = pd.DataFrame(data)
I need to sort this dataframe according to the following conditions that have to be satisfied all together:
In the "Name" column, names that are the same are to remain grouped (e.g. there are 3
records of "Lucy" next to each other, and they cannot be moved apart)
For each group of names, the appearance order has to remain the one
given by the "Step" column (e.g. the first appearance of "Lucy" is
related to the value "First" in the "Step" column, the second to
"Second" and so on)
All the remaining names that in the "Value" column have a value = 0,
have to be moved below the others (e.g. "Pat" can be moved after the
others, but not "Nico" because there are two records of "Nico" and
the other one has a value = 6)
The first three rows cannot be moved
What I have done is to concatenate different sub-dataframes:
df_groupnames=df[df.duplicated(subset=['Name'], keep=False)]
df_nogroup = df[~df.duplicated(subset=['Name'], keep=False)]
df_nogroup_high = df_nogroup[df_nogroup["Value"] > 0 ]
df_nogroup_null = df_nogroup[df_nogroup["Value"] == 0]
# Let's concatenate these dataframes to get the sorted one
df_sorted = pd.concat([df_groupnames, df_nogroup_high, df_nogroup_null])
It works, but I wonder if there's a smarter, simpler way, and maybe faster, to obtain the same.
Thank you for your attention.

Numpy array changes shape when accessing with indices

I have a small matrix A with dimensions MxNxO
I have a large matrix B with dimensions KxMxNxP, with P>O
I have a vector ind of indices of dimension Ox1
I want to do:
B[1,:,:,ind] = A
But, the lefthand of my equation
B[1,:,:,ind].shape
is of dimension Ox1xMxN and therefore I can not broadcast A (MxNxO) into it.
Why does accessing B in this way change the dimensions of the left side?
How can I easily achieve my goal?
Thanks
There's a feature, if not a bug, that when slices are mixed in the middle of advanced indexing, the sliced dimensions are put at the end.
Thus for example:
In [204]: B = np.zeros((2,3,4,5),int)
In [205]: ind=[0,1,2,3,4]
In [206]: B[1,:,:,ind].shape
Out[206]: (5, 3, 4)
The 3,4 dimensions have been placed after the ind, 5.
We can get around that by indexing first with 1, and then the rest:
In [207]: B[1][:,:,ind].shape
Out[207]: (3, 4, 5)
In [208]: B[1][:,:,ind] = np.arange(3*4*5).reshape(3,4,5)
In [209]: B[1]
Out[209]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
This only works when that first index is a scalar. If it too were a list (or array), we'd get an intermediate copy, and couldn't set the value like this.
https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
It's come up in other SO questions, though not recently.
weird result when using both slice indexing and boolean indexing on a 3d array

Filter sequence items in TensorFlow

I have a tensor of allowed items
index = tf.constant([61, 215, 23, 18, 241, 125])
and need to remove items from input sequence batches that are not in index.
seq = tf.constant(
[
[ 18, 241, 0, 0],
[125, 61, 23, 241],
[ 23, 92, 18, 0],
[ 5, 61, 215, 18],
]
)
After the calculation in this case I need
result_needed = tf.constant(
[
[ 18, 241, 0, 0],
[125, 61, 23, 241],
[ 23, 18, 0, 0],
[ 61, 215, 18, 0],
]
)
I cannot do this in Python because this calculation happens during predictions. Also note that while item IDs here are small, solution needs to deal with numbers from 1 to 2^40.
Answer
After some serious pondering time, I came up with the following:
idx_range = tf.reshape(tf.range(seq.shape[-2]), [-1, 1])
idx_tile = tf.tile(idx_range, [1, seq.shape[-2].value])
idx_flat = tf.reshape(idx_tile, [-1])
truth_value = tf.equal(index, tf.expand_dims(seq, -1))
one_hot = tf.to_float(truth_value)
ones = tf.nn.top_k(tf.reduce_sum(one_hot, -1), seq.shape[-1]).indices
ones_flat = tf.reshape(ones, [-1])
ones_idx = tf.reshape(
tf.stack([idx_flat, ones_flat], axis=1),
tf.concat([seq.shape, [2]], axis=0)
)
tf.gather_nd(seq, ones_idx)
This is not exactly what I said I needed, but actually got me close enough. Instead of the output replacing the blacklisted items with 0, it moves them to the end. If you needed them gone, I'm sure there's a method to remove them, but I'm not looking into it. Apologies.

The `out` arguments in `numpy.einsum` can not work as expected

I have two piece codes. The first one is:
A = np.arange(3*4*3).reshape(3, 4, 3)
P = np.arange(1, 4)
A[:, 1:, :] = np.einsum('j, ijk->ijk', P, A[:, 1:, :])
and the result A is :
array([[[ 0, 1, 2],
[ 6, 8, 10],
[ 18, 21, 24],
[ 36, 40, 44]],
[[ 12, 13, 14],
[ 30, 32, 34],
[ 54, 57, 60],
[ 84, 88, 92]],
[[ 24, 25, 26],
[ 54, 56, 58],
[ 90, 93, 96],
[132, 136, 140]]])
The second one is:
A = np.arange(3*4*3).reshape(3, 4, 3)
P = np.arange(1, 4)
np.einsum('j, ijk->ijk', P, A[:, 1:, :], out=A[:,1:,:])
and the result A is :
array([[[ 0, 1, 2],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[12, 13, 14],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[24, 25, 26],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]]])
So the result is different. Here I want to use out to save memory. Is it a bug in numpy.einsum? Or I missed something?
By the way, my numpy version is 1.13.3.
I haven't used this new out parameter before, but have worked with einsum in the past, and have a general idea of how it works (or at least used to).
It looks to me like it initializes the out array to zero before the start of iteration. That would account for all the 0s in the A[:,1:,:] block. If instead I initial separate out array, the desired values are inserted
In [471]: B = np.ones((3,4,3),int)
In [472]: np.einsum('j, ijk->ijk', P, A[:, 1:, :], out=B[:,1:,:])
Out[472]:
array([[[ 3, 4, 5],
[ 12, 14, 16],
[ 27, 30, 33]],
[[ 15, 16, 17],
[ 36, 38, 40],
[ 63, 66, 69]],
[[ 27, 28, 29],
[ 60, 62, 64],
[ 99, 102, 105]]])
In [473]: B
Out[473]:
array([[[ 1, 1, 1],
[ 3, 4, 5],
[ 12, 14, 16],
[ 27, 30, 33]],
[[ 1, 1, 1],
[ 15, 16, 17],
[ 36, 38, 40],
[ 63, 66, 69]],
[[ 1, 1, 1],
[ 27, 28, 29],
[ 60, 62, 64],
[ 99, 102, 105]]])
The Python portion of einsum doesn't tell me much, except how it decides to pass the out array to the c portion, (as one of the list of tmp_operands):
c_einsum(einsum_str, *tmp_operands, **einsum_kwargs)
I know that it sets up a c-api equivalent of np.nditer, using the str to define the axes and iterations.
It iterates something like this section in the iteration tutorial:
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.nditer.html#reduction-iteration
Notice in particular the it.reset() step. That sets the out buffer to 0 prior to iterating. It then iterates over the elements of input arrays and the output array, writing the calculation values to the output element. Since it is doing a sum of products (e.g. out[:] += ...), it has to start with a clean slate.
I'm guessing a bit as to what is actually going on, but it seems logical to me that it should zero out the output buffer to start with. If that array is the same as one of the inputs, that will end up messing with the calculation.
So I don't think this approach will work and save you memory. It needs a clean buffer to accumulate the results in. Once that's done it, or you, can write the values back into A. But given the nature of a dot like product, you can't use the same array for input and for output.
In [476]: A[:,1:,:] = np.einsum('j, ijk->ijk', P, A[:, 1:, :])
In [477]: A
Out[477]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 12, 14, 16],
[ 27, 30, 33]],
....)
In the C source code for einsum, there is a section that will take the array specified by out and do some zero-setting.
But in the Python source code for example, there are execution paths that call the tensordot function before ever descending the arguments to call c_einsum.
This means that some operations might be pre-computed (thus modifying your array A on some contraction passes) with tensordot, before any sub-array is ever set to zero by the zero-setter inside the C code for einsum.
Another way to put it is: on each pass at doing the next contraction operations, NumPy has many choices available to it. To use tensordot directly without getting into the C-level einsum code just yet? Or to prepare the arguments and pass to the C level (which will involve over-writing some sub-view of the output array with all zeros)? Or to re-order the operations and repeat the check?
Depending on the order it chooses for these optimizations, you can end up with unexpected all-zeros sub-arrays.
Best bet is to not try to be this clever and use the same array for the output. You say it is because you want to save memory. Yes, in some special cases an einsum operation might be do-able in-place. But it does not currently detect if this is the case and attempt to avoid the zero-setting.
And in a huge number of cases, over-writing into one of the input arrays during the middle of the overall operation would cause many problems, much like trying to append to a list you are directly looping over, etc.

Numpy Array Column Slicing Produces IndexError: invalid index Exception

I am using version 1.5.1 of numpy and Python 2.6.6.
I am reading a binary file into a numpy array:
>>> dt = np.dtype('<u4,<i2,<i2,<i2,<i2,<i2,<i2,<i2,<i2,u1,u1,u1,u1')
>>> file_data = np.fromfile(os.path.join(folder,f), dtype=dt)
This works just fine. Examining the result:
>>> type(file_data)
<type 'numpy.ndarray'>
>>> file_data
array([(3571121L, -54, 103, 1, 50, 48, 469, 588, -10, 0, 102, 0, 0),
(3571122L, -78, 20, 25, 45, 44, 495, 397, -211, 0, 102, 0, 0),
(3571123L, -69, -48, 23, 60, 19, 317, -26, -151, 0, 102, 0, 0), ...,
(3691138L, -53, 52, -2, -11, 76, 988, 288, -101, 1, 102, 0, 0),
(3691139L, -11, 21, -27, 25, 47, 986, 253, 176, 1, 102, 0, 0),
(3691140L, -30, -19, -63, 59, 12, 729, 23, 302, 1, 102, 0, 0)],
dtype=[('f0', '<u4'), ('f1', '<i2'), ('f2', '<i2'), ... , ('f12', '|u1')])
>>> file_data[0]
(3571121L, -54, 103, 1, 50, 48, 469, 588, -10, 0, 102, 0, 0)
>>> file_data[0][0]
3571121
>>> len(file_data)
120020
When I try to slice the first column:
>>> file_data[:,0]
I get:
IndexError: invalid index.
I have looked at simple examples and was able to do the slicing:
>>> a = np.array([(1,2,3),(4,5,6)])
>>> a[:,0]
array([1, 4])
The only difference I can see between my case and the simple example is that I am using the dtype. What I am doing wrong?
When you set the dtype like that, you are creating a Record Array. Numpy treats that like a 1D array of elements of your dtype. There's a fundamental difference between
file_data[0][0]
and
file_data[0,0]
In the first, you are asking for the first element of a 1D array and then retrieving the first element of that returned element. In the second, you are asking for the element in the first row of the first column of a 2D array. That's why you are getting the IndexError.
If you want to access an individual element using 2D notation, you can create a view and work with that. Unfortunately, AFAIK if you want to treat your object like a 2D array, all elements have to have the same dtype.