Remapping numpy arrarys to dictionary - numpy

I want to remap a numpy array according to a dictionary.
Let us assume I have a numpy array with N rows and 3 columns. Now I want to remap the values according to its indices which are written in tuples in a dictionary.
This works fine:
import numpy as np
a = np.arange(6).reshape(2,3)
b = np.zeros(6).reshape(2,3)
print a
print A
dictt = { (0,0):(0,2), (0,1):(0,1), (0,2):(0,0), (1,0):(1,2), (1,1):(1,1), (1,2):(1,0) }
for key in dictt:
b[key] = a[dictt[key]]
print b
a = [[0 1 2]
[3 4 5]]
b = [[ 2. 1. 0.]
[ 5. 4. 3.]]
Let us assume I have N rows, where N is an even number. Now I want to apply the same mapping (which are valid for those 2 rows in the upper example) to all the other rows.
Hence I want to have an array from:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
to:
b = [[ 2. 1. 0.]
[ 5. 4. 3.]]
[[ 8. 7. 6.]
[ 11. 10. 9.]]
Any ideas? I would like to do it fast since these are 192000 entries in each array which should be remapped.

For simplicity I would just use [::-1].
a = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
b = [item[::-1] for item in a]
>>> b
[[2, 1, 0], [5, 4, 3], [8, 7, 6]]

Related

replace element in numpy array pending a different element's value

I have the following numpy array (as an example):
my_array = [[3, 7, 0]
[20, 4, 0]
[7, 54, 0]]
I want to replace the 0's in the 3rd column of each row with a value of 5 only if the first index is odd.
So the expected outcome would be:
my_array = [[3, 7, 5]
[20, 4, 0]
[7, 54, 5]]
I tried numpy.where and numpy.place, but couldn't get the expected results.
Is there an elegant way to do this with numpy functions?
you can do this by indexing as:
my_array[my_array[:, 0] % 2 != 0, 2] = 5
# my_array[:, 0] % 2 != 0 --- Boolean shows modifying rows --> [ True False True]

Find duplicated sequences in numpy.array or pandas column

For example, I have got an array like this:
([ 1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5 ])
I need to find all duplicated sequences , not values, but sequences of at least two values one by one.
The result should be like this:
of length 2: [1, 5] with indexes (0, 16);
of length 3: [3, 3, 7] with indexes (6, 12); [7, 9, 4] with indexes (2, 8)
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
I can do it with pandas.apply function, but it calculates too slow, swifter did not help me.
And in real life I need to find all of them, with length from 10 up to 100 values one by one on database with 1500 columns with 700 000 values each. So i really do need a vectorized decision.
Is there a vectorized decision for finding all at once? Or at least for finding only 10-values sequences? Or only 4-values sequences? Anything, that will be fully vectorized?
One possible implementation (although not fully vectorized) that finds all sequences of size n that appear more than once is the following:
import numpy as np
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
if M.any():
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
repeated_inds = np.count_nonzero(matches, axis=-1) > n
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)[::n]
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
else:
return [], []
return unique_seqs[repeated_inds], grouped_idxs
In theory, you could replace
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
with
matches = scipy.signal.convolve(
M, np.ones((1, n), dtype=int), mode="full"
).astype(int)
which would make the whole thing "fully vectorized", but my tests showed that this was 3 to 4 times slower than the for-loop. So I'd stick with that. Or simply,
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
which does not have any significant speed-up, since it's basically a hidden loop (see this).
This is based off #Divakar's answer here that dealt with a very similar problem, in which the sequence to look for was provided. I simply made it so that it could follow this procedure for all possible sequences of size n, which are found inside the function with n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]; unique_seqs = np.unique(n_seqs, axis=0).
For example,
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> repeated_seqs, inds = repeated_sequences(a, n)
>>> for i, seq in enumerate(repeated_seqs[:10]):
...: print(f"{seq} with indexes {inds[i]}")
...:
[3 3 7] with indexes [ 6 12]
[7 9 4] with indexes [2 8]
Disclaimer
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
This is not directly taken into account and the sequence [5, 5] would appear more than once according to this algorithm. You could do something like this, based off #Paul's answer here, but it involves a loop:
import numpy as np
repeated_matches = np.array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
>>> print(grouped_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64),
array([ 7, 8, 9, 10], dtype=int64)]
# If there are consecutive numbers in grouped_idxs, that means that there is a long
# sequence that should be excluded. So, you'd have to check for consecutive numbers
filtered_idxs = []
for idx in grouped_idxs:
if not all((idx[1:] - idx[:-1]) == 1):
filtered_idxs.append(idx)
>>> print(filtered_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64)]
Some tests:
>>> n = 3
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> %timeit repeated_sequences(a, n)
414 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> n = 4
>>> a = np.random.randint(0, 10, (10000,))
>>> %timeit repeated_sequences(a, n)
3.88 s ± 54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> result, _ = repeated_sequences(a, n)
>>> result.shape
(2637, 4)
This is not the most efficient implementation by far, but it works as a 2D approach. Plus, if there aren't any repeated sequences, it returns empty lists.
EDIT: Full implementation
I vectorized the routine I added in the Disclaimer section as a possible solution to the long sequence problem and ended up with the following:
import numpy as np
# Taken from:
# https://stackoverflow.com/questions/53051560/stacking-numpy-arrays-of-different-length-using-padding
def stack_padding(it):
def resize(row, size):
new = np.array(row)
new.resize(size)
return new
row_length = max(it, key=len).__len__()
mat = np.array([resize(row, row_length) for row in it])
return mat
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
repeated_seqs = []
idxs_repeated_seqs = []
if M.any():
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
repeated_inds = np.count_nonzero(matches, axis=-1) > n
if repeated_inds.any():
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
# Additional routine
# Pad this uneven array with zeros so that we can use it normally
grouped_idxs = np.array(grouped_idxs, dtype=object)
padded_idxs = stack_padding(grouped_idxs)
# Find the indices where there are padded zeros
pad_positions = padded_idxs == 0
# Perform the "consecutive-numbers check" (this will take one
# item off the original array, so we have to correct for its shape).
idxs_to_remove= np.pad(
(padded_idxs[:, 1:] - padded_idxs[:, :-1]) == 1,
[(0, 0), (0, 1)],
constant_values=True,
)
pad_positions = np.argwhere(pad_positions)
i = pad_positions[:, 0]
j = pad_positions[:, 1] - 1 # Shift by one (shape correction)
idxs_to_remove[i, j] = True # Masking, since we don't want pad indices
# Obtain a final mask (boolean opposite of indices to remove)
final_mask = ~idxs_to_remove.all(axis=-1)
grouped_idxs = grouped_idxs[final_mask] # Filter the long sequences
repeated_seqs = unique_seqs[repeated_inds][final_mask]
# In order to get the correct indices, we must first limit the
# search to a shape (on axis=1) of the closest multiple of n.
# This will avoid taking more indices than we should to show where
# each repeated sequence begins
to = padded_idxs.shape[1] & (-n)
# Build the final list of indices (that goes from 0 - to with
# a step of n
idxs_repeated_seqs = [
grouped_idxs[i][:to:n] for i in range(grouped_idxs.shape[0])
]
return repeated_seqs, idxs_repeated_seqs
For example,
n = 2
examples = [
# First example is your original example array.
np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Second example has a long sequence of 5's, and since there aren't
# any [5, 5] anywhere else, it's not taken into account and therefore
# should not come out.
np.array([1, 5, 5, 5, 5, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Third example has the same long sequence but since there is a [5, 5]
# later, then it should take it into account and this sequence should
# be found.
np.array([1, 5, 5, 5, 5, 6, 5, 5, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Fourth example has a [5, 5] first and later it has a long sequence of
# 5's which are uneven and the previous implementation got confused with
# the indices to show as the starting indices. In this case, it should be
# 1, 13 and 15 for [5, 5].
np.array([1, 5, 5, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 5, 5, 5, 5, 5]),
]
for a in examples:
print(f"\nExample: {a}")
repeated_seqs, inds = repeated_sequences(a, n)
for i, seq in enumerate(repeated_seqs):
print(f"\t{seq} with indexes {inds[i]}")
Output (as expected):
Example: [1 5 7 9 4 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
[7 9] with indexes [2 8]
[9 4] with indexes [3 9]
Example: [1 5 5 5 5 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
Example: [1 5 5 5 5 6 5 5 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [ 0 16]
[5 5] with indexes [1 3 6]
Example: [1 5 5 9 4 6 3 3 7 9 4 0 3 5 5 5 5 5]
[5 5] with indexes [ 1 13 15]
[9 4] with indexes [3 9]
You can test it out yourself with more examples and more cases. Keep in mind this is what I understood from your disclaimer. If you want to count the long sequences as one, even if multiple sequences are in there (for example, [5, 5] appears twice in [5, 5, 5, 5]), this won't work for you and you'd have to come up with something else.

pandas.reorder_levels with only one index

Pandas offers a feature to reorder index with the reorder_index function :
pandas.DataFrame({"A" : [1, 2, 3], "B" : [4,5,6], "C" : [7,8,9]}).set_index(["A", "B"]).reorder_levels(["B", "A"])
However it doesn't seem to work with single-indexed DataFrames :
pandas.DataFrame({"A" : [1, 2, 3], "B" : [4,5,6]}).set_index("A").reorder_levels(["A"])
Am I doing something incorrectly ?
PS : I know it doesn't make sens to reorder the Index with only one index, however it's a border effect and I usually tend to avoid un-necessary if statements for code clarity.
Doing the following is equivalent to reordering:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}).set_index(["A", "B"])
print(df.reset_index().set_index(['B', 'A']))
Output
C
B A
4 1 7
5 2 8
6 3 9
And it works with a single index:
odf = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}).set_index("A")
print(odf)
Output
B
A
1 4
2 5
3 6

Numpy 3D Array dimensions and slicing

I have trouble understanding the Numpy representation for 3D arrays. I'm used to it being (rows,columns,depth) but with Numpy it seems to be (depth,rows,columns).
E.g.:
c = np.arange(12)
c = c.reshape(2,3,2)
d = c
d = d.reshape(2,2,3)
print(c)
print(d)
[[[ 0 1]
[ 2 3]
[ 4 5]]
[[ 6 7]
[ 8 9]
[10 11]]]
d:
[[[ 0 1 2]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]]
d is the representation I wish for. Now if I want to access the second 2D array I can write:
print(d[1,:,:])
[[ 6 7 8]
[ 9 10 11]]
So why is the representation so unintuitive for Numpy? And how would I access all uneven indexed(the first,3th,5th ...) 2D arrays of a 3D array given both representations?

Calculate statistics of one numpy array based on the values in a second numpy array

Lets say I have a 2-d numpy array
a = np.array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]]
and a 3-d numpy array like
b = np.array([[[1, 2, 8, 8],
[3, 4, 8, 8],
[8, 7, 0, 1],
[6, 5, 3, 2]],
[[1, 1, 1, 3],
[1, 1, 4, 2],
[0, 3, 2, 1],
[3, 2, 3, 9]]])
I want to calculate the statistics (mean, median, majority, sum, count,...) of b according to the "IDs" in a.
Example: sum should result in another array (or a list if that is easier), that gives the sum of the values in b. There are 4 unique "IDs" in a: 1,2,3,4, and 2 'layers' in b. For the 1's in a that is a sum of 10 (layer 0) and 4 (layer 1). For the 2's
it's 32 (layer 0) and 10 (layer 1), and so on...
Expected result for sum:
sums = [[1, 10, 4],
[2, 32, 10],
[3, 26, 8],
[4, 6, 15]]
Expected result for mean:
avgs = [[1, 2.5, 1.0 ],
[2, 8.0, 2.5 ],
[3, 6.5, 2.0 ],
[4, 1.5, 3.75]]
My guess, is that there is a handy function in numpy that does that already, but I am not sure what to search for exactly. Any pointers of how to do it, or what to search for, are much appreciated.
Update:
I came up with this for-loop, which is fine for very small arrays. However, my arrays are much larger than 4 by 4 and a faster impementation is needed.
result = []
ids = np.unique(a)
for id in ids:
line = [id]
for band in range(0, b.shape[0]):
cell = b[band][np.where(a == id)]
line.append(cell.mean())
# line.append(cell.min())
# line.append(cell.max())
# line.append(cell.std())
line.append(cell.sum())
line.append(np.median(cell))
result.append(line)
You can try the code below
cal_sums = [[b[j, :, :][np.argwhere(a==i)[:,0],np.argwhere(a==i)[:,1]].sum()
for i in np.unique(a)] for j in range(2)]
cal_mean = [[b[j, :, :][np.argwhere(a==i)[:,0],np.argwhere(a==i)[:,1]].mean()
for i in np.unique(a)] for j in range(2)]
sums = np.zeros((np.unique(a).size, b.shape[0]+1))
means = np.zeros((np.unique(a).size, b.shape[0]+1))
sums[:, 0] , sums[:,1:] = np.unique(a), np.asarray(cal_sums).T
means[:, 0] , means[:,1:] = np.unique(a), np.asarray(cal_mean).T
print(sums)
[[ 1. 10. 4.]
[ 2. 32. 10.]
[ 3. 26. 8.]
[ 4. 6. 15.]]
print(means)
[[1. 2.5 1. ]
[2. 8. 2.5 ]
[3. 6.5 2. ]
[4. 1.5 3.75]]
I tested it in quite large array size and it is fast
n = 1000
a = np.random.randint(1, 5, size=(n, n))
b = np.random.randint(1, 10, size=(2, n, n))
speed:
377 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)