Find the maximum values of a matrix in rows axis and replace other values to zero - numpy

A = [[2,2,4,2,2,2]
[2,6,2,2,2,2]
[2,2,2,2,8,2]]
I want matrix B to be equal to:
B = [[0,0,4,0,0,0]
[0,6,0,0,0,0]
[0,0,0,0,8,0]]
So I want to find the maximum value of each row and replace other values with 0. Is there any way to do this without using for loops?
Thanks in advance for your comments.

Instead of looking at the argmax, you could take the max values for each row directly, then mask the elements which are lower and replace them with zeros:
Inplace this would look like (here True stands for keepdims=True):
>>> A[A < A.max(1, True)] = 0
>>> A
array([[0, 0, 4, 0, 0, 0],
[0, 6, 0, 0, 0, 0],
[0, 0, 0, 0, 8, 0]])
An out of place alternative is to use np.where:
>>> np.where(A == A.max(1, True), A, 0)
array([[0, 0, 4, 0, 0, 0],
[0, 6, 0, 0, 0, 0],
[0, 0, 0, 0, 8, 0]])

Related

Standard implementation of vectorize_sequences

In François Chollet's Deep Learning with Python, appears this function:
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
I understand what this function does. This function is asked about in this quesion and in this question as well, also mentioned here, here, here, here, here & here. Despite being so wide-spread, this vectorization is, according to Chollet's book is done "manually for maximum clarity." I am interested whether there is a standard, not "manual" way of doing it.
Is there a standard Keras / Tensorflow / Scikit-learn / Pandas / Numpy implementation of a function which behaves very similarly to the function above?
Solution with MultiLabelBinarizer
Assuming sequences is an array of integers with maximum possible value upto dimension-1, we can use MultiLabelBinarizer from sklearn.preprocessing to replicate the behaviour of the function vectorize_sequences
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=range(dimension))
mlb.fit_transform(sequences)
Solution with Numpy broadcasting
Assuming sequences is an array of integers with maximum possible value upto dimension-1
(np.array(sequences)[:, :, None] == range(dimension)).any(1).view('i1')
Worked out example
>>> sequences
[[4, 1, 0],
[4, 0, 3],
[3, 4, 2]]
>>> dimension = 10
>>> mlb = MultiLabelBinarizer(classes=range(dimension))
>>> mlb.fit_transform(sequences)
array([[1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0, 0, 0, 0]])
>>> (np.array(sequences)[:, :, None] == range(dimension)).any(1).view('i1')
array([[0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 0, 0, 0, 0]])

how to get row indices where row slice contains a single value (0)

With the numpy array
arr = np.array([[1, 1, 0, 0, 0, 1], [1, 1, 0, 0, 1, 1], [1, 1, 0, 0, 0, 1]])
I would like to get the indices of all rows where the row slice 2:5 contains all zeros.
In the above example, it should return rows 0 and 2.
I tried:
zero_indices = np.where(not np.any(arr[:,2:5]))
but it doesn't seem to work.
I'm trying to do this over a large array with several million rows.
Try this
np.nonzero((~arr[:,2:5].astype(bool)).all(1))[0]
Out[133]: array([0, 2], dtype=int32)
Or
np.nonzero((arr[:,2:5] == 0).all(1))[0]
Out[139]: array([0, 2], dtype=int32)

Add a "frame" of zeros around a matrix in numpy python

I have a matrix of size (456, 456). I would like to make it of size (460, 460) but adding a frame of two zeros all around it.
Here is an example with a smaller matrix. I would like to transform matrixsmall into matrixbig. What is the best way to do it? The original code operates on lots of data to it would be great to have an efficient solution. Thank you in advance for your help!
import numpy as np
matrixsmall = np.array([[1,2],[2,1]])
matrixbig = np.array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 1, 2, 0, 0],
[0, 0, 2, 1, 0, 0],
[0 ,0 ,0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]])
np.pad(matrixsmall, (2,2), "constant", constant_values=(0,0))
will do the trick

how to create 0-1 matrix from probability matrix

(In any language) For a research project, I am stuck on how to convert a matrix P of probability values to a matrix A such that A_ij = 1 with probability P_ij and 0 otherwise? I have looked through various random number generator documentations, but have been unable to figure out how to do this.
If I understand correctly:
In [11]: p = np.random.uniform(size=(5,5))
In [12]: p
Out[12]:
array([[ 0.45481883, 0.21242567, 0.3124863 , 0.00485797, 0.31970718],
[ 0.91995847, 0.29907277, 0.59154085, 0.85847147, 0.13227595],
[ 0.91914631, 0.5495813 , 0.58648856, 0.08037582, 0.23005148],
[ 0.12464628, 0.70657028, 0.75975869, 0.77632964, 0.24587041],
[ 0.69259133, 0.183515 , 0.65500547, 0.19526148, 0.26975325]])
In [13]: a = (p.round(1)==0.7).astype(np.int8)
In [14]: a
Out[14]:
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 1, 0, 0]], dtype=int8)

Compare all values in one column with all values in another column and return indexes

I am interested in comparing all values from 1 dataframe column with all values from a 2nd column and then generating a list or a subset df with values from a 3rd column that is adjacent to the 1st column hits. Hopefully this example will explain it better:
For a simplified example, say I generate the following pandas dataframe:
fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
print fake_df
What I am interested in doing is finding all values in column 'm' that are within 0.1 of any value in
column 'mod' and return the values in column 'n' that correspond to the column 'm' hits. So for the above code, the return would be:
10.2, 2.0, 1.1
(since 101,201 and 501 all have close hits in column 'mod').
I have found ways to compare across the same row, but not like above. Is there a way to do this in pandas without extensive loops?
Thanks!
I don't know such method in pandas, but when you extend your scope to include
numpy, two options come to mind.
Easy/Expensive Method
If you can live with N**2 memory overhead, you can do numpy broadcasting to
find out all "adjacent" elements in one step:
In [25]: fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
In [26]: mvals = fake_df['m'].values
In [27]: modvals = fake_df['mod'].values
In [28]: is_close = np.abs(mvals - modvals[:, np.newaxis]) <= 0.1; is_close.astype(int)
Out[28]:
array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]])
As we care only about 'mod' values that have adjacent 'm's, aggregate over axis=0:
In [29]: is_close.any(axis=0).astype(int)
Out[29]: array([0, 0, 1, 0, 1, 1, 0, 0, 0, 0])
Or otherwise
In [30]: fake_df.ix[is_close.any(axis=0), 'n']
Out[30]:
2 10.2
4 2.0
5 1.1
Name: n, dtype: float64
Efficient/Complex Method
To find adjacent elements in less than O(N**2) without any hashing/rounding
tricks, you have to do some sorting:
In [103]: modvals_sorted = np.sort(modvals)
In [104]: next_indices = np.searchsorted(modvals_sorted, mvals)
You have indices of next elements, but they may point beyond the original
array, so you need a one extra NaN at the end to avoid IndexError. Same
logic applies to previous elements which are next_indices - 1: to avoid
indexing before the first element we must prepend one NaN, too. Note the + 1 that arises because one of NaN has been added to the beginning.
In [105]: modvals_sorted_plus = np.r_[np.nan, modvals_sorted, np.nan]
In [106]: nexts = modvals_sorted_plus[next_indices + 1]
In [107]: prevs = modvals_sorted_plus[(next_indices - 1) + 1]
Now it's trivial. Note that we already have prevs <= mvals <= nexts, so we
don't need to use np.abs. Also, all missing elements are NaN and comparing with them results in False that doesn't alter the result of any operation.
In [108]: adjacent = np.c_[prevs, mvals, nexts]; adjacent
Out[108]:
array([[ nan, 100. , 101.001],
[ 102.001, 120. , 121.001],
[ nan, 101. , 101.001],
[ 121.001, 200. , 201.001],
[ 121.001, 201. , 201.001],
[ 421.001, 501. , 501.001],
[ 202.001, 350. , 351.001],
[ 351.001, 420. , 421.001],
[ 502.001, 525. , 526.001],
[ 421.001, 500. , 501.001]])
In [109]: (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
Out[109]: array([False, False, True, False, True, True, False, False, False, False], dtype=bool)
In [110]: mask = (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
In [112]: fake_df.ix[mask, 'n']
Out[112]:
2 10.2
4 2.0
5 1.1
Name: n, dtype: float64
Try the following:
# I assume all arrays involved to be or to be converted to numpy arrays
import numpy as np
m = np.array([100,120,101,200,201,501,350,420,525,500])
n = np.array([10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0])
mod = np.array([101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001])
res = []
# for each entry in mod, look in m for "close" values
for i in range(len(mod)):
# for each hit, store entry from n in result list
res.extend(n[np.fabs(mod[i]-m)<=0.1])
# cast result to numpy array
res = np.array(res)
print res
The output is
[ 10.2 2. 1.1]
I'll be making of numpy (imported as np) which pandas uses under the hood. np.isclose returns a boolean indexer: for each value of the iterable, there's a True or False value corresponding to the value m being within atol of each value of df["mod"].
>>> for i, m in df["m"].iteritems():
... indices = np.isclose(m, df["mod"], atol=0.1)
... if any(indices):
... print df["n"][i]
Using the DataFrame you gave produces the output:
10.2
2.0
1.1