Pandas: Find row with a ndarray - pandas

I am failry new to panda.
To find all rows with a certain value, I can run
data[data['category'] == 'name']
which would return a Series as expected.
Ony of my column is a 1x2 numpy array. However if I do
data[data['list'] == np.array([0, 0])]
I get ValueError: Lengths must match to compare
How would I find the row with a certain numpy array in it?

You can use apply with lambda function like df[df.list.apply(lambda x: (x == c).all())]
Ex.:
>>> df
list
0 [0, 0]
1 [1, 1]
2 [0, 0]
3 [1, 0]
>>> c
array([0, 0])
>>> df.list.apply(lambda x: x == c)
0 [True, True]
1 [False, False]
2 [True, True]
3 [False, True]
Name: list, dtype: object
>>> df.list.apply(lambda x: (x == c).all())
0 True
1 False
2 True
3 False
Name: list, dtype: bool
>>> df[df.list.apply(lambda x: (x == c).all())]
list
0 [0, 0]
2 [0, 0]

Related

count specific value over all dataframe in python

I have a large dataframe (235832 rows × 79 columns) that contains genotype data rows mean = variants columns mean = patients
I want to search many values in a dataframe ( all, not specific column or row )
So ,
I want to return the number of finding [-1, -1] or [0 -1] across all dataframe how can I do it in python
example of dataframe
0 1 2 3 ... 78
0 [-1, -1] [0, 0] [0, 0] ... [0 -1]
1 [0 0] [0,0] [-1 -1] ... [0 -1]
and so on until 235832
I want count [-1,-1] or [0,-1] in the dataframe
it return 4 in my example
OK. I think you just need to count the occurrences of [-1, -1] and [0, -1]. I am not sure about the formatting. But in general,
Suppose, your dataframe's name is df
df_melted = df.melt()
Running the cell above will get all the values of your original dataframe in ONE column, then, you can run
df['value'].value_counts()
This will return you the counts of all the values within including what you are seeking.
Convert values to tuples by DataFrame.applymap, compare by DataFrame.isin and sum Trues for count matched values:
a = np.array([0,-1])
b = np.array([-1,-1])
c = np.array([1,-1])
df = pd.DataFrame({'a':[a,b,c],'b':[a,b,a],'c':[c,a,c]})
print (df)
a b c
0 [0, -1] [0, -1] [1, -1]
1 [-1, -1] [-1, -1] [0, -1]
2 [1, -1] [0, -1] [1, -1]
out = df.applymap(tuple).isin([tuple( [0,-1] ), tuple( [-1,-1] )] ).sum().sum()
print (out)
6

Get indices of slices with at least one element obeying some condition

I have an ndarray A of shape (n, a, b)
I want a Boolean ndarray X of shape (a, b) where
X[i,j]=any(A[:, i, j] < 0)
How to achieve this?
I would use an intermediate matrix and the sum(axis) method:
np.random.seed(24)
# example matrix filled either with 0 or -1:
A = np.random.randint(2, size=(3, 2, 2)) - 1
# condition test:
X_elementwise = A < 0
# Check whether the conditions are fullfilled at least once:
X = X_elementwise.sum(axis=0) >= 1
Values for A and X:
A = array([[[-1, 0],
[-1, 0]],
[[ 0, 0],
[ 0, -1]],
[[ 0, 0],
[-1, 0]]])
X = array([[ True, False],
[ True, True]])

How to understand the np.argwhere function?

Signature: np.argwhere(a)
Docstring:
Find the indices of array elements that are non-zero, grouped by element.
Examples
>>> x = np.arange(6).reshape(2,3)
>>> x
array([[0, 1, 2],
[3, 4, 5]])
>>> np.argwhere(x>1)
array([[0, 2],
[1, 0],
[1, 1],
[1, 2]])
What does it mean by 'non-zero' and 'grouped by element'? and what is "x>1"?
In each row the first entry is the row index and the second entry is the column index of the entries of x that satisfy the condition.
For example:
2 is greater than 1
so the first row of argwhere gives you [0, 2]
pointing to the position of 2 in x.
Find the indices (positions) of array elements that are non-zero (true), grouped by element (each index is its own row).
Basically, if you pass a boolean array, you will find the indices where that array is true, but transposed so that the indices in the form [[x1, x2, ...], [y1, y2, ...]] become in the form [[x1, y1], [x2, y2], ...].
x > 1 is a boolean array which is True wherever x > 1 and False wherever x <= 1. In your example, it looks loke
[[False, False, True],
[True, True, True]]

a possible bug in numpy.isclose when comparing matrices with nans

consider the next piece of code:
In [90]: m1 = np.matrix([1,2,3], dtype=np.float32)
In [91]: m2 = np.matrix([1,2,3], dtype=np.float32)
In [92]: m3 = np.matrix([1,2,'nan'], dtype=np.float32)
In [93]: np.isclose(m1, m2, equal_nan=True)
Out[93]: matrix([[ True, True, True]], dtype=bool)
In [94]: np.isclose(m1, m3, equal_nan=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-94-5d2b979bc263> in <module>()
----> 1 np.isclose(m1, m3, equal_nan=True)
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in isclose(a, b, rtol, atol, equal_nan)
2571 # Ideally, we'd just do x, y = broadcast_arrays(x, y). It's in
2572 # lib.stride_tricks, though, so we can't import it here.
-> 2573 x = x * ones_like(cond)
2574 y = y * ones_like(cond)
2575 # Avoid subtraction with infinite/nan values...
/usr/local/lib/python2.7/dist-packages/numpy/matrixlib/defmatrix.pyc in __mul__(self, other)
341 if isinstance(other, (N.ndarray, list, tuple)) :
342 # This promotes 1-D vectors to row vectors
--> 343 return N.dot(self, asmatrix(other))
344 if isscalar(other) or not hasattr(other, '__rmul__') :
345 return N.dot(self, other)
ValueError: shapes (1,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)
when comparing arrays with nans it's working as expected:
In [95]: np.isclose(np.array(m1), np.array(m3), equal_nan=True)
Out[95]: array([[ True, True, False]], dtype=bool)
why is np.isclose failing? from the documentation it seems that it should work
thanks
The problem comes from np.nan == np.nan, which is False in the float logic.
In [39]: np.nan == np.nan
Out[39]: False
The `equal_nan` parameter is to force two `nan` values to be considered as equal , not to consider any value to be equal to `nan`.
In [37]: np.isclose(m3,m3)
Out[37]: array([ True, True, False], dtype=bool)
In [38]: np.isclose(m3,m3,equal_nan=True)
Out[38]: array([ True, True, True], dtype=bool)

Creating matrix out of an array of categories in numpy

I have a length-n numpy array, y, of integers in the range [0...k-1]. From this, I would like to create an n-by-k numpy matrix M, where M[i,j] is 1 if y[i]==j, and 0 else.
What is the best way to do this in numpy?
Use broadcasting:
a = np.array([1, 2, 3, 1, 2, 2, 3, 0])
m = a[:, None] == np.arange(max(a)+1)
the result is:
array([[False, True, False, False],
[False, False, True, False],
[False, False, False, True],
[False, True, False, False],
[False, False, True, False],
[False, False, True, False],
[False, False, False, True],
[ True, False, False, False]], dtype=bool)
Or create a zero array and fill, I think it's faster:
m2 = np.zeros((len(a), a.max()+1), np.bool)
m2[np.arange(len(a)), a] = True
print m2
This is maybe a bit out there, but its a pretty extensible solution and at least worth noting. If you've already got scikit-learn, the DictVectorizer class is used to transform categorical features in a dataset to column-wise binary representations just like you described:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
# starting with your numpy array
y = np.array([1, 2, 3, 1, 2, 2, 3, 0])
# transform the array to a list of dicts, with original
# int values now as strings, and a throw-away key ''
y_dict = [{'':str(x)} for x in y.tolist()]
# create the vectorizer and transform the list of dicts
vec = DictVectorizer(sparse=False, dtype=int)
M = vec.fit_transform(y_dict)
print M
[[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
[0 1 0 0]
[0 0 1 0]
[0 0 1 0]
[0 0 0 1]
[1 0 0 0]]
Again, probably overkill but it's kind of cute and I thought I'd throw it out there.