pandas most efficient way to compare dataframe and series - pandas

I have a dataframe of shape (n, p) and a series of length n
I can compare them with:
for i in df.keys():
df[i] > ts
Is there a way to do it in one line? something like df > ts.
if yes, is it more efficient?

I think you need DataFrame.gt:
print (df.gt(s, axis=0))
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
s = pd.Series([1,2,3])
print (s)
0 1
1 2
2 3
dtype: int64
print (df.gt(s, axis=0))
A B C D E F
0 False True True False True True
1 False True True True True True
2 False True True True True False
If need another functions for compare:
lt
gt
le
ge
ne
eq

Related

Python Pandas: How to delete row with certain value of 'Object' datatype?

I have a dataframe name train_data.
This is the datatype of each column.
The columns workclass, occupation, and native-country are of "Object" datatype and some of the rows contain values of "?".
In this example, you can see row index 5 has some values with "?".
I want to delete all rows with any cell that has any "?".
I tried the following code, but it didn't work.
train_data = train_data[~(train_data.values == '?').any(1)]
train_data
use .loc for index slicing.
import pandas as pd
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,'?',9],
'C' : [0,'?',2,3,4]})
print(df1)
A B C
0 0 2 0
1 1 4 ?
2 2 5 2
3 3 ? 3
4 ? 9 4
print(df1.loc[~df1.eq('?').any(1)])
A B C
0 0 2 0
2 2 5 2
if you only want to check object columns use
pd.select_dtypes
df1.select_dtypes('object').eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
Edit.
One method for handling leading or trailing white spaces.
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,' ?',9],
'C' : [0,'? ',2,3,4]})
df1.eq('?').any(1)
0 False
1 False
2 False
3 False
4 True
dtype: bool
df1.replace('(\s+\?)|(\?\s+)',r'?',regex=True).eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
str.strip() with a lambda
str_cols = df1.select_dtypes('object').columns
df1[str_cols] = df1[str_cols].apply(lambda x : x.str.strip())
df1.eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool

Pandas easy API to find out all inf or nan cells?

I had search stackoverflow about this, all are so complex.
I want to output the row and column info about all cells that is inf or NaN.
You can replace np.inf to missing values and test them by DataFrame.isna and last test at least one True by DataFrame.any passed to DataFrame.loc for SubDataFrame:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.inf],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[np.inf,3,6,9,2,np.nan],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 inf a
1 b 5.0 NaN 3 3.0 a
2 c 4.0 9.0 5 6.0 a
3 d 5.0 4.0 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f inf 3.0 0 NaN b
m = df.replace(np.inf, np.nan).isna()
print (m)
A B C D E F
0 False False False False True False
1 False False True False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
5 False True False False True False
df = df.loc[m.any(axis=1), m.any()]
print (df)
B C E
0 4.0 7.0 inf
1 5.0 NaN 3.0
5 inf 3.0 NaN
Or if need index and columns names in DataFrame use DataFrame.stack with Index.to_frame:
s = df.replace(np.inf, np.nan).stack(dropna=False)
df1 = s[s.isna()].index.to_frame(index=False)
print (df1)
0 1
0 0 E
1 1 C
2 5 B
3 5 E

How to get count for the non duplicates in column

My code to get the duplicates, how to negate the below meaning
df.duplicated(subset='col', keep='last').sum()
len(df['col'])-len(df['col'].drop_duplicates())
I think you need DataFrame.duplicated with keep=False for all duplicates, invert mask and sum for count Trues:
df = pd.DataFrame({'col':[1,2,2,3,3,3,4,5,5]})
print (df.duplicated(subset='col', keep=False))
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
dtype: bool
print (~df.duplicated(subset='col', keep=False))
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print ((~df.duplicated(subset='col', keep=False)).sum())
2
Another solution with Series.drop_duplicates and keep=False with length of Series:
print (df['col'].drop_duplicates(keep=False))
0 1
6 4
Name: col, dtype: int64
print (len(df['col'].drop_duplicates(keep=False)))
2

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)

How to build column by column dataframe pandas

I have a dataframe looking like this example
A | B | C
__|___|___
s s nan
nan x x
I would like to create a table of intersections between columns like this
| A | B | C
__|______|____|______
A | True |True| False
__|______|____|______
B | True |True|True
__|______|____|______
C | False|True|True
__|______|____|______
Is there an elegant cycle-free way to do it?
Thank you!
Setup
df = pd.DataFrame(dict(A=['s', np.nan], B=['s', 'x'], C=[np.nan, 'x']))
Option 1
You can use numpy broadcasting to evaluate each column by each other column. Then determine if any of the comparisons are True
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).any(0),
df.columns, df.columns
)
A B C
A True True False
B True True True
C False True True
By replacing any with sum you can get a count of how many intersections.
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).sum(0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Or use np.count_nonzero instead of sum
v = df.values
pd.DataFrame(
np.count_nonzero(v[:, :, None] == v[:, None], 0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Option 2
Fun & Creative way
d = pd.get_dummies(df.stack()).unstack(fill_value=0)
d = d.T.dot(d)
d.groupby(level=1).sum().groupby(level=1, axis=1).sum()
A B C
A 1 1 0
B 1 2 1
C 0 1 1