pandas quantile comparison: indexes not aligned [duplicate] - pandas

How can I perform comparisons between DataFrames and Series? I'd like to mask elements in a DataFrame/Series that are greater/less than elements in another DataFrame/Series.
For instance, the following doesn't replace elements greater than the mean
with nans although I was expecting it to:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x[x > x.mean(axis=1)] = np.nan
>>> x
a b
0 1 3
1 2 4
If we look at the boolean array created by the comparison, it is really weird:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x > x.mean(axis=1)
a b 0 1
0 False False False False
1 False False False False
I don't understand by what logic the resulting boolean array is like that. I'm able to work around this problem by using transpose:
>>> (x.T > x.mean(axis=1).T).T
a b
0 False True
1 False True
But I believe there is some "correct" way of doing this that I'm not aware of. And at least I'd like to understand what is going on.

The problem here is that it's interpreting the index as column values to perform the comparison, if you use .gt and pass axis=0 then you get the result you desire:
In [203]:
x.gt(x.mean(axis=1), axis=0)
Out[203]:
a b
0 False True
1 False True
You can see what I mean when you perform the comparison with the np array:
In [205]:
x > x.mean(axis=1).values
Out[205]:
a b
0 False False
1 False True
here you can see that the default axis for comparison is on the column, resulting in a different result

Related

ValueError: Must be all encoded bytes when reading csv with 0 and 1 in pandas

I am trying to read a csv with 1s and 0s and convert them to True and False, because I have a lot of columns I would like to use the true_values and flase_values arguments, but I got
ValueError: Must be all encoded bytes:
from io import StringIO
import numpy as np
import pandas as pd
pd.read_csv(StringIO("""var1, var2
0, 0
0, 1
1, 1
0, 0
0, 1
1, 0"""), true_values=[1],false_values=[0])
I cannot find the problem with the code that I wrote.
You don't need true_values and false_values parameters. Use dtype instead:
>>> pd.read_csv(StringIO("""var1,var2
0,0
0,1
1,1
0,0
0,1
1,0"""), dtype={'var1': bool, 'var2': bool})
var1 var2
0 False False
1 False True
2 True True
3 False False
4 False True
5 True False
If your columns have same prefix, use filter:
df = pd.read_csv(StringIO("""..."""))
cols = df.filter(like='var').columns
df[cols] = df[cols].astype(bool)
If your columns are consecutive, use iloc:
df = pd.read_csv(StringIO("""..."""))
cols = df.iloc[:, 0:2].columns
df[cols] = df[cols].astype(bool)
Auto-detection:
m = df.min().eq(0) & df.max().eq(1)
df.loc[:, m] = df.loc[:, m].astype(bool)

Boolean comparaison between a series and a dataframe (element-wise)

Here the series and dataframe to be compared element-wise (AND condition):
import pandas as pd
se = pd.Series(data=[False, True])
df = pd.DataFrame(data=[[True, False], [True, True]],
columns=['A','B'])
Desired result:
df2 = pd.DataFrame(data=[[False, False], [True, True]],
columns=['A','B'])
I could achieve that using a slow for loop but I am sure there is a way to vectorise that.
Many thanks !
Convert Series to numpy array and compare with broadcasting:
print (df & se.to_numpy()[:,None])
A B
0 False False
1 True True
You can use conversion to numpy array to benefit from broadcasting:
out = np.logical_and(df, se.to_numpy()[:,None])
output:
A B
0 False False
1 True True
intermediate:
se.to_numpy()[:,None]
array([[False],
[ True]])
Another possible solution:
(df & np.vstack(se))
Output:
A B
0 False False
1 True True
df.mul(se,0)
Output:
A B
0 False False
1 True Tru

Drop pandas column with constant alphanumeric values

I have a dataframe df that contains around 2 million records.
Some of the columns contain only alphanumeric values (e.g. "wer345", "gfer34", "123fdst").
Is there a pythonic way to drop those columns (e.g. using isalnum())?
Apply Series.str.isalnum column-wise to mask all the alphanumeric values of the DataFrame. Then use DataFrame.all to find the columns that only contain alphanumeric values. Invert the resulting boolean Series to select only the columns that contain at least one non-alphanumeric value.
is_alnum_col = df.apply(lambda col: col.str.isalnum()).all()
res = df.loc[:, ~is_alnum_col]
Example
import pandas as pd
df = pd.DataFrame({
'a': ['aas', 'sd12', '1232'],
'b': ['sdds', 'nnm!!', 'ab-2'],
'c': ['sdsd', 'asaas12', '12.34'],
})
is_alnum_col = df.apply(lambda col: col.str.isalnum()).all()
res = df.loc[:, ~is_alnum_col]
Output:
>>> df
a b c
0 aas sdds sdsd
1 sd12 nnm!! asaas12
2 1232 ab-2 12.34
>>> df.apply(lambda col: col.str.isalnum())
a b c
0 True True True
1 True False True
2 True False False
>>> is_alnum_col
a True
b False
c False
dtype: bool
>>> res
b c
0 sdds sdsd
1 nnm!! asaas12
2 ab-2 12.34

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

How this is working?
I know the intuition behind it that given movie_dataset(using panda we have loaded it in "md" and we are finding those rows in 'votecount' which are not null and converting them to int.
but i am not understanding the syntax.
md[md['vote_count'].notnull()] returns a filtered view of your current md dataframe where vote_count is not NULL. Which is being set to the variable vote_counts This is Boolean Indexing.
# Assume this dataframe
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[2,'B'] = np.nan
when you do df['B'].notnull() it will return a boolean vector which can be used to filter your data where the value is True
df['B'].notnull()
0 True
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df[df['B'].notnull()]
A B C
0 -0.516625 -0.596213 -0.035508
1 0.450260 1.123950 -0.317217
3 0.405783 0.497761 -1.759510
4 0.307594 -0.357566 0.279341

creating a logical panda series by comparing two series

In pandas I'm trying to get two series combined to one logical one
f = pd.Series(['a','b','c','d','e'])
x = pd.Series(['a','c'])
As a result I would like to have the series
[1, 0, 1, 0, 0]
I tried
f.map(lambda e: e in x)
Series f is large (30000) so looping over the elements (with map) is probably not very efficient. What would be a good approach?
Use isin:
In [207]:
f = pd.Series(['a','b','c','d','e'])
x = pd.Series(['a','c'])
f.isin(x)
Out[207]:
0 True
1 False
2 True
3 False
4 False
dtype: bool
You can convert the dtype using astype if you prefer:
In [208]:
f.isin(x).astype(int)
Out[208]:
0 1
1 0
2 1
3 0
4 0
dtype: int32