creating a logical panda series by comparing two series - pandas

In pandas I'm trying to get two series combined to one logical one
f = pd.Series(['a','b','c','d','e'])
x = pd.Series(['a','c'])
As a result I would like to have the series
[1, 0, 1, 0, 0]
I tried
f.map(lambda e: e in x)
Series f is large (30000) so looping over the elements (with map) is probably not very efficient. What would be a good approach?

Use isin:
In [207]:
f = pd.Series(['a','b','c','d','e'])
x = pd.Series(['a','c'])
f.isin(x)
Out[207]:
0 True
1 False
2 True
3 False
4 False
dtype: bool
You can convert the dtype using astype if you prefer:
In [208]:
f.isin(x).astype(int)
Out[208]:
0 1
1 0
2 1
3 0
4 0
dtype: int32

Related

Pandas - drop n rows by column value

I need to remove last n rows where Status equals 1
v = df[df['Status'] == 1].count()
f = df[df['Status'] == 0].count()
diff = v - f
diff
df2 = df[~df['Status'] == 1].tail(diff).all() #ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
df2
Check whether Status is equal to 1 and get only those places where it is (.loc[lambda s: s] is doing that using boolean indexing). The index of n such rows from tail will be dropped:
df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
sample run:
In [343]: df
Out[343]:
Status
0 1
1 2
2 3
3 2
4 1
5 1
6 1
7 2
In [344]: n
Out[344]: 2
In [345]: df.Status.eq(1)
Out[345]:
0 True
1 False
2 False
3 False
4 True
5 True
6 True
7 False
Name: Status, dtype: bool
In [346]: df.Status.eq(1).loc[lambda s: s]
Out[346]:
0 True
4 True
5 True
6 True
Name: Status, dtype: bool
In [347]: df.Status.eq(1).loc[lambda s: s].tail(n)
Out[347]:
5 True
6 True
Name: Status, dtype: bool
In [348]: df.Status.eq(1).loc[lambda s: s].tail(n).index
Out[348]: Int64Index([5, 6], dtype='int64')
In [349]: df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
Out[349]:
Status
0 1
1 2
2 3
3 2
4 1
7 2
Using groupBy() and transform() to mark columns to keep:
df = pd.DataFrame({"Status": [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1]})
n = 3
df["Keep"] = df.groupby("Status")["Status"].transform(
lambda x: x.reset_index().index < len(x) - n if x.name == 1 else True
)
df.loc[df["Keep"]].drop(columns="Keep")

boolean indexing with loc returns NaN

import pandas as pd
numbers = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(numbers)
condition = df.loc[:, 1:2] < 4
df[condition]
0 1 2
0 NaN 2.0 3.0
1 NaN NaN NaN
2 NaN NaN NaN
Why am I getting these wrong results, and what can I do to get the correct results?
Boolean condition has to be Series, but here your selected columns return DataFrame:
print (condition)
1 2
0 True True
1 False False
2 False False
So for convert boolean Dataframe to mask use DataFrame.all for test if all Trues per rows or
DataFrame.any if at least one True per rows:
print (condition.any(axis=1))
print (condition.all(axis=1))
0 True
1 False
2 False
dtype: bool
Or select only one column for condition:
print (df.loc[:, 1] < 4)
0 True
1 False
2 False
Name: 1, dtype: bool
print (df[condition.any(axis=1)])
0 1 2
0 1 2 3

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

How this is working?
I know the intuition behind it that given movie_dataset(using panda we have loaded it in "md" and we are finding those rows in 'votecount' which are not null and converting them to int.
but i am not understanding the syntax.
md[md['vote_count'].notnull()] returns a filtered view of your current md dataframe where vote_count is not NULL. Which is being set to the variable vote_counts This is Boolean Indexing.
# Assume this dataframe
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[2,'B'] = np.nan
when you do df['B'].notnull() it will return a boolean vector which can be used to filter your data where the value is True
df['B'].notnull()
0 True
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df[df['B'].notnull()]
A B C
0 -0.516625 -0.596213 -0.035508
1 0.450260 1.123950 -0.317217
3 0.405783 0.497761 -1.759510
4 0.307594 -0.357566 0.279341

pandas quantile comparison: indexes not aligned [duplicate]

How can I perform comparisons between DataFrames and Series? I'd like to mask elements in a DataFrame/Series that are greater/less than elements in another DataFrame/Series.
For instance, the following doesn't replace elements greater than the mean
with nans although I was expecting it to:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x[x > x.mean(axis=1)] = np.nan
>>> x
a b
0 1 3
1 2 4
If we look at the boolean array created by the comparison, it is really weird:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x > x.mean(axis=1)
a b 0 1
0 False False False False
1 False False False False
I don't understand by what logic the resulting boolean array is like that. I'm able to work around this problem by using transpose:
>>> (x.T > x.mean(axis=1).T).T
a b
0 False True
1 False True
But I believe there is some "correct" way of doing this that I'm not aware of. And at least I'd like to understand what is going on.
The problem here is that it's interpreting the index as column values to perform the comparison, if you use .gt and pass axis=0 then you get the result you desire:
In [203]:
x.gt(x.mean(axis=1), axis=0)
Out[203]:
a b
0 False True
1 False True
You can see what I mean when you perform the comparison with the np array:
In [205]:
x > x.mean(axis=1).values
Out[205]:
a b
0 False False
1 False True
here you can see that the default axis for comparison is on the column, resulting in a different result

Imposing a threshold on values in dataframe in Pandas

I have the following code:
t = 12
s = numpy.array(df.Array.tolist())
s[s<t] = 0
thresh = numpy.where(s>0, s-t, 0)
df['NewArray'] = list(thresh)
while it works, surely there must be a more pandas-like way of doing it.
EDIT:
df.Array.head() looks like this:
0 [0.771511552006, 0.771515476223, 0.77143569165...
1 [3.66720695274, 3.66722560562, 3.66684636758, ...
2 [2.3047433839, 2.30475510675, 2.30451676559, 2...
3 [0.999991522708, 0.999996609066, 0.99989319662...
4 [1.11132718786, 1.11133284052, 0.999679589875,...
Name: Array, dtype: object
IIUC you can simply subtract and use clip_lower:
In [29]: df["NewArray"] = (df["Array"] - 12).clip_lower(0)
In [30]: df
Out[30]:
Array NewArray
0 10 0
1 11 0
2 12 0
3 13 1
4 14 2