Delete rows in dataframe based on info from Series - pandas

I would like to delete all rows in the Dataframe that have number of appereance = 10 and status = 1.
Example of Dataframe X is
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
...
First I found all rows with status=1 with count()=10
exclude=X[X.Status == 1].groupby('ID')['Status'].value_counts().loc[lambda x: x==10].index
exclude is Series
MultiIndex([( 371391, 1),
( 383537, 1),
...
Is it possible to delete rows in Dataframe X based od info for ID from Series ?

If your original DataFrame looks something like this:
print(df)
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
5 371391 1
6 371391 1
7 371391 1
8 371391 1
9 371391 1
10 371391 1
11 371391 1
12 371391 1
13 371391 1
And you group IDs and statuses together to find the IDs you want to exclude:
df2 = df.groupby(['ID', 'Status']).size().to_frame('size').reset_index()
print(df2)
ID Status size
0 366804 0 1
1 371391 1 10
2 383537 1 1
3 383538 0 1
4 383539 0 1
excludes = df2.loc[(df2['size'] == 10) & (df2['Status'] == 1), 'ID']
print(excludes)
1 371391
Name: ID, dtype: int64
Then you could use Series.isin and invert the boolean Series ~s:
df = df[~df['ID'].isin(excludes)]
print(df)
ID Status
0 366804 0
2 383537 1
3 383538 0
4 383539 0

Related

Pandas groupby of specific catergorical column

With reference to Pandas groupby with categories with redundant nan
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TYPE"] = df["TYPE"].astype("category")
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
4 A 0
B 0
5 A 0
B 0
8 A 0
B 0
2 1 A 0
B 0
2 A 0
B 0
4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
Expected output
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
I tried to use astype("category") for TYPE. However it seems to output every cartesian product of every item in every group.
What you want is a little abnormal, but we can force it there from a pivot table:
out = df.pivot_table(index=['TEAM', 'ID'],
columns=['TYPE'],
values=['VALUE'],
aggfunc='sum',
observed=True, # This is the key when working with categoricals~
# You should known to try this with your groupby from the post you linked...
fill_value=0).stack()
print(out)
Output:
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
here is one way to do it, based on the data you shared
reset the index and then do the groupby to choose groups where sum is greater than 0, means either of the category A or B is non-zero. Finally set the index
df.reset_index(inplace=True)
(df[df.groupby(['TEAM','ID'])['VALUE']
.transform(lambda x: x.sum()>0)]
.set_index(['TEAM','ID','TYPE']))
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0

Mask minimum values per group in a `pd.DataFrame`

Given a pd.DataFrame containing different time series in different groups, I want to create a mask over all rows that indicates per group at which timepoints the minima of value is reached in respect of type 0:
For example, given the pd.DataFrame:
>>> df
group type time value
0 A 0 0 4
1 A 0 1 5
2 A 1 0 6
3 A 1 1 7
4 B 0 0 11
5 B 0 1 10
6 B 1 0 9
7 B 1 1 8
In group A the minima for type 0 is reached at the timepoint 0. For group B the minima for type 0 is reached at the timepoint 1. Therefore, the resulting column should look like:
is_min
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 True
I have created a version that seems very cumbersome, first finding out the minima locations and then constructing the final column:
def get_minima(df):
type_mask = df.type == 0
min_value = df[type_mask].value.min()
value_mask = df.value == min_value
return df[type_mask & value_mask].time.max()
min_ts = df.groupby('group').apply(get_minima)
df['is_min'] = df.apply(lambda row: min_ts[row.group] == row.time, axis=1)
IIUC, you can try with groupby+apply and min
df['is_min']= df.groupby(['group','type'])['value']
.apply(lambda x: x==x.min())
Same as this with transform+min to get the minimal and eq to create the mask desired:
df['is_min']= df.groupby(['group','type'])['value']
.transform('min').eq(df['value'])
Output:
df
group type time value is_min
0 A 0 0 4 True
1 A 0 1 5 False
2 A 1 0 6 True
3 A 1 1 7 False
4 B 0 0 11 False
5 B 0 1 10 True
6 B 1 0 9 False
7 B 1 1 8 True
You can remove the rows with an excluding merge. sort the values, subset to only "type==0" and drop_duplicates to get the times per group you need to exclude. Then merge with an indicator to exclude.
m = (df.sort_values('value').query('type == 0').drop_duplicates('group')
.drop(columns=['type', 'value']))
# group time
#0 A 0
#5 B 1
df = (df.merge(m, how='outer', indicator=True).query('_merge == "left_only"')
.drop(columns='_merge'))
group type time value
2 A 0 1 5
3 A 1 1 7
4 B 0 0 11
5 B 1 0 9
If you separately need the mask and don't want to automatically query to subset the rows, map the indicator
df = df.merge(m, how='outer', indicator='is_min')
df['is_min'] = df['is_min'].map({'left_only': False, 'both': True})
group type time value is_min
0 A 0 0 4 True
1 A 1 0 6 True
2 A 0 1 5 False
3 A 1 1 7 False
4 B 0 0 11 False
5 B 1 0 9 False
6 B 0 1 10 True
7 B 1 1 8 True

Pandas: The best way to create new Frame by specific criteria

I have a DataFrame:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 2 0
5 2 0
6 2 0
7 3 1
8 3 1
9 3 0
10 4 1
11 4 1
I want to get new DateFrame where there are only id's with both sex values.
So I want to get something like this.
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 3 1
5 3 1
6 3 0
Using groupby and filter with required condition
In [2952]: df.groupby('id').filter(lambda x: set(x.sex) == set([0,1]))
Out[2952]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Also,
In [2953]: df.groupby('id').filter(lambda x: all([any(x.sex == v) for v in [0,1]]))
Out[2953]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Use drop_duplicates by both columns and then get size of one column by value_counts first.
Then filter all values by boolean indexing with isin:
s = df.drop_duplicates()['id'].value_counts()
print (s)
3 2
1 2
4 1
2 1
Name: id, dtype: int64
df = df[df['id'].isin(s.index[s == 2])]
print (df)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
One more:)
df.groupby('id').filter(lambda x: x['sex'].nunique()>1)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Use isin()
Something like this:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
male = df[df['sex'] == 0]
male = male['id']
female = df[df['sex'] == 1]
female = female['id']
df = df[(df['id'].isin(male)) & (df['id'].isin(female))]
print(df)
Output:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Or you can try this
m=df.groupby('id')['sex'].nunique().eq(2)
df.loc[df.id.isin(m[m].index)]
Out[112]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0

Select rows if columns meet condition

I have a DataFrame with 75 columns.
How can I select rows based on a condition in a specific array of columns? If I want to do this on all columns I can just use
df[(df.values > 1.5).any(1)]
But let's say I just want to do this on columns 3:45.
Use ix to slice the columns using ordinal position:
In [31]:
df = pd.DataFrame(np.random.randn(5,10), columns=list('abcdefghij'))
df
Out[31]:
a b c d e f g \
0 -0.362353 0.302614 -1.007816 -0.360570 0.317197 1.131796 0.351454
1 1.008945 0.831101 -0.438534 -0.653173 0.234772 -1.179667 0.172774
2 0.900610 0.409017 -0.257744 0.167611 1.041648 -0.054558 -0.056346
3 0.335052 0.195865 0.085661 0.090096 2.098490 0.074971 0.083902
4 -0.023429 -1.046709 0.607154 2.219594 0.381031 -2.047858 -0.725303
h i j
0 0.533436 -0.374395 0.633296
1 2.018426 -0.406507 -0.834638
2 -0.079477 0.506729 1.372538
3 -0.791867 0.220786 -1.275269
4 -0.584407 0.008437 -0.046714
So to slice the 4th to 5th columns inclusive:
In [32]:
df.ix[:, 3:5]
Out[32]:
d e
0 -0.360570 0.317197
1 -0.653173 0.234772
2 0.167611 1.041648
3 0.090096 2.098490
4 2.219594 0.381031
So in your case
df[(df.ix[:, 2:45]).values > 1.5).any(1)]
should work
indexing is 0 based and the open range is included but the closing range is not so here 3rd column is included and we slice up to column 46 but this is not included in the slice
Another solution with iloc, values can be omited:
#if need from 3rd to 45th columns
print (df[((df.iloc[:, 2:45]) > 1.5).any(1)])
Sample:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(3, size=(5,10)), columns=list('abcdefghij'))
print (df)
a b c d e f g h i j
0 1 0 0 1 1 0 0 1 0 1
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
3 2 1 1 1 1 2 1 1 0 0
4 1 0 0 1 2 1 0 2 2 1
print (df[((df.iloc[:, 2:5]) > 1.5).any(1)])
a b c d e f g h i j
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
4 1 0 0 1 2 1 0 2 2 1

create new column based on other columns in pandas dataframe

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)
You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7