pandas isin based on a single row - pandas

Now I have:
ss dd list
A B [B,E,F]
C E [C,H,E]
A C [A,D,E]
I want to rule out rows that both ss and dd are in list. So we rule out row 2. Function isin() checks if ss and dd are in all rows of list each time, which is not giving me the result.
Please do not use loop cause my dataset is too large.
Output should be:
ss dd list
A B [B,E,F]
A C [A,D,E]

First we flatten your list column to a dataframe and using isin(here index is do matter , that is why I using original dataframe index to create the cdf)
cdf=pd.DataFrame(df['list'].tolist(),index=df.index)
mask=(cdf.isin(df.ss).any(1))&(cdf.isin(df.dd).any(1))
df[~mask]
Out[589]:
ss dd list
0 A B [B, E, F]
2 A C [A, D, E]

Related

How can I apply an expanding window to the names of groupby results?

I would like to use pandas to group a dataframe by one column, and then run an expanding window calculation on the groups. Imagine the following dataframe:
G Val
A 0
A 1
A 2
B 3
B 4
C 5
C 6
C 7
What I am looking for is a way to group the data by column G (resulting in groups ['A', 'B', 'C']), and then applying a function first to the items in group A, then to items in groups A and B, and finally items in groups A to C.
For example, if the function is sum, then the result would be
A 3
B 10
C 28
For my problem the function that is applied needs to be able to access all original items in the dataframe, not only the aggregates from the groupby.
For example when applying mean, the expected result would be
A 1
B 2
C 3.5
A: mean([0,1,2]), B: mean([0,1,2,3,4]), C: mean([0,1,2,3,4,5,6,7]).
cummean not exist, so possible solution is aggregate counts and sum, use cumulative sum and for mean divide:
df = df.groupby('G')['Val'].agg(['size', 'sum']).cumsum()
s = df['sum'].div(df['size'])
print (s)
A 1.0
B 2.0
C 3.5
dtype: float64
If need general solution is possible extract expanding groups and then use function in dict comprehension like:
g = df['G'].drop_duplicates().apply(list).cumsum()
s = pd.Series({x[-1]: df.loc[df['G'].isin(x), 'Val'].mean() for x in g})
print (s)
A 1.0
B 2.0
C 3.5
dtype: float64

Pandas finding average in a comma separated column

I want to take average based on one column which is comma separated and take mean on other column.
My file looks like this:
ColumnA ColumnB
A, B, C 2.9
A, C 9.087
D 6.78
B, D, C 5.49
My output should look like this:
A 7.4435
B 5.645
C 5.83
D 6.135
My code is this:
df = pd.DataFrame(data.ColumnA.str.split(',', expand=True).stack(), columns= ['ColumnA'])
df = df.reset_index(drop = True)
df_avg = pd.DataFrame(df.groupby(by = ['ColumnA'])['ColumnB'].mean())
df_avg = df_avg.reset_index()
It has to be around the same lines but can't figure it out.
In your solution is created index by column ColumnB for avoid lost column values after stack and Series.reset_index, last is added as_index=False for column after aggregation:
df = (df.set_index('ColumnB')['ColumnA']
.str.split(',', expand=True)
.stack()
.reset_index(name='ColumnA')
.groupby('ColumnA', as_index=False)['ColumnB']
.mean())
print (df)
ColumnA ColumnB
0 A 5.993500
1 B 4.195000
2 C 5.825667
3 D 6.135000
Or alternative solution with DataFrame.explode:
df = (df.assign(ColumnA = df['ColumnA'].str.split(','))
.explode('ColumnA')
.groupby('ColumnA', as_index=False)['ColumnB']
.mean())
print (df)
ColumnA ColumnB
0 A 5.993500
1 B 4.195000
2 C 5.825667
3 D 6.135000

How to filter Pandas DataFrame rows by checking inclusion in cell's element which is a list?

Given a dataframe where one of the columns looks like this, how would I filter the dataframe down to only rows where this column's element contains a c anywhere in its list?
df['orderings']
1 (a, a, a, a)
10 (a, a, c, c)
12 (a, a, c, b)
Assume your a,b,c are all strings, you can do:
df[df["orderings"].apply(lambda l: "a" in l and "c" in l)]

Group a subset of values into a list of single row per key, but add None if true on a condition

Suppose I have the following data that I want to conduct groupby on:
Key Prod Val
A a 1
A b 0
B a 1
B b 1
B d 1
C a 0
C b 0
I want to group the table so I have a single row per each key, A, B and C, and a list containing the prod values corresponding to the key. But the element should only be in the list of there's an indicator of 1 for the corresponding val. If it's completely 0 for the entire subset of a key, than the key should just get a none value. Here's the result I'm looking for using the same e.g. above:
Key List
A [a]
B [a, b, d]
C None
What's the most efficient way to perform this in pandas?
Let's try:
df.query('Val == 1').groupby('Key')['Prod'].agg(lambda x: list(x)).reindex(df.Key.unique())
Output:
Key
A [a]
B [a, b, d]
C NaN
dtype: object
I think just making a new dataframe would be easiest:
df2 = pd.DataFrame(columns = ['list'], index = set(df1.Key))
for i, row in df2.iterrows():
df2.loc[i, 'list'] = []
for i, row in df1.iterrows():
key = df1.loc[i, 'key']
if df1.loc[i, 'val'] == 1:
df2.loc[key, 'list'].append(df1.loc[i, 'prod'])

Rearrange rows of pandas dataframe based on list and keeping the order

import numpy as np
import pandas as pd
df = pd.DataFrame(data={'result':[-6.77,6.11,5.67,-7.679,-0.0930,4.342]}\
,index=['A','B','C','D','E','F'])
new_order = np.array([1,2,2,0,1,0])
The new_order numpy array assigns each row to one of three groups [0,1 or 2]. I would like to rearrange the rows of df so that those rows in group 0 appear first, followed by 1, and finally 2. Within each of the three groups the initial ordering should remain unchanged.
At the start the df is arranged as follows:
result
A -6.770
B 6.110
C 5.670
D -7.679
E -0.093
F 4.342
Here is the desired output given the above input data.
result
D -7.679
F 4.342
A -6.770
E -0.093
B 6.110
C 5.670
You could use argsort with kind='mergesort' to get sorted row indices that keeps the order and then simply index into the dataframe with those for the desired output, like so -
df.iloc[new_order.argsort(kind='mergesort')]
Sample run -
In [2]: df
Out[2]:
result
A -6.770
B 6.110
C 5.670
D -7.679
E -0.093
F 4.342
In [3]: df.iloc[new_order.argsort(kind='mergesort')]
Out[3]:
result
D -7.679
F 4.342
A -6.770
E -0.093
B 6.110
C 5.670
pure pandas
df.set_index(new_order, append=True) \
.sort_index(level=1) \
.reset_index(1, drop=True)
explanation
append new_order to the index
set_index(new_order, append=True)
use that new index level and sort by it
sort_index(level=1)
drop the index level I added
reset_index(1, drop=True)