Iterate over duplicate partitions/groups of a Pandas DataFrame

Iterate over duplicate partitions/groups of a Pandas DataFrame - pandas

I have a df like this
id val1 val2 val3
0 1 1 2
1 1 NaN 2
2 1 4 2
3 1 4 2
4 2 1 1
5 3 NaN 3
6 3 7 3
7 3 7 3
then
temp_df = df.loc[df.duplicated(subset=['val1','val3'], keep=False)]
gives me this
id val1 val2 val3
0 1 1 2
1 1 NaN 2
2 1 4 2
3 1 4 2
5 3 NaN 3
6 3 7 3
7 3 7 3
How can I iterate over each partition/group containing the duplicate values?
for partition in temp_df......:
print(partition)
id val1 val2 val3
0 1 1 2
1 1 NaN 2
2 1 4 2
3 1 4 2
id val1 val2 val3
5 3 NaN 3
6 3 7 3
7 3 7 3
The goal is to impute the NaN value with the mode of the partition columns. E.g mode(1, 4, 4) = 4 so I want to fill in the NaN value of the first partition with 4. Similarly, I want to fill in the NaN value of the second partition with 7.

Update
Use groupby_apply:
df['val2'] = df.groupby(['val1', 'val3'])['val2'] \
.apply(lambda x: x.fillna(x.mode().squeeze()))
print(df)
# Output:
id val1 val2 val3
0 0 1 1.0 2
1 1 1 4.0 2
2 2 1 4.0 2
3 3 1 4.0 2
4 4 2 1.0 1
5 5 3 7.0 3
6 6 3 7.0 3
7 7 3 7.0 3
Old answer
IIUC, use groupby after sorting dataframe by val2 then fill forward:
df['val2'] = df.sort_values('val2').groupby(['val1', 'val3'])['val2'].ffill()
print(df)
# Output:
id val1 val2 val3
0 0 1 1.1 2.2
1 1 1 1.1 2.2
2 3 2 1.3 1.0
3 4 3 1.5 6.2
4 5 3 1.5 6.2

Related

how to use pandas concatenate string within rolling window for each group?

I have a data set like below:
cluster order label
0 1 1 a
1 1 2 b
2 1 3 c
3 1 4 c
4 1 5 b
5 2 1 b
6 2 2 b
7 2 3 c
8 2 4 a
9 2 5 a
10 2 6 b
11 2 7 c
12 2 8 c
I want to add a column to concatenate a rolling window of 3 for the previous values of the column label. It seems pandas rolling can only do calculations for numerical. Is there a way to concatenate string?
cluster order label roll3
0 1 1 a NaN
1 1 2 b NaN
2 1 3 c NaN
3 1 4 c abc
4 1 5 b bcc
5 2 1 b NaN
6 2 2 b NaN
7 2 3 c NaN
8 2 4 a bbc
9 2 5 a bca
10 2 6 b caa
11 2 7 c aab
12 2 8 c abc

Use groupby.apply to shift and concat the labels:
df['roll3'] = (df.groupby('cluster')['label']
.apply(lambda x: x.shift(3) + x.shift(2) + x.shift(1)))
# cluster order label roll3
# 0 1 1 a NaN
# 1 1 2 b NaN
# 2 1 3 c NaN
# 3 1 4 c abc
# 4 1 5 b bcc
# 5 2 1 b NaN
# 6 2 2 b NaN
# 7 2 3 c NaN
# 8 2 4 a bbc
# 9 2 5 a bca
# 10 2 6 b caa
# 11 2 7 c aab
# 12 2 8 c abc

Compute lagged means per name and round in pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas

You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5

Backfill and Increment by one?

I have a column of a DataFrame that consists of 0's and NaN's:
Timestamp A B C
1 3 3 NaN
2 5 2 NaN
3 9 1 NaN
4 2 6 NaN
5 3 3 0
6 5 2 NaN
7 3 1 NaN
8 2 8 NaN
9 1 6 0
And I want to backfill it and increment the last value:
Timestamp A B C
1 3 3 4
2 5 2 3
3 9 1 2
4 2 6 1
5 3 3 0
6 5 2 3
7 3 1 2
8 2 8 1
9 1 6 0

YOu can use iloc[::-1] to reverse the data, and groupby().cumcount() to create the row counter:
s = df['C'].iloc[::-1].notnull()
df['C'] = df['C'].bfill() + s.groupby(s.cumsum()).cumcount()
Output
Timestamp A B C
0 1 3 3 4.0
1 2 5 2 3.0
2 3 9 1 2.0
3 4 2 6 1.0
4 5 3 3 0.0
5 6 5 2 3.0
6 7 3 1 2.0
7 8 2 8 1.0
8 9 1 6 0.0

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?

With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64

You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2

Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2

using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2

To count every 3 rows to fit the condition by Pandas rolling

I have dataframe look like this:
raw_data ={'col0':[1,4,5,1,3,3,1,5,8,9,1,2]}
df = DataFrame(raw_data)
col0
0 1
1 4
2 5
3 1
4 3
5 3
6 1
7 5
8 8
9 9
10 1
11 2
What I want to do is to count every 3 rows to fit condition(df['col0']>3) and make new col looks like this:
col0 col_roll_count3
0 1 0
1 4 1
2 5 2 #[index 0,1,2/ 4,5 fit the condition]
3 1 2
4 3 1
5 3 0 #[index 3,4,5/no fit the condition]
6 1 0
7 5 1
8 8 2
9 9 3
10 1 2
11 2 1
How can I achieve that?
I tried this but failed:
df['col_roll_count3'] = df[df['col0']>3].rolling(3).count()
print(df)
col0 col1
0 1 NaN
1 4 1.0
2 5 2.0
3 1 NaN
4 3 NaN
5 3 NaN
6 1 NaN
7 5 3.0
8 8 3.0
9 9 3.0
10 1 NaN
11 2 NaN

df['col_roll_count3'] = df['col0'].gt(3).rolling(3).sum()

Let's use rolling, apply, np.count_nonzero:
df['col_roll_count3'] = df.col0.rolling(3,min_periods=1)\
.apply(lambda x: np.count_nonzero(x>3))
Output:
col0 col_roll_count3
0 1 0.0
1 4 1.0
2 5 2.0
3 1 2.0
4 3 1.0
5 3 0.0
6 1 0.0
7 5 1.0
8 8 2.0
9 9 3.0
10 1 2.0
11 2 1.0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Iterate over duplicate partitions/groups of a Pandas DataFrame - pandas

Related

how to use pandas concatenate string within rolling window for each group?

Compute lagged means per name and round in pandas

Backfill and Increment by one?

Pandas count values inside dataframe

To count every 3 rows to fit the condition by Pandas rolling

Categories

Resources