Backfill and Increment by one? - pandas

I have a column of a DataFrame that consists of 0's and NaN's:
Timestamp A B C
1 3 3 NaN
2 5 2 NaN
3 9 1 NaN
4 2 6 NaN
5 3 3 0
6 5 2 NaN
7 3 1 NaN
8 2 8 NaN
9 1 6 0
And I want to backfill it and increment the last value:
Timestamp A B C
1 3 3 4
2 5 2 3
3 9 1 2
4 2 6 1
5 3 3 0
6 5 2 3
7 3 1 2
8 2 8 1
9 1 6 0

YOu can use iloc[::-1] to reverse the data, and groupby().cumcount() to create the row counter:
s = df['C'].iloc[::-1].notnull()
df['C'] = df['C'].bfill() + s.groupby(s.cumsum()).cumcount()
Output
Timestamp A B C
0 1 3 3 4.0
1 2 5 2 3.0
2 3 9 1 2.0
3 4 2 6 1.0
4 5 3 3 0.0
5 6 5 2 3.0
6 7 3 1 2.0
7 8 2 8 1.0
8 9 1 6 0.0

Related

how to use pandas concatenate string within rolling window for each group?

I have a data set like below:
cluster order label
0 1 1 a
1 1 2 b
2 1 3 c
3 1 4 c
4 1 5 b
5 2 1 b
6 2 2 b
7 2 3 c
8 2 4 a
9 2 5 a
10 2 6 b
11 2 7 c
12 2 8 c
I want to add a column to concatenate a rolling window of 3 for the previous values of the column label. It seems pandas rolling can only do calculations for numerical. Is there a way to concatenate string?
cluster order label roll3
0 1 1 a NaN
1 1 2 b NaN
2 1 3 c NaN
3 1 4 c abc
4 1 5 b bcc
5 2 1 b NaN
6 2 2 b NaN
7 2 3 c NaN
8 2 4 a bbc
9 2 5 a bca
10 2 6 b caa
11 2 7 c aab
12 2 8 c abc
Use groupby.apply to shift and concat the labels:
df['roll3'] = (df.groupby('cluster')['label']
.apply(lambda x: x.shift(3) + x.shift(2) + x.shift(1)))
# cluster order label roll3
# 0 1 1 a NaN
# 1 1 2 b NaN
# 2 1 3 c NaN
# 3 1 4 c abc
# 4 1 5 b bcc
# 5 2 1 b NaN
# 6 2 2 b NaN
# 7 2 3 c NaN
# 8 2 4 a bbc
# 9 2 5 a bca
# 10 2 6 b caa
# 11 2 7 c aab
# 12 2 8 c abc

pandas dataframe enforce monotically per row

I have a dataframe:
df = 0 1 2 3 4
1 1 3 2 5
4 1 5 7 8
7 1 2 3 9
I want to enforce monotonically per row, to get:
df = 0 1 2 3 4
1 1 3 3 5
4 4 5 7 8
7 7 7 7 9
What is the best way to do so?
Try cummax
out = df.cummax(1)
Out[80]:
0 1 2 3 4
0 1 1 3 3 5
1 4 4 5 7 8
2 7 7 7 7 9

replace some entries in a column of dataframe by a column of another dataframe

I have a dataframe about user-product-rating as below,
df1 =
USER_ID PRODUCT_ID RATING
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
another dataframe is the true ratings of some users and some products as below,
df2 =
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
I want to use the true ratings from df2 to replace the corresponding ratings in df1. So what I want to obtain is
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
Any operation to realize this?
rng = [i for i in range(0,10)]
df1 = pd.DataFrame({"USER_ID": rng,
"PRODUCT_ID": rng,
"RATING": rng})
rng_2 = [i for i in range(0,4)]
df2 = pd.DataFrame({'USER_ID' : rng_2,'PRODUCT_ID' : rng_2,
'RATING' : [10,10,10,10]})
Try to use update:
df1 = df1.set_index(['USER_ID', 'PRODUCT_ID'])
df2 = df2.set_index(['USER_ID', 'PRODUCT_ID'])
df1.update(df2)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
print(df2)
USER_ID PRODUCT_ID RATING
0 0 0 10.0
1 1 1 10.0
2 2 2 10.0
3 3 3 10.0
4 4 4 4.0
5 5 5 5.0
6 6 6 6.0
7 7 7 7.0
8 8 8 8.0
9 9 9 9.0
You can use combine first:
df2.astype(object).combine_first(df1)
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9

generate lines with all unique values of given column for each group

df = pd.DataFrame({'timePoint': [1,1,1,1,2,2,2,2,3,3,3,3],
'item': [1,2,3,4,3,4,5,6,1,3,7,2],
'value': [2,4,7,6,5,9,3,2,4,3,1,5]})
>>> df
item timePoint value
0 1 1 2
1 2 1 4
2 3 1 7
3 4 1 6
4 3 2 5
5 4 2 9
6 5 2 3
7 6 2 2
8 1 3 4
9 3 3 3
10 7 3 1
11 2 3 5
In this df, not every item appears at every timePoint. I want to have all unique items at every timePoint, and these newly inserted items should either have:
(i) a NaN value if they have not appeared at a previous timePoint, or
(ii) if they have, they get their most recent value.
The desired output should look like the following (lines with hashtag are those inserted).
>>> dfx
item timePoint value
0 1 1 2.0
3 1 2 2.0 #
8 1 3 4.0
1 2 1 4.0
4 2 2 4.0 #
11 2 3 5.0
2 3 1 7.0
4 3 2 5.0
9 3 3 3.0
3 4 1 6.0
5 4 2 9.0
6 4 3 9.0 #
0 5 1 NaN #
6 5 2 3.0
7 5 3 3.0 #
1 6 1 NaN #
7 6 2 2.0
8 6 3 2.0 #
2 7 1 NaN #
5 7 2 NaN #
10 7 3 1.0
For example, item 1 gets a 4.0 at timePoint 2 because that's what it had a timePoint 1 whereas item 6 gets a NaN at timePoint 1 because there is no preceding value.
Now, I know that if I manage to insert all lines of every unique item missing in each timePoint group, i.e. reach this point:
>>> dfx
item timePoint value
0 1 1 2.0
1 2 1 4.0
2 3 1 7.0
3 4 1 6.0
4 3 2 5.0
5 4 2 9.0
6 5 2 3.0
7 6 2 2.0
8 1 3 4.0
9 3 3 3.0
10 7 3 1.0
11 2 3 5.0
0 5 1 NaN
1 6 1 NaN
2 7 1 NaN
3 1 2 NaN
4 2 2 NaN
5 7 2 NaN
6 4 3 NaN
7 5 3 NaN
8 6 3 NaN
Then I can do:
dfx.sort_values(by = ['item', 'timePoint'],
inplace = True,
ascending = [True, True])
dfx['value'] = dfx.groupby('item')['value'].fillna(method='ffill')
which will return the desired output.
But how do I add as lines all df.item.unique() items that are missing to each timePoint group?
Also, if you have a more efficient solution from scratch to suggest, then by all means please be my guest.
Using pd.MultiIndex.from_product, levels, reindex
d = df.set_index(['item', 'timePoint'])
d.reindex(
pd.MultiIndex.from_product(d.index.levels, names=d.index.names)
).groupby(level='item').ffill().reset_index()
item timePoint value
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0
I think stack with unstack will achieve the format , then we using groupby ffillto fill the nan value forward
s=df.set_index(['item','timePoint']).value.unstack().stack(dropna=False)
s.groupby(level=0).ffill().reset_index()
Out[508]:
item timePoint 0
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0

To count every 3 rows to fit the condition by Pandas rolling

I have dataframe look like this:
raw_data ={'col0':[1,4,5,1,3,3,1,5,8,9,1,2]}
df = DataFrame(raw_data)
col0
0 1
1 4
2 5
3 1
4 3
5 3
6 1
7 5
8 8
9 9
10 1
11 2
What I want to do is to count every 3 rows to fit condition(df['col0']>3) and make new col looks like this:
col0 col_roll_count3
0 1 0
1 4 1
2 5 2 #[index 0,1,2/ 4,5 fit the condition]
3 1 2
4 3 1
5 3 0 #[index 3,4,5/no fit the condition]
6 1 0
7 5 1
8 8 2
9 9 3
10 1 2
11 2 1
How can I achieve that?
I tried this but failed:
df['col_roll_count3'] = df[df['col0']>3].rolling(3).count()
print(df)
col0 col1
0 1 NaN
1 4 1.0
2 5 2.0
3 1 NaN
4 3 NaN
5 3 NaN
6 1 NaN
7 5 3.0
8 8 3.0
9 9 3.0
10 1 NaN
11 2 NaN
df['col_roll_count3'] = df['col0'].gt(3).rolling(3).sum()
Let's use rolling, apply, np.count_nonzero:
df['col_roll_count3'] = df.col0.rolling(3,min_periods=1)\
.apply(lambda x: np.count_nonzero(x>3))
Output:
col0 col_roll_count3
0 1 0.0
1 4 1.0
2 5 2.0
3 1 2.0
4 3 1.0
5 3 0.0
6 1 0.0
7 5 1.0
8 8 2.0
9 9 3.0
10 1 2.0
11 2 1.0