replace some entries in a column of dataframe by a column of another dataframe - pandas

I have a dataframe about user-product-rating as below,
df1 =
USER_ID PRODUCT_ID RATING
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
another dataframe is the true ratings of some users and some products as below,
df2 =
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
I want to use the true ratings from df2 to replace the corresponding ratings in df1. So what I want to obtain is
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
Any operation to realize this?

rng = [i for i in range(0,10)]
df1 = pd.DataFrame({"USER_ID": rng,
"PRODUCT_ID": rng,
"RATING": rng})
rng_2 = [i for i in range(0,4)]
df2 = pd.DataFrame({'USER_ID' : rng_2,'PRODUCT_ID' : rng_2,
'RATING' : [10,10,10,10]})
Try to use update:
df1 = df1.set_index(['USER_ID', 'PRODUCT_ID'])
df2 = df2.set_index(['USER_ID', 'PRODUCT_ID'])
df1.update(df2)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
print(df2)
USER_ID PRODUCT_ID RATING
0 0 0 10.0
1 1 1 10.0
2 2 2 10.0
3 3 3 10.0
4 4 4 4.0
5 5 5 5.0
6 6 6 6.0
7 7 7 7.0
8 8 8 8.0
9 9 9 9.0

You can use combine first:
df2.astype(object).combine_first(df1)
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9

Related

Pandas: How to extract data that has been grouped by

Here is an example code to demonstrate my problem:
import numpy as np
import pandas as pd
np.random.seed(10)
df = pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=list('xy'))
df
x y
0 9 4
1 0 1
2 9 0
3 1 8
4 9 0
... ... ...
95 0 4
96 6 4
97 9 8
98 0 7
99 1 7
groups = df.groupby(['x'])
groups.size()
x
0 11
1 12
2 15
3 13
4 14
5 5
6 6
7 9
8 5
9 10
dtype: int64
How can I access the x-values as a column and the aggregated y-values as a second column to plot x versus y?
Two options.
Use reset_index():
groups = df.groupby(['x']).size().reset_index(name='size')
Add as_index=False to groupby:
groups = df.groupby(['x'], as_index=False).size()
Output for both:
>>> groups
x size
0 0 16
1 1 9
2 2 9
3 3 5
4 4 7
5 5 10
6 6 10
7 7 7
8 8 12
9 9 15
IIUC, use as_index=False:
groups = df.groupby(['x'], as_index=False)
out = groups.size()
out.plot(x='x', y='size')
If you only want to plot, you can also keep the x as index:
df.groupby(['x']).size().plot()
output:
x size
0 0 16
1 1 9
2 2 9
3 3 5
4 4 7
5 5 10
6 6 10
7 7 7
8 8 12
9 9 15

pandas dataframe enforce monotically per row

I have a dataframe:
df = 0 1 2 3 4
1 1 3 2 5
4 1 5 7 8
7 1 2 3 9
I want to enforce monotonically per row, to get:
df = 0 1 2 3 4
1 1 3 3 5
4 4 5 7 8
7 7 7 7 9
What is the best way to do so?
Try cummax
out = df.cummax(1)
Out[80]:
0 1 2 3 4
0 1 1 3 3 5
1 4 4 5 7 8
2 7 7 7 7 9

Backfill and Increment by one?

I have a column of a DataFrame that consists of 0's and NaN's:
Timestamp A B C
1 3 3 NaN
2 5 2 NaN
3 9 1 NaN
4 2 6 NaN
5 3 3 0
6 5 2 NaN
7 3 1 NaN
8 2 8 NaN
9 1 6 0
And I want to backfill it and increment the last value:
Timestamp A B C
1 3 3 4
2 5 2 3
3 9 1 2
4 2 6 1
5 3 3 0
6 5 2 3
7 3 1 2
8 2 8 1
9 1 6 0
YOu can use iloc[::-1] to reverse the data, and groupby().cumcount() to create the row counter:
s = df['C'].iloc[::-1].notnull()
df['C'] = df['C'].bfill() + s.groupby(s.cumsum()).cumcount()
Output
Timestamp A B C
0 1 3 3 4.0
1 2 5 2 3.0
2 3 9 1 2.0
3 4 2 6 1.0
4 5 3 3 0.0
5 6 5 2 3.0
6 7 3 1 2.0
7 8 2 8 1.0
8 9 1 6 0.0

How to find the average of multiple columns using a common column in pandas

How to calculate the mean value of all the columns with 'count' column.I have created a dataframe with random generated values in the below code.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,10)*100/10).astype(int)
df
output:
A B C D E F G H I J
0 4 3 2 8 5 0 9 9 0 5
1 1 5 8 0 5 9 8 3 9 1
2 9 5 1 1 3 2 6 3 8 3
3 4 0 8 1 7 3 4 2 8 8
4 9 4 8 2 7 9 7 8 9 7
5 1 0 7 3 8 6 1 7 2 0
6 3 6 8 9 6 6 5 0 8 4
7 8 9 9 5 3 9 0 7 5 5
8 5 5 8 7 8 4 3 0 9 9
9 2 4 2 3 0 5 2 0 3 0
I found mean value for a single column like this.How to find the mean for multiple columns with respect to count in pandas.
df['count'] = 1
print(df)
df.groupby('count').agg({'A':'mean'})
A B C D E F G H I J count
0 4 3 2 8 5 0 9 9 0 5 1
1 1 5 8 0 5 9 8 3 9 1 1
2 9 5 1 1 3 2 6 3 8 3 1
3 4 0 8 1 7 3 4 2 8 8 1
4 9 4 8 2 7 9 7 8 9 7 1
5 1 0 7 3 8 6 1 7 2 0 1
6 3 6 8 9 6 6 5 0 8 4 1
7 8 9 9 5 3 9 0 7 5 5 1
8 5 5 8 7 8 4 3 0 9 9 1
9 2 4 2 3 0 5 2 0 3 0 1
A
count
1 4.6
If need mean of all columns per groups by column count use:
df.groupby('count').mean()
If need mean by all rows (like grouping if same values in count) use:
df.mean().to_frame().T

Group counts in new column

I want a new column "group_count". This shows me in how many groups in total the attribute occurs.
Group Attribute group_count
0 1 10 4
1 1 10 4
2 1 10 4
3 2 10 4
4 2 20 1
5 3 30 1
6 3 10 4
7 4 10 4
I tried to groupby Group and attributes and then transform by using count
df["group_count"] = df.groupby(["Group", "Attributes"])["Attributes"].transform("count")
Group Attribute group_count
0 1 10 3
1 1 10 3
2 1 10 3
3 2 10 1
4 2 20 1
5 3 30 1
6 3 10 1
7 4 10 1
But it doesnt work
Use df.drop_duplicates(['Group','Attribute']) to get unique Attribute per group , then groupby on Atttribute to get count of Group, finally map with original Attribute column.
m=df.drop_duplicates(['Group','Attribute'])
df['group_count']=df['Attribute'].map(m.groupby('Attribute')['Group'].count())
print(df)
Group Attribute group_count
0 1 10 4
1 1 10 4
2 1 10 4
3 2 10 4
4 2 20 1
5 3 30 1
6 3 10 4
7 4 10 4
Use DataFrameGroupBy.nunique with transform:
df['group_count1'] = df.groupby('Attribute')['Group'].transform('nunique')
print (df)
Group Attribute group_count group_count1
0 1 10 4 4
1 1 10 4 4
2 1 10 4 4
3 2 10 4 4
4 2 20 1 1
5 3 30 1 1
6 3 10 4 4
7 4 10 4 4