I have a dataframe:
df = 0 1 2 3 4
1 1 3 2 5
4 1 5 7 8
7 1 2 3 9
I want to enforce monotonically per row, to get:
df = 0 1 2 3 4
1 1 3 3 5
4 4 5 7 8
7 7 7 7 9
What is the best way to do so?
Try cummax
out = df.cummax(1)
Out[80]:
0 1 2 3 4
0 1 1 3 3 5
1 4 4 5 7 8
2 7 7 7 7 9
Related
How to calculate the mean value of all the columns with 'count' column.I have created a dataframe with random generated values in the below code.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,10)*100/10).astype(int)
df
output:
A B C D E F G H I J
0 4 3 2 8 5 0 9 9 0 5
1 1 5 8 0 5 9 8 3 9 1
2 9 5 1 1 3 2 6 3 8 3
3 4 0 8 1 7 3 4 2 8 8
4 9 4 8 2 7 9 7 8 9 7
5 1 0 7 3 8 6 1 7 2 0
6 3 6 8 9 6 6 5 0 8 4
7 8 9 9 5 3 9 0 7 5 5
8 5 5 8 7 8 4 3 0 9 9
9 2 4 2 3 0 5 2 0 3 0
I found mean value for a single column like this.How to find the mean for multiple columns with respect to count in pandas.
df['count'] = 1
print(df)
df.groupby('count').agg({'A':'mean'})
A B C D E F G H I J count
0 4 3 2 8 5 0 9 9 0 5 1
1 1 5 8 0 5 9 8 3 9 1 1
2 9 5 1 1 3 2 6 3 8 3 1
3 4 0 8 1 7 3 4 2 8 8 1
4 9 4 8 2 7 9 7 8 9 7 1
5 1 0 7 3 8 6 1 7 2 0 1
6 3 6 8 9 6 6 5 0 8 4 1
7 8 9 9 5 3 9 0 7 5 5 1
8 5 5 8 7 8 4 3 0 9 9 1
9 2 4 2 3 0 5 2 0 3 0 1
A
count
1 4.6
If need mean of all columns per groups by column count use:
df.groupby('count').mean()
If need mean by all rows (like grouping if same values in count) use:
df.mean().to_frame().T
I have a dataframe about user-product-rating as below,
df1 =
USER_ID PRODUCT_ID RATING
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
another dataframe is the true ratings of some users and some products as below,
df2 =
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
I want to use the true ratings from df2 to replace the corresponding ratings in df1. So what I want to obtain is
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
Any operation to realize this?
rng = [i for i in range(0,10)]
df1 = pd.DataFrame({"USER_ID": rng,
"PRODUCT_ID": rng,
"RATING": rng})
rng_2 = [i for i in range(0,4)]
df2 = pd.DataFrame({'USER_ID' : rng_2,'PRODUCT_ID' : rng_2,
'RATING' : [10,10,10,10]})
Try to use update:
df1 = df1.set_index(['USER_ID', 'PRODUCT_ID'])
df2 = df2.set_index(['USER_ID', 'PRODUCT_ID'])
df1.update(df2)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
print(df2)
USER_ID PRODUCT_ID RATING
0 0 0 10.0
1 1 1 10.0
2 2 2 10.0
3 3 3 10.0
4 4 4 4.0
5 5 5 5.0
6 6 6 6.0
7 7 7 7.0
8 8 8 8.0
9 9 9 9.0
You can use combine first:
df2.astype(object).combine_first(df1)
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
Is there a simple way to group pandas dataframe according to a given value. In the code below,the value is 1.5, I want to find the last row with value equal to or greater 1.5 and add a new column 'Group' such that all the rows above that last row are in one group and rows below it are in another group.
import pandas as pd
import numpy as np
fnd_val = 1.5
A = [1,2,3,4,5,6,7,8,9]
B = [-1.1306,-0.5694,-0.7241,1.8211,1.5555,0.0416,1.9236,0.1944,-0.0204]
df = pd.DataFrame({'Class':A, 'Value':B})
df_srtd = df.sort_values(by = "Class", ascending)
>>>df_srtd
Class Value
0 1 -1.1306
1 2 -0.5694
2 3 -0.7241
3 4 1.8211
4 5 1.5555
5 6 0.0416
6 7 1.9236
7 8 0.19440
8 9 -0.0204
#Desired output
Class Value Group
0 1 -1.1306 2
1 2 -0.5694 2
2 3 -0.7241 2
3 4 1.8211 2
4 5 1.5555 2
5 6 0.0416 2
6 7 1.9236 2
7 8 0.19440 1
8 9 -0.0204 1
#or if sorting is reversed like below
df_srtd = df.sort_values(by = "Class", ascending=False)
>>>df_srtd
Class Value
8 9 -0.0204
7 8 0.19440
6 7 1.9236
5 6 0.0416
4 5 1.5555
3 4 1.8211
2 3 -0.7241
1 2 -0.5694
0 1 -1.1306
#Desired output
Class Value Group
8 9 -0.0204 2
7 8 0.19440 2
6 7 1.9236 2
5 6 0.0416 2
4 5 1.5555 2
3 4 1.8211 2
2 3 -0.7241 1
1 2 -0.5694 1
0 1 -1.1306 1
My approach (using reversely sorted data):
import pandas as pd
import numpy as np
A = [1,2,3,4,5,6,7,8,9]
B = [-1.1306,-0.5694,-0.7241,1.8211,1.5555,0.0416,1.9236,0.1944,-0.0204]
df = pd.DataFrame({'Class':A, 'Value':B})
df_srtd = df.sort_values(by = "Class", ascending=False)
df_srtd['val_nxt'] = df_srtd['Value'].shift(-1)
fnd_val = 1.5
conditions = [
(df_srtd['Value'] >= fnd_val),
(df_srtd['Value'] < fnd_val)
& (df_srtd['Value'] < df_srtd['val_nxt']),
(df_srtd['Value'] < fnd_val)
]
choices = [ '2', '2', '1']
df_srtd['Group'] = np.select(conditions, choices, default='-99')
print(df_srtd)
Result obtained:
Class Value val_nxt Group
8 9 -0.0204 0.19440 2
7 8 0.19440 1.9236 2
6 7 1.9236 0.0416 2
5 6 0.0416 1.5555 2
4 5 1.5555 1.8211 2
3 4 1.8211 -0.7241 2 #All after this should be grouped 1
2 3 -0.7241 -0.5694 2 #This one should have been 1 but is grouped 2
1 2 -0.5694 -1.1306 1
0 1 -1.1306 NaN 1
As seen in above result the row with class 3 is put in group 2 instead of 1. I tested by adding more conditions but nothing worked.
Try this
df_srtd['Group'] = df_srtd.Value.ge(fnd_val)[::-1].cummax() + 1
Out[321]:
Class Value Group
0 1 -1.1306 2
1 2 -0.5694 2
2 3 -0.7241 2
3 4 1.8211 2
4 5 1.5555 2
5 6 0.0416 2
6 7 1.9236 2
7 8 0.1944 1
8 9 -0.0204 1
It is the same on reversed of Class
Sample `df_srtd_rev`
Class Value
8 9 -0.0204
7 8 0.1944
6 7 1.9236
5 6 0.0416
4 5 1.5555
3 4 1.8211
2 3 -0.7241
1 2 -0.5694
0 1 -1.1306
df_srtd_rev['Group'] = df_srtd_rev.Value.ge(fnd_val)[::-1].cummax() + 1
Out[326]:
Class Value Group
8 9 -0.0204 2
7 8 0.1944 2
6 7 1.9236 2
5 6 0.0416 2
4 5 1.5555 2
3 4 1.8211 2
2 3 -0.7241 1
1 2 -0.5694 1
0 1 -1.1306 1
I want a new column "group_count". This shows me in how many groups in total the attribute occurs.
Group Attribute group_count
0 1 10 4
1 1 10 4
2 1 10 4
3 2 10 4
4 2 20 1
5 3 30 1
6 3 10 4
7 4 10 4
I tried to groupby Group and attributes and then transform by using count
df["group_count"] = df.groupby(["Group", "Attributes"])["Attributes"].transform("count")
Group Attribute group_count
0 1 10 3
1 1 10 3
2 1 10 3
3 2 10 1
4 2 20 1
5 3 30 1
6 3 10 1
7 4 10 1
But it doesnt work
Use df.drop_duplicates(['Group','Attribute']) to get unique Attribute per group , then groupby on Atttribute to get count of Group, finally map with original Attribute column.
m=df.drop_duplicates(['Group','Attribute'])
df['group_count']=df['Attribute'].map(m.groupby('Attribute')['Group'].count())
print(df)
Group Attribute group_count
0 1 10 4
1 1 10 4
2 1 10 4
3 2 10 4
4 2 20 1
5 3 30 1
6 3 10 4
7 4 10 4
Use DataFrameGroupBy.nunique with transform:
df['group_count1'] = df.groupby('Attribute')['Group'].transform('nunique')
print (df)
Group Attribute group_count group_count1
0 1 10 4 4
1 1 10 4 4
2 1 10 4 4
3 2 10 4 4
4 2 20 1 1
5 3 30 1 1
6 3 10 4 4
7 4 10 4 4
I need help with comparing two dataframes. For example:
The first dataframe is
df_1 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 2 2 2 2 2 2
5 5 5 5 5 5 5
6 1 1 1 1 1 1
7 6 6 6 6 6 6
The second dataframe is
df_2 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 5 5 5 5 5 5
5 6 6 6 6 6 6
May I know if there is a way (without using for loop) to find the index of the rows of df_1 that have the same row values of df_2. In the example above, my expected output is below
index =
0
1
2
3
5
7
The size of the column of the "index" variable above should have the same column size of df_2.
If the same row of df_2 repeated in df_1 more than once, I only need the index of the first appearance, thats why I don't need the index 4 and 6.
Please help. Thank you so much!
Tommy
Use DataFrame.merge with DataFrame.drop_duplicates and DataFrame.reset_index for convert index to column for avoid lost index values, last select column called index:
s = df_2.merge(df_1.drop_duplicates().reset_index())['index']
print (s)
0 0
1 1
2 2
3 3
4 5
5 7
Name: index, dtype: int64
Detail:
print (df_2.merge(df_1.drop_duplicates().reset_index()))
0 1 2 3 4 5 index
0 1 1 1 1 1 1 0
1 2 2 2 2 2 2 1
2 3 3 3 3 3 3 2
3 4 4 4 4 4 4 3
4 5 5 5 5 5 5 5
5 6 6 6 6 6 6 7
Check the solution
df1=pd.DataFrame({'0':[1,2,3,4,2,5,1,6],
'1':[1,2,3,4,2,5,1,6],
'2':[1,2,3,4,2,5,1,6],
'3':[1,2,3,4,2,5,1,6],
'4':[1,2,3,4,2,5,1,6],
'5':[1,2,3,4,2,5,1,6]})
df1=pd.DataFrame({'0':[1,2,3,4,5,6],
'1':[1,2,3,4,5,66],
'2':[1,2,3,4,5,6],
'3':[1,2,3,4,5,66],
'4':[1,2,3,4,5,6],
'5':[1,2,3,4,5,6]})
df1[df1.isin(df2)].index.values.tolist()
### Output
[0, 1, 2, 3, 4, 5, 6, 7]