grouping pandas rows above and below based on specified value - pandas

Is there a simple way to group pandas dataframe according to a given value. In the code below,the value is 1.5, I want to find the last row with value equal to or greater 1.5 and add a new column 'Group' such that all the rows above that last row are in one group and rows below it are in another group.
import pandas as pd
import numpy as np
fnd_val = 1.5
A = [1,2,3,4,5,6,7,8,9]
B = [-1.1306,-0.5694,-0.7241,1.8211,1.5555,0.0416,1.9236,0.1944,-0.0204]
df = pd.DataFrame({'Class':A, 'Value':B})
df_srtd = df.sort_values(by = "Class", ascending)
>>>df_srtd
Class Value
0 1 -1.1306
1 2 -0.5694
2 3 -0.7241
3 4 1.8211
4 5 1.5555
5 6 0.0416
6 7 1.9236
7 8 0.19440
8 9 -0.0204
#Desired output
Class Value Group
0 1 -1.1306 2
1 2 -0.5694 2
2 3 -0.7241 2
3 4 1.8211 2
4 5 1.5555 2
5 6 0.0416 2
6 7 1.9236 2
7 8 0.19440 1
8 9 -0.0204 1
#or if sorting is reversed like below
df_srtd = df.sort_values(by = "Class", ascending=False)
>>>df_srtd
Class Value
8 9 -0.0204
7 8 0.19440
6 7 1.9236
5 6 0.0416
4 5 1.5555
3 4 1.8211
2 3 -0.7241
1 2 -0.5694
0 1 -1.1306
#Desired output
Class Value Group
8 9 -0.0204 2
7 8 0.19440 2
6 7 1.9236 2
5 6 0.0416 2
4 5 1.5555 2
3 4 1.8211 2
2 3 -0.7241 1
1 2 -0.5694 1
0 1 -1.1306 1
My approach (using reversely sorted data):
import pandas as pd
import numpy as np
A = [1,2,3,4,5,6,7,8,9]
B = [-1.1306,-0.5694,-0.7241,1.8211,1.5555,0.0416,1.9236,0.1944,-0.0204]
df = pd.DataFrame({'Class':A, 'Value':B})
df_srtd = df.sort_values(by = "Class", ascending=False)
df_srtd['val_nxt'] = df_srtd['Value'].shift(-1)
fnd_val = 1.5
conditions = [
(df_srtd['Value'] >= fnd_val),
(df_srtd['Value'] < fnd_val)
& (df_srtd['Value'] < df_srtd['val_nxt']),
(df_srtd['Value'] < fnd_val)
]
choices = [ '2', '2', '1']
df_srtd['Group'] = np.select(conditions, choices, default='-99')
print(df_srtd)
Result obtained:
Class Value val_nxt Group
8 9 -0.0204 0.19440 2
7 8 0.19440 1.9236 2
6 7 1.9236 0.0416 2
5 6 0.0416 1.5555 2
4 5 1.5555 1.8211 2
3 4 1.8211 -0.7241 2 #All after this should be grouped 1
2 3 -0.7241 -0.5694 2 #This one should have been 1 but is grouped 2
1 2 -0.5694 -1.1306 1
0 1 -1.1306 NaN 1
As seen in above result the row with class 3 is put in group 2 instead of 1. I tested by adding more conditions but nothing worked.

Try this
df_srtd['Group'] = df_srtd.Value.ge(fnd_val)[::-1].cummax() + 1
Out[321]:
Class Value Group
0 1 -1.1306 2
1 2 -0.5694 2
2 3 -0.7241 2
3 4 1.8211 2
4 5 1.5555 2
5 6 0.0416 2
6 7 1.9236 2
7 8 0.1944 1
8 9 -0.0204 1
It is the same on reversed of Class
Sample `df_srtd_rev`
Class Value
8 9 -0.0204
7 8 0.1944
6 7 1.9236
5 6 0.0416
4 5 1.5555
3 4 1.8211
2 3 -0.7241
1 2 -0.5694
0 1 -1.1306
df_srtd_rev['Group'] = df_srtd_rev.Value.ge(fnd_val)[::-1].cummax() + 1
Out[326]:
Class Value Group
8 9 -0.0204 2
7 8 0.1944 2
6 7 1.9236 2
5 6 0.0416 2
4 5 1.5555 2
3 4 1.8211 2
2 3 -0.7241 1
1 2 -0.5694 1
0 1 -1.1306 1

Related

pandas dataframe enforce monotically per row

I have a dataframe:
df = 0 1 2 3 4
1 1 3 2 5
4 1 5 7 8
7 1 2 3 9
I want to enforce monotonically per row, to get:
df = 0 1 2 3 4
1 1 3 3 5
4 4 5 7 8
7 7 7 7 9
What is the best way to do so?
Try cummax
out = df.cummax(1)
Out[80]:
0 1 2 3 4
0 1 1 3 3 5
1 4 4 5 7 8
2 7 7 7 7 9

How to find the average of multiple columns using a common column in pandas

How to calculate the mean value of all the columns with 'count' column.I have created a dataframe with random generated values in the below code.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,10)*100/10).astype(int)
df
output:
A B C D E F G H I J
0 4 3 2 8 5 0 9 9 0 5
1 1 5 8 0 5 9 8 3 9 1
2 9 5 1 1 3 2 6 3 8 3
3 4 0 8 1 7 3 4 2 8 8
4 9 4 8 2 7 9 7 8 9 7
5 1 0 7 3 8 6 1 7 2 0
6 3 6 8 9 6 6 5 0 8 4
7 8 9 9 5 3 9 0 7 5 5
8 5 5 8 7 8 4 3 0 9 9
9 2 4 2 3 0 5 2 0 3 0
I found mean value for a single column like this.How to find the mean for multiple columns with respect to count in pandas.
df['count'] = 1
print(df)
df.groupby('count').agg({'A':'mean'})
A B C D E F G H I J count
0 4 3 2 8 5 0 9 9 0 5 1
1 1 5 8 0 5 9 8 3 9 1 1
2 9 5 1 1 3 2 6 3 8 3 1
3 4 0 8 1 7 3 4 2 8 8 1
4 9 4 8 2 7 9 7 8 9 7 1
5 1 0 7 3 8 6 1 7 2 0 1
6 3 6 8 9 6 6 5 0 8 4 1
7 8 9 9 5 3 9 0 7 5 5 1
8 5 5 8 7 8 4 3 0 9 9 1
9 2 4 2 3 0 5 2 0 3 0 1
A
count
1 4.6
If need mean of all columns per groups by column count use:
df.groupby('count').mean()
If need mean by all rows (like grouping if same values in count) use:
df.mean().to_frame().T

replace some entries in a column of dataframe by a column of another dataframe

I have a dataframe about user-product-rating as below,
df1 =
USER_ID PRODUCT_ID RATING
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
another dataframe is the true ratings of some users and some products as below,
df2 =
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
I want to use the true ratings from df2 to replace the corresponding ratings in df1. So what I want to obtain is
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
Any operation to realize this?
rng = [i for i in range(0,10)]
df1 = pd.DataFrame({"USER_ID": rng,
"PRODUCT_ID": rng,
"RATING": rng})
rng_2 = [i for i in range(0,4)]
df2 = pd.DataFrame({'USER_ID' : rng_2,'PRODUCT_ID' : rng_2,
'RATING' : [10,10,10,10]})
Try to use update:
df1 = df1.set_index(['USER_ID', 'PRODUCT_ID'])
df2 = df2.set_index(['USER_ID', 'PRODUCT_ID'])
df1.update(df2)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
print(df2)
USER_ID PRODUCT_ID RATING
0 0 0 10.0
1 1 1 10.0
2 2 2 10.0
3 3 3 10.0
4 4 4 4.0
5 5 5 5.0
6 6 6 6.0
7 7 7 7.0
8 8 8 8.0
9 9 9 9.0
You can use combine first:
df2.astype(object).combine_first(df1)
USER_ID PRODUCT_ID RATING
0 0 0 10
1 1 1 10
2 2 2 10
3 3 3 10
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9

Comparing two dataframe and output the index of the duplicated row once

I need help with comparing two dataframes. For example:
The first dataframe is
df_1 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 2 2 2 2 2 2
5 5 5 5 5 5 5
6 1 1 1 1 1 1
7 6 6 6 6 6 6
The second dataframe is
df_2 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 5 5 5 5 5 5
5 6 6 6 6 6 6
May I know if there is a way (without using for loop) to find the index of the rows of df_1 that have the same row values of df_2. In the example above, my expected output is below
index =
0
1
2
3
5
7
The size of the column of the "index" variable above should have the same column size of df_2.
If the same row of df_2 repeated in df_1 more than once, I only need the index of the first appearance, thats why I don't need the index 4 and 6.
Please help. Thank you so much!
Tommy
Use DataFrame.merge with DataFrame.drop_duplicates and DataFrame.reset_index for convert index to column for avoid lost index values, last select column called index:
s = df_2.merge(df_1.drop_duplicates().reset_index())['index']
print (s)
0 0
1 1
2 2
3 3
4 5
5 7
Name: index, dtype: int64
Detail:
print (df_2.merge(df_1.drop_duplicates().reset_index()))
0 1 2 3 4 5 index
0 1 1 1 1 1 1 0
1 2 2 2 2 2 2 1
2 3 3 3 3 3 3 2
3 4 4 4 4 4 4 3
4 5 5 5 5 5 5 5
5 6 6 6 6 6 6 7
Check the solution
df1=pd.DataFrame({'0':[1,2,3,4,2,5,1,6],
'1':[1,2,3,4,2,5,1,6],
'2':[1,2,3,4,2,5,1,6],
'3':[1,2,3,4,2,5,1,6],
'4':[1,2,3,4,2,5,1,6],
'5':[1,2,3,4,2,5,1,6]})
df1=pd.DataFrame({'0':[1,2,3,4,5,6],
'1':[1,2,3,4,5,66],
'2':[1,2,3,4,5,6],
'3':[1,2,3,4,5,66],
'4':[1,2,3,4,5,6],
'5':[1,2,3,4,5,6]})
df1[df1.isin(df2)].index.values.tolist()
### Output
[0, 1, 2, 3, 4, 5, 6, 7]

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2