Generating a separate column that stores weighted average per group - pandas

It must not be that hard but I can't cope with this problem.
Imagine I have a long format dataframe with some data and want to calculate a weighted average of score per person and weighted by a manager and keep it as a separate variable - 'w_mean_m'.
df['w_mean_m'] = df.groupby('person')['score'].transform(lambda x: np.average(x['score'], weights=x['manager_weight']))
throws an error and I have no idea how to fix it.

Because GroupBy.transform working with each column separately is not possible select multiple columns, so is used GroupBy.apply with Series.map for new column:
s = (df.groupby('contact')
.apply(lambda x: np.average(x['score'], weights=x['manager_weight'])))
df['w_mean_m'] = df['contact'].map(s)
One hack is possible with selected values by unique index for weights:
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights=df.loc[x.index, "manager_weight"])
df['w_mean_m1'] = df.groupby('contact')['score'].transform(f)
print (df)
manager_weight score contact w_mean_m1
0 1.0 1 a 1.282609
1 1.1 1 a 1.282609
2 1.2 1 a 1.282609
3 1.3 2 a 1.282609
4 1.4 2 b 2.355556
5 1.5 2 b 2.355556
6 1.6 3 b 2.355556
7 1.7 3 c 3.770270
8 1.8 4 c 3.770270
9 1.9 4 c 3.770270
10 2.0 4 c 3.770270
Setup:
df = pd.DataFrame(
{
"manager_weight": [1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0],
"score": [1,1,1,2,2,2,3,3,4,4,4],
"contact": ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
})

Related

groupby to show same row value from other columns

After groupby by "Mode" column and take out the value from "indicator" of "max, min", how to let the relative value to show in the same dataframe like below:
df = pd.read_csv(r'relative.csv')
Grouped = df.groupby('Mode')['Indicator'].agg(['max', 'min'])
print(Grouped)
(from google, maybe can use from col_value or row_value function, but seem be more complicated, could someone can help to solve it by easy ways? thank you.)
You can do it in two steps, using groupby and idxmin() or idxmix():
# Create a df with the min values of 'Indicator', renaming the column 'Value' to 'B'
min = df.loc[df.groupby('Mode')['Indicator'].idxmin()].reset_index(drop=True).rename(columns={'Indicator': 'min', 'Value': 'B'})
print(min)
# Mode min B
# 0 A 1 6
# 1 B 1 7
# Create a df with the max values of 'Indicator', renaming the column 'Value' to 'A'
max = df.loc[df.groupby('Mode')['Indicator'].idxmax()].reset_index(drop=True).rename(columns={'Indicator': 'max', 'Value': 'A'})
print(max)
# Mode max A
# 0 A 3 2
# 1 B 4 3
# Merge the dataframes together
result = pd.merge(min, max)
# reorder the columns to match expected output
print(result[['Mode', 'max','min','A', 'B']])
# Mode max min A B
# 0 A 3 1 2 6
# 1 B 4 1 3 7
The logic is unclear, there is no real reason why you would call your columns A/B since the 6/3 values in it are not coming from A/B.
I assume you want to achieve:
(df.groupby('Mode')['Indicator'].agg(['idxmax', 'idxmin'])
.rename(columns={'idxmin': 'min', 'idxmax': 'max'}).stack()
.to_frame('x').merge(df, left_on='x', right_index=True)
.drop(columns=['x', 'Mode']).unstack()
)
Output:
Indicator Value
max min max min
Mode
A 3 1 2 6
B 4 1 3 7
C 10 10 20 20
Used input:
Mode Indicator Value
0 A 1 6
1 A 2 5
2 A 3 2
3 B 4 3
4 B 3 6
5 B 2 8
6 B 1 7
7 C 10 20
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"Mode": ["A", "A", "A", "B", "B", "B", "B"],
"Indicator": [1, 2, 3, 4, 3, 2, 1],
"Value": [6, 5, 2, 3, 6, 8, 7],
}
)
new_df = df.groupby("Mode")["Indicator"].agg(["max", "min"])
print(new_df)
# Output
max min
Mode
A 3 1
B 4 1
Here is one way to do it with product from Python standard library's itertools module and Pandas at property:
from itertools import product
for row, (col, func) in product(["A", "B"], [("A", "max"), ("B", "min")]):
new_df.at[row, col] = df.loc[
(df["Mode"] == row) & (df["Indicator"] == new_df.loc[row, func]), "Value"
].values[0]
new_df = new_df.astype(int)
Then:
print(new_df)
# Output
max min A B
Mode
A 3 1 2 6
B 4 1 3 7

Pandas how to find consecutive value increase over time on time series data

I have a dataframe with three columns (tilted as name, value, date)as below:
Tool
Value
Date
A
52.14
1/1
A
51.5
1/7
A
52
1/10
A
52.9
2/1
B
53.1
1/5
B
51.7
1/10
B
51.9
1/21
B
52.4
1/22
B
53.0
2/1
B
51.5
2/15
I would like to find which tools have increased measure values on three measurement days.
Tool B's value has increased on 1/21 and then increased on 1/22 and then increased on 2/1. so the outcome will be as below:
Tool
Value
Date
B
51.7
1/10
B
51.9
1/21
B
52.4
1/22
B
53.0
2/1
I am wondering how can I define a function in pandas to give the desired result.
Thanks.
Here's an answer, but I bet there is a better one. I make new Series to keep track of whether a value is less than the next (increasing), increasing 'runs' as groups (increase_group) and then how many consecutive increases happen in that group (consec_increases). I've kept these as stand-alone Series, but if you add them as columns to your table you can see the reasoning. Getting the 2/1 row added is a bit hacked together because it's not part of the same increase_group and I can't figure that out in a more clever way than just adding one more index past the group max
Tool Value Date increasing increase_group consec_increases
0 A 52.14 1/1 False 1 0
1 A 51.50 1/7 True 2 2
2 A 52.00 1/10 True 2 2
3 A 52.90 2/1 False 3 0
4 B 53.10 1/5 False 4 0
5 B 51.70 1/10 True 5 3
6 B 51.90 1/21 True 5 3
7 B 52.40 1/22 True 5 3
8 B 53.00 2/1 False 6 0
9 B 51.50 2/15 False 6 0
Here's the actual code
df = pd.DataFrame({
'Tool': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Value': [52.14, 51.5, 52.0, 52.9, 53.1, 51.7, 51.9, 52.4, 53.0, 51.5],
'Date': ['1/1','1/7','1/10','2/1','1/5','1/10','1/21','1/22','2/1','2/15'],
})
#Assuming your table is already sorted by Tool then by Date like your example
#Bool Series that is True if the prev Tool value is lt (<) the next (increasing)
increasing = df['Value'].lt(df.groupby('Tool')['Value'].shift(-1))
#df['increasing'] = increasing #uncomment if you want to see what's happening
#Group the rows by increasing 'runs'
increase_group = increasing.groupby(df['Tool'], group_keys=False).apply(lambda v: v.ne(v.shift(1))).cumsum()
#df['increase_group'] = increase_group #uncomment if you want to see what's happening
#Count the number of consecutive increases per group
consec_increases = increasing.groupby(increase_group).transform(lambda v: v.cumsum().max())
#df['consec_increases'] = consec_increases #uncomment if you want to see what's happening
#print(df) #uncomment if you want to see what's happening
#get the row indices of each group w/ 3 or more consec increases
inds_per_group = (
increase_group[consec_increases >= 3] #this is where you can set the threshold
.groupby(increase_group)
.apply(lambda g: list(g.index)+[max(g.index)+1]) #get all the row inds AND one more
)
#use the row indices to get the output table
out_df = pd.concat(df.loc[inds].assign(group=i) for i,inds in enumerate(inds_per_group))
With output:
Tool Value Date group
5 B 51.7 1/10 0
6 B 51.9 1/21 0
7 B 52.4 1/22 0
8 B 53.0 2/1 0
The reason I've added a group column is to help you distinguish when you have multiple consecutive runs of 3 or more. For example if your original table had 50.14 instead of 52.14 for the first value:
df = pd.DataFrame({
'Tool': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Value': [50.14, 51.5, 52.0, 52.9, 53.1, 51.7, 51.9, 52.4, 53.0, 51.5],
'Date': ['1/1','1/7','1/10','2/1','1/5','1/10','1/21','1/22','2/1','2/15'],
})
Then the output is
Tool Value Date group
0 A 50.14 1/1 0
1 A 51.50 1/7 0
2 A 52.00 1/10 0
3 A 52.90 2/1 0
5 B 51.70 1/10 1
6 B 51.90 1/21 1
7 B 52.40 1/22 1
8 B 53.00 2/1 1
You can first sort everything according to "tool" and "date"
df = df.sort_values(by=["tool", "date"])
Then, you can use the shift() on "Value" to add another column which contains the "Value" from the day before.
df["value_minus_1"] = df["Value"].shift()
This gives you a column which contains the measurement of the previous day. Then, you can use shift(2) to get the measurements of two days ago.
df["value_minus_2"] = df["Value"].shift(2)
Finally, you can drop the rows that have nas and filter the rows:
df = df.dropna()
df = df[df.val > df.value_minus_1]
df = df[df.value_minus_1 > df.value_minus_2]

pandas groupby and agg operation of selected columns and row

I have a dataframe as below:
I am not sure if it is possible to use pandas to make an output as below:
difference=Response[df.Time=="pre"]-Response.min for each group
If pre is always first per groups and values in output should be repeated:
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: (x.iat[0] - x).min())
For only first value per groups is possible replace values to empty strings, but get mixed values - numeric with strings, so next processing should be problem:
df['diff'] = df['diff'].mask(df['diff'].duplicated(), '')
EDIT:
df = pd.DataFrame({
'Response':[2,5,0.4,2,1,4],
'Time':[7,'pre',9,4,2,'pre'],
'IDs':list('aaabbb')
})
#print (df)
d = df[df.Time=="pre"].set_index('IDs')['Response'].to_dict()
print (d)
{'a': 5.0, 'b': 4.0}
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: d[x.name] - x.min())
print (df)
Response Time IDs diff
0 2.0 7 a 4.6
1 5.0 pre a 4.6
2 0.4 9 a 4.6
3 2.0 4 b 3.0
4 1.0 2 b 3.0
5 4.0 pre b 3.0

Count the number of times a value from one dataframe has repeated in another dataframe

I have 3 dataframes say A, B and C with a common column 'com_col' in all three dataframes. I want to create a new column called 'com_col_occurrences' in B which should be calculated as below. For each value in 'com_col in dataframe B, check whether the value is available in A or not. If it is available then return the number of times the value has occurred in A. If it is not, then check in C whether it is available or not and if it is then return the number of times it has repeated. Please tell me how to write a function for this in Pandas. Please find below the sample code which demonstrates the problem.
import pandas as pd
#Given dataframes
df1 = pd.DataFrame({'comm_col': ['A', 'B', 'B', 'A']})
df2 = pd.DataFrame({'comm_col': ['A', 'B', 'C', 'D', 'E']})
df3 = pd.DataFrame({'comm_col':['A', 'A', 'D', 'E']})
# The value 'A' from df2 occurs in df1 twice. Hence the output is 2.
#Similarly for 'B' the output is 2. 'C' doesn't occur in any of the
#dataframes. Hence the output is 0
# 'D' and 'E' occur don't occur in df1 but occur in df3 once. Hence
#the output for 'D' and 'E' should be 1
#Output should be as shown below
df2['comm_col_occurrences'] = [2, 2, 0, 1, 1]
Output:
**df1**
comm_col
0 A
1 B
2 B
3 A
**df3**
comm_col
0 A
1 A
2 D
3 E
**df2**
comm_col
0 A
1 B
2 C
3 D
4 E
**Output**
comm_col comm_col_occurrences
0 A 2
1 B 2
2 C 0
3 D 1
4 E 1
Thanks in advance
You need:
result = pd.DataFrame({
'df1':df1['comm_col'].value_counts(),
'df2':df2['comm_col'].value_counts(),
'df3':df3['comm_col'].value_counts()
})
result['comm_col_occurrences'] = np.nan
result.loc[result['df1'].notnull(), 'comm_col_occurrences'] = result['df1']
result.loc[result['df3'].notnull(), 'comm_col_occurrences'] = result['df3']
result['comm_col'] = result['comm_col'].fillna(0)
result = result.drop(['df1', 'df2', 'df3'], axis=1)
Output:
comm_col comm_col_occurrences
0 A 2.0
1 B 2.0
2 C 0.0
3 D 1.0
4 E 1.0

How to calculate multiple columns from multiple columns in pandas

I am trying to calculate multiple colums from multiple columns in a pandas dataframe using a function.
The function takes three arguments -a-, -b-, and -c- and and returns three calculated values -sum-, -prod- and -quot-. In my pandas data frame I have three coumns -a-, -b- and and -c- from which I want to calculate the columns -sum-, -prod- and -quot-.
The mapping that I do works only when I have exactly three rows. I do not know what is going wrong, although I expect that it has to do something with selecting the correct axis. Could someone explain what is happening and how I can calculate the values that I would like to have.
Below are the situations that I have tested.
INITIAL VALUES
def sum_prod_quot(a,b,c):
sum = a + b + c
prod = a * b * c
quot = a / b / c
return (sum, prod, quot)
df = pd.DataFrame({ 'a': [20, 100, 18],
'b': [ 5, 10, 3],
'c': [ 2, 10, 6],
'd': [ 1, 2, 3]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
CALCULATION STEPS
Using exactly three rows
When I calculate three columns from this dataframe and using the function function I get:
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
This is exactly the result that I want to have: The sum-column has the sum of the elements in the columns a,b,c; the prod-column has the product of the elements in the columns a,b,c and the quot-column has the quotients of the elements in the columns a,b,c.
Using more than three rows
When I expand the dataframe with one row, I get an error!
The data frame is defined as:
df = pd.DataFrame({ 'a': [20, 100, 18, 40],
'b': [ 5, 10, 3, 10],
'c': [ 2, 10, 6, 4],
'd': [ 1, 2, 3, 4]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
3 40 10 4 4
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: too many values to unpack (expected 3)
while I would expect an extra row:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
3 40 10 4 4 54.0 1600.0 1.0
Using less than three rows
When I reduce tthe dataframe with one row I get also an error.
The dataframe is defined as:
df = pd.DataFrame({ 'a': [20, 100],
'b': [ 5, 10],
'c': [ 2, 10],
'd': [ 1, 2]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: need more than 2 values to unpack
while I would expect a row less:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
QUESTIONS
The questions I have:
1) Why do I get these errors?
2) How do I have to modify the call such that I get the desired data frame?
NOTE
In this link a similar question is asked, but the given answer did not work for me.
The answer doesn't seem correct for 3 rows as well. Can you check other values except first row and first column. Looking at the results, product of 20*5*2 is NOT 120, it's 200 and is placed below in sum column. You need to form list in correct way before assigning to new columns. You can try use following to set the new columns:
df['sum'], df['prod'], df['quot'] = zip(*map(sum_prod_quot, df['a'], df['b'], df['c']))
For details follow the link