pandas groupby and agg operation of selected columns and row - pandas

I have a dataframe as below:
I am not sure if it is possible to use pandas to make an output as below:
difference=Response[df.Time=="pre"]-Response.min for each group

If pre is always first per groups and values in output should be repeated:
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: (x.iat[0] - x).min())
For only first value per groups is possible replace values to empty strings, but get mixed values - numeric with strings, so next processing should be problem:
df['diff'] = df['diff'].mask(df['diff'].duplicated(), '')
EDIT:
df = pd.DataFrame({
'Response':[2,5,0.4,2,1,4],
'Time':[7,'pre',9,4,2,'pre'],
'IDs':list('aaabbb')
})
#print (df)
d = df[df.Time=="pre"].set_index('IDs')['Response'].to_dict()
print (d)
{'a': 5.0, 'b': 4.0}
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: d[x.name] - x.min())
print (df)
Response Time IDs diff
0 2.0 7 a 4.6
1 5.0 pre a 4.6
2 0.4 9 a 4.6
3 2.0 4 b 3.0
4 1.0 2 b 3.0
5 4.0 pre b 3.0

Related

Generating a separate column that stores weighted average per group

It must not be that hard but I can't cope with this problem.
Imagine I have a long format dataframe with some data and want to calculate a weighted average of score per person and weighted by a manager and keep it as a separate variable - 'w_mean_m'.
df['w_mean_m'] = df.groupby('person')['score'].transform(lambda x: np.average(x['score'], weights=x['manager_weight']))
throws an error and I have no idea how to fix it.
Because GroupBy.transform working with each column separately is not possible select multiple columns, so is used GroupBy.apply with Series.map for new column:
s = (df.groupby('contact')
.apply(lambda x: np.average(x['score'], weights=x['manager_weight'])))
df['w_mean_m'] = df['contact'].map(s)
One hack is possible with selected values by unique index for weights:
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights=df.loc[x.index, "manager_weight"])
df['w_mean_m1'] = df.groupby('contact')['score'].transform(f)
print (df)
manager_weight score contact w_mean_m1
0 1.0 1 a 1.282609
1 1.1 1 a 1.282609
2 1.2 1 a 1.282609
3 1.3 2 a 1.282609
4 1.4 2 b 2.355556
5 1.5 2 b 2.355556
6 1.6 3 b 2.355556
7 1.7 3 c 3.770270
8 1.8 4 c 3.770270
9 1.9 4 c 3.770270
10 2.0 4 c 3.770270
Setup:
df = pd.DataFrame(
{
"manager_weight": [1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0],
"score": [1,1,1,2,2,2,3,3,4,4,4],
"contact": ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
})

Pandas: How to perform the same calculation on multiple columns, which come from a passed-in list?

Given the following code, I'd like to calculate three additional columns called "Expected Sales {letter}" that multiplies each column of A, B, and C by the Price column. But to do this, I would like to pass in a list of column names such that only those with 'Group' in the column name get multiplied (meaning that Other does nothing)
data = [(3,5,7), (2,4,6), (5,8,9.2)]
df = pd.DataFrame(data, columns = ['Group A','Group B','Group C'])
df['Fruit'] = ('apples', 'bananas', 'pears')
df['Price'] = (1, 0.5, 2)
df['Other'] = ("blah1", "blah2", "blah3")
Use filter to select your columns then use rename to replace "Group" string by "Expected Sales":
df1 = (df.filter(like='Group').mul(df['Price'], axis=0)
.rename(columns=lambda x: x.replace('Group', 'Expected Sales')))
out = pd.concat([df, df1], axis=1)
Output:
>>> out
Group A Group B Group C Fruit Price Other Expected Sales A Expected Sales B Expected Sales C
0 3 5 7.0 apples 1.0 blah1 3.0 5.0 7.0
1 2 4 6.0 bananas 0.5 blah2 1.0 2.0 3.0
2 5 8 9.2 pears 2.0 blah3 10.0 16.0 18.4

Quickly replace values in a Pandas DataFrame

I have the following dataframe:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
# A B Sum
# 1 1 3 4
# 2 2 4 6
# Sum 3 7 10
I want to:
replace 1 by 3*4/10
replace 2 by 3*6/10
replace 3 by 4*7/10
replace 4 by 7*6/10
What is the easiest way to do this? I want the solution to be able to extend to n number of rows and columns. Been cracking my head over this. TIA!
If I understood you correctly:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
conditions = [(df==1), (df==2), (df==3), (df==4)]
values = [(3*4)/10, (3*6)/10, (4*7)/10, (7*6)/10]
df[df.columns] = np.select(conditions, values, df)
OutPut:
A B Sum
1 1.2 2.8 4.2
2 1.8 4.2 6.0
Sum 2.8 7.0 10.0
Let us try create it from original df before you do the sum and assign
import numpy as np
v = np.multiply.outer(df.sum(1).values,df.sum().values)/df.sum().sum()
out = pd.DataFrame(v,index=df.index,columns=df.columns)
out
Out[20]:
A B
1 1.2 2.8
2 1.8 4.2

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

Removing Rows in a Pandas Dataframe with Identical (and adjacent) entries in a specific column

I have a dataframe where I have some dulicates in the "Item" column.
I want to remove the rows where there are dulicates (adjacent) but retain the last one i.e. Get rid of the red but keep the green
I then want to create a new column, where apples is assumed a start, and the next row is a time delta from this.i.e.
IIUC, try:
df_out = df.assign(Item_cnt=(df['Item'] != df['Item'].shift()).cumsum())\
.drop_duplicates(['Item','Item_cnt'], keep='last')
df_out['delta T'] = df_out['datetime'] - df_out.groupby((df_out['Item'] == 'apples').cumsum())['datetime'].transform('first')
Output:
Item datetime Item_cnt delta T
2 apples 1.2 1 0.0
3 oranges 2.3 2 1.1
4 apples 2.5 3 0.0
5 bananas 2.7 4 0.2
Details:
Create a grouping using cumsum and checking to see if the next line differs, then use drop_duplicates keeping the last record in that group.
IIUC,
df = pd.DataFrame({'Item' : ['apples', 'apples','apples','orange','apples','bananas'],
'dateTime' : [1,1.1,1.2,2.3,2.5,2.7]})
s = df.copy()
s['dateTime'] = s['dateTime'].round()
idx = s.drop_duplicates(subset=['Item','dateTime'],keep='last').index.tolist()
df = df.loc[idx]
df.loc[df['Item'].ne('apples'), 'delta'] = abs(df['dateTime'].shift() - df['dateTime'])
print(df.fillna(0))
Item dateTime delta
2 apples 1.2 0.0
3 orange 2.3 1.1
4 apples 2.5 0.0
5 bananas 2.7 0.2
Here is the df:
df = pd.DataFrame.from_dict({'Item':
['apples', 'apples', 'apples', 'oranges', 'apples', 'bananas'],
'dateTime':[1, 1.1, 1.2, 2.3, 2.5, 2.7]})
You can't use duplcated because you need to keep multiple copies of the same item, so try this:
df['Item_lag'] = df['Item'].shift(-1)
df = df[df['Item'] != df['Item_lag']] # get rid of repeated Items
df['deltaT'] = df['dateTime'] - df['dateTime'].shift(1).fillna(0) # calculate time diff
df.drop(['dateTime', 'Item_lag'], axis=1, inplace=True) # drop extra columns
df # display df
out:
Item deltaT
apples 1.2
oranges 1.1
apples 0.2
bananas 0.2