Pandas divide column by index mean - pandas

I have a pandas dataframe with 2 index's and I want to divide each value by the column average for the second index (A, B).
For example input df
col1 col2
0 A 1 20
1 A 2 10
2 A 1 10
4 A 4 5
5 B 6 15
6 B 2 50
So for col1, I will dived 0A 1A 2A by 2 because the average of 1,2,1,4 is 2.
col1
0 A 0.5
1 A 1
2 A 0.5
4 A 2
5 B 1.5
6 B 0.5
Can anyone see a good way of doing this?

IIUC, try:
df.groupby(level=1)['col1'].apply(lambda x: x/x.mean())
Better without apply is :
df.col1/df.groupby(level=1)['col1'].transform('mean')
Output
0 A 0.5
1 A 1.0
2 A 0.5
4 A 2.0
5 B 1.5
6 B 0.5

Related

Using groupby() and cut() in pandas

I have a dataframe and for each group value I want to label values. If value is less that group mean then label is 1 and if group value is more than group mean then label is 2.
input data frame is
groups num1
0 a 2
1 a 5
2 a Nan
3 b 10
4 b 4
5 b 0
6 b 7
7 c 2
8 c 4
9 c 1
Here mean values for group a, b ,c are 3.5, 5.25 and 2.33 respectively and output data frame is .
groups out
0 a 1
1 a 2
2 a Nan
3 b 2
4 b 1
5 b 1
6 b 2
7 c 1
8 c 2
9 c 1
I want to use panads.cut and may be pandas.groupby and pandas.apply also.
and also how can I skip Null values here?
Thanks in advance
cut is not really pertinent here. Use groupby.transform('mean') and numpy.where:
df['out'] = np.where(df['num1'].lt(df.groupby('groups')['num1']
.transform('mean')),
1, 2)
Output (as new column "out" for clarity):
groups num1 out
0 a 2 1
1 a 5 2
2 a 7 2
3 b 10 2
4 b 4 1
5 b 0 1
6 b 7 2
7 c 2 1
8 c 4 2
9 c 1 1
I really want cut
OK, but it's not really nice and performant:
(df.groupby('groups')['num1']
.transform(lambda g: pd.cut(g, [-np.inf, g.mean(), np.inf], labels=[1, 2]))
)

Pandas: Get rolling mean with a add operation in between

My Pandas df is like:
ID delta price
1 -2 4
2 2 5
3 -3 3
4 0.8
5 0.9
6 -2.3
7 2.8
8 1
9 1
10 1
11 1
12 1
Pandas already has robust mean calculation method in built. I need to use it slightly differently.
So, in my df, price at row 4 would be sum of (a) rolling mean of price in row 1, 2, 3 and (b) delta at row 4.
Once, this is computed: I would move to row 5 for this: (a) rolling mean of price in row 2, 3, 4 and (b) delta at row 5. This would give price at row 5.....
I can iterate over rows to get this but my actual dataframe in quite big and iterating over row would slow things up....any better way to achieve?
I do not think we have method in panda can use the pervious calculated value in the next calculation
n = 3
for x in df.index[df.price.isna()]:
df.loc[x,'price'] = (df.loc[x-n:x,'price'].sum() + df.loc[x,'delta'])/4
df
Out[150]:
ID delta price
0 1 -2.0 4.000000
1 2 2.0 5.000000
2 3 -3.0 3.000000
3 4 0.8 3.200000
4 5 0.9 3.025000
5 6 -2.3 1.731250
6 7 2.8 2.689062
7 8 1.0 2.111328
8 9 1.0 1.882910
9 10 1.0 1.920825
10 11 1.0 1.728766
11 12 1.0 1.633125

Complete an incomplete dataframe in pandas

Good morning.
I have a dataframe that can be both like this:
df1 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
and like this:
df2 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
The difference between the two is only that the case may arise in which one, or several but not all, zones do have data for the highest of the time periods (column date). My desired result is to be able to complete the dataframe until a certain period of time (3 in the example), in the following way in each of the cases:
df1_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
7 B 3 6809 20
8 C 3 288 5
df2_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 1280 3
7 B 3 6809 20
8 C 3 288 5
I've tried different combinations of pivot and fillna with different methods, but I can't achieve the previous result.
I hope my explanation was understood.
Many thanks in advance.
You can use reindex to create entries for all dates in the range, and then forward fill the last value into it.
import pandas as pd
df1 = pd.DataFrame([['A', 1,154, 2],
['B', 1,2647, 7],
['C', 1,0, 0],
['A', 2,1280, 3],
['B', 2,6809, 20],
['C', 2,288, 5],
['A', 3,2000, 4]],
columns=['zone', 'date', 'p1', 'p2'])
result = df1.groupby("zone").apply(lambda x: x.set_index("date").reindex(range(1, 4), method='ffill'))
print(result)
To get
zone p1 p2
zone date
A 1 A 154 2
2 A 1280 3
3 A 2000 4
B 1 B 2647 7
2 B 6809 20
3 B 6809 20
C 1 C 0 0
2 C 288 5
3 C 288 5
IIUC, you can reconstruct a pd.MultiIndex from your original df and use fillna to get the max from each subgroup of zone you have.
first, build your index
ind = df1.set_index(['zone', 'date']).index
levels = ind.levels
n = len(levels[0])
labels = [np.tile(np.arange(n), n), np.repeat(np.arange(0, n), n)]
Then, use pd.MultiIndex constructor to reindex
df1.set_index(['zone', 'date'])\
.reindex(pd.MultiIndex(levels= levels, labels= labels))\
.fillna(df1.groupby(['zone']).max())
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
To fill df2, just change from df1 in this last line of code to df2 and you get
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
I suggest not to copy/paste directly the code and try to run, but rather try to understand the process and make slight changes if needed depending on how different your original data frame is from what you posted.

group data using pandas, but how do I keep the order of the group and do math on two of the columns rows?

df:
Time Name X Y
0 00 AA 0 0
1 30 BB 1 1
2 45 CC 2 2
3 60 GG:AB 3 3
4 90 GG:AC 4 4
5 120 AA 5 3
dataGroup = df.groupby
([pd.Grouper(key=Time,freq='30s'),'Name'])).sort_values(by=['Timestamp'],ascending=True)
I have tried doing a diff() on the row, but it is returning NaN or something not expected.
df.groupby('Name', sort=False)['X'].diff()
How do I keep the groupings and the time sort, and do diff between a row and its previous row (for both the X and the Y column)
Expected output:
XDiff would be Group AA,
XDiff row 1 = (X row1 - origin (known))
XDiff row 2 = (X row2 - X row1)
Time Name X Y XDiff YDiff
0 00 AA 0 0 0 0
5 120 AA 5 3 5 3
1 30 BB 1 1 0 0
6 55 BB 2 3 1 2
2 45 CC 2 2 0 0
3 60 GG:AB 3 3 0 0
4 90 GG:AC 4 4 0 0
It would be nice to see the total distance for each group (ie, AA is 5, BB is 1)
In my example, I only have a couple of rows for each group, but what if there were 100 of them, the diff would give me values for the distance between any two, but not the total distance for that group.
Ripping off https://stackoverflow.com/a/20664760/6672746, you can use a lambda function to calculate the difference between rows for X and Y. I also included two lines to set the index (after the groupby) and sort it.
df['x_diff'] = df.groupby(['Name'])['X'].transform(lambda x: x.diff()).fillna(0)
df['y_diff'] = df.groupby(['Name'])['Y'].transform(lambda x: x.diff()).fillna(0)
df.set_index(["Name", "Time"], inplace=True)
df.sort_index(level=["Name", "Time"], inplace=True)
Output:
X Y x_diff y_diff
Name Time
AA 0 0 0 0.0 0.0
120 5 3 5.0 3.0
BB 30 1 1 0.0 0.0
CC 45 2 2 0.0 0.0
GG:AB 60 3 3 0.0 0.0
GG:AC 90 4 4 0.0 0.0

pandas column operation on certain row in succession

I have a panda dataframe like this:
second block
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
This is a sequential data and I would like to get a new column which is the time difference between the current block and next time it repeats.
second block freq
0 1 a 3 //(4-1)
1 2 b 0 //(not repeating)
2 3 c 2 //(5-3)
3 4 a 0 //(not repeating)
4 5 c 0 //(not repeating)
I have tried to get the unique list of blocks. Then a for loop that do as below.
for i in unique_block:
df['freq'] = df['timestamp'].shift(-1) - df['timestamp']
I do not know how to get 0 for row index 1,3,4 and since the dataframe is too big. This is not efficient. This is not working.
Thanks.
Use groupby + diff(periods=-1). Multiply by -1 to get your difference convention and fillna with 0.
df['freq'] = (df.groupby('block').diff(-1)*-1).fillna(0)
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
You can use shift and transform in your groupby:
df['freq'] = df.groupby('block').second.transform(lambda x: x.shift(-1) - x).fillna(0)
>>> df
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
Using
df.groupby('block').second.apply(lambda x : x.diff().shift(-1)).fillna(0)
Out[242]:
0 3.0
1 0
2 2.0
3 0
4 0
Name: second, dtype: float64