Difference between previous row and next row in pandas gives NaN for first value - pandas

I am new to using Pandas and I have a dataframe df as given below
A B
0 4 5
1 5 8
2 6 11
3 7 13
4 8 15
5 9 30
6 10 477
7 11 3643
8 12 33469
9 13 141409
10 14 335338
11 15 365115
I want to get the difference between previous row and next row for B column
I used df.set_index('B').diff() but it gives NaN for first row. How to get 5 there?
A B
4 NaN
5 3.0
6 3.0
7 2.0
8 2.0
9 15.0
10 447.0
11 3166.0
12 29826.0
13 107940.0
14 193929.0
15 29777.0

Let us do
df.B.diff().fillna(df.B)
0 5.0
1 3.0
2 3.0
3 2.0
4 2.0
5 15.0
6 447.0
7 3166.0
8 29826.0
9 107940.0
10 193929.0
11 29777.0
Name: B, dtype: float64

Related

How to groupby a dataframe with two level header and generate box plot?

Now I have a dataframe like below (original dataframe):
Equipment
A
B
C
1
10
10
10
1
11
11
11
2
12
12
12
2
13
13
13
3
14
14
14
3
15
15
15
And I want to transform the dataframe like below (transformed dataframe):
1
-
-
2
-
-
3
-
-
A
B
C
A
B
C
A
B
C
10
10
10
12
12
12
14
14
14
11
11
11
13
13
13
15
15
15
How can I make such groupby transformation with two level header by Pandas?
Additionally, I want to use the transformed dataframe to generate box plot, and the whole box plot is divided into three parts (i.e. 1,2,3), and each part has three box plots (i.e. A,B,C). Can I use the transformed dataframe in Image 2 without any processing? Or can I realize the box plotting only by the original dataframe?
Thank you so much.
Try:
g = df.groupby(' Equipment ')[df.columns[1:]].apply(lambda x: x.reset_index(drop=True).T)
g:
Equipment 1 2 3
A B C A B C A B C
0 10 10 10 12 12 12 14 14 14
1 11 11 11 13 13 13 15 15 15
Explanation:
grp = df.groupby(' Equipment ')[df.columns[1:]]
grp.apply(print)
A B C
0 10 10 10
1 11 11 11
A B C
2 12 12 12
3 13 13 13
A B C
4 14 14 14
5 15 15 15
you can see the index 0 1, 2 3, 4 5 for each equipment group(1,2,3).
That's why I used reset_index to make them 0 1 for each group why???
If you do without reset index:
df.groupby(' Equipment ')[df.columns[1:]].apply(lambda x: x.T)
0 1 2 3 4 5
Equipment
1 A 10.0 11.0 NaN NaN NaN NaN
B 10.0 11.0 NaN NaN NaN NaN
C 10.0 11.0 NaN NaN NaN NaN
2 A NaN NaN 12.0 13.0 NaN NaN
B NaN NaN 12.0 13.0 NaN NaN
C NaN NaN 12.0 13.0 NaN NaN
3 A NaN NaN NaN NaN 14.0 15.0
B NaN NaN NaN NaN 14.0 15.0
C NaN NaN NaN NaN 14.0 15.0
See the values in (2,3) and (4,5) column. I want to combine them into (0, 1) column only. That's why reset index with a drop.
0 1
Equipment
1 A 10 11
B 10 11
C 10 11
2 A 12 13
B 12 13
C 12 13
3 A 14 15
B 14 15
C 14 15
You can play with the code to understand it deeply. What's happening inside.

Backfill and Increment by one?

I have a column of a DataFrame that consists of 0's and NaN's:
Timestamp A B C
1 3 3 NaN
2 5 2 NaN
3 9 1 NaN
4 2 6 NaN
5 3 3 0
6 5 2 NaN
7 3 1 NaN
8 2 8 NaN
9 1 6 0
And I want to backfill it and increment the last value:
Timestamp A B C
1 3 3 4
2 5 2 3
3 9 1 2
4 2 6 1
5 3 3 0
6 5 2 3
7 3 1 2
8 2 8 1
9 1 6 0
YOu can use iloc[::-1] to reverse the data, and groupby().cumcount() to create the row counter:
s = df['C'].iloc[::-1].notnull()
df['C'] = df['C'].bfill() + s.groupby(s.cumsum()).cumcount()
Output
Timestamp A B C
0 1 3 3 4.0
1 2 5 2 3.0
2 3 9 1 2.0
3 4 2 6 1.0
4 5 3 3 0.0
5 6 5 2 3.0
6 7 3 1 2.0
7 8 2 8 1.0
8 9 1 6 0.0

generate lines with all unique values of given column for each group

df = pd.DataFrame({'timePoint': [1,1,1,1,2,2,2,2,3,3,3,3],
'item': [1,2,3,4,3,4,5,6,1,3,7,2],
'value': [2,4,7,6,5,9,3,2,4,3,1,5]})
>>> df
item timePoint value
0 1 1 2
1 2 1 4
2 3 1 7
3 4 1 6
4 3 2 5
5 4 2 9
6 5 2 3
7 6 2 2
8 1 3 4
9 3 3 3
10 7 3 1
11 2 3 5
In this df, not every item appears at every timePoint. I want to have all unique items at every timePoint, and these newly inserted items should either have:
(i) a NaN value if they have not appeared at a previous timePoint, or
(ii) if they have, they get their most recent value.
The desired output should look like the following (lines with hashtag are those inserted).
>>> dfx
item timePoint value
0 1 1 2.0
3 1 2 2.0 #
8 1 3 4.0
1 2 1 4.0
4 2 2 4.0 #
11 2 3 5.0
2 3 1 7.0
4 3 2 5.0
9 3 3 3.0
3 4 1 6.0
5 4 2 9.0
6 4 3 9.0 #
0 5 1 NaN #
6 5 2 3.0
7 5 3 3.0 #
1 6 1 NaN #
7 6 2 2.0
8 6 3 2.0 #
2 7 1 NaN #
5 7 2 NaN #
10 7 3 1.0
For example, item 1 gets a 4.0 at timePoint 2 because that's what it had a timePoint 1 whereas item 6 gets a NaN at timePoint 1 because there is no preceding value.
Now, I know that if I manage to insert all lines of every unique item missing in each timePoint group, i.e. reach this point:
>>> dfx
item timePoint value
0 1 1 2.0
1 2 1 4.0
2 3 1 7.0
3 4 1 6.0
4 3 2 5.0
5 4 2 9.0
6 5 2 3.0
7 6 2 2.0
8 1 3 4.0
9 3 3 3.0
10 7 3 1.0
11 2 3 5.0
0 5 1 NaN
1 6 1 NaN
2 7 1 NaN
3 1 2 NaN
4 2 2 NaN
5 7 2 NaN
6 4 3 NaN
7 5 3 NaN
8 6 3 NaN
Then I can do:
dfx.sort_values(by = ['item', 'timePoint'],
inplace = True,
ascending = [True, True])
dfx['value'] = dfx.groupby('item')['value'].fillna(method='ffill')
which will return the desired output.
But how do I add as lines all df.item.unique() items that are missing to each timePoint group?
Also, if you have a more efficient solution from scratch to suggest, then by all means please be my guest.
Using pd.MultiIndex.from_product, levels, reindex
d = df.set_index(['item', 'timePoint'])
d.reindex(
pd.MultiIndex.from_product(d.index.levels, names=d.index.names)
).groupby(level='item').ffill().reset_index()
item timePoint value
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0
I think stack with unstack will achieve the format , then we using groupby ffillto fill the nan value forward
s=df.set_index(['item','timePoint']).value.unstack().stack(dropna=False)
s.groupby(level=0).ffill().reset_index()
Out[508]:
item timePoint 0
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0

Rolling grouped cumulative sum

I'm looking to create a rolling grouped cumulative sum. I can get the result via iteration, but wanted to see if there was a more intelligent way.
Here's what the source data looks like:
Per C V
1 c 3
1 a 4
1 c 1
2 a 6
2 b 5
3 j 7
4 x 6
4 x 5
4 a 9
5 a 2
6 c 3
6 k 6
Here is the desired result:
Per C V
1 c 4
1 a 4
2 c 4
2 a 10
2 b 5
3 c 4
3 a 10
3 b 5
3 j 7
4 c 4
4 a 19
4 b 5
4 j 7
4 x 11
5 c 4
5 a 21
5 b 5
5 j 7
5 x 11
6 c 7
6 a 21
6 b 5
6 j 7
6 x 11
6 k 6
This is a very interesting problem. Try below to see if it works for you.
(
pd.concat([df.loc[df.Per<=i][['C','V']].assign(Per=i) for i in df.Per.unique()])
.groupby(by=['Per','C'])
.sum()
.reset_index()
)
Out[197]:
Per C V
0 1 a 4
1 1 c 4
2 2 a 10
3 2 b 5
4 2 c 4
5 3 a 10
6 3 b 5
7 3 c 4
8 3 j 7
9 4 a 19
10 4 b 5
11 4 c 4
12 4 j 7
13 4 x 11
14 5 a 21
15 5 b 5
16 5 c 4
17 5 j 7
18 5 x 11
19 6 a 21
20 6 b 5
21 6 c 7
22 6 j 7
23 6 k 6
24 6 x 11
If you set the index to be 'Per' and 'C', you can first accumulate over those index levels. Then I decided to reindex the resulting series by the the product of the index levels in order to get all possibilities while filling in new indices with zero.
After this, I use groupby, cumsum, and remove zeros.
s = df.set_index(['Per', 'C']).V.sum(level=[0, 1])
s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
Per C V
0 1 a 4
1 1 c 4
2 2 a 10
3 2 b 5
4 2 c 4
5 3 a 10
6 3 b 5
7 3 c 4
8 3 j 7
9 4 a 19
10 4 b 5
11 4 c 4
12 4 j 7
13 4 x 11
14 5 a 21
15 5 b 5
16 5 c 4
17 5 j 7
18 5 x 11
19 6 a 21
20 6 b 5
21 6 c 7
22 6 j 7
23 6 k 6
24 6 x 11
You could use pivot_table/cumsum:
(df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
.fillna(0)
.cumsum(axis=0)
.replace(0, np.nan)
.stack().reset_index())
yields
Per C 0
0 1 a 4.0
1 1 c 4.0
2 2 a 10.0
3 2 b 5.0
4 2 c 4.0
5 3 a 10.0
6 3 b 5.0
7 3 c 4.0
8 3 j 7.0
9 4 a 19.0
10 4 b 5.0
11 4 c 4.0
12 4 j 7.0
13 4 x 11.0
14 5 a 21.0
15 5 b 5.0
16 5 c 4.0
17 5 j 7.0
18 5 x 11.0
19 6 a 21.0
20 6 b 5.0
21 6 c 7.0
22 6 j 7.0
23 6 k 6.0
24 6 x 11.0
On the plus side, I think the pivot_table/cumsum approach helps convey the meaning of the calculation pretty well. Given the pivot_table, the calculation is essentially a cumulative sum down each column:
In [131]: df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
Out[131]:
C a b c j k x
Per
1 4.0 NaN 4.0 NaN NaN NaN
2 6.0 5.0 NaN NaN NaN NaN
3 NaN NaN NaN 7.0 NaN NaN
4 9.0 NaN NaN NaN NaN 11.0
5 2.0 NaN NaN NaN NaN NaN
6 NaN NaN 3.0 NaN 6.0 NaN
On the negative side, the need to fuss with 0's and NaNs is not ideal. We need 0's for the cumsum, but we need NaNs to make unwanted rows to disappear when the DataFrame is stacked.
The pivot_table/cumsum approach also offers a considerable speed advantage over using_concat, but piRSquared's solution is the fastest. On a 1000-row df:
In [169]: %timeit using_reindex2(df)
100 loops, best of 3: 6.86 ms per loop
In [152]: %timeit using_reindex(df)
100 loops, best of 3: 8.36 ms per loop
In [80]: %timeit using_pivot(df)
100 loops, best of 3: 8.58 ms per loop
In [79]: %timeit using_concat(df)
10 loops, best of 3: 84 ms per loop
Here is the setup I used for the benchmark:
import numpy as np
import pandas as pd
def using_pivot(df):
return (df.pivot_table(index='P', columns='C', values='V', aggfunc='sum')
.fillna(0)
.cumsum(axis=0)
.replace(0, np.nan)
.stack().reset_index())
def using_reindex(df):
"""
https://stackoverflow.com/a/49097572/190597 (piRSquared)
"""
s = df.set_index(['P', 'C']).V.sum(level=[0, 1])
return s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
def using_reindex2(df):
"""
https://stackoverflow.com/a/49097572/190597 (piRSquared)
with first line changed
"""
s = df.groupby(['P', 'C'])['V'].sum()
return s.reindex(
pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()
def using_concat(df):
"""
https://stackoverflow.com/a/49095139/190597 (Allen)
"""
return (pd.concat([df.loc[df.P<=i][['C','V']].assign(P=i)
for i in df.P.unique()])
.groupby(by=['P','C'])
.sum()
.reset_index())
def make(nrows):
df = pd.DataFrame(np.random.randint(50, size=(nrows,3)), columns=list('PCV'))
return df
df = make(1000)

-Adding a column to a pandas dataframe that is sum of 3 different rows in another column AND slides those rows down like in excel

I have a data frame like:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
5 16 17 18
6 19 20 21
7 22 23 24
8 25 26 27
I'd like to add a column d that is the sum of column A row 0, column A row 2, and column A row 5.
I figured out how to do:
df['d']=df.loc[0,'a'] + df.loc[2,'a'] + df.loc[5,'a']
But the result is a static d tied to only those rows. I'd like a dynamic d, such that column d, row 2 is the sum of column a, row 1, column a, row3, and column a, row 6.
The end result should be:
a b c d
0 1 2 3 24
1 4 5 6 33
2 7 8 9 42
3 10 11 12 ---And so on
4 13 14 15 ---
5 16 17 18 ---
6 19 20 21 ---
7 22 23 24 ---
8 25 26 27 ---
Thanks for any help!
this is shift
df.a+df.a.shift(-2)+df.a.shift(-5)
Out[412]:
0 24.0
1 33.0
2 42.0
3 51.0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
Name: a, dtype: float64
df['d']=df.a+df.a.shift(-2)+df.a.shift(-5)
df
Out[414]:
a b c d
0 1 2 3 24.0
1 4 5 6 33.0
2 7 8 9 42.0
3 10 11 12 51.0
4 13 14 15 NaN
5 16 17 18 NaN
6 19 20 21 NaN
7 22 23 24 NaN
8 25 26 27 NaN