How to add 1 to previous data if NaN in pandas - pandas

I was wondering if it is possible to add 1 (or n) to missing values in a pandas DataFrame / Series.
For example:
1
10
nan
15
25
nan
nan
nan
30
Would return :
1
10
11
15
25
26
27
28
30
Thank you,

Use .ffill + the result of a groupby.cumcount to determine n
df[0].ffill() + df.groupby(df[0].notnull().cumsum()).cumcount()
0 1.0
1 10.0
2 11.0
3 15.0
4 25.0
5 26.0
6 27.0
7 28.0
8 30.0
dtype: float64

Related

Adding columns with null values in pandas dataframe [duplicate]

When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).
Initial dataframe:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
Desired output:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
Tried code:
-> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
From the documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
Change your code to
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
output
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You could mask the result by doing:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
You can do:
df['Sum'] = df.dropna(how='all').sum(1)
Output:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You can use min_count, this will sum all the row when there is at least on not null, if all null return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']

How to replace values in a dataframes with values in another dataframe

I have 2 dataframes
df_1:
Week Day Coeff_1 ... Coeff_n
1 1 12 23
1 2 11 19
1 3 23 68
1 4 57 81
1 5 35 16
1 6 0 0
1 7 0 0
...
50 1 12 23
50 2 11 19
50 3 23 68
50 4 57 81
50 5 35 16
50 6 0 0
50 7 0 0
df_2:
Week Day Coeff_1 ... Coeff_n
1 1 0 0
1 2 0 0
1 3 0 0
1 4 0 0
1 5 0 0
1 6 56 24
1 7 20 10
...
50 1 0 0
50 2 0 0
50 3 0 0
50 4 0 0
50 5 0 0
50 6 10 84
50 7 29 10
In the first dataframe df_1 I have coefficients for monday to friday. In the second dataframes df_2 I have coefficients for the week end. My goal is to merge both dataframes such that I have no longer 0 values which are obsolete.
What is the best approach to do that?
I found that using df.replace seems to be a good approach
Assuming that your dataframes follow the same structure, you can capitalise on pandas functionality to align automatically on indexes. Thus you can replace 0's with np.nan in df1, and then use fillna:
df1.replace({0:np.nan},inplace=True)
df1.fillna(df2)
Week Day Coeff_1 Coeff_n
0 1.0 1.0 12.0 23.0
1 1.0 2.0 11.0 19.0
2 1.0 3.0 23.0 68.0
3 1.0 4.0 57.0 81.0
4 1.0 5.0 35.0 16.0
5 1.0 6.0 56.0 24.0
6 1.0 7.0 20.0 10.0
7 50.0 1.0 12.0 23.0
8 50.0 2.0 11.0 19.0
9 50.0 3.0 23.0 68.0
10 50.0 4.0 57.0 81.0
11 50.0 5.0 35.0 16.0
12 50.0 6.0 10.0 84.0
13 50.0 7.0 29.0 10.0
Can't you just append the rows df_1 where day is 1-5 to the rows of df_2 where day is 6-7?
df_3 = df_1[df_1.Day.isin(range(1,6))].append(df_2[df_2.Day.isin(range(6,8))])
To get a normal sorting, you can sort your values by week and day:
df_3.sort_values(['Week','Day'])

How to groupby a dataframe with two level header and generate box plot?

Now I have a dataframe like below (original dataframe):
Equipment
A
B
C
1
10
10
10
1
11
11
11
2
12
12
12
2
13
13
13
3
14
14
14
3
15
15
15
And I want to transform the dataframe like below (transformed dataframe):
1
-
-
2
-
-
3
-
-
A
B
C
A
B
C
A
B
C
10
10
10
12
12
12
14
14
14
11
11
11
13
13
13
15
15
15
How can I make such groupby transformation with two level header by Pandas?
Additionally, I want to use the transformed dataframe to generate box plot, and the whole box plot is divided into three parts (i.e. 1,2,3), and each part has three box plots (i.e. A,B,C). Can I use the transformed dataframe in Image 2 without any processing? Or can I realize the box plotting only by the original dataframe?
Thank you so much.
Try:
g = df.groupby(' Equipment ')[df.columns[1:]].apply(lambda x: x.reset_index(drop=True).T)
g:
Equipment 1 2 3
A B C A B C A B C
0 10 10 10 12 12 12 14 14 14
1 11 11 11 13 13 13 15 15 15
Explanation:
grp = df.groupby(' Equipment ')[df.columns[1:]]
grp.apply(print)
A B C
0 10 10 10
1 11 11 11
A B C
2 12 12 12
3 13 13 13
A B C
4 14 14 14
5 15 15 15
you can see the index 0 1, 2 3, 4 5 for each equipment group(1,2,3).
That's why I used reset_index to make them 0 1 for each group why???
If you do without reset index:
df.groupby(' Equipment ')[df.columns[1:]].apply(lambda x: x.T)
0 1 2 3 4 5
Equipment
1 A 10.0 11.0 NaN NaN NaN NaN
B 10.0 11.0 NaN NaN NaN NaN
C 10.0 11.0 NaN NaN NaN NaN
2 A NaN NaN 12.0 13.0 NaN NaN
B NaN NaN 12.0 13.0 NaN NaN
C NaN NaN 12.0 13.0 NaN NaN
3 A NaN NaN NaN NaN 14.0 15.0
B NaN NaN NaN NaN 14.0 15.0
C NaN NaN NaN NaN 14.0 15.0
See the values in (2,3) and (4,5) column. I want to combine them into (0, 1) column only. That's why reset index with a drop.
0 1
Equipment
1 A 10 11
B 10 11
C 10 11
2 A 12 13
B 12 13
C 12 13
3 A 14 15
B 14 15
C 14 15
You can play with the code to understand it deeply. What's happening inside.

Create a new ID column based on conditions in other column using pandas

I am trying to make a new column 'ID' which should give a unique ID each time there is no 'NaN' value in 'Data' column. If the non null values come right to each other, the ID remains the same. I have provided how my final Id column should look like below as reference to better understand. Could anyone guide me on this?
Id Data
0 NaN
0 NaN
0 NaN
1 54
1 55
0 NaN
0 NaN
2 67
0 NaN
0 NaN
3 33
3 44
3 22
0 NaN
.groupby the cumsum to get consecutive groups, using where to mask the NaN. .ngroup gets the consecutive IDs. Also possible with rank.
s = df.Data.isnull().cumsum().where(df.Data.notnull())
df['ID'] = df.groupby(s).ngroup()+1
# df['ID'] = s.rank(method='dense').fillna(0).astype(int)
Output:
Data ID
0 NaN 0
1 NaN 0
2 NaN 0
3 54.0 1
4 55.0 1
5 NaN 0
6 NaN 0
7 67.0 2
8 NaN 0
9 NaN 0
10 33.0 3
11 44.0 3
12 22.0 3
13 NaN 0
Using factorize
v=pd.factorize(df.Data.isnull().cumsum()[df.Data.notnull()])[0]+1
df.loc[df.Data.notnull(),'Newid']=v
df.Newid.fillna(0,inplace=True)
df
Id Data Newid
0 0 NaN 0.0
1 0 NaN 0.0
2 0 NaN 0.0
3 1 54.0 1.0
4 1 55.0 1.0
5 0 NaN 0.0
6 0 NaN 0.0
7 2 67.0 2.0
8 0 NaN 0.0
9 0 NaN 0.0
10 3 33.0 3.0
11 3 44.0 3.0
12 3 22.0 3.0
13 0 NaN 0.0

Applying multiple functions to a pivot table (grouped) dataframe

I currently have a dataframe which looks like this:
df:
store item sales
0 1 1 10
1 1 2 20
2 2 1 10
3 3 2 20
4 4 3 10
5 3 4 15
...
I wanted to view the total sales of each items for each store so I used pivot table to create this:
p_table = pd.pivot_table(df, index='store', values='sales', columns='item', aggfunc=np.sum)
which gives something like:
sales
item 1 2 3 4
store
1 20 30 10 8
2 10 14 12 13
3 1 23 29 10
....
What I want to do now is apply some functions so that each total sales of items represents the percentage of the total sales for a particular store. For example, the value for item 1 at store1 would become:
1. 20/(20+30+10+8) * 100
I am struggling to do this for stacked dataframe. Any suggestions would be much appreciated.
Thanks
I think need divide by div with Series created by sum:
print (p_table)
item 1 2 3 4
store
1 10.0 20.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN 20.0 NaN 15.0
4 NaN NaN 10.0 NaN
print (p_table.sum(axis=1))
store
1 30.0
2 10.0
3 35.0
4 10.0
dtype: float64
out = p_table.div(p_table.sum(axis=1), axis=0)
print (out)
item 1 2 3 4
store
1 0.333333 0.666667 NaN NaN
2 1.000000 NaN NaN NaN
3 NaN 0.571429 NaN 0.428571
4 NaN NaN 1.0 NaN