Compute a sequential rolling mean in pandas as array function? - pandas

I am trying to calculate a rolling mean on dataframe with NaNs in pandas, but pandas seems to reset the window when it meets a NaN, hears some code as an example...
import numpy as np
from pandas import *
foo = DataFrame(np.arange(0.0,13.0))
foo['1'] = np.arange(13.0,26.0)
foo.ix[4:6,0] = np.nan
foo.ix[4:7,1] = np.nan
bar = rolling_mean(foo, 4)
gives the rolling mean that resets the window after each NaN's, not just skipping out the NaNs
bar =
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 8.5 NaN
11 9.5 22.5
12 10.5 23.5
I have found an ugly iter/ dropna() work around that gives the right answer
def sparse_rolling_mean(df_data, window):
...: f_data = DataFrame(np.nan,index=df_data.index, columns=df_data.columns)
...: for i in f_data.columns:
...: f_data.ix[:,i] = rolling_mean(df_data.ix[:,i].dropna(),window)
...: return f_data
bar = sparse_rolling_mean(foo,4)
bar
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
does anyone know if it is possible to do this as an array function ?
many thanks in advance.

you may do:
>>> def sparse_rolling_mean(ts, window):
... return rolling_mean(ts.dropna(), window).reindex_like(ts)
...
>>> foo.apply(sparse_rolling_mean, args=(4,))
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
[13 rows x 2 columns]

you can control what get's naned out with the min_periods arg
In [12]: rolling_mean(foo, 4,min_periods=1)
Out[12]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 2.0 15.0
5 2.5 15.5
6 3.0 16.0
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
You can do this if you want results, except when the original was nan
In [27]: rolling_mean(foo, 4,min_periods=1)[foo.notnull()]
Out[27]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
Your expected are a bit odd, as the first 3 rows should have values.

Related

Adding columns with null values in pandas dataframe [duplicate]

When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).
Initial dataframe:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
Desired output:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
Tried code:
-> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
From the documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
Change your code to
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
output
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You could mask the result by doing:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
You can do:
df['Sum'] = df.dropna(how='all').sum(1)
Output:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You can use min_count, this will sum all the row when there is at least on not null, if all null return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']

Use condition in a dataframe to replace values in another dataframe with nan

I have a dataframe that contains concentration values for a set of samples as follows:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
20
20
A
30
23
20
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
87
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
I have second dataframe that contains the proportion of concentration values that are not missing in the first dataframe:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
0.75
1
0.5
0.25
B
1
0.75
0.25
1
C
1
0
0
1
I would like to replace value in the first dataframe with nan when the condition in the second dataframe is 0.5 or less. Hence, the resulting dataframe would look like that below. Any help would be great!
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
nan
nan
A
30
23
nan
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
nan
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
Is it what your are looking for:
>>> df2.set_index('Sample').mask(lambda x: x <= 0.5) \
.mul(df1.set_index('Sample')).reset_index()
Sample Ethanol Acetone Formaldehyde Methane
0 A 15.0 20.00 NaN NaN
1 A 22.5 23.00 NaN NaN
2 A 15.0 23.00 NaN NaN
3 A NaN 20.00 NaN NaN
4 B 21.0 34.50 NaN 54.0
5 B 23.0 55.50 NaN 54.0
6 B 23.0 50.25 NaN 53.0
7 B 23.0 NaN NaN 33.0
8 C 23.0 NaN NaN 66.0
9 C 22.0 NaN NaN 88.0
10 C 22.0 NaN NaN 90.0
11 C 22.0 NaN NaN 88.0

Is there a way to merge pandas dataframes on row and column index?

I want to merge two pandas data frames that share the same index as well as some columns. pd.merge creates duplicate columns, but I would like to merge on both axes at the same time.
tried pd.merge and pd.concat but did not get the right result.
my try: df3=pd.merge(df1, df2, left_index=True, right_index=True, how='left')
df1
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7
ID
323 7 6 8 7.0 2.0 2.0 10.0
324 2 1 5 3.0 4.0 2.0 1.0
675 9 8 1 NaN NaN NaN NaN
676 3 7 2 NaN NaN NaN NaN
df2
Var#6 Var#7 Var#8 Var#9
ID
675 1 9 2 8
676 3 2 0 7
ideally I would get:
df3
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7 6 8 7.0 2.0 2.0 10.0 NaN NaN
324 2 1 5 3.0 4.0 2.0 1.0 NaN NaN
675 9 8 1 NaN NaN 1 9 2 8
676 3 7 2 NaN NaN 3 2 0 7
IIUC, use df.combine_first():
df3=df1.combine_first(df2)
print(df3)
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7 6 8 7.0 2.0 2.0 10.0 NaN NaN
324 2 1 5 3.0 4.0 2.0 1.0 NaN NaN
675 9 8 1 NaN NaN 1.0 9.0 2.0 8.0
676 3 7 2 NaN NaN 3.0 2.0 0.0 7.0
You can concat and group the data
pd.concat([df1, df2], 1).groupby(level = 0, axis = 1).first()
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7.0 6.0 8.0 7.0 2.0 2.0 10.0 NaN NaN
324 2.0 1.0 5.0 3.0 4.0 2.0 1.0 NaN NaN
675 9.0 8.0 1.0 NaN NaN 1.0 9.0 2.0 8.0
676 3.0 7.0 2.0 NaN NaN 3.0 2.0 0.0 7.0

Boxplot with pandas and groupby

I have the following dataset sample:
0 1
0 0 0.040158
1 2 0.500642
2 0 0.005694
3 1 0.065052
4 0 0.034789
5 2 0.128495
6 1 0.088816
7 1 0.056725
8 0 -0.000193
9 2 -0.070252
10 2 0.138282
11 2 0.054638
12 2 0.039994
13 2 0.060659
14 0 0.038562
And need a box and whisker plot, grouped by column 0. I have the following:
plt.figure()
grouped = df.groupby(0)
grouped.boxplot(column=1)
plt.savefig('plot.png')
But I end up with three subplots. How can place all three on one plot?
Thanks.
In 0.16.0 version of pandas, you could simply do this:
df.boxplot(by='0')
Result:
I don't believe you need to use groupby.
df2 = df.pivot(columns=df.columns[0], index=df.index)
df2.columns = df2.columns.droplevel()
>>> df2
0 0 1 2
0 0.040158 NaN NaN
1 NaN NaN 0.500642
2 0.005694 NaN NaN
3 NaN 0.065052 NaN
4 0.034789 NaN NaN
5 NaN NaN 0.128495
6 NaN 0.088816 NaN
7 NaN 0.056725 NaN
8 -0.000193 NaN NaN
9 NaN NaN -0.070252
10 NaN NaN 0.138282
11 NaN NaN 0.054638
12 NaN NaN 0.039994
13 NaN NaN 0.060659
14 0.038562 NaN NaN
df2.boxplot()

Pandas - Delete rows with two or more NaN values in dataframe

I want to delete column values that contain too many NaN values; specifically: 2 or more.
I have a dataframe with column which looks like this. The below column had 40 rows . I want to remove NaN values from 19th row (after 17.9 value).
AvgWS
0.12
1
2.04
3.01
3.99
5
6
7
7.99
9
10
10.98
11.99
13
13.93
14.99
15.98
NaN
17.9
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Thanks
You can call isnull() on the column, this will return a series with boolean values, you then cast this to int, the True values become 1 and False becomes 0 and then call cumsum(), we then filter the df where the cumumlative sum is less than 2 which equates to the point where the NaN count becomes greater than 2:
In [110]:
df[df['AvgWS'].isnull().astype(int).cumsum() < 2]
Out[110]:
AvgWS
0 0.12
1 1.00
2 2.04
3 3.01
4 3.99
5 5.00
6 6.00
7 7.00
8 7.99
9 9.00
10 10.00
11 10.98
12 11.99
13 13.00
14 13.93
15 14.99
16 15.98
17 NaN
18 17.90