How to perform a rolling window on a pandas DataFrame, whereby each row consists nan values that should not be replaced? - pandas

I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!

Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]

Related

How do I append an uneven column to an existing one?

I am having trouble appending later values from column C to column A within the same df using pandas. I have tried .append and .concat with ignore_index=True, still not working.
import pandas as pd
d = {'a':[1,2,3,None, None], 'b':[7,8,9, None, None], 'c':[None, None, None, 5, 6]}
df = pd.DataFrame(d)
df['a'] = df['a'].append(df['c'], ignore_index=True)
print(df)
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 NaN NaN 5.0
4 NaN NaN 6.0
Desired:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
Thank you for updating that, this is what I would do:
df['a'] = df['a'].fillna(df['c'])
print(df)
Output:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0

Pandas assign value in one column based on top 10 values in another column

I have a table:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
I would like to make a new column called 'flag' for the top 2 values in column D.
I've tried:
for i in df.D.nlargest(2):
df.['flag']= 1
But that gets me:
A B C D flag
0 NaN 2.0 NaN 0 1
1 3.0 4.0 NaN 1 1
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
What I want is:
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
IIUC:
df['flag'] = 0
df.loc[df.D.nlargest(2).index, 'flag'] = 1
Or:
df['flag'] = df.index.isin(df.D.nlargest(2).index).astype(int)
Output:
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
IIUC
df['flag']=df.D.sort_values().tail(2).eq(df.D).astype(int)
df
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1

How to select NaN values in pandas in specific range

I have a dataframe like this:
df = pd.DataFrame({'col1': [5,6,np.nan, np.nan,np.nan, 4, np.nan, np.nan,np.nan, np.nan,7,8,8, np.nan, 5 , np.nan]})
df:
col1
0 5.0
1 6.0
2 NaN
3 NaN
4 NaN
5 4.0
6 NaN
7 NaN
8 NaN
9 NaN
10 7.0
11 8.0
12 8.0
13 NaN
14 5.0
15 NaN
These NaN values should be replaced in the following way. The first selection should look like this.
2 NaN
3 NaN
4 NaN
5 4.0
6 NaN
7 NaN
8 NaN
9 NaN
And then these Nan values should be replace with the only value in that selection, 4.
The second selection is:
13 NaN
14 5.0
15 NaN
and these NaN values should be replaced with 5.
With isnull() you can select the NaN values in a dataframe but how are able to filter/select these specific ranges in pandas?
Solution if missing values are around one non missing val - solution create unique groups and replace in groups by forward and back filling:
#test missing values
s = df['col1'].isna()
#create unique groups
v = s.ne(s.shift()).cumsum()
#count groups and get only 1 value around, filter only misising values groups
mask = v.map(v.value_counts()).eq(1) | s
#groups for replacement per groups
g = mask.ne(mask.shift()).cumsum()
df['col2'] = df.groupby(g)['col1'].apply(lambda x: x.ffill().bfill())
print (df)
col1 col2
0 5.0 5.0
1 6.0 6.0
2 NaN 4.0
3 NaN 4.0
4 NaN 4.0
5 4.0 4.0
6 NaN 4.0
7 NaN 4.0
8 NaN 4.0
9 NaN 4.0
10 7.0 7.0
11 8.0 8.0
12 8.0 8.0
13 NaN 5.0
14 5.0 5.0
15 NaN 5.0

Is there a way to merge pandas dataframes on row and column index?

I want to merge two pandas data frames that share the same index as well as some columns. pd.merge creates duplicate columns, but I would like to merge on both axes at the same time.
tried pd.merge and pd.concat but did not get the right result.
my try: df3=pd.merge(df1, df2, left_index=True, right_index=True, how='left')
df1
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7
ID
323 7 6 8 7.0 2.0 2.0 10.0
324 2 1 5 3.0 4.0 2.0 1.0
675 9 8 1 NaN NaN NaN NaN
676 3 7 2 NaN NaN NaN NaN
df2
Var#6 Var#7 Var#8 Var#9
ID
675 1 9 2 8
676 3 2 0 7
ideally I would get:
df3
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7 6 8 7.0 2.0 2.0 10.0 NaN NaN
324 2 1 5 3.0 4.0 2.0 1.0 NaN NaN
675 9 8 1 NaN NaN 1 9 2 8
676 3 7 2 NaN NaN 3 2 0 7
IIUC, use df.combine_first():
df3=df1.combine_first(df2)
print(df3)
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7 6 8 7.0 2.0 2.0 10.0 NaN NaN
324 2 1 5 3.0 4.0 2.0 1.0 NaN NaN
675 9 8 1 NaN NaN 1.0 9.0 2.0 8.0
676 3 7 2 NaN NaN 3.0 2.0 0.0 7.0
You can concat and group the data
pd.concat([df1, df2], 1).groupby(level = 0, axis = 1).first()
Var#1 Var#2 Var#3 Var#4 Var#5 Var#6 Var#7 Var#8 Var#9
ID
323 7.0 6.0 8.0 7.0 2.0 2.0 10.0 NaN NaN
324 2.0 1.0 5.0 3.0 4.0 2.0 1.0 NaN NaN
675 9.0 8.0 1.0 NaN NaN 1.0 9.0 2.0 8.0
676 3.0 7.0 2.0 NaN NaN 3.0 2.0 0.0 7.0

(pandas) Why does .bfill().ffill() act differently than ffill().bfill() on groups?

I think I'm missing something basic conceptually, but I'm not able to find the answer in the docs.
>>> df=pd.DataFrame({'a':[1,1,2,2,3,3], 'b':[5,np.nan, 6, np.nan, np.nan, np.nan]})
>>> df
a b
0 1 5.0
1 1 NaN
2 2 6.0
3 2 NaN
4 3 NaN
5 3 NaN
Using ffill() and then bfill():
>>> df.groupby('a')['b'].ffill().bfill()
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Using bfill() and then ffill():
>>> df.groupby('a')['b'].bfill().ffill()
0 5.0
1 5.0
2 6.0
3 6.0
4 6.0
5 6.0
Doesn't the second way break the groupings? Will the first way always make sure that the values are filled in only with other values in that group?
I think you need:
print (df.groupby('a')['b'].apply(lambda x: x.ffill().bfill()))
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
print (df.groupby('a')['b'].apply(lambda x: x.bfill().ffill()))
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
because in your sample only first ffill or bfill is DataFrameGroupBy.ffill or DataFrameGroupBy.bfill, second is working with output Series. So it break groups, because Series has no groups.
print (df.groupby('a')['b'].ffill())
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
print (df.groupby('a')['b'].bfill())
0 5.0
1 NaN
2 6.0
3 NaN
4 NaN
5 NaN
Name: b, dtype: float64