Sum of NaNs to equal NaN (not zero) - pandas

I can add a TOTAL column to this DF using df['TOTAL'] = df.sum(axis=1), and it adds the row elements like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN 0.0
However, I would like the total of the bottom row to be NaN, not zero, like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN Nan
Is there a way I can achieve this in a performant way?

Add parameter min_count=1 to DataFrame.sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
df['TOTAL'] = df.sum(axis=1, min_count=1)
print (df)
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN NaN

Related

How to accumulate not null values of each row into separate column/series?

I have a dataframe with a subset of columns with the form:
col_A col_B col_C
0 NaN 1.0 NaN
1 NaN NaN NaN
2 NaN NaN 2.0
3 3.0 NaN 4.0
I want to create a series of the notnull values for each row. For rows with multiple not null values, I want either the average or first value e.g.
new_col
0 1.0
1 NaN
2 2.0
3 3.5
I'll eventually want to add this back to the original dataframe as a separate column so I need to persist rows with all NaN values by forward filling them e.g.
new_col
0 1.0
1 1.0
2 2.0
3 3.5
I know how to determine if there is a not null value in the dataframe, but I don't know how to select for it:
df[['col_A', 'col_B', 'col_C']].count(axis=1) >= 1
You can use:
df.mean(axis=1).ffill()
Or to restrict the columns:
df[['col_A', 'col_B', 'col_C']].mean(axis=1).ffill()
Output:
0 1.0
1 1.0
2 2.0
3 3.5
dtype: float64

How to represent the column with max Nan values in pandas df?

i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!
Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN

DataFrame Column into multiple columns

Column
How can I split a data frame column that contain list of strings like
[{'1','1','1','1'},{'1','1','1','1'},{'1','1','1','1'},{'1','1','1','1'}]
In each cell, into multiple columns of data frame?
Consider that the lists in each cell of the column are not with the same length!
In above image on the left we have the first column and on the right we are watching the results that I want to make.
As #Oliver Prislan comments -- that is an unusual structure - did you mean something else? If your data is structured like that then here is a way you can get it into the new format:
# assumes that your original dataframe is called `df`
# creates a new dataframe called new_df
# removes the unwanted {} and [] and ''
# then expands the columns after splitting each string on the comma
new_df = pd.DataFrame(df['Column0'].str.replace('[{}\[\]\']','').str.split(',', expand=True),
index=df.index)
#renames the columns as you wanted them
new_df.rename(columns='col{}'.format, inplace=True)
If your values are always numeric and you may want to convert the dataframe columns to numeric datatypes:
for col in new_df.columns:
new_df[col] = pd.to_numeric(new_df[col])
Final result:
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15
0 1 1 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 1 1 1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 1 1 1 1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
3 1 1 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1 1 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 1 1 1 1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN

How to perform a rolling window on a pandas DataFrame, whereby each row consists nan values that should not be replaced?

I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!
Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]

prevent pandas.interpolate() from extrapolation

I'm having difficulty in preventing pd.DataFrame.interpolate(method='index') from extrapolation.
Specifically:
>>> df = pd.DataFrame({1: range(1, 5), 2: range(2, 6), 3 : range(3, 7)}, index = [1, 2, 3, 4])
>>> df = df.reindex(range(6)).reindex(range(5), axis=1)
>>> df.iloc[3, 2] = np.nan
>>> df
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 1.0 2.0 3.0 NaN
2 NaN 2.0 3.0 4.0 NaN
3 NaN 3.0 NaN 5.0 NaN
4 NaN 4.0 5.0 6.0 NaN
5 NaN NaN NaN NaN NaN
So df is just a block of data surrounded by NaN, with an interior missing point at iloc[3, 2]. Now when I apply .interpolate() (along either the horizontal or vertical axis), my goal is to have ONLY that interior point filled, leaving the surrounding NaNs untouched. But somehow I'm not able to get it to work.
I tried:
>>> df.interpolate(method='index', axis=0, limit_area='inside')
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 1.0 2.0 3.0 NaN
2 NaN 2.0 3.0 4.0 NaN
3 NaN 3.0 4.0 5.0 NaN
4 NaN 4.0 5.0 6.0 NaN
5 NaN 4.0 5.0 6.0 NaN
Note the last row got filled, which is undesirable. (btw, I'd think the fill value should be linear extrapolation based on index, but it is just padding the last value, which is highly undesirable.)
I also tried combination of limit and limit_direction to no avail.
What would be the correct argument setting to get the desired result? Hopefully without some contorted masking (but that would work too). Thx.
Ok, turns out I'm running this on Pandas 0.21, hence the limit_area argument is silently failing. Looks like starting from 0.24 this is fixed. Case closed.