Summing pandas columns containing Nan's (missing values) [duplicate] - pandas

This question already has answers here:
Pandas Summing Two Columns with Nan
(2 answers)
Closed 1 year ago.
I'm following up on questions that were asked a few years ago: here and
here
I would like to sum two columns from a pandas data frame where both columns contain missing values.
I have scrolled the internet, but coudn't find this precise output I'm looking for.
I have a df as follows, and I want to sum col1 and col2
col1 col2
1 NaN
NaN 1
1 1
Nan Nan
The output I want:
col1 col2 col_sum
1 NaN 1
NaN 1 1
1 1 2
Nan Nan Nan
What I don't want:
When simply using df['col_sum'] = df['col1'] + df['col2'] gives me
col1 col2 col_sum
1 NaN Nan
NaN 1 Nan
1 1 2
Nan Nan Nan
When using the sum() function as suggested in the above (linked) threads gives me
col1 col2 col_sum
1 NaN 1
NaN 1 1
1 1 2
Nan Nan 0
Hence, I would like that the sum of a number with a missing value outputs that number, and the sum of two missing values outputs a missing value.
Treating Nan's as 0 values is a problem for me. Because later on, if I'm taking the mean() of col_sum having a 0 or a Nan will give a totally different result (or isn't it ??).

Use Series.add with fill_value parameter:
df['col_sum'] = df['col1'].add(df['col2'], fill_value=0)
Or sum with min_count=1 parameter:
df['col_sum'] = df.sum(min_count=1, axis=1)
print (df)
0 1.0 NaN 1.0
1 NaN 1.0 1.0
2 1.0 1.0 2.0
3 NaN NaN NaN

Related

How to accumulate not null values of each row into separate column/series?

I have a dataframe with a subset of columns with the form:
col_A col_B col_C
0 NaN 1.0 NaN
1 NaN NaN NaN
2 NaN NaN 2.0
3 3.0 NaN 4.0
I want to create a series of the notnull values for each row. For rows with multiple not null values, I want either the average or first value e.g.
new_col
0 1.0
1 NaN
2 2.0
3 3.5
I'll eventually want to add this back to the original dataframe as a separate column so I need to persist rows with all NaN values by forward filling them e.g.
new_col
0 1.0
1 1.0
2 2.0
3 3.5
I know how to determine if there is a not null value in the dataframe, but I don't know how to select for it:
df[['col_A', 'col_B', 'col_C']].count(axis=1) >= 1
You can use:
df.mean(axis=1).ffill()
Or to restrict the columns:
df[['col_A', 'col_B', 'col_C']].mean(axis=1).ffill()
Output:
0 1.0
1 1.0
2 2.0
3 3.5
dtype: float64

Pandas Long to Wide conversion

I am new to Pandas. I have a data set with in this format.
UserID ISBN BookRatings
0 276725.0 034545104U 0.0
1 276726.0 155061224 5.0
2 276727.0 446520802 0.0
3 276729.0 052165615W 3.0
4 276729.0 521795028 6.0
I would like to create this
ISBN 276725 276726 276727 276729
UserID
0 034545104U
1 0 155061224 0 0 0
2 0 0 446520802 0 0
3 0 0 0 052165615W 0
4 0 0 0 521795028 0
I tried pivot but was not successful. Any kind advice please?
I think that pivot() is the right approach here. The most difficult part is to get the arguments correctly. I think we need to keep the original index and the new columns should be the values in column UserID. Also, we want to fill the new dataframe with the values from column ISBN.
For this, I firstly extract the original index as column and then apply the pivot() function:
df = df.reset_index()
result = df.pivot(index='index', columns='UserID', values='ISBN')
# Make your float columns to integers (only works if all user ids are numbers, drop nan values first)
result.columns = map(int,result.columns)
Output:
276725 276726 276727 276729
index
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028
Edit: If you want the same appearance as in the original dataframe you have to apply the following line as well:
result = result.rename_axis(None, axis=0)
Output:
276725 276726 276727 276729
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

How to select the rows having same id and have all missing value in another column

I have the following dataframe:
ID col_1
1 NaN
2 NaN
3 4.0
2 NaN
2 NaN
3 NaN
3 3.0
1 NaN
I need the following output:
ID col_1
1 NaN
1 NaN
2 NaN
2 NaN
2 NaN
how to do this in pandas
You can create a boolean mask with isna then group this mask by ID and transform using all, then you can filter the rows with the help of this mask:
mask = df['col_1'].isna().groupby(df['ID']).transform('all')
df[mask].sort_values('ID')
Alternatively you can use groupby + filter to filter out the groups which satisfy the condition where all values in col_1 are NaN but this method should be slower than the above:
df.groupby('ID').filter(lambda g: g['col_1'].isna().all()).sort_values('ID')
ID col_1
0 1 NaN
7 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
Let us try with isin after groupby with all
s = df['col_1'].isna().groupby(df['ID']).all()
df = df.loc[df.ID.isin(s[s].index.tolist())]
df
Out[73]:
ID col_1
0 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
7 1 NaN
import pandas as pd
import numpy as np
df=pd.read_excel(r"D:\Stack_overflow\test12.xlsx")
df1=(df[df['cols_1'].isnull()]).sort_values(by=['ID'])
I think we can simply take out the null values.

DataFrame Column into multiple columns

Column
How can I split a data frame column that contain list of strings like
[{'1','1','1','1'},{'1','1','1','1'},{'1','1','1','1'},{'1','1','1','1'}]
In each cell, into multiple columns of data frame?
Consider that the lists in each cell of the column are not with the same length!
In above image on the left we have the first column and on the right we are watching the results that I want to make.
As #Oliver Prislan comments -- that is an unusual structure - did you mean something else? If your data is structured like that then here is a way you can get it into the new format:
# assumes that your original dataframe is called `df`
# creates a new dataframe called new_df
# removes the unwanted {} and [] and ''
# then expands the columns after splitting each string on the comma
new_df = pd.DataFrame(df['Column0'].str.replace('[{}\[\]\']','').str.split(',', expand=True),
index=df.index)
#renames the columns as you wanted them
new_df.rename(columns='col{}'.format, inplace=True)
If your values are always numeric and you may want to convert the dataframe columns to numeric datatypes:
for col in new_df.columns:
new_df[col] = pd.to_numeric(new_df[col])
Final result:
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15
0 1 1 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 1 1 1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 1 1 1 1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
3 1 1 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1 1 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 1 1 1 1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN