How to accumulate not null values of each row into separate column/series? - pandas

I have a dataframe with a subset of columns with the form:
col_A col_B col_C
0 NaN 1.0 NaN
1 NaN NaN NaN
2 NaN NaN 2.0
3 3.0 NaN 4.0
I want to create a series of the notnull values for each row. For rows with multiple not null values, I want either the average or first value e.g.
new_col
0 1.0
1 NaN
2 2.0
3 3.5
I'll eventually want to add this back to the original dataframe as a separate column so I need to persist rows with all NaN values by forward filling them e.g.
new_col
0 1.0
1 1.0
2 2.0
3 3.5
I know how to determine if there is a not null value in the dataframe, but I don't know how to select for it:
df[['col_A', 'col_B', 'col_C']].count(axis=1) >= 1

You can use:
df.mean(axis=1).ffill()
Or to restrict the columns:
df[['col_A', 'col_B', 'col_C']].mean(axis=1).ffill()
Output:
0 1.0
1 1.0
2 2.0
3 3.5
dtype: float64

Related

Pandas Long to Wide conversion

I am new to Pandas. I have a data set with in this format.
UserID ISBN BookRatings
0 276725.0 034545104U 0.0
1 276726.0 155061224 5.0
2 276727.0 446520802 0.0
3 276729.0 052165615W 3.0
4 276729.0 521795028 6.0
I would like to create this
ISBN 276725 276726 276727 276729
UserID
0 034545104U
1 0 155061224 0 0 0
2 0 0 446520802 0 0
3 0 0 0 052165615W 0
4 0 0 0 521795028 0
I tried pivot but was not successful. Any kind advice please?
I think that pivot() is the right approach here. The most difficult part is to get the arguments correctly. I think we need to keep the original index and the new columns should be the values in column UserID. Also, we want to fill the new dataframe with the values from column ISBN.
For this, I firstly extract the original index as column and then apply the pivot() function:
df = df.reset_index()
result = df.pivot(index='index', columns='UserID', values='ISBN')
# Make your float columns to integers (only works if all user ids are numbers, drop nan values first)
result.columns = map(int,result.columns)
Output:
276725 276726 276727 276729
index
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028
Edit: If you want the same appearance as in the original dataframe you have to apply the following line as well:
result = result.rename_axis(None, axis=0)
Output:
276725 276726 276727 276729
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028

Summing pandas columns containing Nan's (missing values) [duplicate]

This question already has answers here:
Pandas Summing Two Columns with Nan
(2 answers)
Closed 1 year ago.
I'm following up on questions that were asked a few years ago: here and
here
I would like to sum two columns from a pandas data frame where both columns contain missing values.
I have scrolled the internet, but coudn't find this precise output I'm looking for.
I have a df as follows, and I want to sum col1 and col2
col1 col2
1 NaN
NaN 1
1 1
Nan Nan
The output I want:
col1 col2 col_sum
1 NaN 1
NaN 1 1
1 1 2
Nan Nan Nan
What I don't want:
When simply using df['col_sum'] = df['col1'] + df['col2'] gives me
col1 col2 col_sum
1 NaN Nan
NaN 1 Nan
1 1 2
Nan Nan Nan
When using the sum() function as suggested in the above (linked) threads gives me
col1 col2 col_sum
1 NaN 1
NaN 1 1
1 1 2
Nan Nan 0
Hence, I would like that the sum of a number with a missing value outputs that number, and the sum of two missing values outputs a missing value.
Treating Nan's as 0 values is a problem for me. Because later on, if I'm taking the mean() of col_sum having a 0 or a Nan will give a totally different result (or isn't it ??).
Use Series.add with fill_value parameter:
df['col_sum'] = df['col1'].add(df['col2'], fill_value=0)
Or sum with min_count=1 parameter:
df['col_sum'] = df.sum(min_count=1, axis=1)
print (df)
0 1.0 NaN 1.0
1 NaN 1.0 1.0
2 1.0 1.0 2.0
3 NaN NaN NaN

How to select the rows having same id and have all missing value in another column

I have the following dataframe:
ID col_1
1 NaN
2 NaN
3 4.0
2 NaN
2 NaN
3 NaN
3 3.0
1 NaN
I need the following output:
ID col_1
1 NaN
1 NaN
2 NaN
2 NaN
2 NaN
how to do this in pandas
You can create a boolean mask with isna then group this mask by ID and transform using all, then you can filter the rows with the help of this mask:
mask = df['col_1'].isna().groupby(df['ID']).transform('all')
df[mask].sort_values('ID')
Alternatively you can use groupby + filter to filter out the groups which satisfy the condition where all values in col_1 are NaN but this method should be slower than the above:
df.groupby('ID').filter(lambda g: g['col_1'].isna().all()).sort_values('ID')
ID col_1
0 1 NaN
7 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
Let us try with isin after groupby with all
s = df['col_1'].isna().groupby(df['ID']).all()
df = df.loc[df.ID.isin(s[s].index.tolist())]
df
Out[73]:
ID col_1
0 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
7 1 NaN
import pandas as pd
import numpy as np
df=pd.read_excel(r"D:\Stack_overflow\test12.xlsx")
df1=(df[df['cols_1'].isnull()]).sort_values(by=['ID'])
I think we can simply take out the null values.

Sum of NaNs to equal NaN (not zero)

I can add a TOTAL column to this DF using df['TOTAL'] = df.sum(axis=1), and it adds the row elements like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN 0.0
However, I would like the total of the bottom row to be NaN, not zero, like this:
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN Nan
Is there a way I can achieve this in a performant way?
Add parameter min_count=1 to DataFrame.sum:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
df['TOTAL'] = df.sum(axis=1, min_count=1)
print (df)
col1 col2 TOTAL
0 1.0 5.0 6.0
1 2.0 6.0 8.0
2 0.0 NaN 0.0
3 NaN NaN NaN

pandas diff between within successive groups

d = pd.DataFrame({'a':[7,6,3,4,8], 'b':['c','c','d','d','c']})
d.groupby('b')['a'].diff()
Gives me
0 NaN
1 -1.0
2 NaN
3 1.0
4 2.0
What I'd need
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Which is difference between only successive values within group, so when a group appears after another group , it's previous values are ignored.
In my example last c value is a new c group.
You would need to groupby on consecutive segments
In [1055]: d.groupby((d.b != d.b.shift()).cumsum())['a'].diff()
Out[1055]:
0 NaN
1 -1.0
2 NaN
3 1.0
4 NaN
Name: a, dtype: float64
Details
In [1056]: (d.b != d.b.shift()).cumsum()
Out[1056]:
0 1
1 1
2 2
3 2
4 3
Name: b, dtype: int32