dataframe groupby nth with same behaviour as first and last - pandas

In a dataframe, when performing groupby['col'].first() we get the first not nan value in each column (same for last).
I am trying to get the second not nan value and I cannot find how. The only relevant function that I found is groupby['col'].nth(1), but it just gives me the second row with nans if exist. groupby['col'].nth(1, dropna='any') doesn't do the job since it skips rows with nans and doesn't check each column seperately.
example:
df = pd.DataFrame({
'A': [1, 1, 1, 1, 1],
'B': [np.nan, 2, 3, 4, 5],
'C': [np.nan, np.nan, 3, 4, 5]
}, columns=['A', 'B', 'C'])
first() behaviour:
df.groupby('A').first().reset_index()
results with:
A B C
0 1 2.0 3.0
on the other hand:
df.groupby('A').nth(0, dropna='any').reset_index()
gives:
A B C
0 1 3.0 3.0
Is there a way to get the same behaviour of first/last in the nth function so I can apply it also for second or any nth item?

You can use the generic aggregate method to filter each series with notna and then pick the index you want, for example:
df.groupby('A').aggregate(lambda x: x.array[pd.notna(x)][0])
Produces:
B C
A
1 2.0 3.0
Changing the index to 1 to get the second notna value gives:
B C
A
1 3.0 4.0
Of course that lambda is a bit naive because it will raise an IndexError if the array isn't long enough. A function like this should work:
def nth_notna(n):
def inner(series):
a = series.array[pd.notna(series)]
if len(a) - 1 < n:
return np.nan
return a[n]
return inner
Then df.groupby('A').aggregate(nth_notna(3)) will produce:
B C
A
1 5.0 NaN

Related

How to remove all type of nan from the dataframe.?

I had a data frame, which is shown below. I want to merge column values into one column, excluding nan values.
Image 1:
When I am using the code
df3["Generation"] = df3[df3.columns[5:]].apply(lambda x: ','.join(x.dropna()), axis=1)
I am getting results like this.
Image 2:
I suspect that these columns are of type string; thus, they are not affected by x.dropna().
One example that I made is this, which gives similar results as yours.
df = pd.DataFrame({'a': [np.nan, np.nan, 1, 2], 'b': [1, 1, np.nan, None]}).astype(str)
df.apply(lambda x: ','.join(x.dropna()))
0 nan,1.0
1 nan,1.0
2 1.0,nan
3 2.0,nan
dtype: object
-----------------
# using simple string comparing solves the problem
df.apply(lambda x: ','.join(x[x!='nan']), axis=1)
0 1.0
1 1.0
2 1.0
3 2.0
dtype: object

Conditional imputation with average of non-missing columns with pandas toolbox

This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.
In SPSS there is function MEAN.n which gives you the mean value of list of numbers only when n elements of that list are valid (not pandas.NA). With that function you are able to imputat missing values only if a minimum number of items are valid.
Are there pandas function to do this with?
Example
Values [1, 2, 3, 4, NA].
Mean of the valid values is 2.5.
The resulting list should be [1, 2, 3, 4, 2.5].
Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.
Values [1, 2, NA, NA, NA].
Mean of the valid values is 1.5 but it does not matter.
The resulting list should not be changed [1, 2, NA, NA, NA] because imputation is not allowed.
Assuming you want to work with pandas, you can define a custom wrapper (using only pandas functions) to fillna with the mean only if a minimum number of items are not NA:
from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])
def fillna_mean(s, N=4):
return s if s.notna().sum() < N else s.fillna(s.mean())
fillna_mean(s1)
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# 4 2.5
# dtype: float64
fillna_mean(s2)
# 0 1
# 1 2
# 2 <NA>
# 3 <NA>
# 4 <NA>
# dtype: object
fillna_mean(s2, N=2)
# 0 1.0
# 1 2.0
# 2 1.5
# 3 1.5
# 4 1.5
# dtype: float64
Lets try list comprehension, though it will be messy
Option1
You can use pd.Series and numpy
s= [x if np.isnan(lst).sum()>=3 else pd.Series(lst).mean(skipna=True) if x is np.nan else x for x in lst]
Option2 use numpy all through
s=[x if np.isnan(lst).sum()>=3 else np.mean([x for x in lst if str(x) != 'nan']) if x is np.nan else x for x in lst]
Case1
lst=[1, 2, 3, 4, np.nan]
outcome
[1, 2, 3, 4, 2.5]
Case2
lst=[1, 2, np.nan, np.nan, np.nan]
outcome
[1, 2, nan, nan, nan]
if you wanted it as a pd. Series, simply
pd.Series(s, name='lst')
How it works
s=[x if np.isnan(lst).sum()>=3 \ #give me element x if the sum of nans in the list is greater than or equal to 3
else pd.Series(lst).mean(skipna=True) if x is np.nan else x \# Otherwise replace the Nan in list with the mean of non NaN elements in the list
for x in lst\#For every element in lst
]

Select rows where number can be found in list

Given the following data
I hope to select the rows where num appears in list. In this case, it will select row 1 and row2, row 3 is not selected since 3 can't be found in [4,5].
Following is the dataframe, how should we write the filter query?
cat1=pd.DataFrame({"num":[1,2,3],
"list":[[1,2,3],[3,2],[4,5]]})
One possible solution with list comprehension, zip and in passed to boolean indexing:
df = cat1[[a in b for a, b in zip(cat1.num, cat1.list)]]
Or solution with DataFrame.apply with axis=1 for processing per rows:
df = cat1[cat1.apply(lambda x: x.num in x.list, axis=1)]
Or create DataFrame and test membership:
df = cat1[pd.DataFrame(cat1.list.tolist()).isin(cat1.num).any(axis=1)]
print (df)
num list
0 1 [1, 2, 3]
1 2 [3, 2]
A different solution if you are using pandas .25 is using explode():
cat1[cat1['num'].isin(cat1.explode('list1').query("num==list1").loc[:,'num'])]
num list1
0 1 [1, 2, 3]
1 2 [3, 2]

Access Row Based on Column Value

I have the following pandas dataframe:
data = {'ID': [1, 2, 3], 'Neighbor': [3, 1, 2], 'x': [5, 6, 7]}
Now I want to create a new column 'y', which for each row is the value of the field x, from that row referenced by the neighbor column (ie that column, whose ID equals the value of Neighbor), e.g: For row 0 (ID 1), 'Neighbor' is 3, thus 'y' should be 7.
So the resulting dataframe should have the colum y = [7, 5, 6].
Can I solve this without using df.apply? (As this is rather time-consuming for my big dataframes.)
I would like to use sth like
df.loc[:, 'y'] = df.loc[df.Neighbor.eq(df.ID), 'x']}
but this returns NaN.
we can pass a dict from your ID and X columns then map these into your new column
your_dict_ = dict(zip(df['ID'],df['x']))
print(your_dict_)
{1: 5, 2: 6, 3: 7}
Then we can use .map to pass these your column using the Neighbor column as the key.
df['Y'] = df['Neighbor'].map(your_dict_)
print(df)
ID Neighbor x Y
0 1 3 5 7
1 2 1 6 5
2 3 2 7 6

How to use pandas rename() on multi-index columns?

How can can simply rename a MultiIndex column from a pandas DataFrame, using the rename() function?
Let's look at an example and create such a DataFrame:
import pandas
df = pandas.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg({"B":["min","max"],"C":"mean"})
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
I am able to select a given MultiIndex column by using a tuple for its name:
print(df[("B","min")])
A
1 0
2 3
Name: (B, min), dtype: int64
However, when using the same tuple naming with the rename() function, it does not seem it is accepted:
df.rename(columns={("B","min"):"renamed"},inplace=True)
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
Any idea how rename() should be called to deal with Multi-Index columns?
PS : I am aware of the other options to flatten the column names before, but this prevents one-liners so I am looking for a cleaner solution (see my previous question)
This doesn't answer the question as worded, but it will work for your given example (assuming you want them all renamed with no MultiIndex):
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg(
renamed=('B', 'min'),
B_max=('B', 'max'),
C_mean=('C', 'mean'),
)
print(df)
renamed B_max C_mean
A
1 0 2 1.0
2 3 4 3.5
For more info, you can see the pandas docs and some related other questions.