How to use apply for multiple Pandas dataset columns? - pandas

I am hardly trying to fill some columns with NaN values, selected from a previous list. The code is going to the else path and never makes the correct modifications...
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': [0.0, np.nan, np.nan, 100],
'C': [20, 0.0002, 10000, np.nan],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
num_cols = ['B', 'C']
fill_mean = lambda col: col.fillna(col.mean()) if col.name in num_cols else col
df2.apply(fill_mean, axis=1)

You can do this much simpler using
df1.fillna(df1.mean())
This will fill the numeric columns' nas by the column mean:
A B C D
0 A0 0.0 20.000000 D0
1 A1 50.0 0.000200 D1
2 A2 50.0 10000.000000 D2
3 A3 100.0 3340.000067 D3

I am not sure if your desired output it just the mean on all columns (single row). If that is the case, may be the below solution could help.
df = df1.select_dtypes(include='float').mean().to_frame().T
df = pd.concat([df, df.reindex(columns = df1.select_dtypes(exclude='float').columns)], axis=1, sort=False)
print(df)
B C A D
0 50.0 3340.000067 NaN NaN

Related

How to group dataframe rows on unique elements in a specific column?

As an example, how do I convert df to df1, by gathering rows into matrices based on shared values in a specific column tidx?
>>> df = pd.DataFrame({'col3':[[1,40],[2,50],[3,60],[4,70]], 'tidx':[21,22,23,21]})
>>> df['col3'] = df['col3'].apply(np.array)
>>> df
col3 tidx
0 [1, 40] 21
1 [2, 50] 22
2 [3, 60] 23
3 [4, 70] 21
>>> df1 = pd.DataFrame({'col3':[[[1,40],[4,70]],[[2,50]],[[3,60]]], 'tidx':[21,22,23]})
>>> df1['col3'] = df1['col3'].apply(np.array)
>>> df1
col3 tidx
0 [[1, 40], [4, 70]] 21
1 [[2, 50]] 22
2 [[3, 60]] 23
You can use .groupby and then apply list function as shown in example below.
df = pd.DataFrame({'col3':[[1,40],[2,50],[3,60],[4,70]], 'tidx':[21,22,23,21]})
df1 = df.groupby('tidx')['col3'].apply(list).reset_index()

How to assigne a new column after groupby in pandas

I want to groupby my data and create a new column assignment.
Given the following data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6]})
df['col3']=df[['col1','col2']].groupby('col1').rolling(2).mean().reset_index()
Expected output = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6], 'col3': [NAN, 1.5, 2.5, NAN, 4.5, 5.5]})
However, this does not work. Is there an straightforward way to do it?
A combination of groupby, apply and assign:
df.groupby('col1', as_index = False).apply(lambda g: g.assign(col3 = g['col2'].rolling(2).mean())).reset_index(drop = True)
output:
col1 col2 col3
0 x1 1 NaN
1 x1 2 1.5
2 x1 3 2.5
3 x2 4 NaN
4 x2 5 4.5
5 x2 6 5.5

Group Pandas rows by ID and forward fill them to the right retaining NaN when it appears on all the rows with the same ID

I have a Pandas DataFrame that I need to:
group by the ID column (not in index)
forward fill rows to the right with the previous value (multiple columns) only if it's not a NaN (np.nan)
For each ID categorical value and each metric column (see the aX columns in the examples below) there is only value (the others when having multiple rows are NaN - np.nan).
Take this as an example:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: my_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
...: {"id": 1, "a1": np.nan, "a2": np.nan, "a3": 80.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
...: ])
In [4]: my_df.head(len(my_df))
Out[4]:
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 NaN NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
I have many more columns like a1 to a4.
I would like to:
pretend np.nan is zero 0.0 when on the same column and different row (with same ID) there is a number so I can sum them together like with groupby and subsequent aggregation functions
forward fill to the right on the same unique row (by ID) only if somewhere on a previous column to the left there was a number
So basically in the example this means that:
for ID 1 "a2"=100.0
for ID 2 "a1" and "a2" are both np.nan
See here:
In [5]: wanted_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": 100.0, "a3": 80.0, "a4": 90.0},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": 30.0},
...: ])
In [6]: wanted_df.head(len(wanted_df))
Out[6]:
id a1 a2 a3 a4
0 1 100.0 100.0 80.0 90.0
1 20 NaN NaN 100.0 30.0
In [7]:
The forward filling to the right should apply to multiple columns on the same row,
not only for the closest row to the right.
When I use my_df.interpolate(method='pad', axis=1,limit=None,limit_direction='forward',limit_area=None,downcast=None,) then I still get multiple rows for the same ID.
When I use my_df.groupby("id").sum() then I see 0.0 everywhere rather than retaining the NaN values in those scenarios defined above.
When I use my_df.groupby("id").apply(np.sum) the ID columns is summed as well, so this is wrong as it should be retained.
How do I do this?
One idea is use min_count=1 to sum:
df = my_df.groupby("id").sum(min_count=1)
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
Or if need first non missing value is possible use GroupBy.first:
df = my_df.groupby("id").first()
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
More problematic is if multiple non missing values per groups and need all of them:
#added 20 to a1
my_df = pd.DataFrame([
{"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
{"id": 1, "a1": 20, "a2": np.nan, "a3": 80.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
])
print (my_df)
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 20.0 NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0
First and second solution working differently:
df2 = my_df.groupby("id").sum(min_count=1)
print (df2)
a1 a2 a3 a4
id
1 120.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
df3 = my_df.groupby("id").first()
print (df3)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
If same type of values, here numbers is possible also use:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
f = lambda x: pd.DataFrame(justify(x.to_numpy(),
invalid_val=np.nan,
axis=0,
side='up'), columns=my_df.columns.drop('id'))
.dropna(how='all')
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0

pd.df find rows pairwise using groupby and change bogus values

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]