Pandas dataframe apply to multiple column - pandas

I am trying to use apply function to my DataFrame.
The apply use a custom function that returns 2 values and that needs to populate the row of 2 columns on my DataFrame.
I put a simple example below:
df = DataFrame ({'a' : 10})
I wish to create two columns: b and c.
b equals 1 if a is above 0.
c equals 1 if a is above 0.
def compute_b_c(a):
if a > 0:
return 1, 1
else:
return 0,0
I tried this but it returns key error:
df[['b', 'c']] = df.a.apply(compute_b_c)

It is possible with DataFrame constructor,also 1,1 and 0,0 are like tuples (1,1) and (0,0):
df = pd.DataFrame ({'a' : [10, -1, 9]})
def compute_b_c(a):
if a > 0:
return (1,1)
else:
return (0,0)
df[['b', 'c']] = pd.DataFrame(df.a.apply(compute_b_c).tolist())
print (df)
a b c
0 10 1 1
1 -1 0 0
2 9 1 1
Performance:
#10k rows
df = pd.DataFrame ({'a' : [10, -1, 9] * 10000})
In [79]: %timeit df[['b', 'c']] = pd.DataFrame(df.a.apply(compute_b_c).tolist())
22.6 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %timeit df[['b', 'c']] = df.apply(lambda row: compute_b_c(row['a']), result_type='expand', axis=1)
5.25 s ± 84.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use result_type parameter of pandas.DataFrame.apply. Applicable only if you use apply function on df(DataFrame) and not df.a(Series)
df[['b', 'c']] = df.apply(lambda row: compute_b_c(row['a']), result_type='expand', axis=1)

Related

Creating a new column with column names in dataframe based on maximum value of two columns

I have a dataframe as follows:
Col1 Val1 Val2
A 1 0
B 2 3
C 0 4
D 3 2
I need the following output:
Col1 Val1 Val2 Type
A 1 0 Val1
B 2 3 Val2
C 0 4 Val2
D 3 2 Val1
The column Type basically refers to where the maximum of Val1 and Val2 are.
I am not sure how to approach this.
(df['Val1'] >= df['Val2']).map({True: 'Val1', False: 'Val2'}
In [43]: df = pd.DataFrame(np.random.randint(0, 20, (10_000, 2)), columns=['val1', 'val2'])
...: %timeit (df['val1'] >= df['val2']).map({True: 'val1', False: 'val2'})
...: %timeit df.apply(lambda x: 'val1' if x.val1 >= x.val2 else 'val2', axis=1)
...: %timeit df.loc[:, ['val1', 'val2']].idxmax(axis=1)
1.27 ms ± 45.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
123 ms ± 836 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.73 ms ± 95.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
you can do it with :
df['Type'] = df.apply(lambda x: 'Val1' if x.Val1 > x.Val2 else 'Val2', axis=1)
Special case : if you want to return None when Val1 == Val2
def get_max_col(x):
if x.Val1 > x.Val2:
return 'Val1'
elif x.Val1 == x.Val2:
return None
else:
return 'Val2'
df['Type'] = df.apply(get_max_col, axis=1)
Run:
df['Type'] = df.iloc[:, 1:].idxmax(axis=1)
This code works regardless of the number of columns and their names.
iloc[:, 1:] is to "filter out" column 0.
If you want just these 2 columns only, alternative choices are:
df['Type'] = df.iloc[:, 1:3].idxmax(axis=1)
or
df['Type'] = df[['Val1', 'Val2']].idxmax(axis=1)

How to convert two dataframes (with same cols/rows) into a dataframe of tuples?

how to convert two dataframes e.g.:
df1=pd.DataFrame({'A': [1, 2,3], 'B': [10, 20,30]})
df2=pd.DataFrame({'A': [11, 22,33], 'B': [110, 220, 330]})
into
A B
0 (1, 11) (10, 110)
1 (2, 22) (20, 220)
2 (3, 33) (30, 330)
I'm trying to find a pandas function instead of using a loop. This is just a dummy example and the original dataframes have many columns
You can use pd.join:
df1.join(df2, lsuffix='1', rsuffix='2').apply(tuple, axis=1).to_frame('A')
The fastest way is probably
df = pd.DataFrame({"A": zip(df1.A, df2.A)})
Much faster and simpler than the other solutions
def repeat_df(df, n):
return pd.concat([df]*n, ignore_index=True)
n = 1000
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'A': [11, 22, 32]})
df1 = repeat_df(df1, n)
df2 = repeat_df(df2, n)
>>> %timeit df1.join(df2, lsuffix='1', rsuffix='2').apply(tuple, axis=1).to_frame('A')
36.6 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pd.concat([df1, df2]).groupby(level=0).agg(tuple)
39.8 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pd.DataFrame({"A": zip(df1.A, df2.A)})
1.95 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
EDIT
OP updated the example to work with multiple columns.
The above solution can be easily generalized
df = pd.DataFrame({col: zip(df1[col], df2[col]) for col in df1.columns})
Still much faster than the other solution. Assuming the same settings
>>> %timeit pd.concat([df1, df2]).groupby(level=0).agg(tuple)
70.3 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pd.DataFrame({col: zip(df1[col], df2[col]) for col in df1.columns})
3.41 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
you can concat both then use groupby.agg on the index. Using this method would align columns and groupby identical index.
print(pd.concat([df1, df2]).groupby(level=0).agg(tuple))
A
0 (1, 11)
1 (2, 22)
2 (3, 32)
that said, in this specific case, maybe using a list comprehension is faster
pd.DataFrame({'A':[(a1, a2) for a1, a2 in zip(df1['A'], df2['A'])]})
You can use pandas pd.itertuples for converting dataframe into pandas tuples after merging two dataframes.
df = pd.concat([df1, df2])
tuples = df.itertuples()

Vectorizing apply(list) and explode in pandas dataframe

I have a dataframe with dates in an integer format (timedelta in days from some arbitrary date) and using another column weeks I'd like to add 7 days to the start_date column for every week > 1 and explode that into another row.
So records with 1 week would remain the same, 2 weeks would get get one additional row, 3 weeks would get 2 additional rows, etc - each additinal row would have start_date incremented by 7.
It's fairly trivial using pd.apply with axis=1, but I can't seem to wrap my head around a vectorized method to solve this.
import pandas as pd
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
Starting df
product start_date weeks
0 a 1000 1
1 b 1000 2
2 c 1000 3
Current approach
df['dates'] = df.apply(lambda x: [x['start_date']+i*7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename({'dates':'start_date'})
Output
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014
Use loc + index.repeat to scale up the DataFrame, then groupby cumcount to add the multiple, then drop the column:
# Scale up DataFrame
df = df.loc[df.index.repeat(df['weeks'])]
# Create Dates Column grouping by the index (level=0)
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
# Drop Column
df = df.drop('start_date', axis=1)
df:
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014
Timing Information:
import pandas as pd
sample_df = pd.DataFrame({'product': ['a', 'b', 'c'],
'start_date': [1000, 1000, 1000],
'weeks': [1, 2, 3]})
OP's Original Code
def orig(df):
df['dates'] = df.apply(
lambda x: [x['start_date'] + i * 7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename(
{'dates': 'start_date'})
%timeit orig(sample_df)
3.53 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This answer:
def fn(df):
df = df.loc[df.index.repeat(df['weeks'])]
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
df = df.drop('start_date', axis=1)
%timeit fn(sample_df)
1.63 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
OP's Answer
def fn2(df):
df['x'] = df['weeks'].apply(lambda x: range(x))
df = df.explode('x')
df['start_date'] = df['start_date'] + (df['x'] * 7)
df.drop(columns='x', inplace=True)
%timeit fn2(sample_df)
2.71 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Figured it out, same concept just a slightly different order.
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
df['x'] = df['weeks'].apply(lambda x:range(x))
df = df.explode('x')
df['start_date'] = df['start_date']+(df['x']*7)
df.drop(columns='x', inplace=True)

Pandas Dataframe groupby aggregate functions and difference between max and min of a column on the fly

import pandas as pd
df = {'a': ['xxx', 'xxx','xxx','yyy','yyy','yyy'], 'start': [10000, 10500, 11000, 12000, 13000, 14000] }
df = pd.DataFrame(data=df)
df_new = df.groupby("a",as_index=True).agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
StartMin=pd.NamedAgg(column='start', aggfunc="min"),
StartMax=pd.NamedAgg(column='start', aggfunc="max"),
)
gives
>>>df_new
ProcessiveGroupLength StartMin StartMax
a
xxx 3 10000 11000
yyy 3 12000 14000
How to get below on the fly, since I think on the fly it will be faster.
>>>df_new
ProcessiveGroupLength Diff
a
xxx 3 1000
yyy 3 2000
Below code gives the following error message:
Traceback (most recent call last):
File "", line 5, in
TypeError: unsupported operand type(s) for -: 'str' and 'str'
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc="max"-"min"),)
Your solution should be changed by lambda function, but I think if many groups or/and large DataFrame this should be slowier like first solution.
Reason is optimalized functions max and min and also vectorized subtraction of Series. In another words if not used lambda functions aggregations is faster.
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
Or yu can use numpy.ptp:
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
print (df_new)
ProcessiveGroupLength Diff
a
xxx 3 1000
yyy 3 2000
Performance: Depends of data, here is used 1k groups in 1M rows:
np.random.seed(20)
N = 1000000
df = pd.DataFrame({'a': np.random.randint(1000, size=N),
'start':np.random.randint(10000, size=N)})
print (df)
In [229]: %%timeit
...: df_new = df.groupby("a",as_index=True).agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: StartMin=pd.NamedAgg(column='start', aggfunc="min"),
...: StartMax=pd.NamedAgg(column='start', aggfunc="max"),
...: ).assign(Diff = lambda x: x.pop('StartMax') - x.pop('StartMin'))
...:
69 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [230]: %%timeit
...: df_new = df.groupby("a").agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
...:
172 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [231]: %%timeit
...: df_new = df.groupby("a").agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
...:
171 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Faster way to change each value in dataframe ACCORDING to original value

I have a dataframe with 30000 columns and 4000 rows. Each cell entry contains an integer. For EVERY entry, I want to multiply the original contents with log(k/m),
where k is the total number of rows ie.4000
and m is the total number of non zero rows for THAT PARTICULAR COLUMN.
My current code uses apply:
for column in df.columns:
m = len(df[column].to_numpy().nonzero())
df[column] = df[column].apply(lambda x: x * np.log10(4000/m))
This takes me hours (????). I hope there is some faster way to do it, anyone have any ideas?
Thanks
First generate sample data:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(4, 5)*500, columns=['A', 'B', 'C', 'D', 'E']).astype(int).replace(range(100, 200), 0)
Result:
A B C D E
0 348 0 0 275 359
1 211 490 342 240 0
2 0 364 219 29 0
3 368 91 87 265 265
Next I define a vector with containing the non-zero column counts:
non_zeros = df.ne(0).sum().values
# Giving me: array([3, 3, 3, 4, 2], dtype=int64)
From there I find the log-factor to each column:
faktor = np.mat(np.log10(len(df)/ non_zeros))
# giving me: matrix([[0.12493874, 0.12493874, 0.12493874, 0. , 0.30103 ]])
Then multiplying each column with it's factor and converting back to DataFrame:
res = np.multiply(np.mat(df), faktor)
df = pd.DataFrame(res)
With this solution you come around the non-tight loops in Python.
Hope it will bring some help.
#Dennis Hansen's answer is good, but if you're still need to iterate over column I would recommend not to use apply in your solution.
a = pd.DataFrame(np.random.rand(10000)) # define an arib. dataframe
a.iloc[5:500] = 0 # set some values to zero
Solution with apply performance:
>> %%timeit
>> b = a.apply(lambda x: x * np.log10(10000/len(a.to_numpy().nonzero())))
1.53 ms ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Solution without apply performance:
>> %%timeit
>> b = a*np.log10(10000/len(a.to_numpy().nonzero()))
849 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)