join or merge or reshape dataframe based on two conditions - pandas

I have two dataframes df and df1 which I want to merge or join.
import pandas as pd
df = pd.DataFrame(columns=['lt1', 'lt2','lt3','lt4','lt5','lt6'])
df['date'] = pd.date_range('2016-1-1', periods=5, freq='D')
df
lt1 lt2 lt3 lt4 lt5 lt6 date
0 NaN NaN NaN NaN NaN NaN 2016-01-01
1 NaN NaN NaN NaN NaN NaN 2016-01-02
2 NaN NaN NaN NaN NaN NaN 2016-01-03
3 NaN NaN NaN NaN NaN NaN 2016-01-04
4 NaN NaN NaN NaN NaN NaN 2016-01-05
df1 = pd.DataFrame({'location': ['lt1','lt3', 'lt6', 'lt1','lt2', 'lt3'], \
'date': ['2016-01-1', '2016-01-02','2016-01-1','2016-01-03','2016-01-5','2016-01-4'], \
'counts': ['2', '1','1','1', '3','1']})
df1.date = pd.to_datetime(df1.date)
df1
counts date location
0 2 2016-01-01 lt1
1 1 2016-01-02 lt3
2 1 2016-01-01 lt6
3 2 2016-01-03 lt1
4 3 2016-01-05 lt2
5 1 2016-01-04 lt3
I want to put counts values depending on location from df1 into df. The merge will be based on date column but the values to be added will be from df2.counts column and those values will be properly assigned into respective location names columns in df. Column names in df contains all the names present in df1.location column.
Merging just by date alone is easy but since it is not really a straightaway merge, it is more like reshaping or join. Any suggestion how to get the following df as output:
df
date lt1 lt2 lt3 lt4 lt5 lt6
0 2016-01-01 2 0 0 0 0 1
1 2016-02-01 0 0 1 0 0 0
2 2016-03-01 1 0 0 0 0 0
3 2016-04-01 0 0 1 0 0 0
4 2016-05-01 0 3 0 0 0 0

Here is one way using pivot_table and combine_first:
m=df1.pivot_table(index='date',columns='location',values='counts',aggfunc='sum')
final=df.set_index('date').combine_first(m).fillna(0).reset_index()
Or just:
(df.set_index('date').combine_first(df1.pivot('date','location','counts'))
.fillna(0).reset_index())
date lt1 lt2 lt3 lt4 lt5 lt6
0 2016-01-01 2 0 0 0 0 1
1 2016-01-02 0 0 1 0 0 0
2 2016-01-03 1 0 0 0 0 0
3 2016-01-04 0 0 1 0 0 0
4 2016-01-05 0 3 0 0 0 0

Related

Calculate the cumulative count for all NaN values in specific column

I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7])
df['ID'] = [1,1,1,1,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=7, freq="M")
df['stock_price'] = [1,np.nan,np.nan,4,5,np.nan,7]
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date stock_price
0 2 2010-07-31 7.0
1 2 2010-06-30 NaN
2 2 2010-05-31 5.0
3 1 2010-04-30 4.0
4 1 2010-03-31 NaN
5 1 2010-02-28 NaN
6 1 2010-01-31 1.0
I would like to calculate the cumulative count of all np.nan for column stock_price for every ID.
The expected result is:
df
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0
Any ideas how to solve it?
Idea is change order by indexing, and then in custom function testing missing values, shifting and used cumlative sum:
f = lambda x: x.isna().shift(fill_value=0).cumsum()
df['cum_count_nans'] = df.iloc[::-1].groupby('ID')['stock_price'].transform(f)
print (df)
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0

How to map data within Pandas DataFrame w.r.t index and column from another DataFrame

Let's say I have two DataFrames as below :
DF1:
from datetime import date, timedelta
import pandas as pd
import numpy as np
sdate = date(2019,1,1) # start date
edate = date(2019,1,7) # end date
required_dates = pd.date_range(sdate,edate-timedelta(days=1),freq='d')
# initialize list of lists
data = [['2019-01-01', 1001], ['2019-01-03', 1121] ,['2019-01-02', 1500],
['2019-01-02', 1400],['2019-01-04', 1501],['2019-01-01', 1200],
['2019-01-04', 1201],['2019-01-04', 1551],['2019-01-05', 1400]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['OnlyDate', 'TBID'])
df1.sort_values(by='OnlyDate',inplace=True)
df1
OnlyDate TBID
0 2019-01-01 1001
5 2019-01-01 1200
2 2019-01-02 1500
3 2019-01-02 1400
1 2019-01-03 1121
4 2019-01-04 1501
6 2019-01-04 1201
7 2019-01-04 1551
8 2019-01-05 1400
DF2 :
df2=pd.DataFrame(columns=[sorted(df1['TBID'].unique())],index=required_dates)
df2
1001 1121 1200 1201 1400 1500 1501 1551
2019-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
2019-01-02 NaN NaN NaN NaN NaN NaN NaN NaN
2019-01-03 NaN NaN NaN NaN NaN NaN NaN NaN
2019-01-04 NaN NaN NaN NaN NaN NaN NaN NaN
2019-01-05 NaN NaN NaN NaN NaN NaN NaN NaN
2019-01-06 NaN NaN NaN NaN NaN NaN NaN NaN
What I am trying is to apply (True or 1 ) to this DF3 Dataframe w.r.t to the values from df1 like below output:
df3 =df2.copy()
for index, row in df1.iterrows():
df3.loc[row['OnlyDate'],row['TBID']] = 1
df3.fillna(0, inplace=True)
df3
1001 1121 1200 1201 1400 1500 1501 1551
2019-01-01 1 0 1 0 0 0 0 0
2019-01-02 0 0 0 0 1 1 0 0
2019-01-03 0 1 0 0 0 0 0 0
2019-01-04 0 0 0 1 0 0 1 1
2019-01-05 0 0 0 0 1 0 0 0
2019-01-06 0 0 0 0 0 0 0 0
Is there any better way for doing this?
Use get_dummies with max for indicators (always 0, 1) or sum if want count values:
df = pd.get_dummies(df1.set_index('OnlyDate')['TBID']).max(level=0)
print (df)
1001 1121 1200 1201 1400 1500 1501 1551
OnlyDate
2019-01-01 1 0 1 0 0 0 0 0
2019-01-02 0 0 0 0 1 1 0 0
2019-01-03 0 1 0 0 0 0 0 0
2019-01-04 0 0 0 1 0 0 1 1
2019-01-05 0 0 0 0 1 0 0 0

Insert multiples dates at start of every group in pandas

I have a dataframe with millions of groups. I am trying to, for each group, add 3 months of dates (month end dates) at the top of every group. So if the first observation of a group is December 2019, I want to fill 3 rows prior to that observation with dates from September 2019 to November 2019. I also want to fill the group column with the relevant group ID and the other columns can remain as null values.
Would like to avoid looping if possible as this is a very large dataset
This is my before DataFrame:
import pandas as pd
before = pd.DataFrame({'Group':[1,1,1,1,1,2,2,2,2,2],
'Date':['31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[1.1,1.7,1.9,2.3,1.5,2.8,2,2,2,2]})
This is my after DataFrame
import pandas as pd
after = pd.DataFrame({'Group':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2],
'Date':['31/07/2018','31/08/2018','30/09/2018','31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','31/12/2000','31/01/2001','28/02/2001','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[np.nan,np.nan,np.nan,1.1,1.7,1.9,2.3,1.5,np.nan,np.nan,np.nan,2.8,2,2,2,2]})
Because processing each group separately if many groups solution cannot be very fast - idea is get first rows of Group by DataFrame.drop_duplicates, shift months by offsets.MonthOffset, join together and add all missing datets between:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
#first and last shifted months - by 1 and by 3 months
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12], sort=False, ignore_index=True)
.set_index('Date')
.groupby('Group')
.resample('m')
.size()
.reset_index(name='value')
.assign(value = np.nan))
print (df)
Group Date value
0 1 2018-07-31 NaN
1 1 2018-08-31 NaN
2 1 2018-09-30 NaN
3 2 2000-12-31 NaN
4 2 2001-01-31 NaN
5 2 2001-02-28 NaN
Last add to original and sorting:
df = pd.concat([before, df], ignore_index=True).sort_values(['Group','Date'])
print (df)
Group Date value
10 1 2018-07-31 NaN
11 1 2018-08-31 NaN
12 1 2018-09-30 NaN
0 1 2018-10-31 1.1
1 1 2018-11-30 1.7
2 1 2018-12-31 1.9
3 1 2019-01-31 2.3
4 1 2019-02-28 1.5
13 2 2000-12-31 NaN
14 2 2001-01-31 NaN
15 2 2001-02-28 NaN
5 2 2001-03-30 2.8
6 2 2001-04-30 2.0
7 2 2001-05-31 2.0
8 2 2001-06-30 2.0
9 2 2001-07-31 2.0
If new months is only few you can omit groupby part:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(2))
df13 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12, df13, before], ignore_index=True, sort=False)
.sort_values(['Group','Date']))
print (df)
Group Date value
0 1 2018-07-31 NaN
2 1 2018-08-31 NaN
4 1 2018-09-30 NaN
6 1 2018-10-31 1.1
7 1 2018-11-30 1.7
8 1 2018-12-31 1.9
9 1 2019-01-31 2.3
10 1 2019-02-28 1.5
1 2 2000-12-30 NaN
3 2 2001-01-30 NaN
5 2 2001-02-28 NaN
11 2 2001-03-30 2.8
12 2 2001-04-30 2.0
13 2 2001-05-31 2.0
14 2 2001-06-30 2.0
15 2 2001-07-31 2.0

Groupby by sort based on date time, groupby sequence based on 'ID' and Date and then mean by sequence

I am new in pandas functionality.
I have a DF as shown below. which is repair data of mobiles.
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
I tried to find out the duration for each id from previous status.
find the mean for each status sequence for that ID.
My expected output is shown below.
ID S1 S1_Dur S2 S2_dur S3 S3_dur S4 S4_dur Avg_MF Avg_FM
0 1 F-M 30 M-F 335.00 F-M 61.00 M-F 750.00 542.50 45.50
1 2 M-F 92 F-F 122.00 NaN nan NaN nan 92.00 nan
2 3 M-F 457 F-M 122.00 NaN nan NaN nan 457.00 122.00
3 4 M-F 304 NaN nan NaN nan NaN nan 304.00 nan
S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration
I tried following codes
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)
Then I got a DF as shown below
ID Status index Date Cost Difference
0 1 F 0 2017-06-22 500 NaT
1 1 M 1 2017-07-22 100 30 days
2 1 F 7 2018-06-22 600 335 days
3 1 M 9 2018-08-22 150 61 days
4 1 F 10 2019-03-22 750 212 days
5 2 M 2 2017-06-29 200 NaT
6 2 F 5 2017-09-29 600 92 days
7 2 F 6 2018-01-29 500 122 days
8 3 M 3 2017-03-20 300 NaT
9 3 F 8 2018-06-20 700 457 days
10 3 M 11 2018-10-20 250 122 days
11 4 M 4 2017-08-10 800 NaT
12 4 F 12 2018-06-10 100 304 days
After that I am stuck.
Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)
df3 = df2.join(df1).reset_index()
print (df3)
ID D_1 S_1 D_2 S_2 D_3 S_3 D_4 S_4 Avg_F-F Avg_F-M \
0 1 30.0 F-M 335.0 M-F 61.0 F-M 212.0 M-F NaN 45.5
1 2 92.0 M-F 122.0 F-F NaN NaN NaN NaN 122.0 NaN
2 3 457.0 M-F 122.0 F-M NaN NaN NaN NaN NaN 122.0
3 4 304.0 M-F NaN NaN NaN NaN NaN NaN NaN NaN
Avg_M-F
0 273.5
1 92.0
2 457.0
3 304.0

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.
Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64