Pandas- conditional information retrieval with on a date range - pandas

I'm still fairly new to pandas and the script i wrote to accomplish a seemily easy task seems needlessly complicated. If you guys know of an easier way to accomplish this I would be extremely grateful.
task:
I hate two spreadsheets (df1&df2), each with an identifier (mrn) and a date. my task is to retrieve an value from df2 for each row in df1 if the following conditions are met:
the identifier for a given row in df1 exists in df2
if above is true, then retrieve the value in df2 if the associated date is within a +/-5 day range from the date in df1.
I have written the following code which accomplishes this:
#%%housekeeping
import numpy as np
import pandas as pd
import csv
import datetime
from datetime import datetime, timedelta
import sys
from io import StringIO
#%%dataframe import
df1=',mrn,date,foo\n0,1,2015-03-06,n/a\n1,11,2009-08-14,n/a\n2,14,2009-05-18,n/a\n3,20,2010-06-19,n/a\n'
df2=',mrn,collection Date,Report\n0,1,2015-03-06,report to import1\n1,11,2009-08-12,report to import11\n2,14,2009-05-21,report to import14\n3,20,2010-06-25,report to import20\n'
df1 = pd.read_csv(StringIO(df1))
df2 = pd.read_csv(StringIO(df2))
#converting to date-time format
df1['date']=pd.to_datetime(df1['date'])
df2['collection Date']=pd.to_datetime(df2['collection Date'])
#%%mask()
def mask(df2, rangeTime):
mask= (df2> rangeTime -timedelta(days=5)) & (df2 <= rangeTime + timedelta(days=5))
return mask
#%% detailLoop()
i=0
for element in df1["mrn"]:
df1DateIter = df1.ix[i, 'date']
df2MRNmatch= df2.loc[df2['mrn']==element, ['collection Date', 'Report']]
df2Date= df2MRNmatch['collection Date']
df2Report= df2MRNmatch['Report']
maskOut= mask(df2Date, df1DateIter)
dateBoolean= maskOut.iloc[0]
if dateBoolean==True:
df1.ix[i, 'foo'] = df2Report.iloc[0]
i+=1
#: once the script has been run the df1 looks like:
Out[824]:
mrn date foo
0 1 2015-03-06 report to import1
1 11 2009-08-14 report to import11
2 14 2009-05-18 report to import14
3 20 2010-06-19 NaN

Related

How to subtract sales for month 1 and month 2 for every customer in my dataframe using pandas?

This is my data frame
`
c = pd.DataFrame({"Product":["p1","p1","p2","p2","p3","p3","p4","p4"],
"sales":[10000,20000,30000,40000,10000,24000,13000,20000],
"Month":["M1","M2","M1","M2","M1","M2","M1","M2"]})
`
The answer should be another dataframe
I tired using boolean masking but I am not sure how to work with both the columns.
Is this what you are looking for?:
import pandas as pd
import numpy as np
c = pd.DataFrame({"Product":["p1","p1","p2","p2","p3","p3","p4","p4"],
"sales":[10000,20000,30000,40000,10000,24000,13000,20000],
"Month":["M1","M2","M1","M2","M1","M2","M1","M2"]})
c['sales'] = np.where(c['Month'] == "M2", c['sales'] * -1, c['sales'])
c.groupby('Product').sum()
This will work only in the case where you have only 'M1' and 'M2'

Daily to Weekly Pandas conversion

I am trying to convert my 15ys worth of daily data into weekly by taking the mean, diff and count of certain features. I tried using .resample but I was not sure if that is the most efficient way.
My sample data:
Date,Product,New Quantity,Price,Refund Flag
8/16/1994,abc,10,0.5,
8/17/1994,abc,11,0.9,1
8/18/1994,abc,15,0.6,
8/19/1994,abc,19,0.4,
8/22/1994,abc,22,0.2,1
8/23/1994,abc,19,0.1,
8/16/1994,xyz,16,0.5,1
8/17/1994,xyz,10,0.9,1
8/18/1994,xyz,12,0.6,1
8/19/1994,xyz,19,0.4,
8/22/1994,xyz,26,0.2,1
8/23/1994,xyz,30,0.1,
8/16/1994,pqr,0,0,
8/17/1994,pqr,0,0,
8/18/1994,pqr,1,1,
8/19/1994,pqr,2,0.6,
8/22/1994,pqr,9,0.1,
8/23/1994,pqr,12,0.2,
This is the output I am looking for:
Date,Product,Net_Quantity_diff,Price_avg,Refund
8/16/1994,abc,9,0.6,1
8/22/1994,abc,-3,0.15,0
8/16/1994,xyz,3,0.6,3
8/22/1994,xyz,4,0.15,1
8/16/1994,pqr,2,0.4,0
8/22/1994,pqr,3,0.15,0
I think the pandas resample method is indeed ideal for this. You can pass a dictionary to the agg method, defining which aggregation function to use for each column. For example:
import numpy as np
import pandas as pd
df = pd.read_csv('sales.txt') # your sample data
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(df['Date'])
del df['Date']
df['Refund Flag'] = df['Refund Flag'].fillna(0).astype(bool)
def span(s):
return np.max(s) - np.min(s)
df_weekly = df.resample('w').agg({'New Quantity': span,
'Price': np.mean,
'Refund Flag': np.sum})
df_weekly
New Quantity Price Refund Flag
Date
1994-08-21 19 0.533333 4
1994-08-28 21 0.150000 2

Add business days to pandas dataframe with dates and skip over holidays python

I have a dataframe with dates as seen in the table below. 1st block is what it should look like and the 2nd block is what I get when just adding the BDays. This is an example of what it should look like when completed. I want to use the 1st column and add 5 business days to the dates, but if the 5 Bdays overlaps a holiday (like 15 Feb'21) then I need to add one additional day. It is fairly simple to add the 5Bday using pandas.tseries.offsets import BDay, but i cannot skip the holidays while using the dataframe.
I have tried to use pandas.tseries.holiday import USFederalHolidayCalendar, the workdays and workalendar modules, but cannot figure it out. Anyone have an idea what I can do.
Correct Example
DATE
EXIT DATE +5
2021/02/09
2021/02/17
2021/02/10
2021/02/18
Wrong Example
DATE
EXIT DATE +5
2021/02/09
2021/02/16
2021/02/10
2021/02/17
Here are some examples of code I tried:
import pandas as pd
from workdays import workday
...
df['DATE'] = workday(df['EXIT DATE +5'], days=5, holidays=holidays)
Next Example:
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt = df['DATE']
df['EXIT DATE +5'] = dt + bday_us
=========================================
Final code:
Below is the code I finally settled on. I had to define the holidays manually due to the days the NYSE actually trades. Like for instance the day Pres Bush was laid to rest.
import datetime as dt
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import BDay
from pandas.tseries.holiday import AbstractHolidayCalendar, Holiday, nearest_workday, \
USMartinLutherKingJr, USPresidentsDay, GoodFriday, USMemorialDay, \
USLaborDay, USThanksgivingDay
class USTradingCalendar(AbstractHolidayCalendar):
rules = [
Holiday('NewYearsDay', month=1, day=1, observance=nearest_workday),
USMartinLutherKingJr,
USPresidentsDay,
GoodFriday,
USMemorialDay,
Holiday('USIndependenceDay', month=7, day=4, observance=nearest_workday),
Holiday('BushDay', year=2018, month=12, day=5),
USLaborDay,
USThanksgivingDay,
Holiday('Christmas', month=12, day=25, observance=nearest_workday)
]
offset = 5
df = pd.DataFrame(['2019-10-11', '2019-10-14', '2017-04-13', '2018-11-28', '2021-07-02'], columns=['DATE'])
df['DATE'] = pd.to_datetime(df['DATE'])
def offset_date(start, offset):
return start + pd.offsets.CustomBusinessDay(n=offset, calendar=USTradingCalendar())
df['END'] = df.apply(lambda x: offset_date(x['DATE'], offset), axis=1)
print(df)
Input data
df = pd.DataFrame(['2021-02-09', '2021-02-10', '2021-06-28', '2021-06-29', '2021-07-02'], columns=['DATE'])
df['DATE'] = pd.to_datetime(df['DATE'])
Suggested solution using apply
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import BDay
def offset_date(start, offset):
return start + pd.offsets.CustomBusinessDay(n=offset, calendar=USFederalHolidayCalendar())
offset = 5
df['END'] = df.apply(lambda x: offset_date(x['DATE'], offset), axis=1)
DATE END
2021-02-09 2021-02-17
2021-02-10 2021-02-18
2021-06-28 2021-07-06
2021-06-29 2021-07-07
2021-07-02 2021-07-12
PS: If you want to use a particular calendar such as the NYSE, instead of the default USFederalHolidayCalendar, I recommend following the instructions on this answer, about creating a custom calendar.
Alternative solution which I do not recommend
Currently, to the best of my knowledge, pandas do not support a vectorized approach to your problem. But if you want to follow a similar approach to the one you mentioned, here is what you should do.
First, you will have to define an arbitrary far away end date that includes all the periods you might need and use it to create a list of holidays.
holidays = USFederalHolidayCalendar().holidays(start='2021-02-09', end='2030-02-09')
Then, you pass the holidays list to CustomBusinessDay through the holidays parameter instead of the calendar to generate the desired offset.
offset = 5
bday_us = pd.offsets.CustomBusinessDay(n=offset, holidays=holidays)
df['END'] = df['DATE'] + bday_us
However, this type of approach is not a true vectorized solution, even though it might seem like it. See the following SO answer for further clarification. Under the hood, this approach is probably doing a conversion that is not efficient. This why it yields the following warning.
PerformanceWarning: Non-vectorized DateOffset being applied to Series
or DatetimeIndex
Here's one way to do it
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from datetime import timedelta as td
def get_exit_date(date):
holiday_list = cals.holidays(start=date, end=date + td(weeks=2)).tolist()
# 6 periods since start date is included in set
n_bdays = pd.bdate_range(start=date, periods=6, freq='C', holidays=holiday_list)
return n_bdays[-1]
df = pd.read_clipboard()
cals = USFederalHolidayCalendar()
# I would convert this to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['EXIT DATE +5'] = df['DATE'].apply(get_exit_date)
this is using bdate_range which returns a datetime index
Results:
DATE EXIT DATE +5
0 2021-02-09 2021-02-17
1 2021-02-10 2021-02-18
Another option is instead of dynamically creating the holiday list. You could also just choose a start date and leave it outside the function like so:
def get_exit_date(date):
# 6 periods since start date is included in set
n_bdays = pd.bdate_range(start=date, periods=6, freq='C', holidays=holiday_list)
return n_bdays[-1]
df = pd.read_clipboard()
cals = USFederalHolidayCalendar()
holiday_list = cals.holidays(start='2021-01-01').tolist()
# I would convert this to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['EXIT DATE +5'] = df['DATE'].apply(get_exit_date)

Dask .loc only the first result (iloc[0])

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

Multiprocessing the Fuzzy match in pandas

I have two data frames.
DF_Address, which is having 347k distinct addresses and DF_Project which is having 24k records having
Project_Id, Project_Start_Date and Project_Address
I want to check if there is a fuzzy match of my Project_Address in Df_Address. If there is a match, I want to extract the Project_ID and Project_Start_Date for the same. Below is code of what I am trying
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
x, choices=choices, score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
Df_Project ["Project_Address"],
80
)
)
This code does provide an output in the form of a tuple
('matched_string', score)
But it is also giving similar strings. Also I need to extract
Project_Id and Project_Start_Date
. Can someone help me to achieve this using parallel processing as the data is huge.
You can convert the tuple into dataframe and then join out to your base data frame.
import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)
Output:
address random_stuff distance
0 abc 100 10
1 cdf 200 20