Comparing one date with starting date and ending date with duplicates - pandas

i have the below sample of data, and i need to create a function that will take a sales date and compare it with the below dates and returns the discount name & percentage, but as below the discount dates are not unique and some times overlaps, so in case the date falls in two different discounts names it has to return the highest duplicate discount name based on the percentage in case the sale date falls in more than one.
Discount Name Start Date End Date Percentage
0 First 2020-07-24 2020-11-25 0.10
1 First 2020-09-13 2020-10-29 0.10
2 First 2020-12-07 2020-12-10 0.10
3 First 2020-12-28 2021-01-19 0.10
4 First 2020-06-14 2020-06-14 0.10
5 Second 2020-06-16 2020-06-18 0.15
6 Second 2020-06-21 2020-06-22 0.15
7 Second 2020-06-22 2020-06-23 0.15
8 Second 2020-07-07 2020-07-08 0.15
9 Third 2020-06-02 2020-06-12 0.20
10 Third 2020-05-19 2020-06-01 0.20
11 Third 2020-05-06 2020-05-17 0.20
12 Third 2020-04-30 2020-05-03 0.20
Screen Shot of Dataframe
i truly hope that someone can help me on this. thanks

This function should do the trick
def discout_rate(df, sales_date):
return df[(df['Start Date'] <= sales_date) & (df['End Date'] >= sales_date)]['Percentage'].max()
The sales_date should be of type datetime.datetime and the columns Start Date and End Date too.

Related

Merging two series with alternating dates into one grouped Pandas dataframe

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

how to clean sql table base on startdate, enddate and effective date

I have a really dirty table in which I have a mix between the start date and one values's change effective date.
The table look like this
id
value
startdate
enddate
effective date
1
0.3
2020-10-07
2021-02-28
2020-07-01
1
1
2020-10-07
2021-02-28
2020-10-07
2
0.46
2021-01-01
2021-01-01
2
1
2021-01-01
2020-10-07
2021-05-01
3
1
2021-08-01
2021-08-01
4
1
2019-03-01
2019-03-01
4
0.5
2019-03-01
2020-08-01
4
0.7
2019-03-01
2021-05-01
When the enddate is empty it means that there is not change planning and when the start date is later and the effective date, it means than they delete an older record and create a new one with other values.
my goal is to clean the table and get it sorted as something like this.
id
value
startdate_valid
enddate_valid
1
0.3
2020-07-01
2020-10-07
1
1
2020-10-07
2021-02-28
2
0.46
2021-01-01
2021-05-01
2
1
2021-05-01
3
1
2021-08-01
4
1
2019-03-01
2020-08-01
4
0.5
2020-08-01
2021-05-01
4
0.7
2021-05-01
any idea of how can I achieve this?
EDIT:
I think I was able to get the startdate_valid value by using
MAX([effective date]) OVER(PARTITION BY id, YEAR([effective date]), MONTH([effective date]) ORDER BY [effective date])
This make sense as I have the startdate included in the effective date but I am still stuck in order to get the enddate_valid
I have found a solution to my problem, I needed to do it in two steps so if someone has a better solution, please share and I will set it as correct
SELECT
*,
COALESCE(
LEAD(sub.StartDate_value) OVER(PARTITION BY sub.Code ORDER BY sub.StartDate_value),
sub.[startdate]) AS [EndDate_value]
FROM (
SELECT
id, name,
COALESCE(
MAX([effective date]) OVER(PARTITION BY id YEAR([effective date]), MONTH([effective date]) ORDER BY [effective date]),
startdate)
) AS StartDate_value
from table ) sub

Transpose a table with multiple ID rows and different assessment dates

I would like to transpose my table to see trends in a data. The data is formatted as such:
UserId is can occur multiple times because of different assessment periods. Let's say a user with ID 1 inccured some charges in January, February, and March. There are currently three rows that contain data from these periods respectively.
I would like to see everything as one row - independently of the number of periods (up to 12 months), for each user ID.
This would enable me to see and compare changes between assessment periods and attributes.
Current format:
UserId AssessmentDate Attribute1 Attribute2 Attribute3
1 2020-01-01 00:00:00.000 -01:00 20.13 123.11 405.00
1 2021-02-01 00:00:00.000 -01:00 1.03 78.93 11.34
1 2021-03-01 00:00:00.000 -01:00 15.03 310.10 23.15
2 2021-02-01 00:00:00.000 -01:00 14.31 41.30 63.20
2 2021-03-01 00:03:45.000 -01:00 0.05 3.50 1.30
Desired format:
UserId LastAssessmentDate Attribute1_M-2 Attribute2_M-1 ... Attribute3_M0
1 2021-03-01 00:00:00.000 -01:00 20.13 123.11 23.15
2 2021-03-01 00:03:45.000 -01:00 NULL 41.30 1.30
Either SQL or Pandas - both work for me. Thanks for the help!

pandas groupby several criteria

I have a dataframe that looks like this
which contains every minute of a year.
I need to simplify it on hourly base and to get only hours of the year and then maximum of Reserved and Used columns for the respective hours.
I made this, which works, but not totally for my purposes
df = df.assign(date=df.date.dt.round('H'))
df1 = df.groupby('date').agg({'Reserved': ['max'], 'Used': ['max'] }).droplevel(1, axis=1).reset_index()
which just groups the minutes into hours.
date Reserved Used
0 2020-01-01 00:00:00 2176 0.0
1 2020-01-01 01:00:00 2176 0.0
2 2020-01-01 02:00:00 2176 0.0
3 2020-01-01 03:00:00 2176 0.0
4 2020-01-01 04:00:00 2176 0.0
... ... ... ...
8780 2020-12-31 20:00:00 3450 50.0
8781 2020-12-31 21:00:00 3450 0.0
8782 2020-12-31 22:00:00 3450 0.0
8783 2020-12-31 23:00:00 3450 0.0
8784 2021-01-01 00:00:00 3450 0.0
Now I need group it more to plot several curves, containing only 24 points (for every hour) based on several criteria
average used and reserved for the whole year (so to group together every 00 hour, every 01 hour, etc.)
average used and reserved for every month (so to group every 00 hour, 01 hour etc for each month individually)
average used and reserved for weekdays and for weekends
I know this is only the similar groupby as before, but I somehow miss the logic of doing it.
Could anybody help?
Thanks.

How to fill pandas dataframe with max() values

I have a dataframe where each day starts at 7:00 and ends at 22:10 in 5 minute intervals.
In the df are around 200 days (weekend days and some specific days are excluded)
Date Time Volume
0 2019-09-03 07:00:00 70000 778
1 2019-09-03 07:05:00 70500 1267
2 2019-09-03 07:10:00 71000 1208
3 2019-09-03 07:15:00 71500 715
4 2019-09-03 07:20:00 72000 372
I need another column, let's call it 'lastdayVolume', with the max value of Volume of the prior day
For example, in 2019-09-03 (between 7:00 and 22:10) the maximum volume value in a single row is 50000, then I need in every row of 2019-09-04 the value 50000 in column 'lastdayVolume'.
How would you do this without decreasing the lenght of the dataframe?
Have you tried
df.resample('1D', on='Date').max()
This should give you one row per day with the maximal value at this day.
EDIT: To combine that with the old Data, you can use a left join. Its a bit messy but
pd.merge(df, df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), left_on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date), right_index=True, how='left')
In [54]: pd.merge(df, df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), left_on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date), right_index=True, how='left')
Out[54]:
Date Time Volume lastdayVolume
0 2019-09-03 07:00:00 70000 778 800.0
1 2019-09-03 07:05:00 70500 1267 800.0
2 2019-09-03 07:10:00 71000 1208 800.0
3 2019-09-03 07:15:00 71500 715 800.0
4 2019-09-03 07:20:00 72000 372 800.0
0 2019-09-02 08:00:00 70000 800 NaN
seems to work out.
Equivalently you can use the slightly shorter
df.join(df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date))
here.
The first DataFrame is your old one, the second is the one I calculated above (with appropriate renaming). For the values to merge on you use your 'Date' column which contains timestamps, offset it by one day and converted to an actual date on the left. On the right simply use the index.
The left join ensures you don't accidentally drop rows if you have no transactions the day before.
EDIT 2: To find out that maximum in a certain timerange, you can use
df.set_index('Date').between_time('15:30:00', '22:10:00')
to filter the DataFrame. Afterwards resample as before
df.join(df.set_index('Date').between_time('15:30:00', '22:10:00').resample('1D')...
where the on parameter in the resample is no longer necessary as the Date went into the index.