I have some transaction data that looks like this.
import pandas as pd
from io import StringIO
from datetime import datetime
from datetime import timedelta
data = """\
cust_id,datetime,txn_type,txn_amt
100,2019-03-05 6:30,Credit,25000
100,2019-03-06 7:42,Debit,4000
100,2019-03-07 8:54,Debit,1000
101,2019-03-05 5:32,Credit,25000
101,2019-03-06 7:13,Debit,5000
101,2019-03-06 8:54,Debit,2000
"""
df = pd.read_table(StringIO(data), sep=',')
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M:%S')
# use datetime as the dataframe index
df = df.set_index('datetime')
print(df)
cust_id txn_type txn_amt
datetime
2019-03-05 06:30:00 100 Credit 25000
2019-03-06 07:42:00 100 Debit 4000
2019-03-07 08:54:00 100 Debit 1000
2019-03-05 05:32:00 101 Credit 25000
2019-03-06 07:13:00 101 Debit 5000
2019-03-06 08:54:00 101 Debit 2000
I would like to resample the data at the daily level aggregating (summing) txn_amount for each combination of cust_id and txn_type. At the same time, I want to standardize the index to 5 days (currently the data only contains 3 days of data). In essence, this is what I would like to produce:
cust_id txn_type txn_amt
datetime
2019-03-03 100 Credit 0
2019-03-03 100 Debit 0
2019-03-03 101 Credit 0
2019-03-03 101 Debit 0
2019-03-04 100 Credit 0
2019-03-04 100 Debit 0
2019-03-04 101 Credit 0
2019-03-04 101 Debit 0
2019-03-05 100 Credit 25000
2019-03-05 100 Debit 0
2019-03-05 101 Credit 25000
2019-03-05 101 Debit 0
2019-03-06 100 Credit 0
2019-03-06 100 Debit 4000
2019-03-06 101 Credit 0
2019-03-06 101 Debit 7000 => (note: aggregated value)
2019-03-07 100 Credit 0
2019-03-07 100 Debit 1000
2019-03-07 101 Credit 0
2019-03-07 101 Debit 0
So far, I've tried creating a new datetime index and to try and resample and then use the newly created index like so:
# create a 5 day datetime index
end_dt = max(df.index).to_pydatetime().strftime('%Y-%m-%d')
start_dt = max(df.index) - timedelta(days=4)
start_dt = start_dt.to_pydatetime().strftime('%Y-%m-%d')
dt_index = pd.date_range(start=start_dt, end=end_dt, freq='1D', name='datetime')
However, I am not sure how to go about the grouping part. Resampling with no grouping outputs wrong results:
# resample timeseries so that one row is 1 day's worth of txns
df2 = df.resample(rule='D').sum().reindex(dt_index).fillna(0)
print(df2)
cust_id txn_amt
datetime
2019-03-03 0.0 0.0
2019-03-04 0.0 0.0
2019-03-05 201.0 50000.0
2019-03-06 302.0 11000.0
2019-03-07 100.0 1000.0
So, how can I incorporate a grouping of cust_id and tsn_type when resampling? I have seen this similar question but the op's data structure is different.
I am using reindex here , the key is to setting up the Multiple index
df.index=pd.to_datetime(df.index).date
df=df.groupby([df.index,df['txn_type'],df['cust_id']]).agg({'txn_amt':'sum'}).reset_index(level=[1,2])
drange=pd.date_range(end=df.index.max(),periods =5)
idx=pd.MultiIndex.from_product([drange,df.cust_id.unique(),df.txn_type.unique()])
Newdf=df.set_index(['cust_id','txn_type'],append=True).reindex(idx,fill_value=0).reset_index(level=[1,2])
Newdf
Out[749]:
level_1 level_2 txn_amt
2019-03-03 100 Credit 0
2019-03-03 100 Debit 0
2019-03-03 101 Credit 0
2019-03-03 101 Debit 0
2019-03-04 100 Credit 0
2019-03-04 100 Debit 0
2019-03-04 101 Credit 0
2019-03-04 101 Debit 0
2019-03-05 100 Credit 25000
2019-03-05 100 Debit 0
2019-03-05 101 Credit 25000
2019-03-05 101 Debit 0
2019-03-06 100 Credit 0
2019-03-06 100 Debit 4000
2019-03-06 101 Credit 0
2019-03-06 101 Debit 7000
2019-03-07 100 Credit 0
2019-03-07 100 Debit 1000
2019-03-07 101 Credit 0
2019-03-07 101 Debit 0
Related
Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1
I used Pandas’ resample function for calculating the sales of a list of proucts every 6 months.
I used the resample function for ‘6M’ and using apply({“column-name”:”sum”}).
Now I’d like to create a table with the sum of the sales for the first six months.
How can I extract the sum of the first 6 months, given that all products have records for more than 3 years, and none of them have the same start date?
Thanks in advance for any suggestions.
Here is an example of the data:
Product Date sales
Product 1 6/30/2017 20
12/31/2017 60
6/30/2018 50
12/31/2018 100
Product 2 1/31/2017 30
7/31/2017 150
1/31/2018 200
7/31/2018 300
1/31/2019 100
While waiting for your data, I worked on this. See if this is something that will be helpful for you.
import pandas as pd
df = pd.DataFrame({'Date':['2018-01-10','2018-02-15','2018-03-18',
'2018-07-10','2018-09-12','2018-10-14',
'2018-11-16','2018-12-20','2019-01-10',
'2019-04-15','2019-06-12','2019-10-18',
'2019-12-02','2020-01-05','2020-02-25',
'2020-03-15','2020-04-11','2020-07-22'],
'Sales':[200,300,100,250,150,350,150,200,250,
200,300,100,250,150,350,150,200,250]})
#first breakdown the data by Yearly Quarters
df['YQtr'] = pd.PeriodIndex(pd.to_datetime(df.Date), freq='Q')
#next create a column to identify Half Yearly - H1 for Jan-Jun & H2 for Jul-Dec
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q1','Q2']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H1'
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q3','Q4']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H2'
#Do a cummulative sum on Half Year to get sales by H1 & H2 for each year
df['HYear_cumsum'] = df.groupby('HYear')['Sales'].cumsum()
#Now filter out only the rows with the max value. That's the H1 & H2 sales figure
df1 = df[df.groupby('HYear')['HYear_cumsum'].transform('max')== df['HYear_cumsum']]
print (df)
print (df1)
The output of this will be:
Source Data + Half Year cumulative sum:
Date Sales YQtr HYear HYear_cumsum
0 2018-01-10 200 2018Q1 2018H1 200
1 2018-02-15 300 2018Q1 2018H1 500
2 2018-03-18 100 2018Q1 2018H1 600
3 2018-07-10 250 2018Q3 2018H2 250
4 2018-09-12 150 2018Q3 2018H2 400
5 2018-10-14 350 2018Q4 2018H2 750
6 2018-11-16 150 2018Q4 2018H2 900
7 2018-12-20 200 2018Q4 2018H2 1100
8 2019-01-10 250 2019Q1 2019H1 250
9 2019-04-15 200 2019Q2 2019H1 450
10 2019-06-12 300 2019Q2 2019H1 750
11 2019-10-18 100 2019Q4 2019H2 100
12 2019-12-02 250 2019Q4 2019H2 350
13 2020-01-05 150 2020Q1 2020H1 150
14 2020-02-25 350 2020Q1 2020H1 500
15 2020-03-15 150 2020Q1 2020H1 650
16 2020-04-11 200 2020Q2 2020H1 850
17 2020-07-22 250 2020Q3 2020H2 250
The half year cumulative sum for each half year.
Date Sales YQtr HYear HYear_cumsum
2 2018-03-18 100 2018Q1 2018H1 600
7 2018-12-20 200 2018Q4 2018H2 1100
10 2019-06-12 300 2019Q2 2019H1 750
12 2019-12-02 250 2019Q4 2019H2 350
16 2020-04-11 200 2020Q2 2020H1 850
17 2020-07-22 250 2020Q3 2020H2 250
I will look at your sample data and work on it later tonight.
I have a df as shown below.
df:
ID open_date limit
1 2020-06-03 100
1 2020-06-23 500
1 2019-06-29 300
1 2018-06-29 400
From the above I would like to calculate a column named age_in_days.
age_in_days is the number of days from open_date to 2020-06-30.
Expected output
ID open_date limit age_in_days
1 2020-06-03 100 27
1 2020-06-23 500 7
1 2019-06-29 300 367
1 2018-06-29 400 732
Make sure open_date in datetime dtype and subtract it from 2020-06-30
df['open_date'] = pd.to_datetime(df.open_date)
df['age_in_days'] = (pd.Timestamp('2020-06-30') - df.open_date).dt.days
Out[209]:
ID open_date limit age_in_days
0 1 2020-06-03 100 27
1 1 2020-06-23 500 7
2 1 2019-06-29 300 367
3 1 2018-06-29 400 732
I have data in a DF (df1) that starts and ends like this below and I'm trying to shift the "0" and "1" columns below so that the date and time is moved back one hour so that the date and time start at hour == 0 not hour == 1.
data starts (df1) -
0 1 2 3 4 5 6 7
0 20160101 100 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 200 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 300 9.515370 165931.0 20160101 300 8.019863 8100.0
data ends (df1) -
0 1 2 3 4 5 6 7
8780 20161231 2100 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2200 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2300 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20170101 0 3.284947 1815.0 20170101 0 0.899262 NaN
I need the date and time to start shifted back one hour so start time is hour begin not hour end -
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
and ends like this with the date and time below -
0 1 2 3 4 5 6 7
8780 20161231 2000 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20161231 2300 3.284947 1815.0 20170101 0 0.899262 NaN
And, i have no real idea of how to accomplish this or how to research it. Thank you,
It would be better to create a proper datetime object then simply remove the hours as a sum which will handle any redaction in days. We can then use dt.strftime to re-create your object (string) columns.
s = pd.to_datetime(
df[0].astype(str) + df[1].astype(str).str.zfill(4), format="%Y%m%d%H%M"
)
0 2016-01-01 01:00:00
1 2016-01-01 02:00:00
2 2016-01-01 03:00:00
8780 2016-12-31 21:00:00
8781 2016-12-31 22:00:00
8782 2016-12-31 23:00:00
8783 2017-01-01 00:00:00
dtype: datetime64[ns]
df[1] = (s - pd.DateOffset(hours=1)).dt.strftime("%H%M").str.lstrip("0").str.zfill(3)
df[0] = (s - pd.DateOffset(hours=1)).dt.strftime("%Y%d%m")
print(df)
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
8780 20163112 2000 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20163112 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20163112 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20163112 2300 3.284947 1815.0 20170101 0 0.899262 NaN
Use, DataFrame.shift to shift the columns 0, 1, then use Series.bfill on column 0 of df2 to fill the missing values, then use .fillna on column 1 of df2 to fill the NaN values, finally use Dataframe.join to join the dataframe df2 with the dataframe df1:
df2 = df1[['0', '1']].shift()
df2['0'] = df2['0'].bfill()
df2['1'] = df2['1'].fillna('000')
df2 = df2.join(df1.loc[:, '2':])
# print(df2)
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
...
8780 20160101 300 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20161231 2300 3.284947 1815.0 20170101 0 0.899262 NaN
You can do subtraction in pandas (considering that the data in your dataframe are not string type)
I will show you an example on how it can be done
import pandas as pd
df = pd.DataFrame()
df['time'] = [0,100,500,2100,2300,0] #creating dataframe
df['time'] = df['time']-100 #This is what you want to do
Now your data will be subtracted by 100.
There is a case when subtracting 0 you will get -100 as time. For that you can do this:
for i in range(len(df['time'])):
if df['time'].iloc[i]== -100:
df['time'].iloc[i]=2300
I have the below test data. There are 3 tables, sales table, sales delivery table and sales delivery months table.
I need to join all the tables together, so that the blue marked rows are connected to the blue marked rows and the red marked rows are connected to the red marked rows.
The join should use the From and To columns that exist in every table, I guess.
Update:
I have tried the following:
SELECT *
FROM Sales co
LEFT JOIN SalesDelivery cd
ON co.SalesID = cd.SalesID
AND cd.From BETWEEN co.From AND co.To
AND cd.To BETWEEN co.From AND co.To
LEFT JOIN SalesDeliveryMonth cdp
ON cd.SalesDeliveryID = cdp.SalesDeliveryID
AND cdp.From BETWEEN cd.From AND cd.To
AND cdp.To BETWEEN cd.From AND cd.To
Sales table:
SalesID Name Revenue From To Current row
100 New CRM 250000.00 1800-01-01 2018-10-03 0
100 New CRM 500000.00 2018-10-03 9999-12-31 1
SalesDelivery table:
SalesID SalesDeliveryID SalesDeliveryName Revenue SalesStart From To Current row
100 AB100 New CRM 250000.00 2018-07-01 1800-01-01 2018-10-03 0
100 AB100 New CRM 500000.00 2018-07-01 2018-10-03 9999-12-31 1
100 ABM100 New CRM - maintenance 0.00 2018-07-01 2018-10-03 9999-12-31 1
SalesDeliveryMonths table:
RevenueMonth Month SalesDeliveryID SalesID From To Current row
833333.3333 2018-07-01 AB100 100 1800-01-01 2018-10-04 0
166666.6667 2018-07-01 AB100 100 2018-10-04 9999-12-31 1
833333.3333 2018-08-01 AB100 100 1800-01-01 2018-10-04 0
166666.6667 2018-08-01 AB100 100 2018-10-04 9999-12-31 1
833333.3333 2018-09-01 AB100 100 1800-01-01 2018-10-04 0
166666.6667 2018-09-01 AB100 100 2018-10-04 9999-12-31 1