How to calculate Growth rate of a time series variable in python pandas - pandas

I have a data in time series format like:
date value
1-1-2013 100
1-2-2013 200
1-3-2013 300
1-4-2013 400
1-5-2013 500
1-6-2013 600
1-7-2013 700
1-8-2013 650
1-9-2013 450
1-10-2013 350
1-11-2013 250
1-12-2013 150

Use Series.pct_change:
In [458]: df['growth rate'] = df.value.pct_change()
In [459]: df
Out[459]:
date value growth rate
0 1-1-2013 100 NaN
1 1-2-2013 200 1.000000
2 1-3-2013 300 0.500000
3 1-4-2013 400 0.333333
4 1-5-2013 500 0.250000
5 1-6-2013 600 0.200000
6 1-7-2013 700 0.166667
7 1-8-2013 650 -0.071429
8 1-9-2013 450 -0.307692
9 1-10-2013 350 -0.222222
10 1-11-2013 250 -0.285714
11 1-12-2013 150 -0.400000
Or:
If you want to show in percent multiply by 100:
In [480]: df['growth rate'] = df.value.pct_change().mul(100)
In [481]: df
Out[481]:
date value growth rate
0 1-1-2013 100 NaN
1 1-2-2013 200 100.000000
2 1-3-2013 300 50.000000
3 1-4-2013 400 33.333333
4 1-5-2013 500 25.000000
5 1-6-2013 600 20.000000
6 1-7-2013 700 16.666667
7 1-8-2013 650 -7.142857
8 1-9-2013 450 -30.769231
9 1-10-2013 350 -22.222222
10 1-11-2013 250 -28.571429
11 1-12-2013 150 -40.000000

Growth rate as single number for each year
df['col'] = df.groupby(['Year'])['col2'].pct_change(periods=1) * 100

Related

How to verify data migration from a sequence of rows? Case statement attempted/failed

I am trying to verify the success of a data migration to two new columns. Our developers created a script to populate the tables and I need to verify with a query.
The two tables are:
dbo.req (the ncv columns are new)
req_id
ncv1
ncv2
dbo.req_map
req_id
nach_id
seq
My instructions are as follows:
for each req_id in dbo.req, find all matching rows in dbo.req_map
traverse all matching rows, starting with the lowest seq value, ascending
insert the first "nach_id" value into dbo.req.ncv1 column
traverse the remaining rows, if another unique "nach_id" value is found, insert it into ncv2 column
then stop processing this req_id
Example data below. I can manage the first half of the query, but can't figure out how to traverse for the ncv2 column. My query thus far:
SELECT
cr.req_id,
cr.ncv1,
cr.ncv2,
ccm.nach_id,
ccm.seq,
CASE WHEN ccm.nach_id = cr.ncv1 OR ccm.nach_id = 0 AND cr.ncv2 = null
then 'OK' else ':('
end as "Match?"
FROM dbo.req cr
LEFT JOIN dbo.req_map ccm
ON cr.req_id = ccm.req_id
Example data below:
dbo.req
req_id
ncv1
ncv2
23227
100
1541
25000
100
1541
dbo.req_map
req_id
nach_id
seq
23227
100
0
23227
100
1
23227
1541
2
23227
100
3
23227
100
4
23227
100
5
23227
100
6
23227
100
7
23227
0
8
23227
0
9
23227
100
10
23227
100
11
23227
100
12
23227
0
13
23227
0
14
23227
0
15
23227
100
16
23227
100
17
23227
100
18
23227
0
19
23227
100
20
23227
0
21
23227
1541
22
23227
1541
23
25000
100
0
25000
100
1
25000
1541
2
25000
100
3
25000
100
4
25000
100
5
25000
100
6
25000
100
7
25000
100
8
25000
100
9
25000
100
10
25000
100
11
25000
100
12
25000
100
13
25000
0
14
25000
0
15
25000
100
16
25000
100
17
25000
100
18
25000
0
19
25000
100
20
25000
0
21
25000
1541
22
25000
1541
23

Year wise aggregation on the given condition in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product price sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below data frame
Expected Output:
product year number_of_months total_price total_discount number_of_sales
A 2016 1 50 5 1
B 2016 1 200 10 1
A 2017 12 250 25 5
B 2017 12 1000 110 5
A 2018 11 100 18 2
B 2018 11 900 130 3
Note: Please note that the data starts from Dec 2016 to Nov 2018.
So number of months in 2016 is 1, in 2017 we have full data so 12 months and 2018 we have 11 months.
First aggregate sum by years and product and then create new column for counts by months by DataFrame.insert and Series.map:
df1 =(df.groupby(['product',df['sale_date'].dt.year], sort=False).sum().add_prefix('total_')
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map({2016:1, 2017:12, 2018:11}))
print (df1)
product sale_date number_of_months total_price total_discount
0 A 2016 1 50 5
1 A 2017 12 250 25
2 B 2016 1 200 10
3 B 2017 12 1000 110
4 A 2018 11 100 18
5 B 2018 11 900 130
If want dynamic dictionary by minumal and maximal datetimes use:
s = pd.date_range(df['sale_date'].min(), df['sale_date'].max(), freq='MS')
d = s.year.value_counts().to_dict()
print (d)
{2017: 12, 2018: 11, 2016: 1}
df1 = (df.groupby(['product',df['sale_date'].dt.year], sort=False).sum().add_prefix('total_')
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map(d))
print (df1)
product sale_date number_of_months total_price total_discount
0 A 2016 1 50 5
1 A 2017 12 250 25
2 B 2016 1 200 10
3 B 2017 12 1000 110
4 A 2018 11 100 18
5 B 2018 11 900 130
For ploting is used DataFrame.set_index with DataFrame.unstack:
df2 = (df1.set_index(['sale_date','product'])[['total_price','total_discount']]
.unstack(fill_value=0))
df2.columns = df2.columns.map('_'.join)
print (df2)
total_price_A total_price_B total_discount_A total_discount_B
sale_date
2016 50 200 5 10
2017 250 1000 25 110
2018 100 900 18 130
df2.plot()
EDIT:
df1 = (df.groupby(['product',df['sale_date'].dt.year], sort=False)
.agg( total_price=('price','sum'),
total_discount=('discount','sum'),
number_of_sales=('discount','size'))
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map({2016:1, 2017:12, 2018:11}))
print (df1)
product sale_date number_of_months total_price total_discount \
0 A 2016 NaN 50 5
1 A 2017 NaN 250 25
2 B 2016 NaN 200 10
3 B 2017 NaN 1000 110
4 A 2018 NaN 100 18
5 B 2018 NaN 900 130
number_of_sales
0 1
1 5
2 1
3 5
4 2
5 3

groupby sum month wise on date time data

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-18
6 1 PLATINUM GROCERY 400 11-Feb-18
7 1 PLATINUM RESTRAUNT 600 21-Feb-18
8 1 PLATINUM GROCERY 800 17-Mar-18
9 1 PLATINUM GROCERY 200 21-Mar-18
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-18
16 2 GOLD GROCERY 400 11-Feb-18
17 2 GOLD RESTRAUNT 600 21-Mar-18
18 2 GOLD GROCERY 200 21-Mar-18
19 2 GOLD HOTEL 700 25-Mar-18
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-18
26 3 SILVER GROCERY 400 11-Feb-18
27 3 SILVER HOTEL 600 25-Mar-18
28 3 SILVER GROCERY 200 29-Mar-18
29 3 SILVER RESTRAUNT 700 30-Mar-18
I am struggling to get below dataframe.
Card_No Card_Type D Jan_Sp Jan_N Feb_Sp Feb_N Mar_Sp GR_T RES_T
1 PLATINUM 70 3300 5 1500 3 1000 2300 100
2 GOLD 72 5200 5 1900 2 1500 2300 1100
3 SILVER . 76 2900 5 900 2 1500 1800 1700
D = Duration in days from first transaction to last transaction.
Jan_Sp = Total spending on January.
Feb_Sp = Total spending on February.
Mar_Sp = Total spending on March.
Jan_N = Number of transaction in Jan.
Feb_N = Number of transaction in Feb.
GR_T = Total spending on GROCERY.
RES_T = Total spending on RESTRAUNT.
I tried following code. I am very new to pandas.
q9['Date'] = pd.to_datetime(Card_Number['Date'])
q9 = q9.sort_values(['Card_Number', 'Date'])
q9['D'] = q9.groupby('ID')['Date'].diff().dt.days
My approach is three steps
get the date range
get the Monthly spending
get the category spending
Step 1: Date
date_df = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days)
Step 2: Month
month_df = (df.groupby(['Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack('Date')
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
Step 3: Category
cat_df = df.pivot_table(index='Card_type',
columns='Category',
values='Amount',
aggfunc='sum')
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
And finally concat:
pd.concat( (date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T RE_T
Card_type
GOLD 72 1900 5200 1500 2 5 3 2300 5200 1100
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500 1000
SILVER 76 900 3200 1500 2 5 3 1800 2100 1700
If your data have several years, and you want to separate them by year, then you can add df.Date.dt.year in each groupby above:
date_df = df.groupby([df.Date.dt.year,'Card_type']).Date.apply(lambda x: (x.max()-x.min()).days)
month_df = (df.groupby([df.Date.dt.year,'Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack(level=-1)
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
cat_df = (df.groupby([df.Date.dt.year,'Card_type', 'Category'])
.Amount
.sum()
.unstack(level=-1)
)
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
pd.concat((date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T
Date Card_type
2017 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
2018 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
I would recommend keeping the dataframe this way, so you can access the annual data, e.g. result_df.loc[2017] gives you 2017 data. If you really want 2017 as year, you can do result_df.unstack(level=0).

pandas groupby sum, count by date time, where consider only year

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-19
6 1 PLATINUM GROCERY 400 11-Feb-19
7 1 PLATINUM RESTRAUNT 600 21-Feb-19
8 1 PLATINUM GROCERY 800 17-Mar-17
9 1 PLATINUM GROCERY 200 21-Mar-17
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-19
16 2 GOLD GROCERY 400 11-Feb-19
17 2 GOLD RESTRAUNT 600 21-Mar-17
18 2 GOLD GROCERY 200 21-Mar-17
19 2 GOLD HOTEL 700 25-Mar-17
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-19
26 3 SILVER GROCERY 400 11-Feb-19
27 3 SILVER HOTEL 600 25-Mar-17
28 3 SILVER GROCERY 200 29-Mar-17
29 3 SILVER RESTRAUNT 700 30-Mar-17
I am struggling to get below dataframe.
Card_No Card_Type D 2018_Sp 2018_N 2019_Sp 2019_N 2018_Sp
1 PLATINUM 70 3300 5 1500 3 1000
2 GOLD 72 5200 5 1900 2 1500
3 SILVER . 76 2900 5 900 2 1500
D = Duration in days from first transaction to last transaction.
2018_Sp = Total spending on year 2018.
2019_Sp = Total spending on 2019.
2017_Sp = Total spending on 2017.
2018_N = Number of transaction in 2018.
2019_N = Number of transaction in 2019.
Use:
#convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#sorting if necessary
df = df.sort_values(['Card_Number','Card_type', 'Date'])
#aggregate count and sum
df1 = (df.groupby(['Card_Number','Card_type', df['Date'].dt.year])['Amount']
.agg([('Sp','size'),('N','sum')])
.unstack()
.sort_index(axis=1, level=1))
#MultiIndex to columns
df1.columns = [f'{b}_{a}' for a, b in df1.columns]
#difference (different output, because different years)
s = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days).rename('D')
#join together
df1 = df1.join(s).reset_index()
print (df1)
Card_Number Card_type 2017_N 2017_Sp 2018_N 2018_Sp 2019_N 2019_Sp \
0 1 PLATINUM 1000 2 3300 5 1500 3
1 2 GOLD 1500 3 5200 5 1900 2
2 3 SILVER 1500 3 3200 5 900 2
D
0 706
1 692
2 688

Splitting Column Headers and Duplicating Row Values in Pandas Dataframe

In the example df below, I'm trying to find a way to split the column headers ('1;2','4','5;6') based on the ';' that exists and duplicate the row values in these split columns. (My actual df comes from an imported csv file so generally I have around 50-80 column headers that need spliting)
Below is my code below with output
import pandas as pd
import numpy as np
#
data = np.array([['Market','Product Code','1;2','4','5;6'],
['Total Customers',123,1,500,400],
['Total Customers',123,2,400,320],
['Major Customer 1',123,1,100,220],
['Major Customer 1',123,2,230,230],
['Major Customer 2',123,1,130,30],
['Major Customer 2',123,2,20,10],
['Total Customers',456,1,500,400],
['Total Customers',456,2,400,320],
['Major Customer 1',456,1,100,220],
['Major Customer 1',456,2,230,230],
['Major Customer 2',456,1,130,30],
['Major Customer 2',456,2,20,10]])
df =pd.DataFrame(data)
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
print (df)
0 Market Product Code 1;2 4 5;6
1 Total Customers 123 1 500 400
2 Total Customers 123 2 400 320
3 Major Customer 1 123 1 100 220
4 Major Customer 1 123 2 230 230
5 Major Customer 2 123 1 130 30
6 Major Customer 2 123 2 20 10
7 Total Customers 456 1 500 400
8 Total Customers 456 2 400 320
9 Major Customer 1 456 1 100 220
10 Major Customer 1 456 2 230 230
11 Major Customer 2 456 1 130 30
12 Major Customer 2 456 2 20 10
Below is my desired output
0 Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
Ideally I would like to perform such a task at the 'read_csv' level. Any thoughts?
Try reindex with repeat
s=df.columns.str.split(';')
df=df.reindex(columns=df.columns.repeat(s.str.len()))
df.columns=sum(s.tolist(),[])
df
Out[247]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
You can split the columns with ';' and then reconstruct a df:
pd.DataFrame({c:df[t] for t in df.columns for c in t.split(';')})
Out[157]:
1 2 4 5 6 Market Product Code
1 1 1 500 400 400 Total Customers 123
2 2 2 400 320 320 Total Customers 123
3 1 1 100 220 220 Major Customer 1 123
4 2 2 230 230 230 Major Customer 1 123
5 1 1 130 30 30 Major Customer 2 123
6 2 2 20 10 10 Major Customer 2 123
7 1 1 500 400 400 Total Customers 456
8 2 2 400 320 320 Total Customers 456
9 1 1 100 220 220 Major Customer 1 456
10 2 2 230 230 230 Major Customer 1 456
11 1 1 130 30 30 Major Customer 2 456
12 2 2 20 10 10 Major Customer 2 456
Or if you would like to reserve column order:
pd.concat([df[t].to_frame(c) for t in df.columns for c in t.split(';')],1)
Out[167]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10