group total based on months difference

group total based on months difference - pandas

data frame:
ID spend month_diff
12 10 -5
12 10 -4
12 10 -3
12 10 1
12 10 -2
12 20 0
12 30 2
12 10 -1
I want to get the spend_total based on the month difference for a particular ID. month_diff in negative means spend done by customer in last year and positive means this year. So, I want to compare the spend of customers for past year and this year. so the conditions are as follows:
Conditions:
if month_diff >= -2 and < 0 then cumulative spend for negative months -> flag=pre
if month_diff > 0 and <=2 then cumulative spend for positive months -> flag=post
Note: the no. of month_diff +ve and -ve are not same. it might be the case that customer had 4 transactions in -ve month_diff and only 2 transaction on +ve so I want to take only 2 month cumulative sum of -ve month_diff and 2 for +ve and don't want to consider the spend where month_diff is 0.
Desired data frame:
ID spend month_diff spend_tot flag
12 10 -2 20 pre
12 30 2 40 post
40 is the cumulative sum of spend for month_diff +1 and +2 (i.e. 10+30) and same for month_diff -1 and -2 and its cumulative spend is 20(i.e.10 + 10

Use:
#filter values by list
df = df[df['month_diff'].isin([1,2,-1,-2])]
#filter duplicated values with absolute values of month_diff
df = df[df.assign(a=df['month_diff'].abs()).duplicated(['ID','a'], keep=False)]
#sign column
a = np.sign(df['month_diff'])
#aggregate sum and last
df1 = (df.groupby(['ID', a])
.agg({'month_diff':'last', 'spend':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
df1['flag'] = np.select([df1['month_diff'].ge(-2) & df1['month_diff'].lt(0),
df1['month_diff'].gt(0) & df1['month_diff'].le(2)],
['pre','post'], default='another val')
print (df1)
ID month_diff spend flag
0 12 -1 20 pre
1 12 2 40 post

Related

Aggregating values into previous row based on condition

I have below input table. I need to create a query to generate the output table which shown below.
The time should be accumulated and the summing up should stop when a record with both time and qty is defined and should restart from there. The Spent_Qty is the sum of all qty defined from both time and qty record till next non zero time and qty record.
Example:
first 3 rows has no meaning. 4th row has Qty defined but the next row has time defined so the qty is belong to previous time.
5th row has 3.5 (decimal time) and no Qty so need sum up with next record with qty defined. 6th row has both defined so the sum of time now is 7.25 (time / 60). 6th row has 2 qty defined and 7th row has 0 qty and 0, 8th row has no time but 0.5 qty is show. This should be summed up with 6th row which 2.5. The 9th row has hours defined so need to stop the qty accumulation and restart from here
The result:
7.25hrs took 2.5 spent qty
Example:
INPUT:
Time
Qty
0
0
0
0
0
0
0
1
3.75
0
3.5
2
0
0
0
0.5
2.5
0
2.5
0.5
0
0.5
0
0
3
0
3.5
0.4
0
0.5
1
0
3
2
0
0
0
2
0
1
4
1
1.75
0
1.75
0
0
1
0.75
1
Output
TOT_TIME
Spent QTY
7.25
2.5
5
1
6.5
0.9
4
5
4
1
3.5
1
0.75
1
I have used LEAD, LAG and other analytical functions. I need to write select statement to get the result along with few other columns. its not working out.

You can use:
SELECT *
FROM table_name
MATCH_RECOGNIZE(
ORDER BY rn
MEASURES
SUM(time) AS total_time,
SUM(qty) AS total_qty
PATTERN ( ^ no_time* | any_row*? time_and_qty no_time* )
DEFINE
time_and_qty AS time > 0 AND qty > 0,
no_time AS time = 0
)
Which, for the sample data, outputs:
TOTAL_TIME
TOTAL_QTY
0
1
7.25
2.5
5
1
6.5
.9
4
5
4
1
4.25
2
Note: The final 4 rows are aggregated together due to the rule "The time should be accumulated and the summing up should stop when a record with both time and qty is defined and should restart from there." It is not until you get to the final row that it has both time and qty.
fiddle

Multiplication of returns by company increasing in time (BHARs)

I have the following Dataframe, organized in panel data. It contains daily returns of many companies on different days following the IPO date. The day_diff represents the days that have passed since the IPO, and return_1 represents the daily individual returns for that specific day for that specific company, from which I have already added +1. Each company has its own company_tic and I have about 300 companies. My goal is to calculate the first component of the right-hand side of the equation below (so having results for each day_diff and company_tic, always starting at day 0, until the last day of data; e.g. = from day 0 to day 1, then from day 0 to day 2, from 0 to day 3, and so on until my last day, which is day 730). I have tried df.groupby(['company_tic', 'day_diff'])['return_1'].expanding().prod() but it doesn't work. Any alternatives?
Index day_diff company_tic return_1
0 0 xyz 1.8914
1 1 xyz 1.0542
2 2 xyz 1.0016
3 0 abc 1.4398
4 1 abc 1.1023
5 2 abc 1.0233
... ... ... ...
159236 x 3

Not sure to fully get what you want, but you might want to use cumprod instead of expanding().prod().
Here's what I tried :
df['return_1_prod'] = df.groupby('company_tic')['return_1'].cumprod()
Output :
day_diff company_tic return_1 return_1_prod
0 0 xyz 1.8914 1.891400
1 1 xyz 1.0542 1.993914
2 2 xyz 1.0016 1.997104
3 0 abc 1.4398 1.439800
4 1 abc 1.1023 1.587092
5 2 abc 1.0233 1.624071

extract week columns from date in pandas

I have a dataframe that has columns like these:
Date earnings workingday length_week first_wday_week last_wdayweek
01.01.2000 10000 1 1
02.01.2000 0 0 1
03.01.2000 0 0 2
04.01.2000 0 0 2
05.01.2000 0 0 2
06.01.2000 23000 1 2
07.01.2000 1000 1 2
08.01.2000 0 0 2
09.01.2000 0 0 2
..
..
..
30.01.2000 0 0 0
31.01.2000 0 1 3
01.02.2000 0 1 3
02.02.2000 2500 1 3
working day indicates there earnings present on that particular day. I am trying to generate last three column from the date.
length_week : gives number of working days in that week
first_working_day_of_week : 1 if its first working day of a week
last_working_day_of_week : 1 if its last working day of a week
Can anyone help me with this?

I first changed the format of your date column as pd.to_datetime couldn't infer the right date format:
df.Date.str.replace('.', '-', regex=True)
df.Date = pd.to_datetime(df.Date, format='%d-%m-%Y')
Then use isocalendar so that we can work with weeks and days more easily:
df[['year', 'week', 'weekday']] = df.Date.dt.isocalendar()
Now length_week is just the sum of workingdays for each seperate weeks:
df['length_week'] = df.groupby(['year', 'week']).workingday.transform('sum')
and we can get frst_worday_week with idxmax:
min_indexes = df.groupby(['year', 'week'], as_index=False).workingday.transform('idxmax')
df['frst_worday_week'] = np.where(df.index == min_indexes.workingday, 1, 0)
Lastly, last_workdayweek is similar but a bit tricky. We need the last occurence of idxmax, so we will reverse each week inside groupby:
max_indexes = df.groupby(['year', 'week'], as_index=False).\
workingday.transform(lambda x: x[::-1].idxmax())
df['last_workdayweek'] = np.where(df.index == max_indexes.workingday, 1, 0)

segmentation total based on multiple condition

data frame:-
ID spend month_diff
12 10 -1
12 10 -2
12 20 1
12 30 2
13 15 -1
13 20 -2
13 25 1
13 30 2
I want to get the spend_total based on the month difference for a particular ID. month_diff in negative means spend done by customer in last year and positive means this year.so,i want to compare the spend of customers for past year and this year. so the conditions are as follows:
Conditions:-
if month_diff >= -2 and < 0 then cumulative spend for negative months - flag=pre
if month_diff > 0 and <=2 then cumulative spend for positive months - flag=post
Desired data frame:-
ID spend month_diff tot_spend flag
12 10 -2 20 pre
12 30 2 50 post
13 20 -2 35 pre
13 30 2 55 post

Use numpy.sign with Series.shift , Series.ne and Series.cumsum for consecutive groups and pass to DataFrame.groupby with aggregate GroupBy.last and sum.
Last use numpy.select:
a = np.sign(df['month_diff'])
g = a.ne(a.shift()).cumsum()
df1 = (df.groupby(['ID', g])
.agg({'month_diff':'last', 'spend':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
df1['flag'] = np.select([df1['month_diff'].ge(-2) & df1['month_diff'].lt(0),
df1['month_diff'].gt(0) & df1['month_diff'].le(2)],
['pre','post'], default='another val')
print (df1)
ID month_diff spend flag
0 12 -2 20 pre
1 12 2 50 post
2 13 -2 35 pre
3 13 2 55 post

how to get a moving average in sql

If I have data from week 1 to week 52 data and I want 4 week Moving Average with 1 week how can I make a SQL query for this? For example, for week 5 I want week1-week4 average, week6 I want week5-week8 average and so on.
I have the columns week and target_value in table A.
Sample data is like this:
Week target_value
1 20
2 10
3 10
4 20
5 60
6 20
So the output I want will start from week 5 as only week 1-week4 is available not before that.
Output data will look like:
Week Output
5 15 (20+10+10+20)/4=15 Moving Average week1-week4
6 25 (10+10+20+60)/4=25 Moving Average week2-week5
The data is in hive but I can move it to oracle if it is simpler to do this there.

SELECT
Week,
(SELECT ISNULL(AVG(B.target_value), A.target_value)
FROM tblA B
WHERE (B.Week < A.Week)
AND B.Week >= (A.Week - 4)
) AS Moving_Average
FROM tblA A
The ISNULL keeps you from getting a null for your first week since there is no week 0. If you want it to be null, then just leave the ISNULL function out.
If you want it to start at week 5 only, then add the following line to the end of the SQL that I wrote:
WHERE A.Week > 4
Results:
Week Moving_Average
1 20
2 20
3 15
4 13
5 15
6 25

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

group total based on months difference - pandas

Related

Aggregating values into previous row based on condition

Multiplication of returns by company increasing in time (BHARs)

extract week columns from date in pandas

segmentation total based on multiple condition

how to get a moving average in sql

Categories

Resources