Pandas Group By With Running Total - pandas

My granny has some strange ideas. Every birthday she takes me shopping.
She has some strict rules. If I buy a present less than $20 she wont contribute anything. If I spend over $20 she will contribute up to $30.
So if a present costs $27 she would contribute $7.
That now leaves me with $23 to spend on extra presents that birthday; the same rules as above apply on any additional presents.
Once the $30 are spent there are no more contributions from granny and I must pay the rest myself.
Here is an example table of my 11th, 12th and 13th birthday.
DollarsSpent granny_pays
BirthDayAge PresentNum
11 1 25.00 5.00 -- I used up $5
2 100.00 25.00 -- I used up last $20
3 10.00 0.00
4 50.00 0.00
12 1 39.00 19.00 -- I used up $19 only $11 left
2 7.00 0.00
3 32.00 11.00 -- I used up the last $11 despite $12 of $32 above the $20 starting point
4 19.00 0.00
13 1 21.00 1.00 -- used up $1
2 27.00 7.00 -- used up $7, total used up $8 and never spent last $22
So in pandas I have gotten this far.
import pandas as pd
granny_wont_pay_first = 20.
granny_limit = 30.
df = pd.DataFrame({'BirthDayAge' : ['11','11','11','11','12','12','12','12','13','13']
,'PresentNum' : [1,2,3,4,1,2,3,4,1,2]
,'DollarsSpent' : [25.,100.,10.,50.,39.,7.,32.,19.,21.,27.]
})
df.set_index(['BirthDayAge','PresentNum'],inplace=True)
df['granny_pays'] = df['DollarsSpent'] - granny_wont_pay_first
df['granny_limit'] = granny_limit
df['zero'] = 0.0
df['granny_pays'] = df[['granny_pays','zero','granny_limit']].apply(np.median,axis=1)
df.drop(['granny_limit','zero'], axis=1, inplace=True)
print df.head(len(df))
And this is the output. Using the median on the 3 numbers is a nice way to work out what granny will contribute.
The problem is that you can see each present is treated in isolation and I don't correctly erode my $30 each present within each BirthDayAge.
DollarsSpent granny_pays
BirthDayAge PresentNum
11 1 25.00 5.00
2 100.00 30.00 -- should be 25.0
3 10.00 0.00
4 50.00 30.00 -- should be 0.0
12 1 39.00 19.00
2 7.00 0.00
3 32.00 12.00 -- should be 11.0
4 19.00 0.00
13 1 21.00 1.00
2 27.00 7.00
Trying to think of a nice pandas way to do this erosion.
Hopefully no loops please.

I don't know if there is a more concise way, but this should work and does avoid loops as requested.
df['per_gift'] = df.DollarsSpent - 20
df['per_gift'] = np.where( df.per_gift > 0, df.per_gift, 0 )
df['per_bday'] = df.groupby('BirthDayAge').per_gift.cumsum()
df['per_bday'] = np.where( df.per_bday > 30, 30, df.per_bday )
df['granny_pays'] = df.groupby('BirthDayAge').per_bday.diff()
df['granny_pays'] = df.granny_pays.fillna(df.per_bday)
Note that 'per_gift' ignores the maximum subsidy of $30 and 'per_bday' is the cumulative subsidy (capped at $30) per 'BirthDayAge'.
BirthDayAge DollarsSpent PresentNum per_gift per_bday granny_pays
0 11 25 1 5 5 5
1 11 100 2 80 30 25
2 11 10 3 0 30 0
3 11 50 4 30 30 0
4 12 39 1 19 19 19
5 12 7 2 0 19 0
6 12 32 3 12 30 11
7 12 19 4 0 30 0
8 13 21 1 1 1 1
9 13 27 2 7 8 7

Related

PostgreSQL query not fetching correct result for the conditional comparison of aggregate function

I have a products table with following values with id as INT and profit as numeric data type
id profit
1 6.00
2 3.00
3 2.00
4 3.00
5 2.00
6 8.00
7 4.00
8 3.00
9 1.00
10 4.00
11 10.00
12 3.00
13 6.00
14 5.00
15 2.00
16 7.00
17 6.00
18 5.00
19 2.00
20 16.00
21 3.00
22 6.00
23 5.00
24 5.00
25 1.00
26 4.00
27 1.00
28 7.00
29 11.00
30 2.00
31 1.00
32 3.00
33 2.00
34 5.00
35 4.00
I want to fetch id's which have profit more than average profit
My QUERY:
SELECT product_id,profit
FROM products
GROUP BY product_id,profit
HAVING profit > AVG(profit)::INT
But, the above query return's empty result.
when you execute the group by query, the records are grouped based on the parameters and then where/having clauses are applied.
so first group is student id 1 and further grouped by profit 6.00 making its average as 6.00 and with having condition profit >avg(profit), there are no records which match the criteria.
same for all other records. that is why you get empty result set as no number can be > itself.
based on your description though, it can be achieved by multiple selects.
select * from products where profit >(select avg(profit) from products)

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Year wise aggregation on the given condition in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product price sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below data frame
Expected Output:
product year number_of_months total_price total_discount number_of_sales
A 2016 1 50 5 1
B 2016 1 200 10 1
A 2017 12 250 25 5
B 2017 12 1000 110 5
A 2018 11 100 18 2
B 2018 11 900 130 3
Note: Please note that the data starts from Dec 2016 to Nov 2018.
So number of months in 2016 is 1, in 2017 we have full data so 12 months and 2018 we have 11 months.
First aggregate sum by years and product and then create new column for counts by months by DataFrame.insert and Series.map:
df1 =(df.groupby(['product',df['sale_date'].dt.year], sort=False).sum().add_prefix('total_')
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map({2016:1, 2017:12, 2018:11}))
print (df1)
product sale_date number_of_months total_price total_discount
0 A 2016 1 50 5
1 A 2017 12 250 25
2 B 2016 1 200 10
3 B 2017 12 1000 110
4 A 2018 11 100 18
5 B 2018 11 900 130
If want dynamic dictionary by minumal and maximal datetimes use:
s = pd.date_range(df['sale_date'].min(), df['sale_date'].max(), freq='MS')
d = s.year.value_counts().to_dict()
print (d)
{2017: 12, 2018: 11, 2016: 1}
df1 = (df.groupby(['product',df['sale_date'].dt.year], sort=False).sum().add_prefix('total_')
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map(d))
print (df1)
product sale_date number_of_months total_price total_discount
0 A 2016 1 50 5
1 A 2017 12 250 25
2 B 2016 1 200 10
3 B 2017 12 1000 110
4 A 2018 11 100 18
5 B 2018 11 900 130
For ploting is used DataFrame.set_index with DataFrame.unstack:
df2 = (df1.set_index(['sale_date','product'])[['total_price','total_discount']]
.unstack(fill_value=0))
df2.columns = df2.columns.map('_'.join)
print (df2)
total_price_A total_price_B total_discount_A total_discount_B
sale_date
2016 50 200 5 10
2017 250 1000 25 110
2018 100 900 18 130
df2.plot()
EDIT:
df1 = (df.groupby(['product',df['sale_date'].dt.year], sort=False)
.agg( total_price=('price','sum'),
total_discount=('discount','sum'),
number_of_sales=('discount','size'))
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map({2016:1, 2017:12, 2018:11}))
print (df1)
product sale_date number_of_months total_price total_discount \
0 A 2016 NaN 50 5
1 A 2017 NaN 250 25
2 B 2016 NaN 200 10
3 B 2017 NaN 1000 110
4 A 2018 NaN 100 18
5 B 2018 NaN 900 130
number_of_sales
0 1
1 5
2 1
3 5
4 2
5 3

Count times date occurs between two dates in a second table

I have two tables [Charges] and [Defects] and want to produce [Desired Query Output] where the output counts the occurrances of defect when [Charges].ChargeDate is between (and including) [Defects].OpenDate and [Defects].CloseDate. For [Defects] table, a close date of NULL means it has not closed yet. Seems simple enough, but I haven't found a good example of how to do this. Can you help??
I'm using SQL Server version 12.
[Charges]
Order Charge ChargeDate
1 1.2 07/10/2020
1 0.6 07/15/2020
6 0.002 07/20/2020
8 0.13 07/01/2020
8 1.1 06/18/2020
8 0.3 06/19/2020
10 2.3 06/24/2020
[Defects]
Order DefectID OpenDate CloseDate
1 25 06/01/2020 NULL
1 27 07/09/2020 07/12/2020
1 30 05/01/2020 07/20/2020
8 45 06/19/2020 06/19/2020
8 47 06/12/2020 07/05/2020
8 48 06/19/2020 NULL
10 49 06/24/2020 NULL
[Desired Query Output]
Order Charge ChargeDate DefectCnt
1 1.2 07/10/2020 3
1 0.6 07/15/2020 2
6 0.002 07/20/2020 0
8 0.13 07/01/2020 2
8 1.1 06/18/2020 1
8 0.3 06/19/2020 3
10 2.3 06/24/2020 1
You can use a correlated subquery or a lateral join:
select
c.*,
(
select count(*)
from defects d
where
d.order = c.order
and c.ChargeDate >= d.OpenDate
and (d.CloseDate is null or c.ChargeDate <= d.CloseDate)
) as DefectCnt
from charges c

Sum/aggregate data based on dates

I have the following table about the items sold out in every shop.
table columns
shop_id: the specific id of the shop,
sold: the purchase amount as dollars ($)
time: the date and time of the purchase.
data
shop_id sold time
1 12.44 23/10/2014 20:20
1 12.77 24/10/2014 20:18
1 10.72 24/10/2014 20:18
1 14.51 24/10/2014 20:18
2 5.94 22/10/2014 20:11
2 15.69 23/10/2014 20:23
2 8.55 24/10/2014 20:12
2 6.96 24/10/2014 20:18
3 8.84 22/10/2014 20:21
3 7.82 22/10/2014 20:21
3 22.19 23/10/2014 20:23
3 13.21 23/10/2014 20:23
4 14.60 23/10/2014 20:20
4 12.19 23/10/2014 20:23
4 5.41 24/10/2014 20:18
4 10.93 24/10/2014 20:19
5 18.54 22/10/2014 20:21
5 7.48 22/10/2014 20:21
5 10.67 24/10/2014 20:18
5 15.96 24/10/2014 20:18
I have 3 classifiers per purchase :
purchase classifiers
low: 0-8 $
medium: 8-12 $
high: 12 and higher $
What I would like to do is to write a query using PostgreSQL that will produce the total purchase each day by each type of purchase classifier.
desired output
date low medium high
22/10/2014 29.10 14.51 12.77
23/10/2014 0 0 70.06
24/10/2014 16.34 51.24 41.39
Thank you very much in advance for your kind help. I am not experienced with postgreSQL. I would do this in R easily, but since moving a huge database table to R is very cumbersome, I need to learn some postgreSQL.
I believe you can do this using conditional aggregation like this:
select
cast(time as date),
sum(case when sold > 0 and sold <= 8 then sold else 0 end) as low,
sum(case when sold > 8 and sold <= 12 then sold else 0 end) as medium,
sum(case when sold > 12 then sold else 0 end) as high
from your_table
group by cast(time as date);
You might have to tweak the ranges a bit - should 8 fall into both low and medium? Also, I can't remember if cast("time" as date) is the correct syntax for Postgresql - I suspect it might not be, but just replace that part with the correct function to get the date from the datetime/timestamp column. It should probably be "time"::date