pandas complex groupby, count and apply a cap

pandas complex groupby, count and apply a cap - pandas

Sample dataframe
> 0 location_day payType Name ratePay elapsedSeconds
> 1 2019-12-10 PRE Amy 12.25 199
> 2 2019-12-12 PRE Amy 12.25 7
> 3 2019-12-17 PRE Amy 12.25 68
> 4 2019-12-17 RP Amy 8.75 62
For each day, sum elapsedSeconds and calculate new column with total toPay (elapsedSeconds * ratePay) but apply a "cap" elapsedSeconds of 120. For any single day that only has 1 payType, apply cap so that only 120 is used to calculate "toPay" col.
But...
Also, groupby payType so that if there are 2 unique "payTypes" on a single day, sum the elapsedSeconds to determine if it's over the cap (120) and if so, subtract the elapsedSeconds from the last payType to make the sum equal to 120.
So I desire this output:
> 0 location_day payType Name ratePay elapsedSeconds
> 1 2019-12-10 PRE Amy 12.25 120
> 2 2019-12-12 PRE Amy 12.25 7
> 3 2019-12-17 PRE Amy 12.25 68
> 4 2019-12-17 RP Amy 8.75 52
I'm not quite sure how to approach this one and really only have performed some very basic grouping and testing of calculating new columns with conditional statements such as
finDfcalc1 = finDf.sort('location_day').groupby(flds)['elapsedSeconds'].sum().reset_index()
finDfcalc1.loc[finDfcalc1['elapsedSeconds'] < 120, 'elapsedSecondsOverage'] = finDfcalc1['elapsedSeconds'] * 1
finDfcalc1.loc[finDfcalc1['elapsedSeconds'] > 120, 'elapsedSecondsOverage'] = finDfcalc1['elapsedSeconds'] - 120
finDfcalc1['toPay'] = finDfcalc1['ratePay'] * finDfcalc1['elapsedSecondsOverage']
None of this has to be a one-liner and would be perfectly happy just working out the logic. All suggestions and ideas are greatly appreciated.

We need to group on the day, calculate the cumsum of 'elapsedSeconds' and then apply some logic to clip the total in a day at 120 seconds and then back calculate the correct # of seconds for all rows.
Here's a longer sample dataset to show how it behaves for an additional day with many rows that need to get changed.
location_day payType Name ratePay elapsedSeconds
2019-12-10 PRE Amy 12.25 199
2019-12-12 PRE Amy 12.25 7
2019-12-17 PRE Amy 12.25 68
2019-12-17 RP Amy 8.75 62
2019-12-18 PRE Amy 12.25 50
2019-12-18 RP Amy 8.75 60
2019-12-18 RA Amy 8.75 20
2019-12-18 RE Amy 8.75 10
2019-12-18 XX Amy 8.75 123
Code:
# Will become the seconds you want in the end
df['real_sec'] = df.groupby('location_day').elapsedSeconds.cumsum()
# Calculate a difference
m = df['real_sec'] - df['elapsedSeconds']
#MagicNum
df['real_sec'] = (df['real_sec'].clip(upper=120) # 120 at most
- m.where(m.gt(0)).fillna(0) # only change rows where diff is positive
).clip(lower=0) # Negative results -> 0
location_day payType Name ratePay elapsedSeconds real_sec
0 2019-12-10 PRE Amy 12.25 199 120.0
1 2019-12-12 PRE Amy 12.25 7 7.0
2 2019-12-17 PRE Amy 12.25 68 68.0
3 2019-12-17 RP Amy 8.75 62 52.0
4 2019-12-18 PRE Amy 12.25 50 50.0
5 2019-12-18 RP Amy 8.75 60 60.0
6 2019-12-18 RA Amy 8.75 20 10.0
7 2019-12-18 RE Amy 8.75 10 0.0
8 2019-12-18 XX Amy 8.75 123 0.0

Related

aggregate data between two dates with two dataframes

Given I have the following DF,
Assume this table has all the sales rep and all the Q end dates in the last 20 years.
Q End date
Rep
Var1
03/31/2010
Bob
11
03/31/2010
Alice
12
03/31/2010
Jack
13
06/30/2010
Bob
14
06/30/2010
Alice
15
06/30/2010
Jack
16
I also have a table of transactions events
Sell Date
Rep
04/01/2009
Bob
03/01/2010
Bob
02/01/2010
Jack
02/01/2010
Jack
I am trying to modify the first DF so to have a column that aggregates the number of transactions that happened 12 month prior to the q end date per Qend per Rep
The result should look like this
Q End end
Rep
Var1
Trailing 12M transactions
03/31/2010
Bob
11
2
03/31/2010
Alice
12
0
03/31/2010
Jack
13
2
06/30/2010
Bob
14
1
06/30/2010
Alice
15
0
06/30/2010
Jack
16
2
My table has 2000-3000 sales rep per Q for ~20 years and number of transactions per trailing 12m can range between 0-7k ish.
Any help here would be appreciated. Thanks!

Try:
df1["Q End date"] = pd.to_datetime(df1["Q End date"])
df2["Sell Date"] = pd.to_datetime(df2["Sell Date"])
df2 = df2.sort_values(by="Sell Date").set_index("Sell Date")
df1["Trailing 12M transactions"] = df1.apply(
lambda x: df2.loc[
x["Q End date"] - pd.DateOffset(years=1) : x["Q End date"]
]
.eq(x["Rep"])
.sum(),
axis=1,
)
print(df1)
Prints:
Q End date Rep Var1 Trailing 12M transactions
0 2010-03-31 Bob 11 2
1 2010-03-31 Alice 12 0
2 2010-03-31 Jack 13 2
3 2010-06-30 Bob 14 1
4 2010-06-30 Alice 15 0
5 2010-06-30 Jack 16 2

Conditional mean while using iloc pandas

Assume I have a dataframe with columns stated below (consist more column in actual data).
Customer Group1 jan_revenue feb_revenue mar_revenue
Sam Bank A 40 50 0
Wilson Bank A 60 70 30
Jay Bank B 10 40 40
Jim Bank A 0 40 70
Yan Bank C 0 40 90
Tim Bank C 10 0 50
I want to calculate the mean for each customer but only those are non-zero.
For example, customer Sam has mean (40+50)/2 = 45 and Wilson (60+70+30)/3 = 53.3333
Since I have a large number of columns, so i choose to use iloc but my approach included all the 0.
df['avg_revenue21'] = df.iloc[:,27:39].mean(axis=1)
May I know is there a way for conditional mean while using iloc?
Thank you

You can use select_dtypes to get numeric columns, replace the zeros with NA, then get the mean as usual:
df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)
output:
Sam 45.000000
Wilson 53.333333
Jay 30.000000
Jim 55.000000
Yan 65.000000
Tim 30.000000
dtype: float64
As new column:
df['avg_revenue21'] = df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)
Customer Group1 jan_revenue feb_revenue mar_revenue avg_revenue21
Sam Bank A 40 50 0 45.000000
Wilson Bank A 60 70 30 53.333333
Jay Bank B 10 40 40 30.000000
Jim Bank A 0 40 70 55.000000
Yan Bank C 0 40 90 65.000000
Tim Bank C 10 0 50 30.000000
variants:
If the input are strings:
df['avg_revenue21'] = df.apply(pd.to_numeric, errors='coerce').replace(0, pd.NA).mean(axis=1)
If you only want to consider a subset:
df['avg_revenue21'] = df.filter(regex='(feb|mar)_').replace(0, pd.NA).mean(axis=1)
or:
df['avg_revenue21'] = df[['feb_revenue', 'mar_revenue']].replace(0, pd.NA).mean(axis=1)

Use DataFrame.replace with mean:
df['new'] = df.replace(0, np.nan).mean(axis=1)
print (df)
Customer Group1 jan_revenue feb_revenue mar_revenue new
0 Sam Bank A 40 50 0 45.000000
1 Wilson Bank A 60 70 30 53.333333
2 Jay Bank B 10 40 40 30.000000
3 Jim Bank A 0 40 70 55.000000
4 Yan Bank C 0 40 90 65.000000
5 Tim Bank C 10 0 50 30.000000
Or:
df['new'] = df.replace(0, np.nan).mean(numeric_only=True, axis=1)
print (df)
Customer Group1 jan_revenue feb_revenue mar_revenue new
0 Sam Bank A 40 50 0 45.000000
1 Wilson Bank A 60 70 30 53.333333
2 Jay Bank B 10 40 40 30.000000
3 Jim Bank A 0 40 70 55.000000
4 Yan Bank C 0 40 90 65.000000
5 Tim Bank C 10 0 50 30.000000
EDIT: If possible columns are not numeric, use to_numeric with errors='coerce' for missing values if no numbers:
df['new'] = df.apply(pd.to_numeric, errors='coerce').replace(0, np.nan).mean(axis=1)

In Azure Data bricks I want to get start dates of every week with week numbers from datetime column

This is a sample Data Frame
Date Items_Sold
12/29/2019 10
12/30/2019 20
12/31/2019 30
1/1/2020 40
1/2/2020 50
1/3/2020 60
1/4/2020 35
1/5/2020 56
1/6/2020 34
1/7/2020 564
1/8/2020 6
1/9/2020 45
1/10/2020 56
1/11/2020 45
1/12/2020 37
1/13/2020 36
1/14/2020 479
1/15/2020 47
1/16/2020 47
1/17/2020 578
1/18/2020 478
1/19/2020 3578
1/20/2020 67
1/21/2020 578
1/22/2020 478
1/23/2020 4567
1/24/2020 7889
1/25/2020 8999
1/26/2020 99
1/27/2020 66
1/28/2020 678
1/29/2020 889
1/30/2020 990
1/31/2020 58585
2/1/2020 585
2/2/2020 555
2/3/2020 56
2/4/2020 66
2/5/2020 66
2/6/2020 6634
2/7/2020 588
2/8/2020 2588
2/9/2020 255
I am running this query
%sql
use my_items_table;
select weekofyear(Date), count(items_sold) as Sum
from my_items_table
where year(Date)=2020
group by weekofyear(Date)
order by weekofyear(Date)
I am getting this output. (IMP: I have added random values in Sum)
Week Sum
1 | 300091
2 | 312756
3 | 309363
4 | 307312
5 | 310985
6 | 296889
7 | 315611
But I want in which with week number one column should hold a start date of each week. Like this
Start_Date Week Sum
12/29/2019 1 300091
1/5/2020 2 312756
1/12/2020 3 309363
1/19/2020 4 307312
1/26/2020 5 310985
2/2/2020 6 296889
2/9/2020 7 315611
I am running the query on Azure Data Bricks.

If you have data for all days, then just use min():
select min(date), weekofyear(Date), count(items_sold) as Sum
from my_items_table
where year(Date) = 2020
group by weekofyear(Date)
order by weekofyear(Date);
Note: The year() is the calendar year starting on Jan 1. You are not going to get dates from other years using this query. If that is an issue, I would suggest that you ask a new question asking how to get the first day for the first week of the year.

How to aggregate across columns in pandas?

There are 5 members contributing the value of something for every [E,M,S] as below:
E,M,S,Mem1,Mem2,Mem3,Mem4,Mem5
1,365,-10,15,21,18,16,,
1,365,10,23,34,,45,65
365,365,-20,34,45,43,32,23
365,365,20,56,45,,32,38
730,365,-5,82,64,13,63,27
730,365,15,24,68,,79,78
Notice that there are missing contributions ,,. I want to know the number of contributions for each [E,M,S]. For this e.g. the output is:
1,365,-10,4
1,365,10,4
365,365,-20,5
365,365,20,4
730,365,-5,5
730,365,15,4
groupingBy['E','M','S'] and then aggregating(counting) or applying(function) but across axis=1 would do. How is that done? Or any another idiomatic way to do such ?

The answer posted by #Wen is brilliant and definitely seems like the easiest way to do this.
If you wanted another way to do this, then you could use .melt to view the groups in the DF. Then, use groupby with a .sum() aggregation within each group in the melted DF. You just need to ignore the NaNs when you aggregate, and one way to do this is by following the approach in this SO post - .notnull() applied to groups.
Input DF
print df
E M S Mem1 Mem2 Mem3 Mem4 Mem5
0 1 365 -10 15 21 18.0 16 NaN
1 1 365 10 23 34 NaN 45 65.0
2 365 365 -20 34 45 43.0 32 23.0
3 365 365 20 56 45 NaN 32 38.0
4 730 365 -5 82 64 13.0 63 27.0
5 730 365 15 24 68 NaN 79 78.0
Here is the approach
# Apply melt to view groups
dfm = pd.melt(df, id_vars=['E','M','S'])
print(dfm.head(10))
E M S variable value
0 1 365 -10 Mem1 15.0
1 1 365 10 Mem1 23.0
2 365 365 -20 Mem1 34.0
3 365 365 20 Mem1 56.0
4 730 365 -5 Mem1 82.0
5 730 365 15 Mem1 24.0
6 1 365 -10 Mem2 21.0
7 1 365 10 Mem2 34.0
8 365 365 -20 Mem2 45.0
9 365 365 20 Mem2 45.0
# GROUP BY
grouped = dfm.groupby(['E','M','S'])
# Aggregate within each group, while ignoring NaNs
gtotals = grouped['value'].apply(lambda x: x.notnull().sum())
# (Optional) Reset grouped DF index
gtotals = gtotals.reset_index(drop=False)
print(gtotals)
E M S value
0 1 365 -10 4
1 1 365 10 4
2 365 365 -20 5
3 365 365 20 4
4 730 365 -5 5
5 730 365 15 4

SQL query self join

I am working on a query for a report in Oracle 10g.
I need to generate a short list of each course along with the number of times they were offered in the past year (including ones that weren't actually offered).
I created one query
SELECT coursenumber, count(datestart) AS Offered
FROM class
WHERE datestart BETWEEN (sysdate-365) AND sysdate
GROUP BY coursenumber;
Which produces
COURSENUMBER OFFERED
---- ----------
ST03 2
PD01 1
AY03 2
TB01 4
This query is all correct. However ideally I want it to list those along with COURSENUMBER HY and CS in the left column as well with 0 or null as the OFFERED value. I have a feeling this involves a join of sorts, but so far what I have tried doesn't produce the classes with nothing offered.
The table normally looks like
REFERENCE_NO DATESTART TIME TIME EID ROOMID COURSENUMBER
------------ --------- ---- ---- ---------- ---------- ----
256 03-MAR-11 0930 1100 2 2 PD01
257 03-MAY-11 0930 1100 12 7 PD01
258 18-MAY-11 1230 0100 12 7 PD01
259 24-OCT-11 1930 2015 6 2 CS01
260 17-JUN-11 1130 1300 6 4 CS01
261 25-MAY-11 1900 2000 13 6 HY01
262 25-MAY-11 1900 2000 13 6 HY01
263 04-APR-11 0930 1100 13 5 ST03
264 13-SEP-11 1930 2100 6 4 ST03
265 05-NOV-11 1930 2100 6 5 ST03
266 04-FEB-11 1430 1600 6 5 ST03
267 02-JAN-11 0630 0700 13 1 TB01
268 01-FEB-11 0630 0700 13 1 TB01
269 01-MAR-11 0630 0700 13 1 TB01
270 01-APR-11 0630 0700 13 1 TB01
271 01-MAY-11 0630 0700 13 1 TB01
272 14-MAR-11 0830 0915 4 3 AY03
273 19-APR-11 0930 1015 4 3 AY03
274 17-JUN-11 0830 0915 14 3 AY03
275 14-AUG-09 0930 1015 14 3 AY03
276 03-MAY-09 0830 0915 14 3 AY03

SELECT
coursenumber,
COUNT(CASE WHEN datestart BETWEEN (sysdate-365) AND sysdate THEN 1 END) AS Offered
FROM class
GROUP BY coursenumber;
So, as you can see, this particular problem doesn't need a join.

I think something like this should work for you, by just doing it as a subquery.
SELECT distinct c.coursenumber,
(SELECT COUNT(*)
FROM class
WHERE class.coursenumber = c.coursenumber
AND datestart BETWEEN (sysdate-365) AND sysdate
) AS Offered
FROM class c

I like jschoen's answer better for this particular case (when you want one and only one row and column out of the subquery for each row of the main query), but just to demonstrate another way to do it:
select t1.coursenumber, nvl(t2.cnt,0)
from class t1 left outer join (
select coursenumber, count(*) cnt
from class
where datestart between (sysdate-365) AND sysdate
group by coursenumber
) t2 on t1.coursenumber = t2.coursenumber

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pandas complex groupby, count and apply a cap - pandas

Related

aggregate data between two dates with two dataframes

Conditional mean while using iloc pandas

In Azure Data bricks I want to get start dates of every week with week numbers from datetime column

How to aggregate across columns in pandas?

SQL query self join

Categories

Resources