Pandas: to get mean for each data category daily [duplicate] - pandas

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.

Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()

You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Related

count number of records by month over the last five years where record date > select month

I need to show the number of valid inspectors we have by month over the last five years. Inspectors are considered valid when the expiration date on their certification has not yet passed, recorded as the month end date. The below SQL code is text of the query to count valid inspectors for January 2017:
SELECT Count(*) AS RecordCount
FROM dbo_Insp_Type
WHERE (dbo_Insp_Type.CERT_EXP_DTE)>=#2/1/2017#);
Rather than designing 60 queries, one for each month, and compiling the results in a final table (or, err, query) are there other methods I can use that call for less manual input?
From this sample:
Id
CERT_EXP_DTE
1
2022-01-15
2
2022-01-23
3
2022-02-01
4
2022-02-03
5
2022-05-01
6
2022-06-06
7
2022-06-07
8
2022-07-21
9
2022-02-20
10
2021-11-05
11
2021-12-01
12
2021-12-24
this single query:
SELECT
Format([CERT_EXP_DTE],"yyyy/mm") AS YearMonth,
Count(*) AS AllInspectors,
Sum(Abs([CERT_EXP_DTE] >= DateSerial(Year([CERT_EXP_DTE]), Month([CERT_EXP_DTE]), 2))) AS ValidInspectors
FROM
dbo_Insp_Type
GROUP BY
Format([CERT_EXP_DTE],"yyyy/mm");
will return:
YearMonth
AllInspectors
ValidInspectors
2021-11
1
1
2021-12
2
1
2022-01
2
2
2022-02
3
2
2022-05
1
0
2022-06
2
2
2022-07
1
1
ID
Cert_Iss_Dte
Cert_Exp_Dte
1
1/15/2020
1/15/2022
2
1/23/2020
1/23/2022
3
2/1/2020
2/1/2022
4
2/3/2020
2/3/2022
5
5/1/2020
5/1/2022
6
6/6/2020
6/6/2022
7
6/7/2020
6/7/2022
8
7/21/2020
7/21/2022
9
2/20/2020
2/20/2022
10
11/5/2021
11/5/2023
11
12/1/2021
12/1/2023
12
12/24/2021
12/24/2023
A UNION query could calculate a record for each of 50 months but since you want 60, UNION is out.
Or a query with 60 calculated fields using IIf() and Count() referencing a textbox on form for start date:
SELECT Count(IIf(CERT_EXP_DTE>=Forms!formname!tbxDate,1,Null)) AS Dt1,
Count(IIf(CERT_EXP_DTE>=DateAdd("m",1,Forms!formname!tbxDate),1,Null) AS Dt2,
...
FROM dbo_Insp_Type
Using the above data, following is output for Feb and Mar 2022. I did a test with Cert_Iss_Dte included in criteria and it did not make a difference for this sample data.
Dt1
Dt2
10
8
Or a report with 60 textboxes and each calls a DCount() expression with criteria same as used in query.
Or a VBA procedure that writes data to a 'temp' table.

How to write the query to make report by month in sql

I have the receiving and sending data for whole year. so i want to built the monthly report base on that data with the rule is Fisrt in first out. It means is the first receiving will be sent out first ...
DECLARE #ReceivingTbl AS TABLE(Id INT,ProId int, RecQty INT,ReceivingDate DateTime)
INSERT INTO #ReceivingTbl
VALUES (1,1001,210,'2019-03-12'),
(2,1001,315,'2019-06-15'),
(3,2001,500,'2019-04-01'),
(4,2001,10,'2019-06-15'),
(5,1001,105,'2019-07-10')
DECLARE #SendTbl AS TABLE(Id INT,ProId int, SentQty INT,SendMonth int)
INSERT INTO #SendTbl
VALUES (1,1001,50,3),
(2,1001,100,4),
(3,1001,80,5),
(4,1001,80,6),
(5,2001,200,6)
SELECT * FROM #ReceivingTbl ORDER BY ProId,ReceivingDate
SELECT * FROM #SendTbl ORDER BY ProId,SendMonth
Id ProId RecQty ReceivingDate
1 1001 210 2019-03-12
2 1001 315 2019-06-15
5 1001 105 2019-07-10
3 2001 500 2019-04-01
4 2001 10 2019-06-15
Id ProId SentQty SendMonth
1 1001 50 3
2 1001 100 4
3 1001 80 5
4 1001 80 6
5 2001 200 6
--- And the below is what i want:
Id ProId RecQty ReceivingDate ... Mar Apr May Jun
1 1001 210 2019-03-12 ... 50 100 60 0
2 1001 315 2019-06-15 ... 0 0 20 80
5 1001 105 2019-07-10 ... 0 0 0 0
3 2001 500 2019-04-01 ... 0 0 0 200
4 2001 10 2019-06-15 ... 0 0 0 0
Thanks!
Your question is not clear to me.
If you want to purely use the FIFO approach, therefore ignore any data the table contains, you necessarely need to order by ID, which in your example you are providing, and looks like it is in order of insert.
The first line inserted should be also the first line appearing in the select (FIFO), in order to do so you have to use:
ORDER BY Id ASC
Which will place the lower value of the ID first (1, 2, 3, ...)
To me though, this doesn't make much sense, so pay attention to the meaning o the data you actually have and leverage dates like ReceivingDate, and order by that, maybe even filtering by month of the date, below an example for January data:
WHERE MONTH(ReceivingDate) = 1

Max date among records and across tables - SQL Server

I tried max to provide in table format but it seem not good in StackOver, so attaching snapshot of the 2 tables. Apologize about the formatting.
SQL Server 2012
**MS Table**
**mId tdId name dueDate**
1 1 **forecastedDate** 1/1/2015
2 1 **hypercareDate** 11/30/2016
3 1 LOE 1 7/4/2016
4 1 LOE 2 7/4/2016
5 1 demo for yy test 10/15/2016
6 1 Implementation – testing 7/4/2016
7 1 Phased Rollout – final 7/4/2016
8 2 forecastedDate 1/7/2016
9 2 hypercareDate 11/12/2016
10 2 domain - Forte NULL
11 2 Fortis completion 1/1/2016
12 2 Certification NULL
13 2 Implementation 7/4/2016
-----------------------------------------------
**MSRevised**
**mId revisedDate**
1 1/5/2015
1 1/8/2015
3 3/25/2017
2 2/1/2016
2 12/30/2016
3 4/28/2016
4 4/28/2016
5 10/1/2016
6 7/28/2016
7 7/28/2016
8 4/28/2016
9 8/4/2016
9 5/28/2016
11 10/4/2016
11 10/5/2016
13 11/1/2016
----------------------------------------
The required output is
1. Will be passing the 'tId' number, for instance 1, lets call it tid (1)
2. Want to compare tId (1)'s all milestones (except hypercareDate) with tid(1)'s forecastedDate milestone
3. return if any of the milestone date (other than hypercareDate) is greater than the forecastedDate
The above 3 steps are simple, but I have to first compare the milestones date with its corresponding revised dates, if any, from the revised table, and pick the max date among all that needs to be compared with the forecastedDate
I managed to solve this. Posting the answer, hope it helps aomebody.
//Insert the result into temp table
INSERT INTO #mstab
SELECT [mId]
, [tId]
, [msDate]
FROM [dbo].[MS]
WHERE ([msName] NOT LIKE 'forecastedDate' AND [msName] NOT LIKE 'hypercareDate'))
// this scalar function will get max date between forecasted duedate and forecasted revised date
SELECT #maxForecastedDate = [dbo].[fnGetMaxDate] ( 'forecastedDate');
// this will get the max date from temp table and compare it with forecasatedDate/
SET #maxmilestoneDate = (SELECT MAX(maxDate)
FROM ( SELECT ms.msDueDate AS dueDate
, mr.msRevisedDate AS revDate
FROM #mstab as ms
LEFT JOIN [MSRev] as mr on ms.msId = mr.msId
) maxDate
UNPIVOT (maxDate FOR DateCols IN (dueDate, revDate))up );

SQL terminology to combine a NOT EXIST query with latest value

I am a beginner with basic knowledge.
I have a single table that I am trying to pull all UID's that have not had a particular code in the table within the past year.
My table looks like this: (but much larger of course)
FACID DPID EID DID UID DT Code Units Charge ET Ord
1 1 6 2 1002 15-Mar-07 99204 1 180 09:36.7 1
1 1 7 5 10004 15-Mar-07 99213 1 68 02:36.9 1
1 1 24 55 25887 15-Mar-07 99213 1 68 43:55.3 1
1 1 25 2 355688 15-Mar-07 99213 1 68 53:20.2 1
1 1 26 5 555654 15-Mar-07 99213 1 68 42:22.6 1
1 1 27 44 135514 15-Mar-07 99213 1 68 00:36.8 1
1 1 28 2 3244522 15-Mar-07 99214 1 98 34:59.4 1
1 1 29 5 235445 15-Mar-07 99213 1 68 56:42.1 1
1 1 30 3 3214444 15-Mar-07 99213 1 68 54:56.5 1
1 1 33 1 221444 15-Mar-07 99204 1 180 37:44.5 1
I am attempting to use the following, but this is not working for my time frame limits.
select distinct UID from PtProcTbl
where DT<'20120101'
and NOT EXISTS (Select Distinct UID
where Code in ('99203','99204','99205','99213',
'99214','99215','99244','99245'))
I need to know how to make sure the UID's that I am pulling are the ones don't have a DT after the 1/1/2012 cut off date that contains one of the NOT Exists codes.
The above query returned UID's that actually dates after 1/1/2012 that does contain one of the above codes...
Not sure what I am doing wrong or if I am totally off base on this..
Thanks in advance.
Are you sure you need the NOT EXISTS? How about instead:
AND Code NOT IN ('99203','99204','99205','99213','99214','99215','99244','99245')

Using Sum() with multiple where clauses

I'm pretty new to this, so forgive if this has been posted (I had no idea what to even search on).
I have 2 tables, Accounts and Usage
AccountID AccountStartDate AccountEndDate
-------------------------------------------
1 12/1/2012 12/1/2013
2 1/1/2013 1/1/2014
UsageId AccountID EstimatedUsage StartDate EndDate
------------------------------------------------------
1 1 10 1/1 1/31
2 1 11 2/1 2/29
3 1 23 3/1 3/31
4 1 23 4/1 4/30
5 1 15 5/1 5/31
6 1 20 6/1 6/30
7 1 15 7/1 7/31
8 1 12 8/1 8/31
9 1 14 9/1 9/30
10 1 21 10/1 10/31
11 1 27 11/1 11/30
12 1 34 12/1 12/31
13 2 13 1/1 1/31
14 2 13 2/1 2/29
15 2 28 3/1 3/31
16 2 29 4/1 4/30
17 2 31 5/1 5/31
18 2 26 6/1 6/30
19 2 43 7/1 7/31
20 2 32 8/1 8/31
21 2 18 9/1 9/30
22 2 20 10/1 10/31
23 2 47 11/1 11/30
24 2 33 12/1 12/31
I'd like to write one query that gives me estimated usage for each month (starting now until the last month that we serve an account) for all accounts being served during that month.
The results would be as follows:
Month-Year Total Est Usage
------------------------------
Oct-12 0 (none being served)
Nov-12 0 (none being served)
Dec-12 34 (only accountid 1 being served)
Jan-13 23 (accountid 1 & 2 being served)
Feb-13 24 (accountid 1 & 2 being served)
Mar-13 51 (accountid 1 & 2 being served)
...
Dec-13 33 (only accountid 2 being served)
Jan-14 0 (none being served)
Feb-14 0 (none being served)
I'm assuming I need to sum and then do a Group By...but not really sure logically how I'd lay this out.
Revised Answer:
I've created a Months table with columns MonthID, Month with values like (201212, 12), (201301, 1), ...
I've also reorganised the usage table to have a month column rather than the start date and end date, as it makes the idea clearer.
See http://sqlfiddle.com/#!3/f57d84/6 for details
The query is now:
Select
m.MonthID,
Sum(u.EstimatedUsage) TotalEstimatedUsage
From
Accounts a
Inner Join
Usage u
On a.AccountID = u.AccountID
Inner Join
Months m
On m.MonthID Between
Year(a.AccountStartDate) * 100 + Month(a.AccountStartDate) And
Year(a.AccountEndDate) * 100 + Month(a.AccountEndDate) And
m.Month = u.Month
Group By
m.MonthID
Order By
1
Previous answer, for reference which assumed usages ranges were full dates rather than just months.
Select
Year(u.StartDate),
Month(u.StartDate),
Sum(Case When a.AccountStartDate <= u.StartDate And a.AccountEndDate >= u.EndDate Then u.EstimatedUsage Else 0 End) TotalEstimatedUsage
From
Accounts a
Inner Join
Usage u
On a.AccountID = u.AccountID
Group By
Year(u.StartDate),
Month(u.StartDate)
Order By
1, 2