Query repeating event in a given time range - sql

I have a set of events (Time in MM/DD) that can be repeated by different users:
EventId Event Time User
1 Start 06/01/2012 10:05AM 1
1 End 06/05/2012 10:45AM 1
2 Start 07/07/2012 09:55AM 2
2 End 09/07/2012 11:05AM 2
3 Start 09/01/2012 11:05AM 2
3 End 09/03/2012 11:05AM 2
I want to get, using SQL, those events a user has done in a specified time range, for instance:
Given 06/06/2012 and 09/02/2012 I am expecting tho get:
EventId Event Time User
2 Start 07/07/2012 09:55AM 2
2 End 09/07/2012 11:05AM 2
3 Start 09/01/2012 11:05AM 1
Any idea on how to deal with this?

A basic range query should work here:
SELECT *
FROM yourTable
WHERE Time >= '2012-06-06'::date AND Time < '2012-09-03'::date;
This assumes you want records falling on June 6, 2012 to September 2 of the same year.

Related

count number of records by month over the last five years where record date > select month

I need to show the number of valid inspectors we have by month over the last five years. Inspectors are considered valid when the expiration date on their certification has not yet passed, recorded as the month end date. The below SQL code is text of the query to count valid inspectors for January 2017:
SELECT Count(*) AS RecordCount
FROM dbo_Insp_Type
WHERE (dbo_Insp_Type.CERT_EXP_DTE)>=#2/1/2017#);
Rather than designing 60 queries, one for each month, and compiling the results in a final table (or, err, query) are there other methods I can use that call for less manual input?
From this sample:
Id
CERT_EXP_DTE
1
2022-01-15
2
2022-01-23
3
2022-02-01
4
2022-02-03
5
2022-05-01
6
2022-06-06
7
2022-06-07
8
2022-07-21
9
2022-02-20
10
2021-11-05
11
2021-12-01
12
2021-12-24
this single query:
SELECT
Format([CERT_EXP_DTE],"yyyy/mm") AS YearMonth,
Count(*) AS AllInspectors,
Sum(Abs([CERT_EXP_DTE] >= DateSerial(Year([CERT_EXP_DTE]), Month([CERT_EXP_DTE]), 2))) AS ValidInspectors
FROM
dbo_Insp_Type
GROUP BY
Format([CERT_EXP_DTE],"yyyy/mm");
will return:
YearMonth
AllInspectors
ValidInspectors
2021-11
1
1
2021-12
2
1
2022-01
2
2
2022-02
3
2
2022-05
1
0
2022-06
2
2
2022-07
1
1
ID
Cert_Iss_Dte
Cert_Exp_Dte
1
1/15/2020
1/15/2022
2
1/23/2020
1/23/2022
3
2/1/2020
2/1/2022
4
2/3/2020
2/3/2022
5
5/1/2020
5/1/2022
6
6/6/2020
6/6/2022
7
6/7/2020
6/7/2022
8
7/21/2020
7/21/2022
9
2/20/2020
2/20/2022
10
11/5/2021
11/5/2023
11
12/1/2021
12/1/2023
12
12/24/2021
12/24/2023
A UNION query could calculate a record for each of 50 months but since you want 60, UNION is out.
Or a query with 60 calculated fields using IIf() and Count() referencing a textbox on form for start date:
SELECT Count(IIf(CERT_EXP_DTE>=Forms!formname!tbxDate,1,Null)) AS Dt1,
Count(IIf(CERT_EXP_DTE>=DateAdd("m",1,Forms!formname!tbxDate),1,Null) AS Dt2,
...
FROM dbo_Insp_Type
Using the above data, following is output for Feb and Mar 2022. I did a test with Cert_Iss_Dte included in criteria and it did not make a difference for this sample data.
Dt1
Dt2
10
8
Or a report with 60 textboxes and each calls a DCount() expression with criteria same as used in query.
Or a VBA procedure that writes data to a 'temp' table.

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Calculate overlap time in seconds for groups in SQL

I have a bunch of timestamps grouped by ID and type in the sample data shown below.
I would like to find overlapped time between start_time and end_time columns in seconds for each group of ID and between each lead and follower combinations. I would like to show the overlap time only for the first record of each group which will always be the "lead" type.
For example, for the ID 1, the follower's start and end times in row 3 overlap with the lead's in row 1 for 193 seconds (from 09:00:00 to 09:03:13). the follower's times in row 3 also overlap with the lead's in row 2 for 133 seconds (09:01:00 to 2020-05-07 09:03:13). That's a total of 326 seconds (193+133)
I used the partition clause to rank rows by ID and type and order them by start_time as a start.
How do I get the overlap column?
row# ID type start_time end_time rank. overlap
1 1 lead 2020-05-07 09:00:00 2020-05-07 09:03:34 1 326
2 1 lead 2020-05-07 09:01:00 2020-05-07 09:03:13 2
3 1 follower 2020-05-07 08:59:00 2020-05-07 09:03:13 1
4 2 lead 2020-05-07 11:23:00 2020-05-07 11:33:00 1 540
4 2 follower 2020-05-07 11:27:00 2020-05-07 11:32:00 1
5 3 lead 2020-05-07 14:45:00 2020-05-07 15:00:00 1 305
6 3 follower 2020-05-07 14:44:00 2020-05-07 14:44:45 1
7 3 follower 2020-05-07 14:50:00 2020-05-07 14:55:05 2
In your example, the times completely cover the total duration. If this is always true, you can use the following logic:
select id,
(sum(datediff(second, start_time, end_time) -
datediff(second, min(start_time), max(end_time)
) as overlap
from t
group by id;
To add this as an additional column, then either use window functions or join in the result from the above query.
If the overall time has gaps, then the problem is quite a bit more complicated. I would suggest that you ask a new question and set up a db fiddle for the problem.
Tried this a couple of way and got it to work.
I first joined 2 tables with individual records for each type, 'lead' and 'follower' and created a case statement to calculate max start time for each lead and follower start time combination and min end time for each lead and follower end time combination. Stored this in a temp table.
CASE
WHEN lead_table.start_time > follower_table.start_time THEN lead_table.start_time
WHEN lead_table.start_time < follower_table.start_time THEN patient_table.start_time_local
ELSE 0
END as overlap_start_time,
CASE
WHEN follower_table.end_time < lead_table.end_time THEN follower_table.end_time
WHEN follower_table.end_time > lead_table.end_time THEN lead_table.end_time
ELSE 0
END as overlap_end_time
Then created an outer query to lookup the temp table just created to find the difference between start time and end time for each lead and follower combination in seconds
select temp_table.id,
temp_table.overlap_start_time,
temp_table.overlap_end_time,
DATEDIFF_BIG(second,
temp_table.overlap_start_time,
temp_table.overlap_end_time) as overlap_time FROM temp_table

Add a new dataframe column which counts the values in certain column less than the date prior to the time

EDITED
I want to add a new column called prev_message_left which counts the no. of messages_left per ID less than the date prior the given time. Basically I want to have a column which says how many times we had left message on call to that customer prior to the current time and date. This is how my data frame looks like
date ID call_time message_left
20191101 1 8:00 0
20191102 2 9:00 1
20191030 1 16:00 1
20191103 2 10:30 1
20191105 2 14:00 0
20191030 1 15:30 0
I want to add an additional column called prev_message_left_count
date ID call_time message_left prev_message_left_count
20191101 1 8:00 0 1
20191102 2 9:00 1 0
20191030 1 16:00 1 0
20191103 2 10:30 1 1
20191105 2 14:00 0 2
20191030 1 15:30 0 0
My dataframe has 15 columns and 90k rows.
I have various other columns in this dataframe and there are columns like 'No Message Left', 'Responded' for which I will have to compute additional columns called 'Previous_no_message_left' and 'prev_responded' similar to 'prev_message_left'
Use DataFrame.sort_values to get the cumulative sum in the correct order by groups. You can create groups using DataFrame.groupby:
df['prev_message_left_count']= (df.sort_values(['date','call_time'])
.groupby('ID')['message_left']
.apply(lambda x: x.shift(fill_value=0)
.cumsum()) )
print(df)
date ID call_time message_left prev_message_left_count
0 20191101 1 8:00 0 1
1 20191102 2 9:00 1 0
2 20191030 1 16:00 1 0
3 20191103 2 10:30 1 1
4 20191105 2 14:00 0 2
5 20191030 1 15:30 0 0
sometimes GroupBy.apply is slow so it may be advisable
df['prev_message_left_count']=( df.sort_values(['date','call_time'])
.groupby('ID')
.shift(fill_value=0)
.groupby(df['ID'])['message_left']
.cumsum()

Filter results based on another filter's date range

Background:
I have the following user's registration funnel, where user creates account and then goes through prompts with the goal of registering:
id date create_account_date user_creates_account registration_date user_registers
1 12/30/2017 12/30/2017 1 12/30/2017 1
2 12/30/2017 12/30/2017 1 1/2/2018 0
2 1/2/2018 12/30/2017 0 1/2/2018 1
3 12/31/2017 12/31/2017 1 12/31/2017 1
4 1/1/2018 1/1/2018 1 1/3/2018 0
4 1/3/2018 1/1/2018 0 1/3/2018 1
5 1/1/2018 1/1/2018 1 1/1/2018 1
6 1/2/2018 1/2/2018 1 1/3/2018 0
6 1/3/2018 1/2/2018 0 1/3/2018 1
7 1/3/2018 1/3/2018 1 1/3/2018 1
8 1/4/2018 1/4/2018 1 1/4/2018 1
In aggregate:
12/30 12/31 1/1 1/2 1/3 1/4 Total Total 1/2-1/4
User Creates Account 2 1 2 1 1 1 8 3
User Registers 1 1 1 1 3 1 8 5
Issue:
I am trying to add a date filter, where I can pick the date range of the data I want to see.
I added create_account_date as filter, and picked Jan 2 to Jan 4. However, that will only force min(registration date)='1/2/18', while max(registration date) can happen after Jan 4.
I also tried forcing create_account_date = registration_date, but that understates those who registered on a day different from create_account_date, but still within the filtered date range.
Ask:
I would like to be able to filter the output by the date range filter/parameter.
So create_account_date and registration date per user are >= min(date) and create_account_date and registration date per user are <= max(date). Here create_account_date >= registration date
So with filter implementation I would have:
1/2/2018 1/3/2018 1/4/2018 Total
User Creates Account 1 1 1 3
User Registers 0 1 1 2
Thank you in advance.
You are trying in a wrong way don't try to manipulate create account and registration date instead add a filter only for normal date field and place the fields in sheet and see the result.
If you really want to create a filter then you can't add the condition just using normal filter instead you need to create a two parameters.
One parameter for start date
One parameter for end date
For both the parameters use date field to display the list of data
Now create two calcualted fields
If create account date >=[start date parameter] and create account date < [end date parameter]
then
your field
end
Similarly
If registration date >=[start date parameter] and registration date < [end date parameter]
then
your field
end
USe both the fileds in rows and place date in column of sheet
I'm not 100% clear on the goal, but it sounds like you want to find users where the account_create_date is between a start and end date AND the registration_date is between a start and end date. If that's so, you could filter with WHERE (create_account_date BETWEEN #start_date AND #end_date) AND (registration_date BETWEEN #start_date AND #end_date). My syntax assumes SQL Server, but you can make it work for other DBMS too.