Grouping Timestamps based on the interval between them - sql

I have a table in Hive (SQL) with a bunch of timestamps that need to be grouped in order to create separate sessions based on the time difference between the timestamps.
Example:
Consider the following timestamps(Given in HH:MM for simplicity):
9.00
9.10
9.20
9.40
9.43
10.30
10.45
11.25
12.30
12.33
and so on..
So now, all timestamps that fall within 30 mins of the next timestamp come under the same session,
i.e. 9.00,9.10,9.20,9.40,9.43 form 1 session.
But since the difference between 9.43 and 10.30 is more than 30 mins, the time stamp 10.30 falls under a different session. Again, 10.30 and 10.45 fall under one session.
After we have created these sessions, we have to obtain the minimum timestamp for that session and the max timestamp.
I tried to subtract the current timestamp with its LEAD and place a flag if it is greater than 30 mins, but I'm having difficulty with this.
Any suggestion from you guys would be greatly appreciated. Please let me know if the question isn't clear enough.
Expected Output for this sample data:
Session_start Session_end
9.00 9.43
10.30 10.45
11.25 11.25 (same because the next time is not within 30 mins)
12.30 12.33
Hope this helps.

So it's not MySQL but Hive. I don't know Hive, but if it supports LAG, as you say, try this PostgreSQL query. You will probably have to change the time difference calculation, that's usually different from one dbms to another.
select min(thetime) as start_time, max(thetime) as end_time
from
(
select thetime, count(gap) over (rows between unbounded preceding and current row) as groupid
from
(
select thetime, case when thetime - lag(thetime) over (order by thetime) > interval '30 minutes' then 1 end as gap
from mytable
) times
) groups
group by groupid
order by min(thetime);
The query finds gaps, then uses a running total of gap counts to build group IDs, and the rest is aggregation.
SQL fiddle: http://www.sqlfiddle.com/#!17/8bc4a/6.

With MySQL lacking LAG and LEAD functions, getting the previous or next record is some work already. Here is how:
select
thetime,
(select max(thetime) from mytable afore where afore.thetime < mytable.thetime) as afore_time,
(select min(thetime) from mytable after where after.thetime > mytable.thetime) as after_time
from mytable;
Based on this we can build the whole query where we are looking for gaps (i.e. the time difference to the previous or next record is more than 30 minutes = 1800 seconds).
select
startrec.thetime as start_time,
(
select min(endrec.thetime)
from
(
select
thetime,
coalesce(time_to_sec(timediff((select min(thetime) from mytable after where after.thetime > mytable.thetime), thetime)), 1801) > 1800 as gap
from mytable
) endrec
where gap
and endrec.thetime >= startrec.thetime
) as end_time
from
(
select
thetime,
coalesce(time_to_sec(timediff(thetime, (select max(thetime) from mytable afore where afore.thetime < mytable.thetime))), 1801) > 1800 as gap
from mytable
) startrec
where gap;
SQL fiddle: http://www.sqlfiddle.com/#!2/d307b/20.

Try this..
SELECT MIN(session_time_tmp) session_start, MAX(session_time_tmp) session_end FROM
(
SELECT IF((TIME_TO_SEC(TIMEDIFF(your_time_field, COALESCE(#previousValue, your_time_field))) / 60) > 30 ,
#sessionCount := #sessionCount + 1, #sessionCount ) sessCount,
( #previousValue := your_time_field ) session_time_tmp FROM
(
SELECT your_time_field, #previousValue:= NULL, #sessionCount := 1 FROM yourtable ORDER BY your_time_field
) a
) b
GROUP BY sessCount
Just replace yourtable and your_time_field

Try this:
SELECT DATE_FORMAT(MIN(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_start,
DATE_FORMAT(MAX(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_end
FROM tableA A
LEFT JOIN ( SELECT A.column1, diff, IF(#diff:=diff < 30, #id, #id:=#id+1) AS rnk
FROM (SELECT B.column1, TIME_TO_SEC(TIMEDIFF(STR_TO_DATE(B.column1, '%H.%i'), STR_TO_DATE(A.column1, '%H.%i'))) / 60 AS diff
FROM tableA A
INNER JOIN tableA B ON STR_TO_DATE(A.column1, '%H.%i') < STR_TO_DATE(B.column1, '%H.%i')
GROUP BY STR_TO_DATE(A.column1, '%H.%i')
) AS A, (SELECT #diff:=0, #id:= 1) AS B
) AS B ON A.column1 = B.column1
GROUP BY IFNULL(B.rnk, 1);
Check the SQL FIDDLE DEMO
OUTPUT
| SESSION_START | SESSION_END |
|---------------|-------------|
| 9.00 | 9.43 |
| 10.30 | 10.45 |
| 11.25 | 11.25 |
| 12.30 | 12.33 |

Related

SQL How to subtract 2 row values of a same column based on same key

How to extract the difference of a specific column of multiple rows with same id?
Example table:
id
prev_val
new_val
date
1
0
1
2020-01-01 10:00
1
1
2
2020-01-01 11:00
2
0
1
2020-01-01 10:00
2
1
2
2020-01-02 10:00
expected result:
id
duration_in_hours
1
1
2
24
summary:
with id=1, (2020-01-01 10:00 - 2020-01-01 11:00) is 1hour;
with id=2, (2020-01-01 10:00 - 2020-01-02 10:00) is 24hour
Can we achieve this with SQL?
This solutions will be an effective way
with pd as (
select
id,
max(date) filter (where c.old_value = '0') as "prev",
max(date) filter (where c.old_value = '1') as "new"
from
table
group by
id )
select
id ,
new - prev as diff
from
pd;
if you need the difference between successive readings something like this should work
select a.id, a.new_val, a.date - b.date
from my_table a join my_table b
on a.id = b.id and a.prev_val = b.new_val
you could use min/max subqueries. For example:
SELECT mn.id, (mx.maxdate - mn.mindate) as "duration",
FROM (SELECT id, max(date) as mindate FROM table GROUP BY id) mn
JOIN (SELECT id, min(date) as maxdate FROM table GROUP BY id) mx ON
mx.id=mn.id
Let me know if you need help in converting duration to hours.
You can use the lead()/lag() window functions to access data from the next/ previous row. You can further subtract timestamps to give an interval and extract the parts needed.
select id, floor( extract('day' from diff)*24 + extract('hour' from diff) ) "Time Difference: Hours"
from (select id, date_ts - lag(date_ts) over (partition by id order by date_ts) diff
from example
) hd
where diff is not null
order by id;
NOTE: Your expected results, as presented, are incorrect. The results would be -1 and -24 respectively.
DATE is a very poor choice for a column name. It is both a Postgres data type (at best leads to confusion) and a SQL Standard reserved word.

How do i give the condition to group by time period?

I need to get the count of records using PostgreSQL from time 7:00:00 am till next day 6:59:59 am and the count resets again from 7:00am to 6:59:59 am.
Where I am using backend as java (Spring boot).
The columns in my table are
id (primary_id)
createdon (timestamp)
name
department
createdby
How do I give the condition for shift wise?
You'd need to pick a slice based on the current time-of-day (I am assuming this to be some kind of counter which will be auto-refreshed in some application).
One way to do that is using time ranges:
SELECT COUNT(*)
FROM mytable
WHERE createdon <# (
SELECT CASE
WHEN current_time < '07:00'::time THEN
tsrange(CURRENT_DATE - '1d'::interval + '07:00'::time, CURRENT_DATE + '07:00'::time, '[)')
ELSE
tsrange(CURRENT_DATE + '07:00'::time, CURRENT_DATE + '1d'::interval + '07:00'::time, '[)')
END
)
;
Example with data: https://rextester.com/LGIJ9639
As I understand the question, you need to have a separate group for values in each 24-hour period that starts at 07:00:00.
SELECT
(
date_trunc('day', (createdon - '7h'::interval))
+ '7h'::interval
) AS date_bucket,
count(id) AS count
FROM lorem
GROUP BY date_bucket
ORDER BY date_bucket
This uses the date and time functions and the GROUP BY clause:
Shift the timestamp value back 7 hours ((createdon - '7h'::interval)), so the distinction can be made by a change of date (at 00:00:00). Then,
Truncate the value to the date (date_trunc('day', …)), so that all values in a bucket are flattened to a single value (the date at midnight). Then,
Add 7 hours again to the value (… + '7h'::interval), so that it represents the starting time of the bucket. Then,
Group by that value (GROUP BY date_bucket).
A more complete example, with schema and data:
DROP TABLE IF EXISTS lorem;
CREATE TABLE lorem (
id serial PRIMARY KEY,
createdon timestamp not null
);
INSERT INTO lorem (createdon) (
SELECT
generate_series(
CURRENT_TIMESTAMP - '36h'::interval,
CURRENT_TIMESTAMP + '36h'::interval,
'45m'::interval)
);
Now the query:
SELECT
(
date_trunc('day', (createdon - '7h'::interval))
+ '7h'::interval
) AS date_bucket,
count(id) AS count
FROM lorem
GROUP BY date_bucket
ORDER BY date_bucket
;
produces this result:
date_bucket | count
---------------------+-------
2019-03-06 07:00:00 | 17
2019-03-07 07:00:00 | 32
2019-03-08 07:00:00 | 32
2019-03-09 07:00:00 | 16
(4 rows)
You can use aggregation -- by subtracting 7 hours:
select (createdon - interval '7 hour')::date as dy, count(*)
from t
group by dy
order by dy;

Google Big Query SQL - Get most recent unique value by date

#EDIT - Following the comments, I rephrase my question
I have a BigQuery table that i want to use to get some KPI of my application.
In this table, I save each create or update as a new line in order to keep a better history.
So I have several times the same data with a different state.
Example of the table :
uuid |status |date
––––––|–––––––––––|––––––––––
3 |'inactive' |2018-05-12
1 |'active' |2018-05-10
1 |'inactive' |2018-05-08
2 |'active' |2018-05-08
3 |'active' |2018-05-04
2 |'inactive' |2018-04-22
3 |'inactive' |2018-04-18
We can see that we have multiple value of each data.
What I would like to get:
I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day.
So for each day, the amount of 'active' entries, including those from previous days.
So with this example I should have this result :
date | actives
____________|_________
2018-05-02 | 0
2018-05-03 | 0
2018-05-04 | 1
2018-05-05 | 1
2018-05-06 | 1
2018-05-07 | 1
2018-05-08 | 2
2018-05-09 | 2
2018-05-10 | 3
2018-05-11 | 3
2018-05-12 | 2
Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.
What I've tried:
I'm stuck with two solutions that each return a different error.
First solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT COUNT(uuid)
FROM (
SELECT
uuid, status, date,
RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
FROM users
WHERE
PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
)
WHERE
status = 'active'
and rank = 1
## rank is the condition which causes the error
) users
FROM
dates, UNNEST(arr_dates) i_date
ORDER BY i_date;
The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid.
But when I try this, I got a :
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.
Second solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT
COUNT(t1.uuid)
FROM
users t1
WHERE
t1.date = (
SELECT MAX(t2.date)
FROM users t2
WHERE
t2.uuid = t1.uuid
## Here that's the i_date condition which causes problem
AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date
)
AND status='active' ) users
FROM
dates,
UNNEST(arr_dates) i_date
ORDER BY i_date;
Here, the second select is working too and correctly returning the number of active user for a current day.
But the problem is when i try to use i_date to retrieve datas among the multiple days.
And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...
Which solution is more able to succeed ? What should i change ?
And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date
-- ORDER BY date
Update to address your "rephrased" question :o)
Below example is using dummy data from your question
#standardSQL
WITH `project.dataset.users` AS (
SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
SELECT 1, 'active', '2018-05-10' UNION ALL
SELECT 1, 'inactive', '2018-05-08' UNION ALL
SELECT 2, 'active', '2018-05-08' UNION ALL
SELECT 3, 'active', '2018-05-04' UNION ALL
SELECT 2, 'inactive', '2018-04-22' UNION ALL
SELECT 3, 'inactive', '2018-04-18'
), dates AS (
SELECT day FROM UNNEST((
SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
FROM `project.dataset.users`
)) day
), active_users AS (
SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
FROM `project.dataset.users` u
)
WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day
-- ORDER BY day
with result
Row day actives
1 2018-05-04 1
2 2018-05-05 1
3 2018-05-06 1
4 2018-05-07 1
5 2018-05-08 2
6 2018-05-09 2
7 2018-05-10 3
8 2018-05-11 3
9 2018-05-12 2
I think this -- or something similar -- will do what you want:
SELECT day,
coalesce(running_actives, 0) - coalesce(running_inactives, 0)
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2015-05-11'), DATE('2018-06-29'), INTERVAL 1 DAY)
) AS day left join
(select date, sum(countif(status = 'active')) over (order by date) as running_actives,
sum(countif(status = 'active')) over (order by date) as running_inactives
from t
group by date
) a
on a.date = day
order by day;
The exact solution depends on whether the "inactive" is inclusive of the day (as above) or takes effect the next day. Either is handled the same way, by using cumulative sums of actives and inactives and then taking the difference.
In order to get data on all days, this generates the days using arrays and unnest(). If you have data on all days, that step may be unnecessary

Exclude overlapping periods in time aggregate function

I have a table containing each a start and and end date:
DROP TABLE temp_period;
CREATE TABLE public.temp_period
(
id integer NOT NULL,
"startDate" date,
"endDate" date
);
INSERT INTO temp_period(id,"startDate","endDate") VALUES(1,'2010-01-01','2010-03-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(2,'2013-05-17','2013-07-18');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(3,'2010-02-15','2010-05-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(7,'2014-01-01','2014-12-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(56,'2014-03-31','2014-06-30');
Now I want to know the total duration of all periods stored there. I need just the time as an interval. That's pretty easy:
SELECT sum(age("endDate","startDate")) FROM temp_period;
However, the problem is: Those periods do overlap. And I want to eliminate all overlapping periods, so that I get the total amount of time which is covered by at least one record in the table.
You see, there are quite some gaps in between the times, so passing the smallest start date and the most recent end date to the age function won't do the trick. However, I thought about doing that and subtracting the total amount of gaps, but no elegant way to do that came into my mind.
I use PostgreSQL 9.6.
What about this:
WITH
/* get all time points where something changes */
points AS (
SELECT "startDate" AS p
FROM temp_period
UNION SELECT "endDate"
FROM temp_period
),
/*
* Get all date ranges between these time points.
* The first time range will start with NULL,
* but that will be excluded in the next CTE anyway.
*/
inter AS (
SELECT daterange(
lag(p) OVER (ORDER BY p),
p
) i
FROM points
),
/*
* Get all date ranges that are contained
* in at least one of the intervals.
*/
overlap AS (
SELECT DISTINCT i
FROM inter
CROSS JOIN temp_period
WHERE i <# daterange("startDate", "endDate")
)
/* sum the lengths of the date ranges */
SELECT sum(age(upper(i), lower(i)))
FROM overlap;
For your data it will return:
┌──────────┐
│ interval │
├──────────┤
│ 576 days │
└──────────┘
(1 row)
You could try to use recursive cte to calculate the period. For each record, we will check if it's overlapped with previous records. If it is, we only calculate the period that is not overlapping.
WITH RECURSIVE days_count AS
(
SELECT startDate,
endDate,
AGE(endDate, startDate) AS total_days,
rowSeq
FROM ordered_data
WHERE rowSeq = 1
UNION ALL
SELECT GREATEST(curr.startDate, prev.endDate) AS startDate,
GREATEST(curr.endDate, prev.endDate) AS endDate,
AGE(GREATEST(curr.endDate, prev.endDate), GREATEST(curr.startDate, prev.endDate)) AS total_days,
curr.rowSeq
FROM ordered_data curr
INNER JOIN days_count prev
ON curr.rowSeq > 1
AND curr.rowSeq = prev.rowSeq + 1),
ordered_data AS
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY startDate) AS rowSeq
FROM temp_period)
SELECT SUM(total_days) AS total_days
FROM days_count;
I've created a demo here
Actually there is a case that is not covered by the previous examples.
What if we have such a period ?
INSERT INTO temp_period(id,"startDate","endDate") VALUES(100,'2010-01-03','2010-02-10');
We have the following intervals:
Interval No. | | start_date | | end_date
--------------+------------------+------------+----------------+------------
1 | Interval start | 2010-01-01 | Interval end | 2010-03-31
2 | Interval start | 2010-01-03 | Interval end | 2010-02-10
3 | Interval start | 2010-02-15 | Interval end | 2010-05-31
4 | Interval start | 2013-05-17 | Interval end | 2013-07-18
5 | Interval start | 2014-01-01 | Interval end | 2014-12-31
6 | Interval start | 2014-03-31 | Interval end | 2014-06-30
Even though segment 3 overlaps segment 1, it's seen as a new segment, hence the (wrong) result:
sum
-----
620
(1 row)
The solution is to tweak the core of the query
CASE WHEN start_date < lag(end_date) OVER (ORDER BY start_date, end_date) then NULL ELSE start_date END
needs to be replaced by
CASE WHEN start_date < max(end_date) OVER (ORDER BY start_date, end_date rows between unbounded preceding and 1 preceding) then NULL ELSE start_date END
then it works as expected
sum
-----
576
(1 row)
Summary:
SELECT sum(e - s)
FROM (
SELECT left_edge as s, max(end_date) as e
FROM (
SELECT start_date, end_date, max(new_start) over (ORDER BY start_date, end_date) as left_edge
FROM (
SELECT start_date, end_date, CASE WHEN start_date < max(end_date) OVER (ORDER BY start_date, end_date rows between unbounded preceding and 1 preceding) then NULL ELSE start_date END AS new_start
FROM temp_period
) s1
) s2
GROUP BY left_edge
) s3;
This one required two outer joins on a complex query. One join to identify all overlaps with a startdate larger than THIS and to expand the timespan to match the larger of the two. The second join is needed to match records with no overlaps. Take the Min of the min and the max of the max, including non matched. I was using MSSQL so the syntax may be a bit different.
DECLARE #temp_period TABLE
(
id int NOT NULL,
startDate datetime,
endDate datetime
)
INSERT INTO #temp_period(id,startDate,endDate) VALUES(1,'2010-01-01','2010-03-31')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(2,'2013-05-17','2013-07-18')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-05-31')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-07-31')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(7,'2014-01-01','2014-12-31')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(56,'2014-03-31','2014-06-30')
;WITH OverLaps AS
(
SELECT
Main.id,
OverlappedID=Overlaps.id,
OverlapMinDate,
OverlapMaxDate
FROM
#temp_period Main
LEFT OUTER JOIN
(
SELECT
This.id,
OverlapMinDate=CASE WHEN This.StartDate<Prior.StartDate THEN This.StartDate ELSE Prior.StartDate END,
OverlapMaxDate=CASE WHEN This.EndDate>Prior.EndDate THEN This.EndDate ELSE Prior.EndDate END,
PriorID=Prior.id
FROM
#temp_period This
LEFT OUTER JOIN #temp_period Prior ON Prior.endDate > This.startDate AND Prior.startdate < this.endDate AND This.Id<>Prior.ID
) Overlaps ON Main.Id=Overlaps.PriorId
)
SELECT
T.Id,
--If has overlapped then sum all overlapped records prior to this one, else not and overlap get the start and end
MinDate=MIN(COALESCE(HasOverlapped.OverlapMinDate,startDate)),
MaxDate=MAX(COALESCE(HasOverlapped.OverlapMaxDate,endDate))
FROM
#temp_period T
LEFT OUTER JOIN OverLaps IsAOverlap ON IsAOverlap.OverlappedID=T.id
LEFT OUTER JOIN OverLaps HasOverlapped ON HasOverlapped.Id=T.id
WHERE
IsAOverlap.OverlappedID IS NULL -- Exclude older records that have overlaps
GROUP BY
T.Id
Beware: the answer by Laurenz Albe has a huge scalability issue.
I was more than happy when I found it. I customized it for our needs. We deployed to staging and very soon, the server took several minutes to return the results.
Then I found this answer on postgresql.org. Much more efficient.
https://wiki.postgresql.org/wiki/Range_aggregation
SELECT sum(e - s)
FROM (
SELECT left_edge as s, max(end_date) as e
FROM (
SELECT start_date, end_date, max(new_start) over (ORDER BY start_date, end_date) as left_edge
FROM (
SELECT start_date, end_date, CASE WHEN start_date < lag(end_date) OVER (ORDER BY start_date, end_date) then NULL ELSE start_date END AS new_start
FROM temp_period
) s1
) s2
GROUP BY left_edge
) s3;
Result:
sum
-----
576
(1 row)

Select repeat occurrences within time period <x days

If I had a large table (100000 + entries) which had service records or perhaps admission records. How would I find all the instances of re-occurrence within a set number of days.
The table setup could be something like this likely with more columns.
Record ID Customer ID Start Date Time Finish Date Time
1 123456 24/04/2010 16:49 25/04/2010 13:37
3 654321 02/05/2010 12:45 03/05/2010 18:48
4 764352 24/03/2010 21:36 29/03/2010 14:24
9 123456 28/04/2010 13:49 31/04/2010 09:45
10 836472 19/03/2010 19:05 20/03/2010 14:48
11 123456 05/05/2010 11:26 06/05/2010 16:23
What I am trying to do is work out a way to select the records where there is a re-occurrence of the field [Customer ID] within a certain time period (< X days). (Where the time period is Start Date Time of the 2nd occurrence - Finish Date Time of the first occurrence.
This is what I would like it to look like once it was run for say x=7
Record ID Customer ID Start Date Time Finish Date Time Re-occurence
9 123456 28/04/2010 13:49 31/04/2010 09:45 1
11 123456 05/05/2010 11:26 06/05/2010 16:23 2
I can solve this problem with a smaller set of records in Excel but have struggled to come up with a SQL solution in MS Access. I do have some SQL queries that I have tried but I am not sure I am on the right track.
Any advice would be appreciated.
I think this is a clear expression of what you want. It's not extremely high performance but I'm not sure that you can avoid either correlated sub-query or a cartesian JOIN of the table to itself to solve this problem. It is standard SQL and should work in most any engine, although the details of the date math may differ:
SELECT * FROM YourTable YT1 WHERE EXISTS
(SELECT * FROM YourTable YT2 WHERE
YT2.CustomerID = YT1.CustomerID AND YT2.StartTime <= YT2.FinishTime + 7)
In order to accomplish this you would need to make a self join as you are comparing the entire table to itself. Assuming similar names it would look something like this:
select r1.customer_id, min(start_time), max(end_time), count(1) as reoccurences
from records r1,
records r2
where r1.record_id > r2.record_id -- this ensures you don't double count the records
and r1.customer_id = r2.customer_id
and r1.finish_time - r2.start_time <= 7
group by r1.customer_id
You wouldn't be able to easily get both the record_id and the number of occurences, but you could go back and find it by correlating the start time to the record number with that customer_id and start_time.
This will do it:
declare #t table(Record_ID int, Customer_ID int, StartDateTime datetime, FinishDateTime datetime)
insert #t values(1 ,123456,'2010-04-24 16:49','2010-04-25 13:37')
insert #t values(3 ,654321,'2010-05-02 12:45','2010-05-03 18:48')
insert #t values(4 ,764352,'2010-03-24 21:36','2010-03-29 14:24')
insert #t values(9 ,123456,'2010-04-28 13:49','2010-04-30 09:45')
insert #t values(10,836472,'2010-03-19 19:05','2010-03-20 14:48')
insert #t values(11,123456,'2010-05-05 11:26','2010-05-06 16:23')
declare #days int
set #days = 7
;with a as (
select record_id, customer_id, startdatetime, finishdatetime,
rn = row_number() over (partition by customer_id order by startdatetime asc)
from #t),
b as (
select record_id, customer_id, startdatetime, finishdatetime, rn, 0 recurrence
from a
where rn = 1
union all
select a.record_id, a.customer_id, a.startdatetime, a.finishdatetime,
a.rn, case when a.startdatetime - #days < b.finishdatetime then recurrence + 1 else 0 end
from b join a
on b.rn = a.rn - 1 and b.customer_id = a.customer_id
)
select record_id, customer_id, startdatetime, recurrence from b
where recurrence > 0
Result:
https://data.stackexchange.com/stackoverflow/q/112808/
I just realize it should be done in access. I am so sorry, this was written for sql server 2005. I don't know how to rewrite it for access.