Most efficient way to count number of matching rows with multiple criterias at once - sql

I have a very large table (called device_operation with 50 million rows) which holds all the operations of a product in its lifecycle (such as "start", "stop", "refill", ..." and the status of these operations (row status : Completed, Failed), with the ID of the associated device (row device_id) and a timestamp for each operation (row create_date).
Something like this :
/------+-----------+------------------+---------\
| ID | Device ID | Create_Date | Status |
+------+-----------+------------------+---------+
| 1 | 1 | 2012-03-04 01:43 | Success |
| 2 | 4 | 2012-04-04 02:34 | Failed |
| 3 | 9 | 2013-01-01 01:23 | Failed |
| 4 | 4 | 2013-12-12 12:34 | Success |
| 5 | 23 | 2014-02-01 03:45 | Success |
| 6 | 1 | 2014-05-03 08:34 | Failed |
\------+-----------+------------------+---------/
I also have another table (called subscription) that tells me when the warranty has started (row create_date) for the product (row device_id). Warranty lasts one year.
/-----------+------------------\
| Device ID | Create_Date |
+-----------+------------------+
| 2 | 2011-04-03 05:00 |
| 4 | 2012-03-05 03:45 |
| 5 | 2012-03-05 06:07 |
| ... | ... |
\-----------+------------------/
I am using PostgreSQL.
I want to do the following :
List all device IDs which had at least one successful operation before a given date (2014-07-06)
For each of those devices, count :
The number of failed operations after that date + 2 days (2014-07-08), and the device was under warranty when the operation was attempted
The number of failed operations after that date + 2 days (2014-07-08), and the device was outside warranty when the operation was attempted
The number of successful operations after that date (device being under warranty or not)
I had some limited success with the following (query has been simplified a little bit for readability - there are other joins involved to get to the subscription table, and other criterias to include the devices in the list) :
SELECT distinct device_operation.device_id as did, subscription.create_date,
(
SELECT COUNT(*)
FROM device_operation dop
WHERE dop.device_id = device_operation.device_id and
dop.create_date > '2014-07-08' and
dop.status = 'Success'
) as success,
(
SELECT COUNT(*)
FROM device_operation dop2
WHERE
dop2.device_id = subscription.device_id and
dop2.create_date > '2014-07-08' and
dop2.status = 'Failed' and
dop2.create_date <= subscription.create_date + interval '1 year'
) as failed_during_warranty,
(
SELECT COUNT(*)
FROM device_operation dop2
WHERE
dop2.device_id = subscription.device_id and
dop2.create_date > '2014-07-08' and
dop2.status = 'Failed' and
dop2.create_date > subscription.create_date + interval '1 year'
) as failed_after_warranty,
FROM device_operation, subscription
WHERE
device_operation.status = 'Success' and -- list operations which are successful
device_operation.create_date <= '2014-07-06' and -- list operations before that date
device_operation.device_id = subscription.device_id -- get warranty start for each operation
ORDER BY success DESC, failed_during_warranty DESC, failed_after_warranty DESC
As you can guess, it's so slow I cannot run the query. However it gives you an idea of the structure.
I have tried to use NULLIF to combine the requests into one, in the hope it's going to make PostgreSQL only list the subquery once instead of 3, but it returns "subquery must return only one column" :
SELECT distinct device_operation.device_id as did, subscription.create_date,
(
SELECT COUNT(NULLIF(dop2.status != 'Success', true)) as completed,
COUNT(NULLIF(dop2.status != 'Failed' or not (dop2.create_date <= subscription.create_date + interval '1 year'), true)) as failed_in_warranty,
COUNT(NULLIF(dop2.status != 'Failed' or (dop2.create_date <= subscription.create_date + interval '1 year'), true)) as failed_after_warranty
FROM device_operation dop2
WHERE
dop2.device_id = device_operation.device_id and
dop2.device_id = subscription.device_id and
dop2.create_date > '2014-07-08'
) as subq
FROM device_operation, subscription
WHERE
device_operation.status = 'Success' and -- list operations which are successful
device_operation.create_date <= '2014-07-06' and -- list operations before that date
device_operation.device_id = subscription.device_id -- get warranty start for each operation
ORDER BY success DESC, failed_in_warranty DESC, failed_outside_warranty DESC
I also tried to move the subquery to the FROM clause but that doesn't work as I need to run the subquery for each row of the main query (or do I ? maybe there's a better way)
What I expect is something like this :
/-----------+---------+------------------------+-----------------------\
| Device ID | Success | Failed during warranty | Failed after warranty |
+-----------+---------+------------------------+-----------------------+
| 194853 | 10 | 0 | 0 |
| 7853 | 5 | 5 | 0 |
| 5848 | 3 | 0 | 56 |
| 8546455 | 0 | 45 | 0 |
| 102 | 0 | 4 | 1 |
| 69329548 | 0 | 0 | 9 |
| 17 | 0 | 0 | 0 |
\-----------+---------+------------------------+-----------------------+
Can someone help me find the most efficient way to do it ?
EDIT: Corner cases: You can consider all devices have an entry in subscription.
Thank you very much !

I think you just require conditional aggregation. I find the data structure and logic a bit hard to follow, but I think the following is basically what you need:
SELECT d.device_id,
SUM(CASE WHEN d.status = 'Failed' AND d.create_date <= '2014-07-06' + interval '2 day'
THEN 1 ELSE 0
END) as NumFails,
SUM(CASE WHEN d.status = 'Failed' AND d.create_date <= '2014-07-06' + interval '2 day' AND
d.create_date > s.create_date + interval '1 year'
THEN 1 ELSE 0
END) as NumFailsNoWarranty,
SUM(CASE WHEN d.status = 'Success' AND d.create_date <= '2014-07-06' + interval '2 day'
THEN 1 ELSE 0
END) as NumSuccesses
FROM device_operation d JOIN
subscription s
ON d.device_id = s.device_id
GROUP BY d.device_id
HAVING SUM(CASE WHEN d.status = 'Success' AND d.create_date <= '2014-07-06' THEN 1 ELSE 0 END) > 0;

Related

How to limit a SUM using postgres

I want to limit the SUM to 2 hours in postgres sql. I dont want to limit the result of the sum, but the value that it is going to sum.
For example, the following table:
In this case, if I SUM('02:20', '01:50', '00:30', '03:00') the result would be 07:30.
|CODE | HOUR |
| --- | ---- |
| 1 | 02:20 |
| 2 | 01:50 |
| 3 | 00:30 |
| 4 | 03:00 |
But what i want, is to limit the column HOUR to 02:00. So if the value is > 02:00, it will be replaced with 02:00, only in the SUM.
So the SUM should look like this ('02:00', '01:50', '00:30', '02:00'), and the result would be 06:20
Use case as an argument of the function:
select sum(case when hour < '2:00' then hour else '2:00' end)
from my_table
Test it in Db<>fiddle.
It's still not perfect, but the idea is this:
select sum(x.anything::time)
from (select id,
time,
case when time <= '02:00' then time::text
else '02:' || (EXTRACT(MINUTES FROM time::time)::text)
end as anything
from time_table) x

SQL for Microsoft Access 2013

I am trying to get a count of customers who return for additional services that takes into account a comparison of assessment scores prior to the service, grouped by the type of Service. (Ultimately, I also want to be able to ignore returns that are within a month of the orginal service, but I'm pretty sure I can get that wrinkle sorted myself)
When counting results for a particular service, it should look at returns to any service type, not just the original service type. (Edit: *It should also look at all future returns, not just the next or the most recent *).
It does not need to be run often, but there are 15000+ lines of data and computational resources are limited by an underpowered machine (this is for a nonprofit organization), so efficiency would be nice but not absolutely needed.
Sample Data
ServiceTable
CustomerID Service Date ScoreBefore
A Service1 1/1/2017 1
A Service2 1/3/2017 1
A Service1 1/1/2018 4
B Service3 3/1/2018 3
B Service1 6/1/2018 1
B Service1 6/2/2018 1
C Service2 1/1/2019 4
C Service2 6/1/2019 1
Results should be (not taking into account the date padding option):
Service1
ReturnedWorse 0
ReturnedSame 2
ReturnedBetter 1
Service2
ReturnedWorse 1
ReturnedSame 0
ReturnedBetter 1
Service3
ReturnedWorse 2
So far, I have tried creating make table queries that could then be queried to get the aggregate info, but I am a bit stuck and suspect there may be a better route.
What I have tried:
SELECT CustomerID, Service, Date, ScoreBefore INTO ReturnedWorse
FROM ServiceTable AS FirstStay
WHERE ((((SELECT COUNT(*)
FROM ServiceTable AS SecondStay
WHERE FirstStay.CustomerID=SecondStay.CustomerID
AND
FirstStay.ScoreBefore> SecondStay.ScoreBefore
AND
SecondStay.Date > FirstStay.Date))));
Any help would be greatly appreciated.
This would have been easier to do with window functions, but they are not available in ms-access.
Here is a query that solves my understanding of your question :
t0: pick a record in the table (a customer buying a service)
t1 : pull out the record corresponding to the next time the same customer contracted any service with an INNER JOIN and a correlated subquery (if there is no such record, the initial record is not taken into account)
compare the score of the previous record to the current one
group the results by service id
You can see it in action in this db fiddlde. The results are slightly different from your expectation (see my comments)... but they are consistent with the above explanation ; you might want to adapt some of the rules to match your exact expected result, using the same principles.
SELECT
t0.service,
SUM(CASE WHEN t1.scorebefore < t0.scorebefore THEN 1 ELSE 0 END) AS ReturnedWorse,
SUM(CASE WHEN t1.scorebefore = t0.scorebefore THEN 1 ELSE 0 END) AS ReturnedSame,
SUM(CASE WHEN t1.scorebefore > t0.scorebefore THEN 1 ELSE 0 END) AS ReturnedBetter
FROM
mytable t0
INNER JOIN mytable t1
ON t0.customerid = t1.customerid
AND t0.date < t1.date
AND NOT EXISTS (
SELECT 1
from mytable
WHERE
customerid = t1.customerid
AND date < t1.date
AND date > t0.date
)
GROUP BY t0.service
| service | ReturnedWorse | ReturnedSame | ReturnedBetter |
| -------- | ------------- | ------------ | -------------- |
| Service1 | 0 | 2 | 0 |
| Service2 | 1 | 0 | 1 |
| Service3 | 1 | 0 | 0 |
From your comments, I understand that you want to take into account all future returns and not only the next one. This eliminates the need for a correlatead subquery, and actually yields your expected output. See this db fiddle :
SELECT
t0.service,
SUM(CASE WHEN t1.scorebefore < t0.scorebefore THEN 1 ELSE 0 END) AS ReturnedWorse,
SUM(CASE WHEN t1.scorebefore = t0.scorebefore THEN 1 ELSE 0 END) AS ReturnedSame,
SUM(CASE WHEN t1.scorebefore > t0.scorebefore THEN 1 ELSE 0 END) AS ReturnedBetter
FROM
mytable t0
INNER JOIN mytable t1
ON t0.customerid = t1.customerid
-- AND t0.service = t1.service
AND t0.date < t1.date
GROUP BY t0.service
| service | ReturnedWorse | ReturnedSame | ReturnedBetter |
| -------- | ------------- | ------------ | -------------- |
| Service1 | 0 | 2 | 1 |
| Service2 | 1 | 0 | 1 |
| Service3 | 2 | 0 | 0 |

How to write SQL query to calculate instances where a row containing a distinct id occurs 7 days after the fist occurrence if the unique id?

I am looking to return a date, the count of unique_ids first occurrences on that date, the number unique_ids that occurred 7 days after their first occurrence and the percentage of occurrences after 7 days / number of first occurrences.
example data_import table
+---------------------+------------------+
| time | distinct_id |
+---------------------+------------------+
| 2018/10/01 | 1 | first instance of `1`
+---------------------+------------------+
| 2018/10/01 | 2 | also first instance, but does not occur 7 days later
+---------------------+------------------+
| 2018/10/02 | 1 | should be disregarded (not first instance of 1)
+---------------------+------------------+
| 2018/10/02 | 3 | first instance of `3`
+---------------------+------------------+
| 2018/10/08 | 1 | First instance 7 days after first instance of `1`
+---------------------+------------------+
| 2018/10/08 | 1 | Don't count as this is the 2nd instance of `1` on this day
+---------------------+------------------+
| 2018/10/09 | 3 | 7 days after first instance of `3`
+---------------------+------------------+
| 2018/10/09 | 1 | 7 days after non-first instance of `1`
+---------------------+------------------+
And the expected return.
+---------------------+----------------------+------------------------+---------------------------+
| time | num_of_1st_instance | num_occur_7_days_after | percent_used_7_days_after |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/01 | 2 | 1 | .50 |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/02 | 1 | 1 | 1.0 |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/03 | 0 | 0 | 0 |
+---------------------+----------------------+------------------------+---------------------------+
The query I have written is close, but counts occurrences other that the first for a distinct_id.
In my example, this query would include the occurrence of distinct_id 1 on 2018/10/02 and it's occurrence seven days after 2018/10/02 on 2018/10/09. Not wanted as the 2018/10/02 occurrence of distinct_id 1 is not it's first.
SELECT
data_import.time AS date,
count(distinct data_import.distinct_id) AS num_installs_on_install_date,
count(distinct future_activity.distinct_id) AS num_occur_7_days_after,
count(distinct future_activity.distinct_id) / count(distinct data_import.distinct_id)::float AS percent_used_7_days_after
FROM data_import
LEFT JOIN data_import AS future_activity ON
data_import.distinct_id = future_activity.distinct_id
AND
DATE(data_import.time) = DATE(future_activity.time) - INTERVAL '7 days'
AND
data_import.time = ( SELECT
time
FROM
data_import
WHERE
distinct_id = future_activity.distinct_id
ORDER BY
time
limit
1 )
GROUP BY DATE(data_import.time)
I hope that I explained this clearly. Please let me know how I can change my current query or a different approach to the solution.
Hmmm. Does this do what you want?
select di.time, sum( (seqnum = 1)::int) as first_instance,
sum( flag_7day ) as num_after_7_day,
sum( (seqnum = 1)::int) * 1.0 / sum( flag_7day ) as ratio
from (select di.*,
row_number() over (partition by distinct_id order by time) as seqnum,
(case when exists (select 1 from data_import di2 where di2.distinct_id = di.distinct_id and di2.time > di.time + interval '7 day')
then 1 else 0
end) as flag_7day
from data_import di
) di
group by di.time;
This doesn't return days with no first instances. Those days seem a bit awkward with respect to the ratio, so I'm not 100% sure that you really need them. If you do, it is easy enough to include a generate_series() to generate all dates in the range that you want.

Correlate Sequences of Independent Events - Calculate Time Intersection

We are building a PowerBI reporting solution and I (well Stack) solved one problem and the business came up with a new reporting idea. Not sure of the best way to approach it as I know very little about PowerBI and the business seems to want quite complex reports.
We have two sequences of events from separate data sources. They both contain independent events occurring to vehicles. One describes what location a vehicle is within - the other describes incident events which have a reason code for the incident. The business wants to report on time spent in each location for each reason. Vehicles can change location totally independent of the incident events occurring - and events actually are datetime and occur at random points throughtout day. Each type of event has a startime/endtime and a vehicleID.
Vehicle Location Events
+------------------+-----------+------------+-----------------+----------------+
| LocationDetailID | VehicleID | LocationID | StartDateTime | EndDateTime |
+------------------+-----------+------------+-----------------+----------------+
| 1 | 1 | 1 | 2012-1-1 | 2016-1-1 |
| 2 | 1 | 2 | 2016-1-1 | 2016-4-1 |
| 3 | 1 | 1 | 2016-4-1 | 2016-11-1 |
| 4 | 2 | 1 | 2011-1-1 | 2016-11-1 |
+------------------+-----------+------------+-----------------+----------------+
Vehicle Status Events
+---------+---------------+-------------+-----------+--------------+
| EventID | StartDateTime | EndDateTime | VehicleID | ReasonCodeID |
+---------+---------------+-------------+-----------+--------------+
| 1 | 2012-1-1 | 2013-1-1 | 1 | 1 |
| 2 | 2013-1-1 | 2015-1-1 | 1 | 3 |
| 3 | 2015-1-1 | 2016-5-1 | 1 | 4 |
| 4 | 2016-5-1 | 2016-11-1 | 1 | 2 |
| 5 | 2015-9-1 | 2016-2-1 | 2 | 1 |
+---------+---------------+-------------+-----------+--------------+
Is there anyway I can correlate the two streams together and calculate total time per Vehicle per ReasonCode per location? This would seem to require me to be able to relate the two events - so a change of location may occur part way through a given ReasonCode.
Calculation Example ReasonCodeID 4
VehicleID 1 is in location ID 1 from 2012-1-1 to 2016-1-1 and
2016-4-1 to 2016-11-1
VehicleID 1 is in location ID 2 from 2016-1-1
to 2016-4-1
VehcileID 1 has ReasonCodeID 4 from 2015-1-1 to
2016-5-1
Therefore first Period in location 1 intersects with 365 days of ReasonCodeID 4 (2015-1-1 to 2016-1-1). 2nd period in location 1 intersects with 30 days (2016-4-1 to 2016-5-1).
In location 2 intersects with 91 days of ReasonCodeID 4(2016-1-1 to 2016-4-1
Desired output would be the below.
+-----------+--------------+------------+------------+
| VehicleID | ReasonCodeID | LocationID | Total Days |
+-----------+--------------+------------+------------+
| 1 | 1 | 1 | 366 |
| 1 | 3 | 1 | 730 |
| 1 | 4 | 1 | 395 |
| 1 | 4 | 2 | 91 |
| 1 | 2 | 1 | 184 |
| 2 | 1 | 1 | 154 |
+-----------+--------------+------------+------------+
I have created a SQL fiddle that shows the structure here
Vehicles have related tables and I'm sure the business will want them grouped by vehicle class etc but if I can understand how to calculate the intersection points in this case that would give me the basis for rest of reporting.
I think this solution requires a CROSS JOIN implementation. The relationship between both tables is Many to Many which implies the creation of a third table that bridges LocationEvents and VehicleStatusEvents tables so I think specifying the relationship in the expression could be easier.
I use a CROSS JOIN between both tables, then filter the results only to get those rows which VehicleID columns are the same in both tables. I am also filtering the rows that VehicleStatusEvents range dates intersects LocationEvents range dates.
Once the filtering is done I am adding a column to calculate the count of days between each intersection. Finally, the measure sums up the days for each VehicleID, ReasonCodeID and LocationID.
In order to implement the CROSS JOIN you will have to rename the VehicleID, StartDateTime and EndDateTime on any of both tables. It is necessary for avoiding ambigous column names errors.
I rename the columns as follows:
VehicleID : LocationVehicleID and StatusVehicleID
StartDateTime : LocationStartDateTime and StatusStartDateTime
EndDateTime : LocationEndDateTime and StatusEndDateTime
After this you can use CROSSJOIN in the Total Days measure:
Total Days =
SUMX (
FILTER (
ADDCOLUMNS (
FILTER (
CROSSJOIN ( LocationEvents, VehicleStatusEvents ),
LocationEvents[LocationVehicleID] = VehicleStatusEvents[StatusVehicleID]
&& LocationEvents[LocationStartDateTime] <= VehicleStatusEvents[StatusEndDateTime]
&& LocationEvents[LocationEndDateTime] >= VehicleStatusEvents[StatusStartDateTime]
),
"CountOfDays", IF (
[LocationStartDateTime] <= [StatusStartDateTime]
&& [LocationEndDateTime] >= [StatusEndDateTime],
DATEDIFF ( [StatusStartDateTime], [StatusEndDateTime], DAY ),
IF (
[LocationStartDateTime] > [StatusStartDateTime]
&& [LocationEndDateTime] >= [StatusEndDateTime],
DATEDIFF ( [LocationStartDateTime], [StatusEndDateTime], DAY ),
IF (
[LocationStartDateTime] <= [StatusStartDateTime]
&& [LocationEndDateTime] <= [StatusEndDateTime],
DATEDIFF ( [StatusStartDateTime], [LocationEndDateTime], DAY ),
IF (
[LocationStartDateTime] >= [StatusStartDateTime]
&& [LocationEndDateTime] <= [StatusEndDateTime],
DATEDIFF ( [LocationStartDateTime], [LocationEndDateTime], DAY ),
BLANK ()
)
)
)
)
),
LocationEvents[LocationID] = [LocationID]
&& VehicleStatusEvents[ReasonCodeID] = [ReasonCodeID]
),
[CountOfDays]
)
Then in Power BI you can build a matrix (or any other visualization) using this measure:
If you don't understand completely the measure expression, here is the T-SQL translation:
SELECT
dt.VehicleID,
dt.ReasonCodeID,
dt.LocationID,
SUM(dt.Diff) [Total Days]
FROM
(
SELECT
CASE
WHEN a.StartDateTime <= b.StartDateTime AND a.EndDateTime >= b.EndDateTime -- Inside range
THEN DATEDIFF(DAY, b.StartDateTime, b.EndDateTime)
WHEN a.StartDateTime > b.StartDateTime AND a.EndDateTime >= b.EndDateTime -- |-----|*****|....|
THEN DATEDIFF(DAY, a.StartDateTime, b.EndDateTime)
WHEN a.StartDateTime <= b.StartDateTime AND a.EndDateTime <= b.EndDateTime -- |...|****|-----|
THEN DATEDIFF(DAY, b.StartDateTime, a.EndDateTime)
WHEN a.StartDateTime >= b.StartDateTime AND a.EndDateTime <= b.EndDateTime -- |---|****|-----
THEN DATEDIFF(DAY, a.StartDateTime, a.EndDateTime)
END Diff,
a.VehicleID,
b.ReasonCodeID,
a.LocationID --a.StartDateTime, a.EndDateTime, b.StartDateTime, b.EndDateTime
FROM LocationEvents a
CROSS JOIN VehicleStatusEvents b
WHERE a.VehicleID = b.VehicleID
AND
(
(a.StartDateTime <= b.EndDateTime)
AND (a.EndDateTime >= b.StartDateTime)
)
) dt
GROUP BY dt.VehicleID,
dt.ReasonCodeID,
dt.LocationID
Note in T-SQL you could use an INNER JOIN operator too.
Let me know if this helps.
select coalesce(l.VehicleID,s.VehicleID) as VehicleID
,s.ReasonCodeID
,l.LocationID
,sum
(
datediff
(
day
,case when s.StartDateTime > l.StartDateTime then s.StartDateTime else l.StartDateTime end
,case when s.EndDateTime < l.EndDateTime then s.EndDateTime else l.EndDateTime end
)
) as TotalDays
from VehicleLocationEvents as l
full join VehicleStatusEvents as s
on s.VehicleID =
l.VehicleID
and case when s.StartDateTime > l.StartDateTime then s.StartDateTime else l.StartDateTime end <=
case when s.EndDateTime < l.EndDateTime then s.EndDateTime else l.EndDateTime end
group by coalesce(l.VehicleID,s.VehicleID)
,s.ReasonCodeID
,l.LocationID
or
select VehicleID
,ReasonCodeID
,LocationID
,sum (datediff (day,max_StartDateTime,min_EndDateTime)) as TotalDays
from (select coalesce(l.VehicleID,s.VehicleID) as VehicleID
,s.ReasonCodeID
,l.LocationID
,case when s.StartDateTime > l.StartDateTime then s.StartDateTime else l.StartDateTime end as max_StartDateTime
,case when s.EndDateTime < l.EndDateTime then s.EndDateTime else l.EndDateTime end as min_EndDateTime
from VehicleLocationEvents as l
full join VehicleStatusEvents as s
on s.VehicleID =
l.VehicleID
) ls
where max_StartDateTime <= min_EndDateTime
group by VehicleID
,ReasonCodeID
,LocationID

SQL Query Pivot Data 2 rows into 1

I am try to write a query which pivots activity data into summary rows. For example the input data is activity date, description and status of either; Start, Stop or null if it is an informative row.
This is an example of the data format:
ID | ActivityID | ActivityType | ActivityDate | Status | Activity
----------------------------------------------------------------------
701 | 26 | Start | 02/07/13 15:16 | 10 | Run Job
728 | 26 | No Change | 05/07/13 09:30 | 20 | Running
859 | 26 | Stop | 22/07/13 12:45 | 30 | Error
1064 | 26 | Start | 10/08/13 13:26 | 11 | Restarted
1524 | 26 | Stop | 28/08/13 10:19 | 31 | Error
1785 | 26 | Stop | 07/09/13 11:48 | 31 | Error
2205 | 26 | Start | 17/09/13 09:05 | 10 | Restarted
2528 | 26 | Start | 14/10/13 17:56 | 11 | Restarted
2528 | 26 | Stop | 25/10/13 20:47 | 32 | Completed
And this is the expected result:
ActivityID | Start_Type | Start_Date | Start_Status | Start_Activity | Stop_Type | Stop_Date | Stop_Status | Stop_Activity
---------------------------------------------------------------------------------
26 | Start | 02/07/13 15:16 | 10 | Run Job | Stop | 22/07/13 12:45 | 30 | Error
26 | Start | 10/08/13 13:26 | 11 | Restarted | Stop | 28/08/13 10:19 | 31 | Error
26 | Start | 17/09/13 09:05 | 10 | Restarted | Stop | 25/10/13 20:47 | 32 | Done
I want to get all the starts and put the corrosponding stop in the same row as activity_start_date etc. and activity_stop_date.
My problem is I want the first stop after each start which I have done, but I want to also ignore a start that has a start before it. For instants Start, Stop, Stop, Start, Start, Stop, Start would return; Start Stop, Start Stop, Start null.
I have tried a left join to join the 1st stop after a start but this also includes the second start matching the same stop twice.
I thought I needed a temp table but I think this is unnecessary. Would a date variable work where the first start sets the variable then the stop is the first stop after the variable which sets the variable and the next start is the first start after the new set variable?
Your help is appreciated!
This is what I have so far:
SELECT
activity.ID,
activity.ActivityType AS StartActivityType,
activity.ActivityDate AS StartActivityDate,
activity.Status AS StartStatus,
activity.Activity AS StartActivity,
activity2.ActivityType AS StopActivityType,
activity2.ActivityDate AS StopActivityDate,
activity2.Status AS StopStatus,
activity2.Activity AS StopActivity
FROM tempdb..#TempTable activity
FULL OUTER JOIN #TempTable activity2
ON activity.ID = activity2.ID
AND activity2.ActivityType = 'Stop'
AND (activity2.ActivityDate > activity.ActivityDate)
AND (activity2.ActivityDate = ( SELECT MIN(activity3.ActivityDate)
FROM tempdb..#TempTable activity3
WHERE activity.ID = activity2.ID
AND activity3.ActivityType = 'Stop'
AND activity3.ActivityDate > activity.ActivityDate))
WHERE activity.ActivityType = 'Start'
--does not have a start before
AND activity.ActivityDate > ( SELECT MAX(activity3.ActivityDate)
FROM tempdb..#TempTable activity2
WHERE activity2.PathwayID = activity.ID
AND activity2.ActivityType IN ('Stop','Start')
AND activity2.ActivityDate > activity.ActivityDate))
--AND activity2.ActivityDate != LAG(activity2.ActivityDate) OVER (ORDER BY activity.ActivityDate),
--AND activity.ActivityDate IS NULL
ORDER BY activity.ID ASC
select * from #TempTable a1
cross apply
(
select top 1 *
from
#TempTable a2
where
a2.ActivityType = 'Stop'
and a2.ActivityID = a1.ActivityID
and a2.ActivityDate > a1.ActivityDate
order by
a2.ActivityDate
) a2
where
a1.ActivityType = 'Start'
order by
a1.ActivityDate;
You have some oddities in your data that I do not understand, but this should get you 90% there and it's much cleaner than trying to join back on ActivityDate.
The query of shriop is almost correct, it fails where there are following rows with ActivityType = 'Start' like 2205 and 2528. For those the duplicated rows (the rows with the same ActivityType of the precedent) need to be dropped.
If the OP uses SQLServer 2012 or better this can be done using LAG
WITH DATA AS (
SELECT ActivityID
, ActivityType
, ActivityDate
, Status
, Activity
, LastActivity = LAG(ActivityType, 1, 'Stop') OVER (ORDER BY ActivityDate)
FROM Table1
WHERE ActivityType IN ('Start', 'Stop')
)
SELECT a1.ActivityID
, Start_Type = a1.ActivityType
, Start_Date = a1.ActivityDate
, Start_Status = a1.Status
, Start_Activity = a1.Activity
, Stop_Type = a2.ActivityType
, Stop_Date = a2.ActivityDate
, Stop_Status = a2.Status
, Stop_Activity = a2.Activity
FROM DATA a1
CROSS apply (SELECT top 1 *
FROM Table1 a2
WHERE a2.ActivityType = 'Stop'
AND a2.ActivityID = a1.ActivityID
AND a2.ActivityDate > a1.ActivityDate
ORDER BY a2.ActivityDate) a2
WHERE a1.ActivityType = 'Start'
AND a1.LastActivity <> a1.ActivityType
ORDER BY a1.ActivityDate;
Otherwise LAG can be simulated by row numbering and self join