SQL Simple Group By causing duplicates sections

SQL Simple Group By causing duplicates sections - sql

I have a SQL query that is causing me trouble. The SQL That I have so far is the following:
SELECT Dated, [Level 1 reason], SUM(Duration) AS Mins
FROM Downtime
GROUP BY Dated, [Level 1 reason]
Now the problem I am having is that the results include multiple reasons, rather than being grouped together as I require. An example of the problem results is the following:
1/2/2013 10:02:00 AM Westport 2
1/2/2013 10:17:00 AM Westport 9
1/2/2013 10:48:00 AM Engineering 5
1/2/2013 11:01:00 AM Facilities 6
The intended result is that there be a single Westport group for a date. The query also needs to handle multiple dates, but those weren't included in the snippet for readability.
Thanks for any help. I know it's some simple reason, but I can't figure it out.
**EDIT IN: sorry, I am performing this in Access.
Removing the Group by Dated results in an error in Access. I am unsure what to make of it
"You Tried to excecute a query that does not include the specified
expression 'Dated as part of an aggregate function."**
D Stanley solved my question with the following query
SELECT DateValue(Dated) AS Dated, [Level 1 reason], SUM(Duration) AS Mins
FROM Downtime
GROUP BY DateValue(Dated), [Level 1 reason]

In Access you ace use the DateValue function to remove the time from a date column:
SELECT DateValue(Dated) Dated, [Level 1 reason], SUM(Duration) AS Mins
FROM Downtime
GROUP BY DateValue(Dated), [Level 1 reason]

It seems like you want to remove the time component. How to do that varies between database systems. In SQL Server it would be:
SELECT DATEADD(day,DATEDIFF(day,0,Dated),0), [Level 1 reason],
SUM(Duration) AS Mins
FROM Downtime
GROUP BY DATEADD(day,DATEDIFF(day,0,Dated),0), [Level 1 reason]
This works because 0 can be implicitly converted to a date (01/01/1900 at midnight), and DATEADD and DATEDIFF only work in integral parts of the datepart (here, day). So, this is "how many complete days have occurred since 01/01/1900 at midnight?" and "Let's add that same number of days onto 01/01/1900 at midnight" - which gives us the same date as the value we started with, but always at midnight.
For Access, I think you have to quote the datepart (day becomes "d"). I'm not sure if the 0 implicit conversion still works - but you can just substitute any constant date in all for places I've used a 0 above, something like:
SELECT DATEADD("d",DATEDIFF("d","01/01/1900",Dated),"01/01/1900"),
[Level 1 reason],
SUM(Duration) AS Mins
FROM Downtime
GROUP BY DATEADD("d",DATEDIFF("d","01/01/1900",Dated),"01/01/1900"),
[Level 1 reason]

I guess in access it would be:
SELECT CDate(Int(Dated)) , [Level 1 reason],
SUM(Duration) AS Mins
FROM Downtime
GROUP BY CDate(Int(Dated)) , [Level 1 reason];

Related

Get time difference between Log records

I have a log table that tracks the bug's status. I would like to extract the amount of time spent when the log changes from OPEN (OldStatus) to FIXED or REQUEST CLOSE (NewStatus). Right now, my query looks at the max and min of the log which does not produce the result I want. For example, the bug #1 was fixed in 2 hours on 2020-01-01, then reopened (OldStatus) and got a REQUEST CLOSE (NewStatus) in 3 hours on 2020-12-12. I want the query result to return two rows with date and number of hours spent to fix the bug since its most recently opened time.
Here's the data:
CREATE TABLE Log (
BugID int,
CurrentTime timestamp,
Person varchar(20),
OldStatus varchar(20),
NewStatus varchar(20)
);
INSERT INTO Log (BugID, CurrentTime, Person, OldStatus, NewStatus)
VALUES (1, '2020-01-01 00:00:00', 'A', 'OPEN', 'In Progress'),
(1, '2020-01-01 00:00:01', 'A', 'In Progress', 'REVIEW In Progress'),
(1, '2020-01-01 02:00:00', 'A', 'In Progress', 'FIXED'),
(1, '2020-01-01 06:00:00', 'B', 'OPEN', 'In Progress'),
(1, '2020-01-01 00:00:00', 'B', 'In Progress', 'REQUEST CLOSE')
SELECT DATEDIFF(HOUR, start_time, finish_time) AS Time_Spent_Min
FROM (
SELECT BugId,
MAX(CurrentTime) as finish_time,
MIN(CurrentTime) as start_time
FROM Log
WHERE (OldStatus = 'OPEN' AND NewString = 'In Progress') OR NewString = 'FIXED'
) AS TEMP
The actual data looks as below:
FYI #Charlieface

This is a type of gaps-and-islands problem.
There are a number of solutions, here is one:
We need to assign a grouping ID to each island of OPEN -> In Progress. We can use windowed conditional COUNT to get a grouping number for each start point.
To get a grouping for the end point, we need to assign the previous row's NewStatus using LAG, then do another conditional COUNT on that.
We then simply group by BugId and our calculated grouping and return the start and end times
WITH IslandStart AS (
SELECT *,
COUNT(CASE WHEN OldStatus = 'OPEN' AND NewStatus = 'In Progress' THEN 1 END)
OVER (PARTITION BY BugID ORDER BY CurrentTime ROWS UNBOUNDED PRECEDING) AS GroupStart,
LAG(NewStatus) OVER (PARTITION BY BugID ORDER BY CurrentTime) AS Prev_NewStatus
FROM Log l
),
IslandEnd AS (
SELECT *,
COUNT(CASE WHEN Prev_NewStatus IN ('CLAIM FIXED', 'REQUEST CLOSE') THEN 1 END)
OVER (PARTITION BY BugID ORDER BY CurrentTime ROWS UNBOUNDED PRECEDING) AS GroupEnd
FROM IslandStart l
)
SELECT
BugId,
MAX(CurrentTime) as finish_time,
MIN(CurrentTime) as start_time,
DATEDIFF(minute, MIN(CurrentTime), MAX(CurrentTime)) AS Time_Spent_Min
FROM IslandEnd l
WHERE GroupStart = GroupEnd + 1
GROUP BY
BugId,
GroupStart;
Notes:
timestamp is not meant for actual dates and times, instead use datetime or datetime2
You may need to adjust the COUNT condition if OPEN -> In Progress is not always the first row of an island

You have a few competing factors here:
You should use a SmallDateTime, DateTime2 or DateTimeOffset typed columns to store the actual time in the log, these types allow for calculating the differece between values using DateDiff() and DateAdd() and other date/time based comparison logic, where as Timestamp is designed to be used as a currency token, you can use it to determine if one record is more recent than another, you shouldn't try to use it to determine the actual time of the event.
What is difference between datetime and timestamp
DATETIMEOFFSET, DATE, TIME, SMALLDATETIME, DATETIME SYSUTCDATETIME and SYSUTCDATE
You have not explained the expected workflow, we can only assume that the flow is [OPEN]=>[In Progress]=>[CLAIM FIXED]. There is also no mention of 'In Progress', which we assume is an interim state. What actually happens here is that this structure can really only tell you the time spent in the 'In Progress' state, which is probably OK for your needs as this is the time spent actually working, but it is important to recognise that we do not know when the bug is changed to 'OPEN' in the first place, unless that is also logged but we need to see the data to explain that.
Your example dataset does not cover enough combinations for you to notice that the existing logic will fail as soon as you add more than 1 bug. What is more you have asked to calculate the number of hours, but your example data only shows a variation minutes and has no example where the bug is completed at all.
Without a realistic set of data to test with, you will find it hard to debug your logic and hard to accept that it actually works before you execute this against a larger dataaset. It can help to have a scripted scenario, much like your post here, but you should create the data to reflect that script.
You use 'FIXED' in your example, but 'CLAIM FIXED' in query, so which one is it?
Step 1: Structure
Change the datatype of CurrentTime to a DateTime based column. Your application logic may drive requirements here. If your system is cloud based or international, then you may see benefits from using DateTimeOffset instead of having to convert into UTC, otherwise if you do not need high precision timing in your logs, it is very common to use SmallDateTime for logging.
Many ORM and application frameworks will allow you to configure a DateTime based column as the concurrency token, it you need one at all. If you are not happy using a lower precision value for concurrency, then you could have the two columns side by side, to compare the time difference between two records, we need to use a DateTime based type.
In the case of log, we rarely allow or expect logs to be edited, if your logs are read-only then having a concurrency token at all may not be necessary, especially if you only use the concurrency token to determine concurrency during edits of individual records.
NOTE: You should consider using an enum or FK for the Status concept. Already in your example dataset there was a typo for 'In Progerss', using a numeric comparison for the status may provide some performance benefits but it will help to prevent spelling mistakes, especially when FK or lookup lists are used from any application logic.
Step 2: Example Data
If the requirement is to calculate the number of hours spent between records, then we need to create some simple examples that show a difference of a few hours, and then add some examples where the same bug is opened, fixed and then re-opened.
bug #1 was fixed in 2 hours on 2020-01-01, then reopened and got fixed in 3 hours on 2020-12-12
The following table shows the known data states and the expected hrs, we need to throw in a few more data stories to validate that the end query handles obvious boundary conditions like multiple Bugs and overlapping dates
BUG #
Time
Previous State
New State
Hrs In Progress
1
2020-01-01 08:00:00
OPEN
In Progress
1
2020-01-01 10:00:00
In Progress
FIXED
(2 hrs)
1
2020-12-10 09:00:00
FIXED
OPEN
1
2020-12-12 9:30:00
OPEN
In Progress
1
2020-12-12 12:30:00
In Progress
FIXED
(3 hrs)
2
2020-03-17 11:15:00
OPEN
In Progress
2
2020-03-17 14:30:00
In Progress
FIXED
(3.25 hrs)
3
2020-08-22 10:00:00
OPEN
In Progress
3
2020-08-22 16:30:00
In Progress
FIXED
(6.5 hrs)
Step 3: Query
What is interesting to notice here is that 'In Progress' is actually the significant state to query against. What we actually want is to see all rows where the OldStatus is 'In Progress' and we want to link that row to the most recent record before this one with the same BugID and with a NewStatus equal to 'In Progress'
What is interesting in the above table is that not all the expected hours are whole numbers (integers) which makes using DateDiff a little bit tricky because it only counts the boundary changes, not the total number of hours. to highlight this, look at the next two queries, the first one represents 59 minutes, the other only 2 minutes:
SELECT DateDiff(HOUR, '2020-01-01 08:00:00', '2020-01-01 08:59:00') -- 0 (59m)
SELECT DateDiff(HOUR, '2020-01-01 08:59:00', '2020-01-01 09:01:00') -- 1 (1m)
However the SQL results show the first query as 0 hours, but the second query returns 1 hour. That is because it only compares the HOUR column, it is not actually doing a subtraction of the time value at all.
To work around this, we can use MINUTE or MI as the date part argument and divide the result by 60.
SELECT CAST(ROUND(DateDiff(MI, '2020-01-01 08:00:00', '2020-01-01 08:59:00')/60.0,2) as Numeric(10,2)) -- 0.98
SELECT CAST(ROUND(DateDiff(MI, '2020-01-01 08:59:00', '2020-01-01 09:01:00')/60.0,2) as Numeric(10,2)) -- 0.03
You can choose to format this in other ways by calculating the modulo to get the minutes in whole numbers instead of a fraction but that is out of scope for this post, understanding the limitations of DateDiff is what is important to take this further.
There are a number of ways to correlate a previous record within the same table, if you need other values form the record then you might use a join with a sub-query to return the TOP 1 from all the records before the current one, you could use window queries or a CROSS APPLY to perform a nested lookup. The following uses CROSS APPLY which is NOT standard across all RDBMS but I feel it keeps MS SQL queries really clean:
SELECT [Fixed].BugID, [start_time], [Fixed].[CurrentTime] as [finish_time]
, DATEDIFF(MI, [start_time], [Fixed].[CurrentTime]) / 60 AS Time_Spent_Hr
, DATEDIFF(MI, [start_time], [Fixed].[CurrentTime]) % 60 AS Time_Spent_Min
FROM Log as Fixed
CROSS APPLY (SELECT MAX(CurrentTime) AS start_time
FROM Log as Started
WHERE Fixed.BugID = Started.BugID
AND Started.NewStatus = 'In Progress'
AND CurrentTime < Fixed.CurrentTime) as Started
WHERE Fixed.OldStatus = 'In Progress'
You can play with this fiddle: http://sqlfiddle.com/#!18/c408d4/3
However the results show this:
BugID
start_time
finish_time
Time_Spent_Hr
Time_Spent_Min
1
2020-01-01T08:00:00Z
2020-01-01T10:00:00Z
2
0
1
2020-12-12T09:30:00Z
2020-12-12T12:30:00Z
3
0
2
2020-03-17T11:15:00Z
2020-03-17T14:30:00Z
3
15
3
2020-08-22T10:00:00Z
2020-08-22T16:30:00Z
6
30

If I assume that every "open" is followed by one "fixed" before the next open, then you can basically use lead() to solve this problem.
This version unpivots the data, so you could have "open" and "fixed" in the same row:
select l.*, datediff(hour, currenttime, fixed_time)
from (select v.*,
lead(v.currenttime) over (partition by v.bugid order by v.currenttime) as fixed_time
from log l cross apply
(values (bugid, currentTime, oldStatus),
(bugid, currentTime, newStatus)
) v(bugid, currentTime, status)
where v.status in ('OPEN', 'FIXED')
) l
where status = 'OPEN';
Here is a db<>fiddle, which uses data compatible with your explanation. (Your sample data is not correct.)

Data aggregation by sliding time periods

[Query and question edited and fixed thanks to comments from #Gordon Linoff and #shawnt00]
I recently inherited a SQL query that calculates the number of some events in time windows of 30 days from a log database. It uses a CTE (Common Table Expression) to generate the 30 days ranges since '2019-01-01' to now. And then it counts the cases in those 30/60/90 days intervals. I am not sure this is the best method. All I know is that it takes a long time to run and I do not understand 100% how exactly it works. So I am trying to rebuild it in an efficient way (maybe as it is now is the most efficient way, I do not know).
I have several questions:
One of the things I notice is that instead of using DATEDIFF the query simply substracts a number of days from the date.Is that a good practice at all?
Is there a better way of doing the time comparisons?
Is there a better way to do the whole thing? The bottom line is: I need to aggregate data by number of occurrences in time periods of 30, 60 and 90 days.
Note: LogDate original format is like 2019-04-01 18:30:12.000.
DECLARE #dt1 Datetime='2019-01-01'
DECLARE #dt2 Datetime=getDate();
WITH ctedaterange
AS (SELECT [Dates]=#dt1
UNION ALL
SELECT [dates] + 30
FROM ctedaterange
WHERE [dates] + 30<= #dt2)
SELECT
[dates],
lt.Activity, COUNT(*) as Total,
SUM(CASE WHEN lt.LogDate <= dates and lt.LogDate > dates - 90 THEN 1 ELSE 0 END) AS Activity90days,
SUM(CASE WHEN lt.LogDate <= dates and lt.LogDate > dates - 60 THEN 1 ELSE 0 END) AS Activity60days,
SUM(CASE WHEN lt.LogDate <= dates and lt.LogDate > dates - 30 THEN 1 ELSE 0 END) AS Activity30days
FROM ctedaterange AS cte
JOIN (SELECT Activity, CONVERT(DATE, LogDate) as LogDate FROM LogTable) AS lt
ON cte.[dates] = lt.LogDate
group by [dates], lt.Activity
OPTION (maxrecursion 0)
Sample dataset (LogTable):
LogDate, Activity
2020-02-25 01:10:10.000, Activity01
2020-04-14 01:12:10.000, Activity02
2020-08-18 02:03:53.000, Activity02
2019-10-29 12:25:55.000, Activity01
2019-12-24 18:11:11.000, Activity03
2019-04-02 03:33:09.000, Activity01
Expected Output (the output does not reflect the data shown above for I would need too many lines in the sample set to be shown in this post)
As I said above, the bottom line is: I need to aggregate data by number of occurrences in time periods of 30, 60 and 90 days.
Activity, Activity90days, Activity60days, Activity30days
Activity01, 3, 0, 1
Activity02, 1, 10, 2
Activity03, 5, 1, 3
Thank you for any suggestion.

SQL Server doesn't yet have the option to range over values of the window frame of an analytic function. Since you've generated all possible dates though and you've already got the counts by date, it's very easy to look back a specific number of (aggregated) rows to get the right totals. Here is my suggested expression for 90 days:
sum(count(LogDate)) over (
partition by Activity order by [dates]
with rows between 89 preceding and current row
)

How can I express a SUM condition using SWITCH or nested IIF?

I have a request with a SQL code which calculates working days of each of these activities: Interstaff, Mission, ``Congés` using a defined VBA Function.
When there are two activities on the same day, it counts 1 day for each activity, I want the Sum function to ignore this day, and go to the next one.
I would like to modify my SQL code to add this condition:
"When calculating mission, if there is a Congés within the same period, ignore this day (and give priority to only count it in congés)"
I read that there is a SWITCH or a nested IIF conditions, but I couldn't translate that within my actual code...
Please do consider that I am still a beginner, on his way to learning!
SELECT
Z.Planning_Consultants.ID_Consultant,
Sum(IIf([Activité]="(2) Interstaff",WorkingDaysInDateRange([maxBegin],[minEnd])*Planning_Consultants.Time_Allocated,0)) AS NonBillable,
Sum(IIf([Activité]="(1) Mission",WorkingDaysInDateRange([maxBegin],[minEnd])*Planning_Consultants.Time_Allocated,0)) AS Billable,
Sum(IIf([Activité]="(3) Congés/Arrêt",WorkingDaysInDateRange([maxBegin],[minEnd])*Planning_Consultants.Time_Allocated,0)) AS Absent,
For example Mr A got a mission from 03/06/2019 to 07/06/2019. With a day off (congé) on 06/06/2019. I expect the output to be Mission 4 days and congé 1 day, in my case I will have mission 5 days, and congé 1 day,
Here's an example of a Dataset :
Activity BegDate EndDate Time_Allocated
(1)Mission 01/01/2019 31/12/2019 100%
(3)congé 02/04/2019 05/04/2019 100%
For April for example, I would like to have 18 working days and 4 congé, instead of 22 working days and 4 congé

This is the sort of query that should work for you. I've built this in sql server rather than access so you might have to make some mods, but it will give you the idea. It will give you the actual days for the activity, plus the total leave days that cover the activity. The rest I will leave up to you.
select t1.Activity, t1.Time_Allocated,
WorkingDaysInDateRange(t1.BegDate, t1.EndDate) as days,
sum(WorkingDaysInDateRange(
case when t2.BegDate<T1.BegDate then T1.BegDate else t2.BegDate end,
case when t2.EndDate>T1.EndDate then T1.EndDate else t2.EndDate end))
as LeaveDays
from times t1
left join times t2 on t2.BegDate<=t1.Enddate and t2.EndDate>=t1.BegDate
and t2.Activity='(3)congé'
and t1.Activity!='(3)congé'
group by t1.Activity, t1.Time_Allocated, t1.BegDate, t1.EndDate

Using iif to mimic CASE for days of week

I've hit a little snag with one of my queries. I'm throwing together a simple chart to plot a number of reports being submitted by day of week.
My query to start was :
SELECT Weekday(incidentdate) AS dayOfWeek
, Count(*) AS NumberOfIncidents
FROM Incident
GROUP BY Weekday(incidentdate);
This works fine and returns what I want, something like
1 200
2 323
3 32
4 322
5 272
6 282
7 190
The problem is, I want the number returned by the weekday function to read the corresponding day of week, like case when 1 then 'sunday' and so forth. Since Access doesn;t have the SQL server equivalent that returns it as the word for the weekday, I have to work around.
Problem is, it's not coming out the way I want. So I wrote it using iif since I can't use CASE. The problem is, since each iif statement is treated like a column selection (the way I'm writing it), my data comes out unusable, like this
SELECT
iif(weekday(incidentdate) =1,'Sunday'),
iif(weekday(incidentdate) =2,'Monday')
'so forth
, Count(*) AS NumberOfIncidents
FROM tblIncident
GROUP BY Weekday(incidentdate);
Expr1000 Expr1001 count
Sunday 20
Monday 106
120
186
182
164
24
Of course, I want my weekdays to be in the same column as the original query. Halp pls

Use the WeekdayName() function.
SELECT
WeekdayName(Weekday(incidentdate)) AS dayOfWeek,
Count(*) AS NumberOfIncidents
FROM Incident
GROUP BY WeekdayName(Weekday(incidentdate));

As BWS Suggested, Switch was what I wanted. Here's what I ended up writing
SELECT
switch(
Weekday(incidentdate) = 1, 'Sunday',
Weekday(incidentdate) = 2,'Monday',
Weekday(incidentdate) = 3,'Tuesday',
Weekday(incidentdate) = 4,'Wednesday',
Weekday(incidentdate) = 5,'Thursday',
Weekday(incidentdate) = 6,'Friday',
Weekday(incidentdate) = 7,'Saturday'
) as DayOfWeek
, Count(*) AS NumberOfIncidents
FROM tblIncident
GROUP BY Weekday(incidentdate);
Posting this here so there's actual code for future readers
Edit: WeekdayName(weekday(yourdate)) as HansUp said it probably a little easier :)

check this previous post:
What is the equivalent of Select Case in Access SQL?

Why not just create a 7 row table with day number & day name then just join to it?

Looking for SQL count performance improvements.

I'm refactoring some older SQL, which is struggling after 4 years and 1.7m rows of data. Is there a way to improve the following MS SQL Query:
SELECT ServiceGetDayRange_1.[Display Start Date],
SUM (CASE WHEN Calls.line_date BETWEEN [Start Date] AND [End Date] THEN 1 ELSE 0 END) AS PerDayCount
FROM dbo.ServiceGetDayRange(GETUTCDATE(), 30, #standardBias, #daylightBias, #DST_startMonth, #DST_endMonth, #DST_startWeek, #DST_endWeek, #DST_startHour, #DST_endHour, #DST_startDayNumber, #DST_endDayNumber) AS ServiceGetDayRange_1 CROSS JOIN
(select [line_date] from dbo.l_log where dbo.l_log.line_date > dateadd(day,-31,GETUTCDATE())) as Calls
GROUP BY ServiceGetDayRange_1.[Display Start Date], ServiceGetDayRange_1.[Display End Date]
ORDER BY [Display Start Date]
It counts log entries over the previous 30 days (ServiceGetDayRange function returns table detailing ranges, TZ aligned) for plotting on a chart.. useless information, but i'm not the client.
The execution plan states 99% of the exec time is used in counting the entries.. as you would expect. Very little overhead in working out the TZ offsets (remember max 30 rows).
Stupidly i thought 'ah, indexed view' but then realised i cant bind to a function.
Current exec time if 6.25 seconds. Any improvement on that +rep
Thanks in advance.

Is it faster if you turn the CASE into a WHERE?
SELECT ServiceGetDayRange_1.[Display Start Date], COUNT(*) AS PerDayCount
FROM dbo.ServiceGetDayRange(GETUTCDATE(), 30, #standardBias, #daylightBias, #DST_startMonth, #DST_endMonth, #DST_startWeek, #DST_endWeek, #DST_startHour, #DST_endHour, #DST_startDayNumber, #DST_endDayNumber) AS ServiceGetDayRange_1 CROSS JOIN
(select [line_date] from dbo.l_log where dbo.l_log.line_date > dateadd(day,-31,GETUTCDATE())) as Calls
WHERE Calls.line_date BETWEEN [Start Date] AND [End Date]
GROUP BY ServiceGetDayRange_1.[Display Start Date], ServiceGetDayRange_1.[Display End Date]
ORDER BY [Display Start Date]

6.25 seconds for nearly 2m rows is pretty good.. maybe try a count of valid rows (your 1/0 conditional should allow that) as opposed to a sum of values.. I think that's more efficient in oracle environments.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Simple Group By causing duplicates sections - sql

In Access you ace use the DateValue function to remove the time from a date column: SELECT DateValue(Dated) Dated, [Level 1 reason], SUM(Duration) AS Mins FROM Downtime GROUP BY DateValue(Dated), [Level 1 reason]

I guess in access it would be: SELECT CDate(Int(Dated)) , [Level 1 reason], SUM(Duration) AS Mins FROM Downtime GROUP BY CDate(Int(Dated)) , [Level 1 reason];

Related

Get time difference between Log records

Data aggregation by sliding time periods

How can I express a SUM condition using SWITCH or nested IIF?

Using iif to mimic CASE for days of week

Looking for SQL count performance improvements.

Categories

Resources