SQL Query SUM DATEDIFF MAX - sql

I have a problem with a SQL-Query. I want to count the runtime of a used application. But in the database the date value is inserted more then one time. I only need the highest value of the pk_date column and no duplicated entries from the starttime column.
Here is the SQL-Query:
SELECT DISTINCT Standortname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
Lizenzname,
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
endtime,
pk_date
FROM BenutzerLizenz,Benutzer,Abteilung,Lizenz,Standort
WHERE
BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user
AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic
AND PK_ID_standort=FK_ID_standort
AND DATEPART(month,PK_Date) = '04'
AND DATEPART(YEAR,PK_Date) = '2013'
AND Lizenzname = 'iman_1st'
AND Standortname = 'Unterlüß'
GROUP BY
Standortname,
DATEPART(YEAR,PK_Date),
DATEPART(month,PK_Date),
Lizenzname,
starttime,
endtime,
pk_date
Here is the result:
... RuntimeMinute starttime pk_date
339 2013-04-11 11:05:00.0000000 2013-04-11 16:44:37.9650000
346 2013-04-11 11:05:00.0000000 2013-04-11 16:51:25.4800000
356 2013-04-11 11:05:00.0000000 2013-04-11 17:01:19.9670000
475 2013-04-11 10:06:00.0000000 2013-04-11 18:01:15.6620000
The first three above runtimes are from the same user and session, the last one is from another user and session. I only want to count and sum the last runtimes from the same starttime and the maximum date inserted (pk_date) -> 356 + 475 is the value that I would like to have.
In another similar query all values are accumulated (the columns starttime, endtime, pk_date are not included in it, so the query builds the sum of all runtime values for all users). I tried to use DISTINCT and MAX(pk_date) but it didn't work as expected. Do I have to use Sub-Queries?

I would use RANK() function for this.
SELECT * FROM
(
SELECT DISTINCT Standortname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
Lizenzname,
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
endtime,
pk_date,
RANK() Over (PARTITION BY username ORDER BY pk_date DESC) As Rank
FROM BenutzerLizenz,Benutzer,Abteilung,Lizenz,Standort
WHERE
BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user
AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic
AND PK_ID_standort=FK_ID_standort
AND DATEPART(month,PK_Date) = '04'
AND DATEPART(YEAR,PK_Date) = '2013'
AND Lizenzname = 'iman_1st'
AND Standortname = 'Unterlüß'
GROUP BY
Standortname,
DATEPART(YEAR,PK_Date),
DATEPART(month,PK_Date),
Lizenzname,
starttime,
endtime,
pk_date,
username
) tmp where Rank=1
The RANK() functions ranks each row of a result set in the order defined by ORDER BY. Used with PARTITION BY, you can further partition the data for ranking.
Since you already have the data that you need, you will partition the result by username and rank the pk_date in order to get the highest one.

It sounds like you want to make a query that only keeps the max(pk_date) for each beginning time and user/session combination. Then add that query to your FROM clause (let's say as adhoc). Then you put in the WHERE clause pk_date = adhoc.pkdate AND username = adhoc.username etc...
Simplified example:
(SELECT username, startdate, max(pk_date) as pk_date
FROM <whatever>
GROUP BY username, startdate) (= <new>)
now, in your main query...
SELECT ... FROM ...,<new> adhoc
WHERE adhoc.username = username
AND adhoc.startdate = startdate
AND pk_date = adhoc.pk_date ...
Does this help?

Related

Merge two consecutive rows into a column

Assuming that, I have a table something like this :
EmployeeCode
EntryType
TIME
ABC200413
IN
8:48AM
ABC200413
OUT
4:09PM
ABC200413
IN
4:45PM
ABC200413
OUT
6:09PM
ABC200413
IN
7:45PM
ABC200413
OUT
10:09PM
Now I want to convert my data something like this :
EmployeeCode
IN_TIME
OUT_TIME
ABC200413
8:48AM
4:09PM
ABC200413
4:45PM
6:09PM
ABC200413
7:45PM
10:09PM
Is there any way I can achieve this using SQL server query?
Thanks in advance.
Provided mytable contains only valid pairs of in / out events
select EmployeeCode,
max(case EntryType when 'IN' then TIME end ) IN_TIME,
max(case EntryType when 'OUT' then TIME end ) OUT_TIME
from (
select EmployeeCode, EntryType, TIME,
row_number() over(partition by EmployeeCode, EntryType order by TIME) rn
from mytable
)t
group by EmployeeCode, rn
order by EmployeeCode, rn
Otherwise a kind of clean-up is required first.
One solution that may work is using inner joins .. as you may have multiple In & Out records.
Select A.EmployeeCode,
Min(TIME) IN_TIME,
(Select Max(A2.TIME) From Attendance A2 Where A2.EmployeeCode = A.EmployeeCode And A2.EntryType = 'OUT') OUT_TIME
From Attendance A
Where A.EntryType = 'IN'
Group By A.EmployeeCode
So, The main query get the max Out time for each employee & the inner query get the min In time. That solution supposes that at least each employee has one IN record
Assuming there are always pairs of IN/OUT, you can use LEAD to get the next value
select
EmployeeCode,
TIME as IN_TIME,
nextTime AS OUT_TIME
from (
select *,
lead(case EntryType when 'OUT' then TIME end) over (partition by EmployeeCode order by TIME) nextTime
from mytable
) t
where EntryType = 'IN';

How to run SUM() OVER PARTITION BY for COUNT DISTINCT

I'm trying to get the number of distinct users for each event at a daily level while maintainig a running sum for every hour.
I'm using Athena/Presto as the query engine.
I tried the following query:
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
count,
SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
COUNT(DISTINCT moengageuserid) as count
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
GROUP BY 1,2
ORDER BY 1,2
);
But on seeing the results I realized that taking SUM of COUNT DISTINCT is not correct as it's not additive.
So, I tried the below query
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
);
But this query fails with the following error:
SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause
Count the first time a user appears for the running distinct count:
SELECT eventname, date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
COUNT(DISTINCT moengageuserid) as hour_cont,
SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid as hour_count,
ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
FROM clickstream.moengage
WHERE date = '2020-08-20' AND
eventname IN ('e1', 'e2', 'e3', 'e4')
) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;
To calculate running distinct count you can collect user IDs into set (distinct array) and get the size:
cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
This is analytic function and will assign the same value to the whole partition (eventname, date), you can aggregate records in upper subquery using max(), etc.

Lag functions and SUM

I need to get the list of users that have been offline for at least 20 min every day. Here's my data
I have this starting query but am stuck on how to sum the difference in offline_mins i.e. need to add "and sum(offline_mins)>=20" to the where clause
SELECT
userid,
connected,
LAG(recordeddt) OVER(PARTITION BY userid
ORDER BY userid,
recordeddt) AS offline_period,
DATEDIFF(minute, LAG(recordeddt) OVER(PARTITION BY userid
ORDER BY userid,
recordeddt),recordeddt) offline_mins
FROM device_data where connected=0;
My expected results :
Thanks in advance.
This reads like a gaps-and-island problem, where you want to group together adjacent rows having the same userid and status.
As a starter, here is a query that computes the islands:
select userid, connected, min(recordeddt) startdt, max(lead_recordeddt) enddt,
datediff(min(recordeddt), max(lead_recordeddt)) duration
from (
select dd.*,
row_number() over(partition by userid order by recordeddt) rn1,
row_number() over(partition by userid, connected order by recordeddt) rn2,
lead(recordeddt) over(partition by userid order by recordeddt) lead_recordeddt
from device_data dd
) dd
group by userid, connected, rn1 - rn2
Now, say you want users that were offline for at least 20 minutes every day. You can breakdown the islands per day, and use a having clause for filtering:
select userid
from (
select recordedday, userid, connected,
datediff(min(recordeddt), max(lead_recordeddt)) duration
from (
select dd.*, v.*,
row_number() over(partition by v.recordedday, userid order by recordeddt) rn1,
row_number() over(partition by v.recordedday, userid, connected order by recordeddt) rn2,
lead(recordeddt) over(partition by v.recordedday, userid order by recordeddt) lead_recordeddt
from device_data dd
cross apply (values (convert(date, recordeddt))) v(recordedday)
) dd
group by convert(date, recordeddt), userid, connected, rn1 - rn2
) dd
group by userid
having count(distinct case when connected = 0 and duration >= 20 then recordedday end) = count(distinct recordedday)
As noted this is a gaps and island problem. This is my take on it using a simple lag function to create groups, filter out the connected rows and then work on the date ranges.
CREATE TABLE #tmp(ID int, UserID int, dt datetime, connected int)
INSERT INTO #tmp VALUES
(1,1,'11/2/20 10:00:00',1),
(2,1,'11/2/20 10:05:00',0),
(3,1,'11/2/20 10:10:00',0),
(4,1,'11/2/20 10:15:00',0),
(5,1,'11/2/20 10:20:00',0),
(6,2,'11/2/20 10:00:00',1),
(7,2,'11/2/20 10:05:00',1),
(8,2,'11/2/20 10:10:00',0),
(9,2,'11/2/20 10:15:00',0),
(10,2,'11/2/20 10:20:00',0),
(11,2,'11/2/20 10:25:00',0),
(12,2,'11/2/20 10:30:00',0)
SELECT UserID, connected,DATEDIFF(minute,MIN(DT), MAX(DT)) OFFLINE_MINUTES
FROM
(
SELECT *, SUM(CASE WHEN connected <> LG THEN 1 ELSE 0 END) OVER (ORDER BY UserID,dt) grp
FROM
(
select *, LAG(connected,1,connected) OVER(PARTITION BY UserID ORDER BY UserID,dt) LG
from #tmp
) x
) y
WHERE connected <> 1
GROUP BY UserID,grp,connected
HAVING DATEDIFF(minute,MIN(DT), MAX(DT)) >= 20

How to generate session_id by sql?

My tracking system do not generate sessions IDS.
I have user_id & event_date_time.
I need a new session_id for each user's session that starts 30 minutes or more after last event_date_time of each user.
My final goal is to calculate median session time.
I tried to generate session_id=1 and session_id=2 once event_date_time-next_event_time>30 and guid=guid, but i'm stuck from here
select a.*,
case when (a.next_event_date-a.event_date)*24*60<30 and userID=next_userID
then 1
when (a.next_event_date-a.event_date)*24*60>=30 and userID=next_userID then
2
end session_id
from
(select f.userID,
lead(f.userID) over (partition by f.guid order by f.event_date)
next_guid,
f.event_date,
lead(f.event_date) over (partition by f.guid order by f.event_date)
next_event_date
from event_table f
)a
where next_event_date is not null
If I understood correctly you could generate ID's this way:
select id, guid, event_date,
sum(chg) over (partition by guid order by event_date) session_id
from (
select id, guid, event_date,
case when lag(guid) over (partition by guid order by event_date) = guid
and 24 * 60 * (event_date -lag(event_date)
over (partition by guid order by event_date) ) < 30
then 0 else 1
end chg
from event_table ) a
dbfiddle demo
Compare neighbouring rows, if there are different guids or time difference is greater than 30 minutes then assign 1. Then sum these values analytically.
I think you're on the right track using lead or lag. My recommendation would be to break this into steps and create a temp table to work against:
With the first query, assign every record its own unique ID, either a sequence number or GUID. You could also capture some of the lagged data in this step.
With a second query, find the overlaps (< 30 minutes) and make the overlapping records all the same -- either the same as the earliest or latest in that grouping, doesn't matter as long as it's consistent.
Something like this:
create table events_temp as (
select f.*,
row_number() over (partition by f.userID order by f.event_date) as user_row,
lag(f.userID) over (partition by f.userID order by f.event_date) as prev_userID,
lag(f.event_date) over (partition by f.userID order by f.event_date) as prev_event_date
from event_table f
order by f.userId, f.event_date
)
select a.*,
case when prev_userID = userID
and 24 * 60 * (event_date - prev_event_date) < 30
then lag(user_row) over (partition by userID order by user_row)
else user_row
end as session_id
from events_temp

Tree / Group UNION SQL-Query for a Report

I have a query for a dynamicreport as a datasource.
The result till now is:
There are 3 queries connected with UNION. Line 1 all data accumulated for the company. Line 2 all Data for the location and line 3 the detail data.
It is like a tree. But my problem is, that the accumulation is not correct (AnzahlMinuten). Is there an other way to display this data in a dynamicreport. This 3 queries can be very time intense. I also use the RANK() function because i got multiple entries for the time a license is used.
If there are no other easier solutions, where is my fault in the connected queries with union, so that the accumulation is not correct?
SELECT Gesellschaftsname,Standortname,Lizenzname,Abteilungsname,Kostenstelle,
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT RuntimeMinute) AS AnzahlMinuten,
1 FROM (SELECT * FROM(SELECT DISTINCT Standortname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
Lizenzname,COUNT(DISTINCT username) AS AnzUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
starttime,
username,
pk_date,
Abteilungsname,
Gesellschaftsname,
Kostenstelle,
RANK() Over (PARTITION BY starttime ORDER BY pk_date DESC) As Rank
FROM BenutzerLizenz,Benutzer,Abteilung,Lizenz,Standort,Gesellschaft,Kostenstelle
WHERE BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic AND PK_ID_standort=FK_ID_standort AND PK_ID_Abteilung = FK_ID_Abteilung AND PK_ID_Gesellschaft = FK_ID_Gesellschaft AND PK_ID_Kostenstelle = FK_ID_Kostenstelle AND
DATEPART(month,PK_Date) IN ('06','07') AND
DATEPART(YEAR,PK_Date) = '2013' AND
Lizenzname IN ('DESIGNER','iman_nth') AND
Standortname IN ('Unterlüß','Neuenburg')
GROUP BY Standortname, Lizenzname, starttime, pk_date, username ,Abteilungsname, Kostenstelle, Gesellschaftsname) tmp
WHERE Rank = 1)tmp2 GROUP BY Standortname,Lizenzname,Abteilungsname, Kostenstelle, Gesellschaftsname
UNION
SELECT Gesellschaftsname,'','','','',
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT RuntimeMinute) AS AnzahlMinuten,2
FROM (SELECT * FROM(SELECT DISTINCT Gesellschaftsname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
COUNT(DISTINCT username) AS AnzUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
starttime,
username,
pk_date,
RANK() Over (PARTITION BY starttime ORDER BY pk_date DESC) As Rank
FROM BenutzerLizenz,Benutzer,Lizenz,Standort,Gesellschaft
WHERE BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic AND PK_ID_Gesellschaft = FK_ID_Gesellschaft AND
DATEPART(month,PK_Date) IN ('06','07') AND
DATEPART(YEAR,PK_Date) = '2013' AND
Lizenzname IN ('DESIGNER','iman_nth') AND
Standortname IN ('Unterlüß','Neuenburg')
GROUP BY Gesellschaftsname,starttime, pk_date, username) tmp
WHERE Rank = 1)tmp2 GROUP BY Gesellschaftsname
UNION
SELECT '',Standortname,'','','',
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT RuntimeMinute) AS AnzahlMinuten,3
FROM (SELECT * FROM(SELECT DISTINCT Standortname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
COUNT(DISTINCT username) AS AnzUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
starttime,
username,
pk_date,
RANK() Over (PARTITION BY starttime ORDER BY pk_date DESC) As Rank
FROM BenutzerLizenz,Benutzer,Abteilung,Lizenz,Standort
WHERE BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic AND PK_ID_standort=FK_ID_standort AND PK_ID_Abteilung = FK_ID_Abteilung AND
DATEPART(month,PK_Date) IN ('06','07') AND
DATEPART(YEAR,PK_Date) = '2013' AND
Lizenzname IN ('DESIGNER','iman_nth') AND
Standortname IN ('Unterlüß','Neuenburg')
GROUP BY Standortname, starttime, pk_date, username) tmp
WHERE Rank = 1)tmp2 GROUP BY Standortname
ORDER BY 2
I think the main issue is with the use of "distinct." This is not a coding problem. When summing distinct on multiple grouping levels, the totals of sub-groups may be greater than the total of the top group. For example:
GroupId Value
1 1
1 2
1 3
2 2
2 4
2 5
Sum(distinct value) on group 1 = 6
sum(distinct value) on group 2 = 11
sum(distinct value) on both groups = 15
Also, in general, it sounds like you are asking for a neater way to solve this problem of multiple grouping levels in a single recordset. I did something like this at a previous job:
sql fiddle
The idea is that you build the list of possible groups first in a CTE as
Level1 Level2 Level3
A NULL NULL
A AA NULL
A AB NULL
A AA AAA
A AA AAB
A AB ABA
A AB ABB
then join that to your data on the three levels and group by Level1, Level2, Level3. It's a lot cleaner.