Tree / Group UNION SQL-Query for a Report - sql

I have a query for a dynamicreport as a datasource.
The result till now is:
There are 3 queries connected with UNION. Line 1 all data accumulated for the company. Line 2 all Data for the location and line 3 the detail data.
It is like a tree. But my problem is, that the accumulation is not correct (AnzahlMinuten). Is there an other way to display this data in a dynamicreport. This 3 queries can be very time intense. I also use the RANK() function because i got multiple entries for the time a license is used.
If there are no other easier solutions, where is my fault in the connected queries with union, so that the accumulation is not correct?
SELECT Gesellschaftsname,Standortname,Lizenzname,Abteilungsname,Kostenstelle,
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT RuntimeMinute) AS AnzahlMinuten,
1 FROM (SELECT * FROM(SELECT DISTINCT Standortname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
Lizenzname,COUNT(DISTINCT username) AS AnzUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
starttime,
username,
pk_date,
Abteilungsname,
Gesellschaftsname,
Kostenstelle,
RANK() Over (PARTITION BY starttime ORDER BY pk_date DESC) As Rank
FROM BenutzerLizenz,Benutzer,Abteilung,Lizenz,Standort,Gesellschaft,Kostenstelle
WHERE BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic AND PK_ID_standort=FK_ID_standort AND PK_ID_Abteilung = FK_ID_Abteilung AND PK_ID_Gesellschaft = FK_ID_Gesellschaft AND PK_ID_Kostenstelle = FK_ID_Kostenstelle AND
DATEPART(month,PK_Date) IN ('06','07') AND
DATEPART(YEAR,PK_Date) = '2013' AND
Lizenzname IN ('DESIGNER','iman_nth') AND
Standortname IN ('Unterlüß','Neuenburg')
GROUP BY Standortname, Lizenzname, starttime, pk_date, username ,Abteilungsname, Kostenstelle, Gesellschaftsname) tmp
WHERE Rank = 1)tmp2 GROUP BY Standortname,Lizenzname,Abteilungsname, Kostenstelle, Gesellschaftsname
UNION
SELECT Gesellschaftsname,'','','','',
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT RuntimeMinute) AS AnzahlMinuten,2
FROM (SELECT * FROM(SELECT DISTINCT Gesellschaftsname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
COUNT(DISTINCT username) AS AnzUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
starttime,
username,
pk_date,
RANK() Over (PARTITION BY starttime ORDER BY pk_date DESC) As Rank
FROM BenutzerLizenz,Benutzer,Lizenz,Standort,Gesellschaft
WHERE BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic AND PK_ID_Gesellschaft = FK_ID_Gesellschaft AND
DATEPART(month,PK_Date) IN ('06','07') AND
DATEPART(YEAR,PK_Date) = '2013' AND
Lizenzname IN ('DESIGNER','iman_nth') AND
Standortname IN ('Unterlüß','Neuenburg')
GROUP BY Gesellschaftsname,starttime, pk_date, username) tmp
WHERE Rank = 1)tmp2 GROUP BY Gesellschaftsname
UNION
SELECT '',Standortname,'','','',
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT RuntimeMinute) AS AnzahlMinuten,3
FROM (SELECT * FROM(SELECT DISTINCT Standortname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
COUNT(DISTINCT username) AS AnzUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
starttime,
username,
pk_date,
RANK() Over (PARTITION BY starttime ORDER BY pk_date DESC) As Rank
FROM BenutzerLizenz,Benutzer,Abteilung,Lizenz,Standort
WHERE BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic AND PK_ID_standort=FK_ID_standort AND PK_ID_Abteilung = FK_ID_Abteilung AND
DATEPART(month,PK_Date) IN ('06','07') AND
DATEPART(YEAR,PK_Date) = '2013' AND
Lizenzname IN ('DESIGNER','iman_nth') AND
Standortname IN ('Unterlüß','Neuenburg')
GROUP BY Standortname, starttime, pk_date, username) tmp
WHERE Rank = 1)tmp2 GROUP BY Standortname
ORDER BY 2

I think the main issue is with the use of "distinct." This is not a coding problem. When summing distinct on multiple grouping levels, the totals of sub-groups may be greater than the total of the top group. For example:
GroupId Value
1 1
1 2
1 3
2 2
2 4
2 5
Sum(distinct value) on group 1 = 6
sum(distinct value) on group 2 = 11
sum(distinct value) on both groups = 15
Also, in general, it sounds like you are asking for a neater way to solve this problem of multiple grouping levels in a single recordset. I did something like this at a previous job:
sql fiddle
The idea is that you build the list of possible groups first in a CTE as
Level1 Level2 Level3
A NULL NULL
A AA NULL
A AB NULL
A AA AAA
A AA AAB
A AB ABA
A AB ABB
then join that to your data on the three levels and group by Level1, Level2, Level3. It's a lot cleaner.

Related

Lag functions and SUM

I need to get the list of users that have been offline for at least 20 min every day. Here's my data
I have this starting query but am stuck on how to sum the difference in offline_mins i.e. need to add "and sum(offline_mins)>=20" to the where clause
SELECT
userid,
connected,
LAG(recordeddt) OVER(PARTITION BY userid
ORDER BY userid,
recordeddt) AS offline_period,
DATEDIFF(minute, LAG(recordeddt) OVER(PARTITION BY userid
ORDER BY userid,
recordeddt),recordeddt) offline_mins
FROM device_data where connected=0;
My expected results :
Thanks in advance.
This reads like a gaps-and-island problem, where you want to group together adjacent rows having the same userid and status.
As a starter, here is a query that computes the islands:
select userid, connected, min(recordeddt) startdt, max(lead_recordeddt) enddt,
datediff(min(recordeddt), max(lead_recordeddt)) duration
from (
select dd.*,
row_number() over(partition by userid order by recordeddt) rn1,
row_number() over(partition by userid, connected order by recordeddt) rn2,
lead(recordeddt) over(partition by userid order by recordeddt) lead_recordeddt
from device_data dd
) dd
group by userid, connected, rn1 - rn2
Now, say you want users that were offline for at least 20 minutes every day. You can breakdown the islands per day, and use a having clause for filtering:
select userid
from (
select recordedday, userid, connected,
datediff(min(recordeddt), max(lead_recordeddt)) duration
from (
select dd.*, v.*,
row_number() over(partition by v.recordedday, userid order by recordeddt) rn1,
row_number() over(partition by v.recordedday, userid, connected order by recordeddt) rn2,
lead(recordeddt) over(partition by v.recordedday, userid order by recordeddt) lead_recordeddt
from device_data dd
cross apply (values (convert(date, recordeddt))) v(recordedday)
) dd
group by convert(date, recordeddt), userid, connected, rn1 - rn2
) dd
group by userid
having count(distinct case when connected = 0 and duration >= 20 then recordedday end) = count(distinct recordedday)
As noted this is a gaps and island problem. This is my take on it using a simple lag function to create groups, filter out the connected rows and then work on the date ranges.
CREATE TABLE #tmp(ID int, UserID int, dt datetime, connected int)
INSERT INTO #tmp VALUES
(1,1,'11/2/20 10:00:00',1),
(2,1,'11/2/20 10:05:00',0),
(3,1,'11/2/20 10:10:00',0),
(4,1,'11/2/20 10:15:00',0),
(5,1,'11/2/20 10:20:00',0),
(6,2,'11/2/20 10:00:00',1),
(7,2,'11/2/20 10:05:00',1),
(8,2,'11/2/20 10:10:00',0),
(9,2,'11/2/20 10:15:00',0),
(10,2,'11/2/20 10:20:00',0),
(11,2,'11/2/20 10:25:00',0),
(12,2,'11/2/20 10:30:00',0)
SELECT UserID, connected,DATEDIFF(minute,MIN(DT), MAX(DT)) OFFLINE_MINUTES
FROM
(
SELECT *, SUM(CASE WHEN connected <> LG THEN 1 ELSE 0 END) OVER (ORDER BY UserID,dt) grp
FROM
(
select *, LAG(connected,1,connected) OVER(PARTITION BY UserID ORDER BY UserID,dt) LG
from #tmp
) x
) y
WHERE connected <> 1
GROUP BY UserID,grp,connected
HAVING DATEDIFF(minute,MIN(DT), MAX(DT)) >= 20

How to return all the rows in the yellow census blocks?

Hey the schema is like this: for the whole dataset, we should order by machine_id first, then order by ss2k. after that, for each machine, we should find all the rows with at least consecutively 5 flag = 'census'. In this dataset, the result should be all the yellow rows..
I cannot return the last 4 rows of the yellow blocks by using this:
drop table if exists qz_panel_census_228_rank;
create table qz_panel_census_228_rank as
select t.*
from (select t.*,
count(*) filter (where flag = 'census') over (partition by machine_id, date order by ss2k rows between current row and 4 following) as census_cnt5,
count(*) filter (where flag = 'census') over (partition by machine_id, date) as count_census,
row_number() over (partition by machine_id, date order by ss2k) as seqnum,
count(*) over (partition by machine_id, date) as cnt
from qz_panel_census_228 t
) t
where census_cnt5 = 5
group by 1,2,3,4,5,6,7,8,9,10,11
DISTRIBUTED BY (machine_id);
You were close, but you need to search in both directions:
select t.*
from (select t.*,
case when count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between 4 preceding and current row) = 5
or count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between current row and 4 following) = 5
then 1
else 0
end as flag
from qz_panel_census_228 t
) t
where flag = 1
Edit:
This approach will not work unless you add an extra count for each possible 5 row window, e.g. 3 preceding and 1 following, 2 preceding and 2 following, etc. This results in ugly code and is not very flexible.
The common way to solve this gaps & islands problem is to assign consecutive rows to a common group first:
select *
from
(
select t2.*,
count(*) over (partition by machine_id, date, grp) as cnt
from
(
select t1.*
from (select t.*,
-- keep the same number for 'census' rows
sum(case when flag = 'census' then 0 else 1 end)
over (partition by machine_id, date
order by ss2k
rows unbounded preceding) as grp
from qz_panel_census_228 t
) t1
where flag = 'census' -- only census rows
) as t2
) t3
where cnt >= 5 -- only groups of at least 5 census rows
Wow, there has to be a better way of doing this, but the only way I could figure out was to create blocks of consecutive 'census' values. This looks awful but might be a catalyst to a better idea.
with q1 as (
select
machine_id, recorded, ss2k, flag, date,
case
when flag = 'census' and
lag (flag) over (order by machine_id, ss2k) != 'census'
then 1
else 0
end as block
from foo
),
q2 as (
select
machine_id, recorded, ss2k, flag, date,
sum (block) over (order by machine_id, ss2k) as group_id,
case when flag = 'census' then 1 else 0 end as census
from q1
),
q3 as (
select
machine_id, recorded, ss2k, flag, date, group_id,
sum (census) over (partition by group_id order by ss2k) as max_count
from q2
),
groups as (
select group_id
from q3
group by group_id
having max (max_count) >= 5
)
select
q2.machine_id, q2.recorded, q2.ss2k, q2.flag, q2.date
from
q2
join groups g on q2.group_id = g.group_id
where
q2.flag = 'census'
If you run each query within the with clauses in isolation, I think you will see how this evolves.

Group by in columns and rows, counts and percentages per day

I have a table that has data like following.
attr |time
----------------|--------------------------
abc |2018-08-06 10:17:25.282546
def |2018-08-06 10:17:25.325676
pqr |2018-08-05 10:17:25.366823
abc |2018-08-06 10:17:25.407941
def |2018-08-05 10:17:25.449249
I want to group them and count by attr column row wise and also create additional columns in to show their counts per day and percentages as shown below.
attr |day1_count| day1_%| day2_count| day2_%
----------------|----------|-------|-----------|-------
abc |2 |66.6% | 0 | 0.0%
def |1 |33.3% | 1 | 50.0%
pqr |0 |0.0% | 1 | 50.0%
I'm able to display one count by using group by but unable to find out how to even seperate them to multiple columns. I tried to generate day1 percentage with
SELECT attr, count(attr), count(attr) / sum(sub.day1_count) * 100 as percentage from (
SELECT attr, count(*) as day1_count FROM my_table WHERE DATEPART(week, time) = DATEPART(day, GETDate()) GROUP BY attr) as sub
GROUP BY attr;
But this also is not giving me correct answer, I'm getting all zeroes for percentage and count as 1. Any help is appreciated. I'm trying to do this in Redshift which follows postgresql syntax.
Let's nail the logic before presenting:
with CTE1 as
(
select attr, DATEPART(day, time) as theday, count(*) as thecount
from MyTable
)
, CTE2 as
(
select theday, sum(thecount) as daytotal
from CTE1
group by theday
)
select t1.attr, t1.theday, t1.thecount, t1.thecount/t2.daytotal as percentofday
from CTE1 t1
inner join CTE2 t2
on t1.theday = t2.theday
From here you can pivot to create a day by day if you feel the need
I am trying to enhance the query #johnHC btw if you needs for 7days then you have to those days in case when
with CTE1 as
(
select attr, time::date as theday, count(*) as thecount
from t group by attr,time::date
)
, CTE2 as
(
select theday, sum(thecount) as daytotal
from CTE1
group by theday
)
,
CTE3 as
(
select t1.attr, EXTRACT(DOW FROM t1.theday) as day_nmbr,t1.theday, t1.thecount, t1.thecount/t2.daytotal as percentofday
from CTE1 t1
inner join CTE2 t2
on t1.theday = t2.theday
)
select CTE3.attr,
max(case when day_nmbr=0 then CTE3.thecount end) as day1Cnt,
max(case when day_nmbr=0 then percentofday end) as day1,
max(case when day_nmbr=1 then CTE3.thecount end) as day2Cnt,
max( case when day_nmbr=1 then percentofday end) day2
from CTE3 group by CTE3.attr
http://sqlfiddle.com/#!17/54ace/20
In case that you have only 2 days:
http://sqlfiddle.com/#!17/3bdad/3 (days descending as in your example from left to right)
http://sqlfiddle.com/#!17/3bdad/5 (days ascending)
The main idea is already mentioned in the other answers. Instead of joining the CTEs for calculating the values I am using window functions which is a bit shorter and more readable I think. The pivot is done the same way.
SELECT
attr,
COALESCE(max(count) FILTER (WHERE day_number = 0), 0) as day1_count, -- D
COALESCE(max(percent) FILTER (WHERE day_number = 0), 0) as day1_percent,
COALESCE(max(count) FILTER (WHERE day_number = 1), 0) as day2_count,
COALESCE(max(percent) FILTER (WHERE day_number = 1), 0) as day2_percent
/*
Add more days here
*/
FROM(
SELECT *, (count::float/count_per_day)::decimal(5, 2) as percent -- C
FROM (
SELECT DISTINCT
attr,
MAX(time::date) OVER () - time::date as day_number, -- B
count(*) OVER (partition by time::date, attr) as count, -- A
count(*) OVER (partition by time::date) as count_per_day
FROM test_table
)s
)s
GROUP BY attr
ORDER BY attr
A counting the rows per day and counting the rows per day AND attr
B for more readability I convert the date into numbers. Here I take the difference between current date of the row and the maximum date available in the table. So I get a counter from 0 (first day) up to n - 1 (last day)
C calculating the percentage and rounding
D pivot by filter the day numbers. The COALESCE avoids the NULL values and switched them into 0. To add more days you can multiply these columns.
Edit: Made the day counter more flexible for more days; new SQL Fiddle
Basically, I see this as conditional aggregation. But you need to get an enumerator for the date for the pivoting. So:
SELECT attr,
COUNT(*) FILTER (WHERE day_number = 1) as day1_count,
COUNT(*) FILTER (WHERE day_number = 1) / cnt as day1_percent,
COUNT(*) FILTER (WHERE day_number = 2) as day2_count,
COUNT(*) FILTER (WHERE day_number = 2) / cnt as day2_percent
FROM (SELECT attr,
DENSE_RANK() OVER (ORDER BY time::date DESC) as day_number,
1.0 * COUNT(*) OVER (PARTITION BY attr) as cnt
FROM test_table
) s
GROUP BY attr, cnt
ORDER BY attr;
Here is a SQL Fiddle.

How do I get the value associated with a MIN or MAX

I'm in the middle of creating a query and have it where I need the other values, however I am pulling a MIN and MAX date for individual patient_id. I'm wondering how I would go about how I would pull a value associated with that MIN or MAX date as well? I'm looking for a value the column provider_id which will show which doctor they saw on that MIN or MAX date. Here is what I have so far:
WITH test AS (
SELECT patient_id,
clinic,
SUM(amount) AS production,
MIN(tran_date) AS first_visit,
MAX(tran_date) AS last_visit
FROM transactions
WHERE impacts='P'
GROUP BY patient_id, clinic)
SELECT w.patient_id,
w.clinic,
p.city,
p.state,
p.zipcode,
p.sex,
w.production,
w.first_visit,
w.last_visit
FROM test w
LEFT JOIN patient p
ON (w.patient_id=p.patient_id AND w.clinic=p.clinic)
I believe that this will get what you're looking for:
;WITH CTE_Transactions AS (
SELECT DISTINCT
patient_id,
clinic,
SUM(amount) OVER (PARTITION BY patient_id, clinic) AS production,
FIRST_VALUE(tran_date) OVER (PARTITION BY patient_id, clinic ORDER BY tran_date) AS first_visit,
FIRST_VALUE(provider_id) OVER (PARTITION BY patient_id, clinic ORDER BY tran_date) AS first_provider_id,
LAST_VALUE(tran_date) OVER (PARTITION BY patient_id, clinic ORDER BY tran_date) AS last_visit,
LAST_VALUE(provider_id) OVER (PARTITION BY patient_id, clinic ORDER BY tran_date) AS last_provider_id,
ROW_NUMBER() OVER (PARTITION BY patient_id, clinic ORDER BY tran_date) AS row_num
FROM Transactions
WHERE impacts='P'
)
SELECT
w.patient_id,
w.clinic,
p.city,
p.state,
p.zipcode,
p.sex,
w.production,
w.first_visit,
w.last_visit
FROM
CTE_Transactions W
LEFT JOIN Patient P ON
W.patient_id = P.patient_id AND
W.clinic = P.clinic
INNER JOIN Provider FIRST_PROV ON
FIRST_PROV.provider_id = W.first_provider_id
INNER JOIN Provider LAST_PROV ON
LAST_PROV.provider_id = W.last_provider_id
WHERE
W.row_num = 1
I assume you are referring to the CTE. You can use conditional aggregation along with window functions. For instance, to get the amount for the first visit:
WITH test AS (
SELECT patient_id, clinic,
SUM(amount) AS production,
MIN(tran_date) AS first_visit,
MAX(tran_date) AS last_visit,
SUM(CASE WHEN tran_date = min_tran_date THEN amount END) as first_amount
FROM (SELECT t.*,
MIN(trans_date) OVER (PARTITION BY patient_id, clinic) as min_tran_date
FROM transactions
WHERE impacts = 'P'
) t
GROUP BY patient_id, clinic
)

SQL Query SUM DATEDIFF MAX

I have a problem with a SQL-Query. I want to count the runtime of a used application. But in the database the date value is inserted more then one time. I only need the highest value of the pk_date column and no duplicated entries from the starttime column.
Here is the SQL-Query:
SELECT DISTINCT Standortname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
Lizenzname,
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
endtime,
pk_date
FROM BenutzerLizenz,Benutzer,Abteilung,Lizenz,Standort
WHERE
BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user
AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic
AND PK_ID_standort=FK_ID_standort
AND DATEPART(month,PK_Date) = '04'
AND DATEPART(YEAR,PK_Date) = '2013'
AND Lizenzname = 'iman_1st'
AND Standortname = 'Unterlüß'
GROUP BY
Standortname,
DATEPART(YEAR,PK_Date),
DATEPART(month,PK_Date),
Lizenzname,
starttime,
endtime,
pk_date
Here is the result:
... RuntimeMinute starttime pk_date
339 2013-04-11 11:05:00.0000000 2013-04-11 16:44:37.9650000
346 2013-04-11 11:05:00.0000000 2013-04-11 16:51:25.4800000
356 2013-04-11 11:05:00.0000000 2013-04-11 17:01:19.9670000
475 2013-04-11 10:06:00.0000000 2013-04-11 18:01:15.6620000
The first three above runtimes are from the same user and session, the last one is from another user and session. I only want to count and sum the last runtimes from the same starttime and the maximum date inserted (pk_date) -> 356 + 475 is the value that I would like to have.
In another similar query all values are accumulated (the columns starttime, endtime, pk_date are not included in it, so the query builds the sum of all runtime values for all users). I tried to use DISTINCT and MAX(pk_date) but it didn't work as expected. Do I have to use Sub-Queries?
I would use RANK() function for this.
SELECT * FROM
(
SELECT DISTINCT Standortname,
DATEPART(YEAR,PK_Date) AS Jahr,
DATEPART(month,PK_Date) AS Monat,
Lizenzname,
COUNT(DISTINCT username) AS AnzahlUser,
SUM(DISTINCT DATEDIFF(minute,starttime ,pk_date)) AS RuntimeMinute,
endtime,
pk_date,
RANK() Over (PARTITION BY username ORDER BY pk_date DESC) As Rank
FROM BenutzerLizenz,Benutzer,Abteilung,Lizenz,Standort
WHERE
BenutzerLizenz.PK_ID_user=Benutzer.PK_ID_user
AND BenutzerLizenz.PK_ID_lic=Lizenz.PK_ID_lic
AND PK_ID_standort=FK_ID_standort
AND DATEPART(month,PK_Date) = '04'
AND DATEPART(YEAR,PK_Date) = '2013'
AND Lizenzname = 'iman_1st'
AND Standortname = 'Unterlüß'
GROUP BY
Standortname,
DATEPART(YEAR,PK_Date),
DATEPART(month,PK_Date),
Lizenzname,
starttime,
endtime,
pk_date,
username
) tmp where Rank=1
The RANK() functions ranks each row of a result set in the order defined by ORDER BY. Used with PARTITION BY, you can further partition the data for ranking.
Since you already have the data that you need, you will partition the result by username and rank the pk_date in order to get the highest one.
It sounds like you want to make a query that only keeps the max(pk_date) for each beginning time and user/session combination. Then add that query to your FROM clause (let's say as adhoc). Then you put in the WHERE clause pk_date = adhoc.pkdate AND username = adhoc.username etc...
Simplified example:
(SELECT username, startdate, max(pk_date) as pk_date
FROM <whatever>
GROUP BY username, startdate) (= <new>)
now, in your main query...
SELECT ... FROM ...,<new> adhoc
WHERE adhoc.username = username
AND adhoc.startdate = startdate
AND pk_date = adhoc.pk_date ...
Does this help?