Group By Dynamic Ranges in SQL (cockroachdb/postgres) - sql

I have a query that looks like
select s.session_id, array_agg(sp.value::int8 order by sp.value::int8) as timestamps
from sessions s join session_properties sp on sp.session_id = s.session_id
where s.user_id = '6f129b1c-43a6-4871-86f6-1749bfe1a5af' and sp.key in ('SleepTime', 'WakeupTime') and value != 'None' and value::int8 > 0
group by s.session_id
The result would look like
f321c813-7927-47aa-88c3-b3250af34afa | {1588499070,1588504354}
f38a8841-c402-433d-939d-194eca993bb6 | {1588187599,1588212803}
2befefaf-3b31-46c9-8416-263fa7b9309d | {1589912247,1589935771}
3da64787-65cd-4305-b1ac-1393e2fb11a9 | {1589741569,1589768453}
537e69aa-c39d-484d-9108-2f2cd956d4ee | {1588100398,1588129026}
5a9470ff-f930-491f-a57d-8c089e535d53 | {1589140368,1589165092}
The first column is a unique id and the second column is from and to timestamps.
Now I have a third table which has some timeseries data
records
------------------------
timestamp | name | value
Is it possible to find avg(value) from from records in group of session_ids over the from and to timestamps.
I could run a for loop in the application and do a union to get the desired result. But I was wondering if that is possible in postgres or cockroachdb

I wouldn't aggregate the two values but use two joins to find them. That way you can be sure which value belongs to which property.
Once you have that, you can join that result to your records table.
with ranges as (
select s.session_id, st.value as from_value, wt.value as to_value
from sessions s
join session_properties st on sp.session_id = s.session_id and st.key = 'SleepTime'
join session_properties wt on wt.session_id = s.session_id and wt.key = 'WakeupTime'
where s.user_id = '6f129b1c-43a6-4871-86f6-1749bfe1a5af'
and st.value != 'None' and wt.value::int8 > 0
and wt.value != 'None' and wt.value::int8 > 0
)
select ra.session_id, avg(rc.value)
from records rc
join ranges ra
on ra.from_value >= rc.timewstamp
and rc.timestamp < ra.to_value
group by ra.session_id;

Related

use distinct within case statement

I have a query that uses multiple left joins and trying to get a SUM of values from one of the joined columns.
SELECT
SUM( case when session.usersessionrun =1 then 1 else 0 end) new_unique_session_user_count
FROM session
LEFT JOIN appuser ON appuser.appid = '6279df3bd2d3352aed591583'
AND appuser.userid = session.userid
LEFT JOIN userdevice ON userdevice.appid = '6279df3bd2d3352aed591583'
AND userdevice.userid = appuser.userid
WHERE session.appid = '6279df3bd2d3352aed591583'
AND (session.uploadedon BETWEEN '2022-04-18 08:31:26' AND '2022-05-18 08:31:26')
But this obviously gives a redundant session.usersessionrun=1 counts since it's a joined resultset.
Here the logic was to mark the user as new if the sessionrun for that record is 1.
I grouped by userid and usersessionrun and it shows that the records are repeated.
userid. sessionrun. count
628212 1 2
627a01 1 4
So what I was trying to do was something like
SUM(CASE distinct(session.userid) AND WHEN session.usersessionrun = 1 THEN 1 ELSE 0 END) new_unique_session_user_count
i.e. for every unique user count, session.usersessionrun = 1 should only be done once.
As you have discovered, JOIN operations can generate combinatorial explosions of data.
You need a subquery to count your sessions by userid. Then you can treat the subquery as a virtual table and JOIN it to the other tables to get the information you need in your result set.
The subquery (nothing in my answer is debugged):
SELECT COUNT(*) new_unique_session_user_count,
session.userid
FROM session
WHERE session.appid = '6279df3bd2d3352aed591583'
AND session.uploadedon BETWEEN '2022-04-18 08:31:26'
AND '2022-05-18 08:31:26'
AND session.usersessionrun = 1
AND session.appid = '6279df3bd2d3352aed591583'
GROUP BY userid
This subquery summarizes your session table and has one row per userid. The trick to avoiding JOIN-created combinatorial explosions is using subqueries that generate results with only one row per data item mentioned in a JOIN's ON-clause.
Then, you join it with the other tables like this
SELECT summary.new_unique_session_user_count
FROM (
SELECT COUNT(*) new_unique_session_user_count,
session.userid
FROM session
WHERE session.appid = '6279df3bd2d3352aed591583'
AND session.uploadedon BETWEEN '2022-04-18 08:31:26'
AND '2022-05-18 08:31:26'
AND session.usersessionrun = 1
AND session.appid = '6279df3bd2d3352aed591583'
GROUP BY userid
) summary
JOIN appuser ON appuser.appid = '6279df3bd2d3352aed591583'
AND appuser.userid = summary.userid
JOIN userdevice ON userdevice.appid = '6279df3bd2d3352aed591583'
AND userdevice.userid = appuser.userid
There may be better ways to structure this query, but it's hard to guess at them without more information about your table definitions and business rules.

SQL statement merge two rows into one

In the results of my sql-statement (SQL Server 2016) I would like to combine two rows with the same value in two columns ("study_id" and "study_start") into one row and keep the row with higest value in a third cell ("Id"). If any columns (i.e. "App_id" or "Date_arrival) in the row with higest Id is NULL, then it should take the value from the row with the lowest "Id".
I get the result below:
Id study_id study_start Code Expl Desc Startmonth App_id Date_arrival Efter_op Date_begin
167262 878899 954 4.1 udd.ord Afbrudt feb 86666 21-06-2012 N 17-08-2012
180537 878899 954 1 Afsluttet Afsluttet feb NULL NULL NULL NULL
And I would like to get this result:
Id study_id study_start Code Expl Desc Startmonth App_id Date_arrival Efter_op Date_begin
180537 878899 954 1 Afsluttet Afsluttet feb 86666 21-06-2012 N 17-08-2012
My statement looks like this:
SELECT dbo.PopulationStam_V.ELEV_ID AS id,
dbo.PopulationStam_V.PERS_ID AS study_id,
dbo.STUDIESTARTER.STUDST_ID AS study_start,
dbo.Optagelse_Studiestatus.AFGANGSARSAG AS Code,
dbo.Optagelse_Studiestatus.KORT_BETEGNELSE AS Expl,
ISNULL((CAST(dbo.Optagelse_Studiestatus.Studiestatus AS varchar(20))), 'Indskrevet') AS 'Desc',
dbo.STUDIESTARTER.OPTAG_START_MANED AS Startmonth,
dbo.ANSOGNINGER.ANSOG_ID as App_id,
dbo.ANSOGNINGER.ANKOMSTDATO AS Data_arrival',
dbo.ANSOGNINGER.EFTEROPTAG AS Efter_op,
dbo.ANSOGNINGER.STATUSDATO AS Date_begin
FROM dbo.INSTITUTIONER
INNER JOIN dbo.PopulationStam_V
ON dbo.INSTITUTIONER.INST_ID = dbo.PopulationStam_V.SEMI_ID
LEFT JOIN dbo.ANSOGNINGER
ON dbo.PopulationStam_V.ELEV_ID = dbo.ANSOGNINGER.ELEV_ID
INNER JOIN dbo.STUDIESTARTER
ON dbo.PopulationStam_V.STUDST_ID_OPRINDELIG = dbo.STUDIESTARTER.STUDST_ID
INNER JOIN dbo.UDD_NAVNE_T
ON dbo.PopulationStam_V.UDDA_ID = dbo.UDD_NAVNE_T.UDD_ID
INNER JOIN dbo.UDDANNELSER
ON dbo.UDD_NAVNE_T.UDD_ID = dbo.UDDANNELSER.UDDA_ID
LEFT OUTER JOIN dbo.PERSONER
ON dbo.PopulationStam_V.PERS_ID = dbo.PERSONER.PERS_ID
LEFT OUTER JOIN dbo.POSTNR
ON dbo.PERSONER.PONR_ID = dbo.POSTNR.PONR_ID
LEFT OUTER JOIN dbo.KønAlleElevID_V
ON dbo.PopulationStam_V.ELEV_ID = dbo.KønAlleElevID_V.ELEV_ID
LEFT OUTER JOIN dbo.Optagelse_Studiestatus
ON dbo.PopulationStam_V.AFAR_ID = dbo.Optagelse_Studiestatus.AFAR_ID
LEFT OUTER JOIN dbo.frafaldsmodel_adgangsgrundlag
ON dbo.frafaldsmodel_adgangsgrundlag.ELEV_ID = dbo.PopulationStam_V.ELEV_ID
LEFT OUTER JOIN dbo.Optagelse_prioriteterUFM
ON dbo.Optagelse_prioriteterUFM.cpr = dbo.PopulationStam_V.CPR_NR
AND dbo.Optagelse_prioriteterUFM.Aar = dbo.frafaldsmodel_adgangsgrundlag.optagelsesaar
LEFT OUTER JOIN dbo.frafaldsmodel_stoettetabel_uddannelser AS fsu
ON fsu.id_uddannelse = dbo.UDDANNELSER.UDDA_ID
AND fsu.id_inst = dbo.INSTITUTIONER.INST_ID
AND fsu.uddannelse_aar = dbo.frafaldsmodel_adgangsgrundlag.optagelsesaar
WHERE dbo.STUDIESTARTER.STUDIESTARTSDATO > '2012-03-01 00:00:00.000'
AND (dbo.Optagelse_Studiestatus.AFGANGSARSAG IS NULL
OR dbo.Optagelse_Studiestatus.AFGANGSARSAG NOT LIKE '2.7.4')
AND (dbo.PopulationStam_V.INDSKRIVNINGSFORM = '1100'
OR dbo.PopulationStam_V.INDSKRIVNINGSFORM = '1700')
GROUP BY dbo.PopulationStam_V.ELEV_ID,
dbo.PopulationStam_V.PERS_ID,
dbo.STUDIESTARTER.STUDST_ID,
dbo.Optagelse_Studiestatus.AFGANGSARSAG,
dbo.Optagelse_Studiestatus.KORT_BETEGNELSE,
dbo.STUDIESTARTER.OPTAG_START_MANED,
Studiestatus,
dbo.ANSOGNINGER.ANSOG_ID,
dbo.ANSOGNINGER.ANKOMSTDATO,
dbo.ANSOGNINGER.EFTEROPTAG,
dbo.ANSOGNINGER.STATUSDATO
I really hope somebody out there can help.
Many ways, this will work:
WITH subSource AS (
/* Your query here */
)
SELECT
s1.id,
/* all other columns work like this:
COALESCE(S1.column,s2.column)
for example: */
coalesce(s1.appid,s2.appid) as appid
FROM subSource s1
INNER JOIN subSource s2
ON s1.study_id =s2.study_id
and s1.study_start = s2.study_start
AND s1.id > s2.id
/* I imagine some other clauses might be needed but maybe not */
The rest is copy paste

Hive table with multiple partitions

I have a table (data_table) with multiple partition columns year/month/monthkey.
Directories look something like year=2017/month=08/monthkey=2017-08/files.parquet
Which of the below queries would be faster?
select count(*) from data_table where monthkey='2017-08'
or
select count(*) from data_table where monthkey='2017-08' and year = '2017' and month = '08'
I think the initial time taken by hadoop take to find the required directories in the first case would be more. But want to confirm
Finding the relevant partitions is a metastore operation and not a file system operation.
It is done by querying the metasore and not by scanning the directories.
The metasore query of the first use-case will most likely be faster than the second use-case but in any case we are talking here on fractions of a second.
Demo
create external table t100k(i int)
partitioned by (x int,y int,xy string)
;
explain dependency select count(*) from t100k where xy='100-1000';
The query that was issued against the metastore:
select "PARTITIONS"."PART_ID"
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where (("FILTER2"."PART_KEY_VAL" = '100-1000'))
explain dependency select count(*) from t100k where x=100 and y=1000 and xy='100-1000';
The query that was issued against the metastore:
select "PARTITIONS"."PART_ID"
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0
inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where ( ( (((case when "FILTER0"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER0"."PART_KEY_VAL" as decimal(21,0)) else null end) = 100)
and ((case when "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) else null end) = 1000))
and ("FILTER2"."PART_KEY_VAL" = '100-1000')) )
Since comment will change the formatting, hence posting here.
Kindly accept #Dudu's reply. Please execute the below on metastore DB (mysql in my case):
mysql> select part_id, location, tbl_id, part_name from PARTITIONS as P inner join SDS as S on P.SD_ID = S.SD_ID where P.TBL_ID = 472;
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
| part_id | location | tbl_id | part_name |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
| 7 | hdfs://hostname:8020/tmp/multi_part/2011/01/2011-01 | 472 | year=2011/month=1/year_month=2011-01 |
| 9 | hdfs://hostname:8020/tmp/multi_part/2012/01/2012-01 | 472 | year=2012/month=1/year_month=2012-01 |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
2 rows in set (0.00 sec)
The location from both the queries will pull data from same hdfs directory.
The only difference in speed will be from the metastore DB query that is already explained in Dudu's answer.

SELF-JOIN discarding true CROSS JOIN rows

I have the following query;
What I get is tickets information. I use self-join to obtain the requester and the assignee in the same row:
SELECT z.id AS TICKET, z.name AS Subject, reqs.name AS Requester, techs.name AS Assignee,
e.name AS Entity,DATE_FORMAT(tt.date,'%y%-%m%-%d') AS DATE,
DATE_FORMAT(tt.date,'%T') AS HOUR,
CASE WHEN z.priority = 6 THEN 'Mayor' WHEN z.priority = 5 THEN 'Muy urgente' WHEN z.priority = 4 THEN 'Urgente'WHEN z.priority = 3 THEN 'Mediana' WHEN z.priority = 2 THEN 'Baja' WHEN z.priority =1 THEN 'Muy baja' END AS Priority,
c.name AS Category, i.name AS Department
FROM glpi_tickets_users tureq
JOIN glpi_tickets_users tutech ON tureq.tickets_id = tutech.tickets_id
JOIN glpi_users AS reqs ON tureq.users_id = reqs.id
JOIN glpi_users AS techs ON tutech.users_id = techs.id
JOIN glpi_tickets z ON z.id = tureq.tickets_id
LEFT OUTER JOIN glpi_tickettasks tt ON z.id = tt.tickets_id
LEFT JOIN glpi_itilcategories i ON z.itilcategories_id = i.id
LEFT JOIN glpi_usercategories c ON c.id = reqs.usercategories_id
INNER JOIN glpi_entities e ON z.entities_id = e.id
WHERE (tureq.id < tutech.id AND tureq.type < tutech.type) OR
(tureq.id < tutech.id AND tureq.users_id = tutech.users_id) OR
(tureq.id = tutech.id AND tureq.users_id = tutech.users_id)
The problem is that I get something like that:
1 Report jdoe jdoe Development 16-06-07 11:56:17 Mediana Software Mkt
1 Report jdoe fwilson Development 16-06-07 11:56:17 Mediana Software MKt
1 Report fwilson fwilson Development 16-06-07 11:56:17 Mediana Software Mkt
2 Task11 gwilliams gwilliams Ops 16-06-08 12:00:00 ALTA Hardware Def
3 Task12 gwilliams gwilliams Ops 16-06-08 12:01:00 ALTA Hardware Def
I don't want first and third row because is a CROSS JOIN result. Second row is OK, because jdoe is a requester and fwilson an assignee.
The problem is that sometimes requester and assignee are the same, eg: he creates a ticket for a task that himself will do. For example, 4th and 5th rows are OK.
So, how should I do to make a difference for those distinct cases, i.e.: I need to include:
tureq.id = tech.id AND req.users_id = tech.users.id
BUT NOT IF ALREADY EXISTS
tureq.id = tech.id AND req.users_id <> tech.users_id
Update
The main problem is that a user can assign to himself a ticket:
SELECT * from glpi_tickets_users WHERE type = 2 GROUP BY tickets_id HAVING COUNT(users_id)<2 limit 3;
+----+------------+----------+------+------------------+-------------------+
| id | tickets_id | users_id | type | use_notification | alternative_email |
+----+------------+----------+------+------------------+-------------------+
| 1 | 2 | 12 | 2 | 1 | NULL |
| 3 | 6 | 13 | 2 | 1 | NULL |
| 7 | 8 | 14 | 2 | 1 | NULL |
+----+------------+----------+------+------------------+-------------------+
Update 2:
It was a human mistake. The problem was really not about self-assigned tickets. Rather it was either that some tickets had not Requester or had Requester but still had not any resolver assigned.
I've found
As there are always the two types per ticket you are interested in, you can simply select the according records, so as to get requester and assignee per ticket.
select
t.id as ticket,
t.name as subject,
requester.name as requester,
assignee.name as assignee,
e.name as entity,
date_format(tt.date,'%y%-%m%-%d') as date,
date_format(tt.date,'%T') as hour,
case t.priority
when 6 then 'Mayor'
when 5 then 'Muy urgente'
when 4 then 'Urgente'
when 3 then 'Mediana'
when 2 then 'Baja'
when 1 then 'Muy baja'
end as priority,
uc.name as category,
ic.name as department
from glpi_tickets t
join glpi_entities e on e.id = t.entities_id
join
(
select tu.tickets_id, u.name, u.usercategories_id
from glpi_tickets_users tu
join glpi_users u on u.id = users_id
where tu.type = 1
) requester on requester.tickets_id = t.id
join
(
select tu.tickets_id, u.name
from glpi_tickets_users tu
join glpi_users u on u.id = users_id
where tu.type = 2
) assignee on assignee.tickets_id = t.id
left join glpi_itilcategories ic on ic.id = t.itilcategories_id
left join glpi_usercategories uc on uc.id = requester.usercategories_id;
left outer join glpi_tickettasks tt on tt.tickets_id = t.id
The only thing I wonder is: There can be several ticket tasks per ticket. So what do you want to do then? Have one line per ticket task in your results? This is what the query does. Only, it looks queer that your result rows don't contain any information on the tasks except for the dates, so you may have many, many lines with the same data, only with different dates. So maybe, you'd rather want the first or last date per ticket. To get the last date per ticket, you'd replace the last line in the query with:
left outer join
(
select tickets_id, max(date) as date
from glpi_tickettasks
group by tickets_id
) tt on tt.tickets_id = t.id
And you probably want to add an ORDER BY clause.
you need to add more qualifiers to your joins for example
JOIN glpi_tickets_users tutech ON tureq.tickets_id = tutech.tickets_id and tutech.type = 2

Convert MySQL query to MS SQL Server ... failing on aggregate requirements

GOAL:
I need to retrieve the most recent message date (max), number of rows in its attachment, and the vendors name.
Also, we need to limit the results to messages sent this year (after 2014-01-01 00:00:00.000) which have an attachment with 50k rows or more.
TRIED:
See this sqlFiddle.
SELECT
v.name
,a.attachmentRows
,MAX(e.createdDate) recentDate
FROM emailMessage e
INNER JOIN vendor v
ON (e.vendorID = v.vendorID)
INNER JOIN emailAttachment a
ON (e.emailMessageID = a.emailMessageID)
WHERE e.createdDate > '2014-01-01 00:00:00.000'
AND a.attachmentRows >= 50000
GROUP BY e.vendorID
EXPECTATIONS:
| NAME | ATTACHMENTROWS | RECENTDATE |
|-------------|----------------|---------------------------------|
| "Company C" | 123880 | February, 22 2014 10:00:00+0000 |
PROBLEM:
While my SQL skills are rather primitive, I'm fairly comfortable with the MySQL flavor so I started my fiddling there. That query worked as expected.
When switching over to SQL Server, though, I run into this error for each of the selected fields:
Column 'blahBlah' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I understand what the error is telling me, but with three tables involved, I'm at a loss as to how to remedy it. (And of course, simply grouping by all the selected fields would not yield the desired results.)
PLEA:
Please help!
Please try this Fiddle:
SELECT
v.name
,a.attachmentRows
,e.createdDate recentDate
FROM emailMessage e
INNER JOIN vendor v
ON (e.vendorID = v.vendorID)
INNER JOIN emailAttachment a
ON (e.emailMessageID = a.emailMessageID)
INNER JOIN (SELECT MAX(emailMessageID) emailMessageID, vendorID from emailMessage group by vendorID) as maxi
on maxi.emailMessageID = e.emailMessageID
WHERE e.createdDate > '2014-01-01 00:00:00.000'
AND a.attachmentRows >= 50000
This assumes the emailMessageID increments with the createdDate. Using the date is problematic if two emails arrive at the exact same time stamp.
SELECT
v.name
,a.attachmentRows
,MAX(e.createdDate) recentDate
FROM emailMessage e
INNER JOIN vendor v
ON (e.vendorID = v.vendorID)
INNER JOIN emailAttachment a
ON (e.emailMessageID = a.emailMessageID)
WHERE e.createdDate > '2014-01-01 00:00:00.000'
AND a.attachmentRows >= 50000
GROUP BY v.name ,a.attachmentRows