I have a table of actions within a session and duration (milliseconds) between each step:
+-----------------------------------------------------------------------+
| | userid | sessionid | action sequence | action | milliseconds | |
| +--------+-----------+-----------------+-------------+--------------+ |
| | 1 | 1 | 1 | event start | 0 | |
| | 1 | 1 | 2 | other | 188114 | |
| | 1 | 1 | 3 | event end | 248641 | |
| | 1 | 1 | 4 | other | 398215 | |
| | 1 | 1 | 5 | event start | 488284 | |
| | 1 | 1 | 6 | other | 528445 | |
| | 1 | 1 | 7 | other | 572711 | |
| | 1 | 1 | 8 | event end | 598123 | |
| | 1 | 2 | 1 | event start | 0 | |
| | 1 | 2 | 2 | event end | 54363 | |
| | 2 | 1 | 1 | other | 0 | |
| | 2 | 1 | 2 | other | 2345 | |
| | 2 | 1 | 1 | other | 75647 | |
| | 3 | 1 | 2 | other | 0 | |
| | 3 | 1 | 3 | event start | 34678 | |
| | 3 | 1 | 4 | other | 46784 | |
| | 3 | 1 | 5 | other | 78905 | |
| | 4 | 1 | 1 | event start | 0 | |
| | 4 | 1 | 2 | other | 7454 | |
| | 4 | 1 | 3 | other | 11245 | |
| | 4 | 1 | 4 | event end | 24567 | |
| | 4 | 1 | 5 | other | 29562 | |
| | 4 | 1 | 6 | other | 43015 | |
| +--------+-----------+-----------------+-------------+--------------+ |
I would like to capture complete events -- sessions containing both an event start and end (some may have a start but no end, an end but no start, or neither -- I don't want those), and their start and end times. Ultimately I want to track duration by transposing the sequential rows of times into columns so I can calculate a difference. The above data table would ideally be transposed into:
+--------+-----------+---------------+--------+--------+
| userid | sessionid | full event id | start | end |
+--------+-----------+---------------+--------+--------+
| 1 | 1 | 1 | 0 | 248641 |
| 1 | 1 | 2 | 488284 | 598123 |
| 1 | 2 | 1 | 0 | 54363 |
| 4 | 1 | 1 | 0 | 24567 |
+--------+-----------+---------------+--------+--------+
I attempted something like:
select a.userid, a.sessionid, a.milliseconds as start, b.milliseconds as end
from #table a
inner join #table b
on a.userid=b.userid
and a.sessionid=b.sessionid
and a.action='event start'
and b.action='event end'
However, that doesn't work since some users may have multiple event start and ends in on session (like userid 1). I am stuck on how to best transpose the times data for each event. Thanks for you help!
So, given your above data:
CREATE TABLE test_table (
`userid` int,
`sessionid` int,
`actionSequence` int,
`action` varchar(11),
`milliseconds` int
);
INSERT INTO test_table
(`userid`, `sessionid`, `actionSequence`, `action`, `milliseconds`)
VALUES
(1, 1, 1, 'event start', 0),
(1, 1, 2, 'other', 188114),
(1, 1, 3, 'event end', 248641),
(1, 1, 4, 'other', 398215),
(1, 1, 5, 'event start', 488284),
(1, 1, 6, 'other', 528445),
(1, 1, 7, 'other', 572711),
(1, 1, 8, 'event end', 598123),
(1, 2, 1, 'event start', 0),
(1, 2, 2, 'event end', 54363),
(2, 1, 1, 'other', 0),
(2, 1, 2, 'other', 2345),
(2, 1, 1, 'other', 75647),
(3, 1, 2, 'other', 0),
(3, 1, 3, 'event start', 34678),
(3, 1, 4, 'other', 46784),
(3, 1, 5, 'other', 78905),
(4, 1, 1, 'event start', 0),
(4, 1, 2, 'other', 7454),
(4, 1, 3, 'other', 11245),
(4, 1, 4, 'event end', 24567),
(4, 1, 5, 'other', 29562),
(4, 1, 6, 'other', 43015);
The following query should get you where you want to be (you were on the right track):
SELECT
tt1.userid,
tt1.sessionid,
tt1.actionSequence,
tt1.milliseconds AS startMS,
MIN(tt2.milliseconds) AS endMS,
MIN(tt2.milliseconds) - tt1.milliseconds AS totalMS
FROM test_table tt1
INNER JOIN test_table tt2
ON tt2.userid = tt1.userid
AND tt2.sessionid = tt1.sessionid
AND tt2.actionSequence > tt1.actionSequence
AND tt2.action = 'event end'
WHERE tt1.action = 'event start'
GROUP BY tt1.userid, tt1.sessionid, tt1.actionSequence, startMS
Giving you this result set:
userid sessionid actionSequence startMS endMS totalMS
1 1 1 0 248641 248641
1 1 5 488284 598123 109839
1 2 1 0 54363 54363
4 1 1 0 24567 24567
The GROUP BY is important, because there are two rows with action = 'event end' and sequence > 1 for sessionid = 1 and userid = 1, so (I assume) we want the one closest to the current sequence, i.e. the MIN(milliseconds). As you can see, it also allows you to go ahead and take the difference of the two columns in this result set, saving you the extra step you were planning :]
Here is a SQLFiddle of this query in action on MySQL 5.6. You did not specify an RDBMS, but I believe the language used by this query should be simple enough to work in any sql engine.
Related
I'm trying to search for specific data in the database table (Oracle 12c). I want to search for specific texts in row groups. Each group have specific ID, so I would like to get ID of the group if all of the searching arguments can be found.
I prepared sample table but with some simplifications:
- In real table there is more than 20 columns and millions of rows.
- I converted real values to some shorter version like a or b, in real table there are VARCHAR(500) columns
- There can be thousands of rows in the same group (same ID)
- The search have to be fast, so manipulating too much of this data or many nested queries might not be an option
Sample Table:
+----+----+---+---+----+
| ID | A | B | C | D |
+----+----+---+---+----+
| 1 | aq | a | a | a |
| 1 | a | a | c | ad |
| 1 | a | a | a | a |
| 2 | a | a | a | a |
| 2 | a | a | a | a |
| 2 | a | a | a | a |
| 3 | a | a | a | a |
| 3 | a | a | a | a |
| 3 | a | d | a | a |
+----+----+---+---+----+
Sample Cases:
+------+-------------+-----------+
| Case | Searching | Expected |
+------+-------------+-----------+
| 1 | `q` and `c` | [1] |
| 2 | `a` and `d` | [1, 3] |
| 3 | `a` and `q` | [1] |
| 4 | `a` | [1, 2, 3] |
+------+-------------+-----------+
Case 1:
ID = 1 - matching q and c in two rows
Result = Row [1]
+----+----+---+---+----+
| ID | A | B | C | D |
+----+----+---+---+----+
| 1 | aq | a | a | a | <-- q
| 1 | a | a | c | ad | <-- c
| 1 | a | a | a | a |
| 2 | a | a | a | a |
| 2 | a | a | a | a |
| 2 | a | a | a | a |
| 3 | a | a | a | a |
| 3 | a | a | a | a |
| 3 | a | d | a | a |
+----+----+---+---+----+
Case 2:
ID = 2 - doesn't have d anywhere
Result: Rows [1, 3]
+----+----+---+---+----+
| ID | A | B | C | D |
+----+----+---+---+----+
| 1 | aq | a | a | a | <-- a
| 1 | a | a | c | ad | <-- a, d
| 1 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 3 | a | a | a | a | <-- a
| 3 | a | a | a | a | <-- a
| 3 | a | d | a | a | <-- a, d
+----+----+---+---+----+
Case 3:
ID = 1, matching q and c in single row
Result: Row [1]
+----+----+---+---+----+
| ID | A | B | C | D |
+----+----+---+---+----+
| 1 | aq | a | a | a | <-- a, q
| 1 | a | a | c | ad | <-- a
| 1 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 3 | a | a | a | a | <-- a
| 3 | a | a | a | a | <-- a
| 3 | a | d | a | a | <-- a
+----+----+---+---+----+
Case 4:
We have a everywhere
Result: Rows [1, 2, 3]
+----+----+---+---+----+
| ID | A | B | C | D |
+----+----+---+---+----+
| 1 | aq | a | a | a | <-- a
| 1 | a | a | c | ad | <-- a
| 1 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 2 | a | a | a | a | <-- a
| 3 | a | a | a | a | <-- a
| 3 | a | a | a | a | <-- a
| 3 | a | d | a | a | <-- a
+----+----+---+---+----+
Any help appreciated :), thanks
You could use listagg to:
Concatenate all the columns into one
Group the rows for each id into one string
Which gives:
create table t (
id int, a varchar2(2), b varchar2(1), c varchar2(1), d varchar2(2)
);
insert into t values (1, 'aq', 'a', 'a', 'a');
insert into t values (1, 'a', 'a', 'c', 'ad');
insert into t values (1, 'a', 'a', 'a', 'a');
insert into t values (2, 'a', 'a', 'a', 'a');
insert into t values (2, 'a', 'a', 'a', 'a');
insert into t values (2, 'a', 'a', 'a', 'a');
insert into t values (3, 'a', 'a', 'a', 'a');
insert into t values (3, 'a', 'a', 'a', 'a');
insert into t values (3, 'a', 'd', 'a', 'a');
commit;
with vals as (
select t.id,
listagg ( a || b || c || d )
within group ( order by a ) str
from t
group by t.id
)
select * from vals
where str like '%q%'
and str like '%c%';
ID STR
1 aaaaaacadaqaaa
with vals as (
select t.id,
listagg ( a || b || c || d )
within group ( order by a ) str
from t
group by t.id
)
select * from vals
where str like '%a%'
and str like '%d%';
ID STR
1 aaaaaacadaqaaa
3 aaaaaaaaadaa
Fair warning: This is likely to be slow!
You may be able to mitigate this by placing the listagg query in a materialized view.
Also with 20+ columns with some up to 500 characters long, it's likely you'll blow out the character limit for listagg. Unless you've enabled extended data types to allow 32,767 long varchar2s in SQL.
You can try the following code:
SELECT
ID
FROM
(
SELECT
ID,
RTRIM(XMLAGG(XMLELEMENT(E, A || B || C || D, ',').EXTRACT('//text()')).GETCLOBVAL(), ',')
AS CONSOLIDATED_VALUE
FROM
T
GROUP BY
ID
)
WHERE
CONSOLIDATED_VALUE LIKE '%q%'
AND CONSOLIDATED_VALUE LIKE '%c%'
Demo
Cheers!!
I'm fairly new to sql and not sure how to pivot table that can result in a binary data from a categorical data column.
Here is my current table:
+---------+------------------+--------------------+----------------+
| User ID | Cell Phone Brand | Purchased Platform | Recorded Usage |
+---------+------------------+--------------------+----------------+
| 1001 | Apple | Retail | 4 |
| 1001 | Samsung | Online | 4 |
| 1002 | Samsung | Retail | 5 |
| 1003 | Google | Online | 3 |
| 1003 | LG | Online | 3 |
| 1004 | LG | Online | 6 |
| 1005 | Apple | Online | 3 |
| 1006 | Google | Retail | 5 |
| 1007 | Goohle | Online | 3 |
| 1008 | Samsung | Retail | 4 |
| 1009 | LG | Retail | 4 |
| 1009 | Apple | Retail | 3 |
| 1010 | Apple | Retail | 6 |
+---------+------------------+--------------------+----------------+
I'd like to have the following result with aggregated Recorded Usage and binary data for devices:
+---------+--------------------+----------------+-------+---------+--------+----+
| User ID | Purchased Platform | Recorded Usage | Apple | Samsung | Google | LG |
+---------+--------------------+----------------+-------+---------+--------+----+
| 1001 | Retail | 4 | 1 | 0 | 0 | 0 |
| 1001 | Online | 4 | 0 | 1 | 0 | 0 |
| 1002 | Retail | 5 | 0 | 1 | 0 | 0 |
| 1003 | Online | 3 | 0 | 0 | 1 | 0 |
| 1003 | Online | 3 | 0 | 0 | 0 | 1 |
| 1004 | Online | 6 | 0 | 0 | 0 | 1 |
| 1005 | Online | 3 | 1 | 0 | 0 | 0 |
| 1006 | Retail | 5 | 0 | 0 | 1 | 0 |
| 1007 | Online | 3 | 0 | 0 | 1 | 0 |
| 1008 | Retail | 4 | 0 | 1 | 0 | 0 |
| 1009 | Retail | 4 | 0 | 0 | 0 | 1 |
| 1009 | Retail | 3 | 1 | 0 | 0 | 0 |
| 1010 | Retail | 6 | 1 | 0 | 0 | 0 |
+---------+--------------------+----------------+-------+---------+--------+----+
You can use case when statements:
declare #tmp table (UserID int, CellPhoneBrand varchar(10), PurchasedPlatform varchar(10), RecordedUsage int)
insert into #tmp
values
(1001,'Apple' ,'Retail', 4)
,(1001,'Samsung','Online', 4)
,(1002,'Samsung','Retail', 5)
,(1003,'Google' ,'Online', 3)
,(1003,'LG' ,'Online', 3)
,(1004,'LG' ,'Online', 6)
,(1005,'Apple' ,'Online', 3)
,(1006,'Google' ,'Retail', 5)
,(1007,'Goohle' ,'Online', 3)
,(1008,'Samsung','Retail', 4)
,(1009,'LG' ,'Retail', 4)
,(1009,'Apple' ,'Retail', 3)
,(1010,'Apple' ,'Retail', 6)
select UserID, PurchasedPlatform, RecordedUsage
,case when CellPhoneBrand ='Apple' then 1 else 0 end as Apple
,case when CellPhoneBrand ='Samsung' then 1 else 0 end as Samsung
,case when CellPhoneBrand ='Google' then 1 else 0 end as Google
,case when CellPhoneBrand ='LG' then 1 else 0 end as LG
from #tmp
Results:
This get's you the result you're after in your expected results. Like I mentioned in my comment, I would more likely expect an aggregated pivot here:
WITH VTE AS(
SELECT *
FROM (VALUES(1001,'Apple ','Retail',4),
(1001,'Samsung','Online',4),
(1002,'Samsung','Retail',5),
(1003,'Google ','Online',3),
(1003,'LG ','Online',3),
(1004,'LG ','Online',6),
(1005,'Apple ','Online',3),
(1006,'Google ','Retail',5),
(1007,'Goohle ','Online',3),
(1008,'Samsung','Retail',4),
(1009,'LG ','Retail',4),
(1009,'Apple ','Retail',3),
(1010,'Apple ','Retail',6)) V(ID, Brand, Platform, Usage))
SELECT ID,
Platform,
Usage,
CASE WHEN Brand = 'Apple' THEN 1 ELSE 0 END AS Apple,
CASE WHEN Brand = 'Samsung' THEN 1 ELSE 0 END AS Samsung,
CASE WHEN Brand = 'Google' THEN 1 ELSE 0 END AS Google,
CASE WHEN Brand = 'LG' THEN 1 ELSE 0 END AS LG
FROM VTE;
Since you used a word pivot in your description. here is a solution that shows how to pivot data in sqlserver using PIVOT statement
declare #temp TABLE
(
[UserID] varchar(50),
[CellPhoneBrand] varchar(50),
[PurchasedPlatform] varchar(50),
[RecordedUsage] int
);
INSERT INTO #temp
(
[UserID],
[CellPhoneBrand],
[PurchasedPlatform],
[RecordedUsage]
)
VALUES
(1001,'Apple', 'Retail', 4),
(1001,'Samsung', 'Online', 4),
(1002,'Samsung', 'Retail', 5),
(1003,'Google', 'Online', 3),
(1003,'LG', 'Online', 3),
(1004,'LG', 'Online', 6),
(1005,'Apple', 'Online', 3),
(1006,'Google', 'Retail', 5),
(1007,'Goohle', 'Online', 3),
(1008,'Samsung', 'Retail', 4),
(1009,'LG', 'Retail', 4),
(1009,'Apple', 'Retail', 3),
(1010,'Apple', 'Retail', 6)
select *
from
(
select [UserID], [PurchasedPlatform], [RecordedUsage],[CellPhoneBrand]
from #temp
) src
pivot
(
count(CellPhoneBrand)
for [CellPhoneBrand] in ([Apple], [Samsung],[Google],[LG])
) piv;
I'm struggling a hierarchical SQL query. I want to have another 2 columns of the disp_order of its children and sibling.
Children - Should hold all disp_order of their child and their grand children and so far.
Sibling - Should hold the disp_order of the row having the same parent.
+------------+-----+-------------+--------+
| disp_order | lvl | description | parent |
+------------+-----+-------------+--------+
| 0 | 1 | A | |
| 1 | 2 | B | 0 |
| 2 | 3 | C | 1 |
| 3 | 4 | D | 2 |
| 4 | 5 | E | 3 |
| 5 | 2 | F | 0 |
| 6 | 3 | G | 5 |
| 7 | 3 | H | 5 |
| 8 | 3 | I | 5 |
| 9 | 4 | J | 8 |
| 10 | 5 | K | 9 |
+------------+-----+-------------+--------+
What the result should be:
+------------+-----+-------------+--------+------------------------+---------+
| disp_order | lvl | description | parent | children | sibling |
+------------+-----+-------------+--------+------------------------+---------+
| 0 | 1 | A | | 1,2,3,4,5,6,7,8,9,10 | |
| 1 | 2 | B | 0 | 2,3,4 | 5 |
| 2 | 3 | C | 1 | 3,4 | |
| 3 | 4 | D | 2 | 4 | |
| 4 | 5 | E | 3 | | |
| 5 | 2 | F | 0 | 6,7,8,9,10 | 1 |
| 6 | 3 | G | 5 | | 7,8 |
| 7 | 3 | H | 5 | | 6,8 |
| 8 | 3 | I | 5 | 9,10 | 6,7 |
| 9 | 4 | J | 8 | 10 | |
| 10 | 5 | K | 9 | | |
+------------+-----+-------------+--------+------------------------+---------+
Here is my current query:
SELECT t.*,
( SELECT MAX( disp_order )
FROM tbl_pattern p
WHERE p.lvl = t.lvl - 1
AND p.disp_order < t.disp_order ) AS parent
FROM tbl_pattern t
Continuing from your previous question:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE tbl_pattern ( order_no, code, disp_order, lvl, description ) AS
SELECT 'RM001-01', 1, 0, 1, 'HK140904-1A' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 1, 2, 'HK140904-1B' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 2, 3, 'HK140904-1B' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 3, 4, 'HK140904-1C' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 4, 5, 'HK140904-1D' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 5, 2, 'HK140904-1E' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 6, 3, 'HK140904-1E' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 7, 3, 'HK140904-1X' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 8, 4, 'HK140904-1E' FROM DUAL UNION ALL
SELECT 'RM001-01', 1, 9, 5, 'HK140904-1E' FROM DUAL;
Query 1:
WITH data ( order_no, code, disp_order, lvl, description, parent ) AS (
SELECT t.*,
( SELECT MAX( disp_order )
FROM tbl_pattern p
WHERE p.order_no = t.order_no
AND p.code = t.code
AND p.lvl = t.lvl - 1
AND p.disp_order < t.disp_order ) AS parent
FROM tbl_pattern t
)
SELECT d.*,
( SELECT LISTAGG( c.disp_order, ',' ) WITHIN GROUP ( ORDER BY c.disp_order )
FROM data c
START WITH c.parent = d.disp_order
AND c.order_no = d.order_no
AND c.code = d.code
CONNECT BY PRIOR c.disp_order = c.parent
AND PRIOR c.order_no = c.order_no
AND PRIOR c.code = c.code
) AS children,
( SELECT LISTAGG( c.disp_order, ',' ) WITHIN GROUP ( ORDER BY c.disp_order )
FROM data c
WHERE c.parent = d.parent
AND c.disp_order <> d.disp_order
AND c.order_no = d.order_no
AND c.code = d.code
) AS siblings
FROM data d
Results:
| ORDER_NO | CODE | DISP_ORDER | LVL | DESCRIPTION | PARENT | CHILDREN | SIBLINGS |
|----------|------|------------|-----|-------------|--------|-------------------|----------|
| RM001-01 | 1 | 0 | 1 | HK140904-1A | (null) | 1,2,3,4,5,6,7,8,9 | (null) |
| RM001-01 | 1 | 1 | 2 | HK140904-1B | 0 | 2,3,4 | 5 |
| RM001-01 | 1 | 2 | 3 | HK140904-1B | 1 | 3,4 | (null) |
| RM001-01 | 1 | 3 | 4 | HK140904-1C | 2 | 4 | (null) |
| RM001-01 | 1 | 4 | 5 | HK140904-1D | 3 | (null) | (null) |
| RM001-01 | 1 | 5 | 2 | HK140904-1E | 0 | 6,7,8,9 | 1 |
| RM001-01 | 1 | 6 | 3 | HK140904-1E | 5 | (null) | 7 |
| RM001-01 | 1 | 7 | 3 | HK140904-1X | 5 | 8,9 | 6 |
| RM001-01 | 1 | 8 | 4 | HK140904-1E | 7 | 9 | (null) |
| RM001-01 | 1 | 9 | 5 | HK140904-1E | 8 | (null) | (null) |
I am trying to find a way to count based on groups and I was not able to figure out a way without having to use a Cursor. Since using a Cursor will be relatively slow I was hoping there might be a better way.
Simplified the data is structured as follows:
+----+--------+-------+--------+
| ID | NEXTID | RowNo | Status |
+----+--------+-------+--------+
| 1 | 2 | 1 | 1 |
| 2 | 3 | 1 | 1 |
| 3 | 4 | 1 | 0 |
| 4 | | 1 | 1 |
| 1 | 2 | 2 | 0 |
| 2 | 3 | 2 | 1 |
| 3 | 4 | 2 | 1 |
| 4 | | 2 | 1 |
| 1 | 2 | 3 | 1 |
| 2 | 3 | 3 | 1 |
| 3 | 4 | 3 | 1 |
| 4 | | 3 | 1 |
+----+--------+-------+--------+
I now want to COUNT the Status column in groups resulting in:
+-----+-------------+
| Row | StatusCount |
+-----+-------------+
| 1 | 2 |
| 1 | 1 |
| 2 | 3 |
| 3 | 4 |
+-----+-------------+
For Testing purposes I creating the following code:
SELECT
ID,
NEXTID,
RowNo,
Status,
LEAD(ID,1,0)
OVER (ORDER BY RowNo,ID) AS LEADER
INTO #TestTable
FROM
(
VALUES
(1, 2, 1, 1),
(2, 3, 1, 1),
(3, 4, 1, 0),
(4, '', 1, 1),
(1, 2, 2, 0),
(2, 3, 2, 1),
(3, 4, 2, 1),
(4, '', 2, 1),
(1, 2, 3, 1),
(2, 3, 3, 1),
(3, 4, 3, 1),
(4, '', 3, 1)
)
AS TestTable(
ID,
NEXTID,
RowNo,
Status);
GO
SELECT
RowNo,
Count(Status) AS StatusCount
FROM #TestTable
WHERE
Status = 1
GROUP BY
RowNo
This results in
+-----+-------------+
| Row | StatusCount |
+-----+-------------+
| 1 | 3 |
| 2 | 3 |
| 3 | 4 |
+-----+-------------+
Not separating the first row. I do realise that I need another GROUP BY condition but I can not figure out the appropriate condition.
Thank you very much for your help. If this has already been answered I was unable to find the topic and hints will also be appreciated.
With kind regards
freubau
You can identify the groups by doing a cumulative sum of the zeros up to each number. Then, the rest is just aggregation:
select rowno, count(*)
from (select t.*,
sum(case when status = 0 then 1 else 0 end) over (partition by rowno order by id) as grp
from #TestTable t
) t
where status = 1
group by rowno, grp
order by rowno, grp;
Here is a rex tester for it.
I've got a table of temperature samples over time from several sources and I want to find the minimum, maximum, and average temperatures across all sources at set time intervals. At first glance this is easily done like so:
SELECT MIN(temp), MAX(temp), AVG(temp) FROM samples GROUP BY time;
However, things become much more complicated (to the point of where I'm stumped!) if sources drop in and out and rather than ignoring the missing sources during the intervals in question I want to use the sources' last know temperatures for the missing samples. Using datetimes and constructing intervals (say every minute) across samples unevenly distributed over time further complicates things.
I think it should be possible to create the results I want by doing a self-join on the samples table where the time from the first table is greater than or equal to the time of the second table and then calculating aggregate values for rows grouped by source. However, I'm stumped about how to actually do this.
Here's my test table:
+------+------+------+
| time | source | temp |
+------+------+------+
| 1 | a | 20 |
| 1 | b | 18 |
| 1 | c | 23 |
| 2 | b | 21 |
| 2 | c | 20 |
| 2 | a | 18 |
| 3 | a | 16 |
| 3 | c | 13 |
| 4 | c | 15 |
| 4 | a | 4 |
| 4 | b | 31 |
| 5 | b | 10 |
| 5 | c | 16 |
| 5 | a | 22 |
| 6 | a | 18 |
| 6 | b | 17 |
| 7 | a | 20 |
| 7 | b | 19 |
+------+------+------+
INSERT INTO samples (time, source, temp) VALUES (1, 'a', 20), (1, 'b', 18), (1, 'c', 23), (2, 'b', 21), (2, 'c', 20), (2, 'a', 18), (3, 'a', 16), (3, 'c', 13), (4, 'c', 15), (4, 'a', 4), (4, 'b', 31), (5, 'b', 10), (5, 'c', 16), (5, 'a', 22), (6, 'a', 18), (6, 'b', 17), (7, 'a', 20), (7, 'b', 19);
To do my min, max and avg calculations, I want an intermediate table that looks like this:
+------+------+------+
| time | source | temp |
+------+------+------+
| 1 | a | 20 |
| 1 | b | 18 |
| 1 | c | 23 |
| 2 | b | 21 |
| 2 | c | 20 |
| 2 | a | 18 |
| 3 | a | 16 |
| 3 | b | 21 |
| 3 | c | 13 |
| 4 | c | 15 |
| 4 | a | 4 |
| 4 | b | 31 |
| 5 | b | 10 |
| 5 | c | 16 |
| 5 | a | 22 |
| 6 | a | 18 |
| 6 | b | 17 |
| 6 | c | 16 |
| 7 | a | 20 |
| 7 | b | 19 |
| 7 | c | 16 |
+------+------+------+
The following query is getting me close to what I want but it takes the temperature value of the source's first result, rather than the most recent one at the given time interval:
SELECT s.dt as sdt, s.mac, ss.temp, MAX(ss.dt) as maxdt FROM (SELECT DISTINCT dt FROM samples) AS s CROSS JOIN samples AS ss WHERE s.dt >= ss.dt GROUP BY sdt, mac HAVING maxdt <= s.dt ORDER BY sdt ASC, maxdt ASC;
+------+------+------+-------+
| sdt | mac | temp | maxdt |
+------+------+------+-------+
| 1 | a | 20 | 1 |
| 1 | c | 23 | 1 |
| 1 | b | 18 | 1 |
| 2 | a | 20 | 2 |
| 2 | c | 23 | 2 |
| 2 | b | 18 | 2 |
| 3 | b | 18 | 2 |
| 3 | a | 20 | 3 |
| 3 | c | 23 | 3 |
| 4 | a | 20 | 4 |
| 4 | c | 23 | 4 |
| 4 | b | 18 | 4 |
| 5 | a | 20 | 5 |
| 5 | c | 23 | 5 |
| 5 | b | 18 | 5 |
| 6 | c | 23 | 5 |
| 6 | a | 20 | 6 |
| 6 | b | 18 | 6 |
| 7 | c | 23 | 5 |
| 7 | b | 18 | 7 |
| 7 | a | 20 | 7 |
+------+------+------+-------+
Update: chadhoc (great name, by the way!) gives a nice solution that unfortunately does not work in MySQL, since it does not support the FULL JOIN he uses. Luckily, I believe a simple UNION is an effective replacement:
-- Unify the original samples with the missing values that we've calculated
(
SELECT time, source, temp
FROM samples
)
UNION
( -- Pull all the time/source combinations that we are missing from the sample set, along with the temp
-- from the last sampled interval for the same time/source combination if we do not have one
SELECT a.time, a.source, (SELECT t2.temp FROM samples AS t2 WHERE t2.time < a.time AND t2.source = a.source ORDER BY t2.time DESC LIMIT 1) AS temp
FROM
( -- All values we want to get should be a cross of time/temp
SELECT t1.time, s1.source
FROM
(SELECT DISTINCT time FROM samples) AS t1
CROSS JOIN
(SELECT DISTINCT source FROM samples) AS s1
) AS a
LEFT JOIN samples s
ON a.time = s.time
AND a.source = s.source
WHERE s.source IS NULL
)
ORDER BY time, source;
Update 2: MySQL gives the following EXPLAIN output for chadhoc's code:
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+
| 1 | PRIMARY | temp | ALL | NULL | NULL | NULL | NULL | 18 | |
| 2 | UNION | <derived4> | ALL | NULL | NULL | NULL | NULL | 21 | |
| 2 | UNION | s | ALL | NULL | NULL | NULL | NULL | 18 | Using where |
| 4 | DERIVED | <derived6> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 4 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 7 | |
| 6 | DERIVED | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using temporary |
| 5 | DERIVED | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using temporary |
| 3 | DEPENDENT SUBQUERY | t2 | ALL | NULL | NULL | NULL | NULL | 18 | Using where; Using filesort |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using filesort |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+
I was able to get Charles' code working like so:
SELECT T.time, S.source,
COALESCE(
D.temp,
(
SELECT temp FROM samples
WHERE source = S.source AND time = (
SELECT MAX(time)
FROM samples
WHERE
source = S.source
AND time < T.time
)
)
) AS temp
FROM (SELECT DISTINCT time FROM samples) AS T
CROSS JOIN (SELECT DISTINCT source FROM samples) AS S
LEFT JOIN samples AS D
ON D.source = S.source AND D.time = T.time
Its explanation is:
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+
| 1 | PRIMARY | <derived5> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 7 | |
| 1 | PRIMARY | D | ALL | NULL | NULL | NULL | NULL | 18 | |
| 5 | DERIVED | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using temporary |
| 4 | DERIVED | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using temporary |
| 2 | DEPENDENT SUBQUERY | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using where |
| 3 | DEPENDENT SUBQUERY | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using where |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+
I think you'll get better performance making use of the ranking/windowing functions in mySql, but unfortunately I do not know those as well as the TSQL implementation. Here is an ANSI compliant solution that will work though:
-- Full join across the sample set and anything missing from the sample set, pulling the missing temp first if we do not have one
select coalesce(c1.[time], c2.[time]) as dt, coalesce(c1.source, c2.source) as source, coalesce(c2.temp, c1.temp) as temp
from samples c1
full join ( -- Pull all the time/source combinations that we are missing from the sample set, along with the temp
-- from the last sampled interval for the same time/source combination if we do not have one
select a.time, a.source,
(select top 1 t2.temp from samples t2 where t2.time < a.time and t2.source = a.source order by t2.time desc) as temp
from
( -- All values we want to get should be a cross of time/samples
select t1.[time], s1.source
from
(select distinct [time] from samples) as t1
cross join
(select distinct source from samples) as s1
) a
left join samples s
on a.[time] = s.time
and a.source = s.source
where s.source is null
) c2
on c1.time = c2.time
and c1.source = c2.source
order by dt, source
I know this looks complicated, but it's formatted to explain itself...
It should work... Hope you only have three sources... If you have an arbitrary number of sources than this won't work... In that case see the second query...
EDIT: Removed first attempt
EDIT: If you don't know the sources ahead of time, you'll have to do something where you create an intermediate result set that "Fills in" the missing values..
something like this:
2nd EDIT: Removed need for Coalesce by moving logic to retrieve most recent temp reading for each source from Select clause into the Join condition.
Select T.Time, Max(Temp) MaxTemp,
Min(Temp) MinTemp, Avg(Temp) AvgTemp
From
(Select T.TIme, S.Source, D.Temp
From (Select Distinct Time From Samples) T
Cross Join
(Select Distinct Source From Samples) S
Left Join Samples D
On D.Source = S.Source
And D.Time =
(Select Max(Time)
From Samples
Where Source = S.Source
And Time <= T.Time)) Z
Group By T.Time