Aggregate adjacent only records with T-SQL - sql

I have (simplified for the example) a table with the following data
Row Start Finish ID Amount
--- --------- ---------- -- ------
1 2008-10-01 2008-10-02 01 10
2 2008-10-02 2008-10-03 02 20
3 2008-10-03 2008-10-04 01 38
4 2008-10-04 2008-10-05 01 23
5 2008-10-05 2008-10-06 03 14
6 2008-10-06 2008-10-07 02 3
7 2008-10-07 2008-10-08 02 8
8 2008-10-08 2008-11-08 03 19
The dates represent a period in time, the ID is the state a system was in during that period and the amount is a value related to that state.
What I want to do is to aggregate the Amounts for adjacent rows with the same ID number, but keep the same overall sequence so that contiguous runs can be combined. Thus I want to end up with data like:
Row Start Finish ID Amount
--- --------- ---------- -- ------
1 2008-10-01 2008-10-02 01 10
2 2008-10-02 2008-10-03 02 20
3 2008-10-03 2008-10-05 01 61
4 2008-10-05 2008-10-06 03 14
5 2008-10-06 2008-10-08 02 11
6 2008-10-08 2008-11-08 03 19
I am after a T-SQL solution that can be put into a SP, however I can't see how to do that with simple queries. I suspect that it may require iteration of some sort but I don't want to go down that path.
The reason I want to do this aggregation is that the next step in the process is to do a SUM() and Count() grouped by the unique ID's that occur within the sequence, so that my final data will look something like:
ID Counts Total
-- ------ -----
01 2 71
02 2 31
03 2 33
However if I do a simple
SELECT COUNT(ID), SUM(Amount) FROM data GROUP BY ID
On the original table I get something like
ID Counts Total
-- ------ -----
01 3 71
02 3 31
03 2 33
Which is not what I want.

If you read the book "Developing Time-Oriented Database Applications in SQL" by R T Snodgrass (the pdf of which is available from his web site under publications), and get as far as Figure 6.25 on p165-166, you will find the non-trivial SQL which can be used in the current example to group the various rows with the same ID value and continuous time intervals.
The query development below is close to correct, but there is a problem spotted right at the end, that has its source in the first SELECT statement. I've not yet tracked down why the incorrect answer is being given. [If someone can test the SQL on their DBMS and tell me whether the first query works correctly there, it would be a great help!]
It looks something like:
-- Derived from Figure 6.25 from Snodgrass "Developing Time-Oriented
-- Database Applications in SQL"
CREATE TABLE Data
(
Start DATE,
Finish DATE,
ID CHAR(2),
Amount INT
);
INSERT INTO Data VALUES('2008-10-01', '2008-10-02', '01', 10);
INSERT INTO Data VALUES('2008-10-02', '2008-10-03', '02', 20);
INSERT INTO Data VALUES('2008-10-03', '2008-10-04', '01', 38);
INSERT INTO Data VALUES('2008-10-04', '2008-10-05', '01', 23);
INSERT INTO Data VALUES('2008-10-05', '2008-10-06', '03', 14);
INSERT INTO Data VALUES('2008-10-06', '2008-10-07', '02', 3);
INSERT INTO Data VALUES('2008-10-07', '2008-10-08', '02', 8);
INSERT INTO Data VALUES('2008-10-08', '2008-11-08', '03', 19);
SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS (SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS (SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS (SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish)
OR (T2.Start <= L.Finish AND L.Finish < T2.Finish)));
The output from that query is:
01 2008-10-01 2008-10-02
01 2008-10-03 2008-10-05
02 2008-10-02 2008-10-03
02 2008-10-06 2008-10-08
03 2008-10-05 2008-10-06
03 2008-10-05 2008-11-08
03 2008-10-08 2008-11-08
Edited: There's a problem with the penultimate row - it should not be there. And I'm not clear (yet) where it is coming from.
Now we need to treat that complex expression as a query expression in the FROM clause of another SELECT statement, which will sum the amount values for a given ID over the entries that overlap with the maximal ranges shown above.
SELECT M.ID, M.Start, M.Finish, SUM(D.Amount)
FROM Data AS D,
(SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS (SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS (SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS (SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish)
OR (T2.Start <= L.Finish AND L.Finish < T2.Finish)))) AS M
WHERE D.ID = M.ID
AND M.Start <= D.Start
AND M.Finish >= D.Finish
GROUP BY M.ID, M.Start, M.Finish
ORDER BY M.ID, M.Start;
This gives:
ID Start Finish Amount
01 2008-10-01 2008-10-02 10
01 2008-10-03 2008-10-05 61
02 2008-10-02 2008-10-03 20
02 2008-10-06 2008-10-08 11
03 2008-10-05 2008-10-06 14
03 2008-10-05 2008-11-08 33 -- Here be trouble!
03 2008-10-08 2008-11-08 19
Edited: This is almost the correct data set on which to do the COUNT and SUM aggregation requested by the original question, so the final answer is:
SELECT I.ID, COUNT(*) AS Number, SUM(I.Amount) AS Amount
FROM (SELECT M.ID, M.Start, M.Finish, SUM(D.Amount) AS Amount
FROM Data AS D,
(SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS
(SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS
(SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS
(SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish) OR
(T2.Start <= L.Finish AND L.Finish < T2.Finish)))
) AS M
WHERE D.ID = M.ID
AND M.Start <= D.Start
AND M.Finish >= D.Finish
GROUP BY M.ID, M.Start, M.Finish
) AS I
GROUP BY I.ID
ORDER BY I.ID;
id number amount
01 2 71
02 2 31
03 3 66
Review:
Oh! Drat...the entry for 3 has twice the 'amount' that it should have. Previous 'edited' parts indicate where things started to go wrong. It looks as though either the first query is subtly wrong (maybe it is intended for a different question), or the optimizer I'm working with is misbehaving. Nevertheless, there should be an answer closely related to this that will give the correct values.
For the record: tested on IBM Informix Dynamic Server 11.50 on Solaris 10. However, should work fine on any other moderately standard-conformant SQL DBMS.

Probably need to create a cursor and loop through the results, keeping track of which id you are working with and accumulating the data along the way. When the id changes you can insert the accumulated data into a temporary table and return the table at the end of the procedure (select all from it). A table-based function might be better as you can then just insert into the return table as you go along.

I suspect that it may require iteration of some sort but I don't want to go down that path.
I think that's the route you'll have to take, use a cursor to populate a table variable. If you have a large number of records you could use a permanent table to store the results then when you need to retrieve the data you could process only the new data.
I would add a bit field with a default of 0 to the source table to keep track of which records have been processed. Assuming no one is using select * on the table, adding a column with a default value won't affect the rest of your application.
Add a comment to this post if you want help coding the solution.

Well I decided to go down the iteration route using a mixture of joins and cursors. By JOINing the data table against itself I can create a link list of only those records that are consecutive.
INSERT INTO #CONSEC
SELECT a.ID, a.Start, b.Finish, b.Amount
FROM Data a JOIN Data b
ON (a.Finish = b.Start) AND (a.ID = b.ID)
Then I can unwind the list by iterating over it with a cursor, and doing updates back to the data table to adjust (And delete the now extraneous records from the Data table)
DECLARE CCursor CURSOR FOR
SELECT ID, Start, Finish, Amount FROM #CONSEC ORDER BY Start DESC
#Total = 0
OPEN CCursor
FETCH NEXT FROM CCursor INTO #ID, #START, #FINISH, #AMOUNT
WHILE #FETCH_STATUS = 0
BEGIN
#Total = #Total + #Amount
#Start_Last = #Start
#Finish_Last = #Finish
#ID_Last = #ID
DELETE FROM Data WHERE Start = #Finish
FETCH NEXT FROM CCursor INTO #ID, #START, #FINISH, #AMOUNT
IF (#ID_Last<> #ID) OR (#Finish<>#Start_Last)
BEGIN
UPDATE Data
SET Amount = Amount + #Total
WHERE Start = #Start_Last
#Total = 0
END
END
CLOSE CCursor
DEALLOCATE CCursor
This all works and has acceptable performance for typical data that I am using.
I did find one small issue with the above code. Originally I was updating the Data table on each loop through the cursor. But this didn't work. It seems that you can only do one update on a record, and that multiple updates (in order to keep adding data) revert back to the reading the original contents of the record.

Related

SQL Server 2008 - need help on a antithetical query

I want to find out meter reading for given transaction day. In some cases there won’t be any meter reading and would like to see a meter reading for previous day.
Sample data set follows. I am using SQL Server 2008
declare #meter table (UnitID int, reading_Date date,reading int)
declare #Transactions table (Transactions_ID int,UnitID int,Transactions_date date)
insert into #meter (UnitID,reading_Date,reading ) values
(1,'1/1/2014',1000),
(1,'2/1/2014',1010),
(1,'3/1/2014',1020),
(2,'1/1/2014',1001),
(3,'1/1/2014',1002);
insert into #Transactions(Transactions_ID,UnitID,Transactions_date) values
(1,1,'1/1/2014'),
(2,1,'2/1/2014'),
(3,1,'3/1/2014'),
(4,1,'4/1/2014'),
(5,2,'1/1/2014'),
(6,2,'3/1/2014'),
(7,3,'4/1/2014');
select * from #meter;
select * from #Transactions;
I expect to get following output
Transactions
Transactions_ID UnitID Transactions_date reading
1 1 1/1/2014 1000
2 1 2/1/2014 1010
3 1 3/1/2014 1020
4 1 4/1/2014 1020
5 2 1/1/2014 1001
6 2 3/1/2014 1001
7 3 4/1/2014 1002
Your SQL Query to get your desired out put will as following:
SELECT Transactions_ID, T.UnitID, Transactions_date
, (CASE WHEN ISNULL(M.reading,'') = '' THEN
(
SELECT MAX(Reading) FROM #meter AS A
JOIN #Transactions AS B ON A.UnitID=B.UnitID AND A.UnitID=T.UnitID
)
ELSE M.reading END) AS Reading
FROM #meter AS M
RIGHT OUTER JOIN #Transactions AS T ON T.UnitID=M.UnitID
AND T.Transactions_date=M.reading_Date
I can think of two ways to approach this - neither of them are ideal.
The first (and slightly better) way would be to create a SQL Function that took the Transactions_date as a parameter and returned the reading for Max(Reading_date) where reading_date <= transactions_date. You could then use this function in a select statement against the Transactions table.
The other approach would be to use a cursor to iterate through the transactions table and use the same logic as above where you return the reading for Max(Reading_date) where reading_date <= transactions_date.
Try the below query:
Please find the result of the same in SQLFiddle
select a.Transactions_ID, a.UnitID, a.Transactions_date,
case when b.reading IS NULL then c.rd else b.reading end as reading
from
Transactions a
left outer join
meter b
on a.UnitID = b.UnitID
and a.Transactions_date = b.reading_Date
inner join
(
select UnitID,max(reading) as rd
from meter
group by UnitID
) as C
on a.UnitID = c.UnitID

SQL - Comparing against multiple values in the same column?

Ok, so I'm trying to write some SQL and I'm not sure how to tackle this situation. I have a table similiar to what is below. The basic idea is that I need to get the records that are in an 'H' status (easy enough), but I need to exclude records that were in an 'H' status and moved on to an 'A' status at a later date.
So ideally, the results should only return the last two records, IDs 03 and 04. How would you guys do this?
ID STATUS STAT_DATE
01 A 05/01/2013
01 H 05/01/2012
02 A 12/01/2013
02 H 12/01/2012
03 H 03/01/2009
04 H 02/01/2008
You could do it this way:
select *
from t t1
where status='H' and not exists(
select *
from t t2
where t1.id=t2.id and t2.status='A' and t2.stat_date > t1.stat_date)
That will give you all entries of table t with status='H' where there is no entry in t with the same id, a later date, and status='A'.

SQL - display selective output that is not sequenced last in a multiple occurrence

I have a table like below
OP OP_var SPS SPS_sq
1010 01 KEB_x 01
1010 01 KEK_x 02
1010 02 KEH_c 01
1010 02 KEK_y 02
1010 02 KEB_d 03
1020 01 KEK_f 01
1020 01 KEE_g 02
The OP column has variance (OP_var) and within it is a group of SPS. SPS_sq is the sequencing of these SPS lines within the OP+OP_var.
I would like to display KEK% where the KEK%'s SPS_sq is not last (meaning, the KEK% is either first or anywhere in the middle of the sequence number of the OP and OP_var as long as it is not last.
The output should look like this :
OP OP_var SPS SPS_sq
1010 02 KEK_y 02
1020 01 KEK_f 01
ignore all KEK% that is SPS_sq last within the OP+OP_var.
I assume you're looking for a random row per (op, op_var) combination. The random row has to have an SPS like 'KEK%', and it cannot have the same SPS as the last row. (That implies it cannot be the last row itself.)
This example uses window functions, which are available in SQL Server, Oracle, PostGRES. It uses a SQL Server specific way (newid()) to create a random order.
select *
from (
select row_number() over (
partition by yt1.OP, yt1.OP_var
order by newid()) as rn2 -- Random order
, yt1.*
from dbo.YourTable yt1
join (
select row_number() over (
partition by OP, OP_var
order by SPS_sq desc) as rn
, *
from YourTable
) as last_row
on yt1.OP = last_row.OP
and yt1.OP_var = last_row.OP_var
and last_row.rn = 1 -- Highest SPS_sq
where yt1.SPS <> last_row.SPS
and yt1.SPS like 'KEK%'
) SubQueryALias
where rn2 = 1 -- Random KEK row that doesn't share SPS with last row
Example at SQL Fiddle.
If you want all kek where kek is not at the max sps_sq for (op, op_var)
select * from Table1 t
where t.sps like 'KEK%'
and not exists
(select null from Table1 t1
inner join (select MAX(t2.sps_sq) as maxsps_sq, t2.op, t2.op_var
from Table1 t2
GROUP BY t2.op, t2.op_var) as getmax
on t1.op = getmax.op and t1.op_var = getmax.op_var
and t1.sps_sq = getmax.maxsps_sq
where t1.op = t.op and t1.op_var = t.op_var and t1.sps = t.sps and t.sps_sq = t1.sps_sq
);
SqlFiddle
Caution :
As Andomar's noticed, this will take all the kek% for an [op, op_var] which don't have the last sps_sq number.

Show data from table even if there is no data!! Oracle

I have a query which shows count of messages received based on dates.
For Eg:
1 | 1-May-2012
3 | 3-May-2012
4 | 6-May-2012
7 | 7-May-2012
9 | 9-May-2012
5 | 10-May-2012
1 | 12-May-2012
As you can see on some dates there are no messages received. What I want is it should show all the dates and if there are no messages received it should show 0 like this
1 | 1-May-2012
0 | 2-May-2012
3 | 3-May-2012
0 | 4-May-2012
0 | 5-May-2012
4 | 6-May-2012
7 | 7-May-2012
0 | 8-May-2012
9 | 9-May-2012
5 | 10-May-2012
0 | 11-May-2012
1 | 12-May-2012
How can I achieve this when there are no rows in the table?
First, it sounds like your application would benefit from a calendar table. A calendar table is a list of dates and information about the dates.
Second, you can do this without using temporary tables. Here is the approach:
with constants as (select min(thedate>) as firstdate from <table>)
dates as (select( <firstdate> + rownum - 1) as thedate
from (select rownum
from <table> cross join constants
where rownum < sysdate - <firstdate> + 1
) seq
)
select dates.thedate, count(t.date)
from dates left outer join
<table> t
on t.date = dates.thedate
group by dates.thedate
Here is the idea. The alias constants records the earliest date in your table. The alias dates then creates a sequence of dates. The inner subquery calculates a sequence of integers, using rownum, and then adds these to the first date. Note this assumes that you have on average at least one transaction per date. If not, you can use a bigger table.
The final part is the join that is used to bring back information about the dates. Note the use of count(t.date) instead of count(*). This counts the number of records in your table, which should be 0 for dates with no data.
You don't need a separate table for this, you can create what you need in the query. This works for May:
WITH month_may AS (
select to_date('2012-05-01', 'yyyy-mm-dd') + level - 1 AS the_date
from dual
connect by level < 31
)
SELECT *
FROM month_may mm
LEFT JOIN mytable t ON t.some_date = mm.the_date
The date range will depend on how exactly you want to do this and what your range is.
You could achieve this with a left outer join IF you had another table to join to that contains all possible dates.
One option might be to generate the dates in a temp table and join that to your query.
Something like this might do the trick.
CREATE TABLE #TempA (Col1 DateTime)
DECLARE #start DATETIME = convert(datetime, convert(nvarchar(10), getdate(), 121))
SELECT #start
DECLARE #counter INT = 0
WHILE #counter < 50
BEGIN
INSERT INTO #TempA (Col1) VALUES (#start)
SET #start = DATEADD(DAY, 1, #start)
SET #counter = #counter+1
END
That will create a TempTable to hold the dates... I've just generated 50 of them starting from today.
SELECT
a.Col1,
COUNT(b.MessageID)
FROM
TempA a
LEFT OUTER JOIN YOUR_MESSAGE_TABLE b
ON a.Col1 = b.DateColumn
GROUP BY
a.Col1
Then you can left join your message counts to that.

SQL Server 2005: Insert missing records in table that is in another reference table

I need help with the following. I have 2 tables. The first holds data captured by client. example.
[Data] Table
PersonId Visit Tested Done
01 Day 1 Eyes Yes
01 Day 1 Ears Yes
01 Day 2 Eyes Yes
01 Day 3 Eyes Yes
02 Day 1 Eyes Yes
02 Day 2 Ears Yes
02 Day 2 Smell Yes
03 Day 2 Eyes Yes
03 Day 2 Smell Yes
03 Day 3 Ears Yes
and the second table holds info of what needs to be tested.
[Ref] Table
Visit Test
Day 1 Eyes
Day 1 Ears
Day 1 Smell
Day 2 Eyes
Day 2 Ears
Day 2 Smell
Day 3 Eyes
Day 3 Ears
Day 3 Smell
now I'm trying to write an insert query on the [Data] to insert the non-existent tests that needed to be performed. The result I'm looking for example:
[Data] table after:
PersonId Visit Tested Done
01 Day 1 Eyes Yes
01 Day 1 Ears Yes
01 Day 1 Smell No
01 Day 2 Eyes Yes
01 Day 2 Ears No
01 Day 2 Smell No
01 Day 3 Eyes Yes
01 Day 3 Ears No
01 Day 3 Smell No
02 Day 1 Eyes Yes
02 Day 1 Ears No
02 Day 1 Smell No
02 Day 2 Eyes No
02 Day 2 Ears Yes
02 Day 2 Smell Yes
02 Day 3 Eyes No
02 Day 3 Ears No
02 Day 3 Smell No
03 Day 1 Eyes No
03 Day 1 Ears No
03 Day 1 Smell No
03 Day 2 Eyes Yes
03 Day 2 Ears No
03 Day 2 Smell Yes
03 Day 3 Eyes No
03 Day 3 Ears Yes
03 Day 3 Smell No
If needed it will be OK to create a third [results] table.
All help will be much appreciated.
Kind Regards
Jacques
I'm suspicious of the database design if it requires this (along with some other red flags), but the following query should give you what you are asking for:
INSERT INTO Results
(
person_id,
visit,
tested,
done
)
SELECT
P.person_id,
T.visit,
T.test,
'No'
FROM
(SELECT DISTINCT person_id FROM Results) P -- Replace with Persons table if you have one
CROSS JOIN Templates T
LEFT OUTER JOIN Results R ON
R.person_id = P.person_id AND
R.visit = T.visit AND
R.test = T.test
WHERE
R.person_id IS NULL
Or alternatively:
INSERT INTO Results
(
person_id,
visit,
tested,
done
)
SELECT
P.person_id,
T.visit,
T.test,
'No'
FROM
(SELECT DISTINCT person_id FROM Results) P -- Replace with Persons table if you have one
INNER JOIN Templates T ON
NOT EXISTS
(
SELECT *
FROM
Results R
WHERE
R.person_id = P.person_id AND
R.visit = T.visit AND
R.test = T.test
)
I think you'll need a person table with just the personIDs, then you can do a cross join (full outer join) with your test ref table to come up with a schedule of personIDs and expected tests.
Then, with that schedule set, do an outer join with the set of tests performed on personIDs and expect nulls instead of no's.
Then, if you want, you can convert your nulls to 'no'.
This probably isn't the best way, but....What if you were to create a primary key on the [Data] table,
PK: (PersonID, Visit, Tested)
Then you could create a function to insert for each personID, and Day
CREATE PROCEDURE InsertTests
#PersonID int
, #Day nvarchar(10)
Begin
BEGIN TRY
INSERT INTO [Data]
(PersonID, Visit, Tested, Done)
VALUES
(#PersonID, #Day, Eyes, No)
END TRY
BEGIN CATCH
END CATCH
BEGIN TRY
INSERT INTO [Data]
(PersonID, Visit, Tested, Done)
VALUES
(#PersonID, #Day, Ears, No)
END TRY
BEGIN CATCH
END CATCH
BEGIN TRY
INSERT INTO [Data]
(PersonID, Visit, Tested, Done)
VALUES
(#PersonID, #Day, Smell, No)
END TRY
BEGIN CATCH
END CATCH
End
INSERT Data
SELECT P.PersonID, R.Visit, D.Test, 'No'
FROM
Person P -- or (SELECT DISTINCT PersonID FROM Data) P
CROSS JOIN Ref R
WHERE
NOT EXISTS (
SELECT 1
FROM Data D
WHERE
P.PersonID = D.PersonID
AND R.Visit = D.Visit
AND R.Test = D.Test
)
And I can't resist posting a short version of #djacobson's answer:
ALTER TABLE Data ADD CONSTRAINT DF_Data_Done DEFAULT ('No')
INSERT Data (PersonID, Visit, Test)
SELECT P.PersonID, R.Visit, D.Test
FROM Person P CROSS JOIN Ref R
EXCEPT SELECT PersonID, Visit, Test FROM Data
Here's a perhaps-simpler solution using Common Table Expressions:
WITH allTestsForEveryone AS
(
SELECT *
FROM (SELECT DISTINCT PersonID FROM DATA) a
CROSS JOIN REF
),
allMissingTests AS
(
SELECT PersonID,Visit,Test FROM allTestsForEveryone
EXCEPT
SELECT PersonID,Visit,Tested FROM DATA
)
INSERT INTO [DATA] (PersonID, Visit, Tested, Done)
SELECT PersonID, Visit, Test, 0 AS Done FROM allMissingTests;
The first CTE (allTestsForEveryone) gives you a set of all tests needed on all days for all persons. In the second CTE (allMissingTests), we subtract the tests that have been taken using the EXCEPT operator, and add a '0' to represent their not-done status when we insert them (you can replace that with 'No' - when I ran this test I used a bit column). We then insert the results of the second CTE into Data.