Continuous sequences in SQL - sql

Having a table with the following fields:
Order,Group,Sequence
it is required that all orders in a given group form a continuous sequence. For example: 1,2,3,4 or 4,5,6,7. How can I check using a single SQL query what orders do not comply with this rule? Thank you.
Example data:
Order Group Sequence
1 1 3
2 1 4
3 1 5
4 1 6
5 2 3
6 2 4
7 2 6
Expected result:
Order
5
6
7
Also accepted if the query returns only the group which has the wrong sequence, 2 for the example data.

Assuming that the sequences are generated and therefore cannot be duplicated:
SELECT group
FROM theTable
GROUP BY group
HAVING MAX(Sequence) - MIN(Sequence) &lt> (COUNT(*) - 1);

How about this?
select Group from Table
group by Group
having count(Sequence) <= max(Sequence)-min(Sequence)
[Edit] This assumes that Sequence does not allow duplicates within a particular group. It might be better to use:count != max - min + 1
[Edit again] D'oh, still not perfect. Another query to flush out duplicates would take care of that though.
[Edit the last] The original query worked fine in sqlite, which is what I had available for a quick test. It is much more forgiving than SQL server. Thanks to Bell for the pointer.

Personaly I think I would consider rethinking the requirement. It is the nature of relational databases that gaps in sequences can easily occur due to records that are rolled back. For instance, supppose an order starts to create four items in it, but one fails for some rason and is rolled back. If you precomputed the sequences manually, you would then have a gap is the one rolled back is not the last one. In other scenarios, you might get a gap due to multiple users looking for sequence values at approximately the same time or if at the last minute a customer deleted one record from the order. What are you honestly looking to gain from having contiguous sequences that you don't get from a parent child relationship?

This SQL selects the orders 3 and 4 wich have none continuous sequences.
DECLARE #Orders TABLE ([Order] INTEGER, [Group] INTEGER, Sequence INTEGER)
INSERT INTO #Orders VALUES (1, 1, 0)
INSERT INTO #Orders VALUES (1, 2, 0)
INSERT INTO #Orders VALUES (1, 3, 0)
INSERT INTO #Orders VALUES (2, 4, 0)
INSERT INTO #Orders VALUES (2, 5, 0)
INSERT INTO #Orders VALUES (2, 6, 0)
INSERT INTO #Orders VALUES (3, 4, 0)
INSERT INTO #Orders VALUES (3, 6, 0)
INSERT INTO #Orders VALUES (4, 1, 0)
INSERT INTO #Orders VALUES (4, 2, 0)
INSERT INTO #Orders VALUES (4, 8, 0)
SELECT o1.[Order]
FROM #Orders o1
LEFT OUTER JOIN #Orders o2 ON o2.[Order] = o1.[Order] AND o2.[Group] = o1.[Group] + 1
WHERE o2.[Order] IS NULL
GROUP BY o1.[Order]
HAVING COUNT(*) > 1

So your table is in the form of
Order Group Sequence
1 1 4
1 1 5
1 1 7
..and you want to find out that 1,1,6 is missing?
With
select
min(Sequence) MinSequence,
max(Seqence) MaxSequence
from
Orders
group by
[Order],
[Group]
you can find out the bounds for a given Order and Group.
Now you can simulate the correct data by using a special numbers table, which just contains every single number you could ever use for a sequence. Here is a good example of such a numbers table. It's not important how you create it, you could also create an excel file with all the numbers from x to y and import that excel sheet.
In my example I assume such a numbers table called "Numbers" with only one column "n":
select
[Order],
[Group],
n Sequence
from
(select min(Sequence) MinSequence, max(Seqence) MaxSequence from [Table] group by [Order], [Group]) MinMaxSequence
left join Numbers on n >= MinSequence and n <= MaxSequence
Put that SQL into a new view. In my example I will call the view "vwCorrectOrders".
This gives you the data where the sequences are continuous. Now you can join that data with the original data to find out which sequences are missing:
select
correctOrders.*
from
vwCorrectOrders co
left join Orders o on
co.[Order] = o.[Order]
and co.[Group] = o.[Group]
and co.Sequence = o.Sequence
where
o.Sequence is null
Should give you
Order Group Sequence
1 1 6

After a while I came up with the following solution. It seems to work but it is highly inefficient. Please add any improvement suggestions.
SELECT OrdMain.Order
FROM ((Orders AS OrdMain
LEFT OUTER JOIN Orders AS OrdPrev ON (OrdPrev.Group = OrdMain.Group) AND (OrdPrev.Sequence = OrdMain.Sequence - 1))
LEFT OUTER JOIN Orders AS OrdNext ON (OrdNext.Group = OrdMain.Group) AND (OrdNext.Sequence = OrdMain.Sequence + 1))
WHERE ((OrdMain.Sequence < (SELECT MAX(Sequence) FROM Orders OrdMax WHERE (OrdMax.Group = OrdMain.Group))) AND (OrdNext.Order IS NULL)) OR
((OrdMain.Sequence > (SELECT MIN(Sequence) FROM Orders OrdMin WHERE (OrdMin.Group = OrdMain.Group))) AND (OrdPrev.Order IS NULL))

Related

avoiding group by for column used in datediff?

As the database is currently constructed, I can only use a Date Field of a certain table in a datediff-function that is also part of a count aggregation (not the date field, but that entity where that date field is not null. The group by in the end messes up the counting, since the one entry is counted on it's own / as it's own group.
In some detail:
Our lead recruiter want's a report that shows the sum of applications, and conducted interviews per opening. So far no problem. Additionally he likes to see the total duration per opening from making it public to signing a new employee per opening and of cause only if the opening could already be filled.
I have 4 tables to join:
table 1 holds the data of the opening
table 2 has the single applications
table 3 has the interview data of the applications
table 4 has the data regarding the publication of the openings (with the date when a certain opening was made public)
The problem is the duration requirement. table 4 holds the starting point and in table 2 one (or none) applicant per opening has a date field filled with the time he returned a signed contract and therefor the opening counts as filled. When I use that field in a datediff I'm forced to also put that column in the group by clause and that results in 2 row per opening. 1 row has all the numbers as wanted and in the second row there is always that one person who has a entry in that date field...
So far I haven't come far in thinking of a way of avoiding that problem except for explanining to the colleague that he get's his time-to-fill number in another report.
SELECT
table1.col1 as NameOfProject,
table1.col2 as Company,
table1.col3 as OpeningType,
table1.col4 as ReasonForOpening,
count (table2.col2) as NumberOfApplications,
sum (case when table2.colSTATUS = 'withdrawn' then 1 else 0 end) as mberOfApplicantsWhoWithdraw,
sum (case when table3.colTypeInterview = 'PhoneInterview' then 1 else 0 end) as NumberOfPhoneInterview,
...more sum columns...,
table1.finished, // shows „1“ if opening is occupied
DATEDIFF(day, table4.colValidFrom, **table2.colContractReceived**) as DaysToCompletion
FROM
table2 left join table3 on table2.REF_NR = table3.REF_NR
join table1 on table2.PROJEKT = table1.KBEZ
left join table4 on table1.REFNR = table4.PRJ_REFNR
GROUP BY
**table2.colContractReceived**
and all other columns except the ones in aggregate (sum and count) functions go in the GROUP BY section
ORDER BY table1.NameOfProject
Here is a short rebuild of what it looks like. First a row where the opening is not filled and all aggregations come out in one row as wanted. The next project/opening shows up double, because the field used in the datediff is grouped independently...
project company; no_of_applications; no_of_phoneinterview; no_of_personalinterview; ... ; time_to_fill_in_days; filled?
2018_312 comp a 27 4 2 null 0
2018_313 comp b 54 7 4 null 0
2018_313 comp b 1 1 1 42 1
I'd be glad to get any idea how to solve this. Thanks for considering my request!
(During the 'translation' of all the specific column and table names I might have build in a syntax error here and there but the query worked well ecxept for that unwanted extra aggregation per filled opening)
If I've understood your requirement properly, I believe the issue you are having is that you need to show the date between the starting point and the time at which an applicant responded to an opening, however this must only show a single row based on whether or not the position was filled (if the position was filled, then show that row, if not then show that row).
I've achieved this result by assuming that you count a position as filled using the "ContractsRecevied" column. This may be wrong however the principle should still provide what you are looking for.
I've essentially wrapped your query in to a subquery, performed a rank ordering by the contractsfilled column descending and partitioned by the project. Then in the outer query I filter for the first instance of this ranking.
Even if my assumption about the column structure and data types is wrong, this should provide you with a model to work with.
The only issue you might have with this ranking solution is if you want to aggregate over both rows within one (so include all of the summed columns for both the position filled and position not filled row per project). If this is the case let me know and we can work around that.
Please let me know if you have any questions.
declare #table1 table (
REFNR int,
NameOfProject nvarchar(20),
Company nvarchar(20),
OpeningType nvarchar(20),
ReasonForOpening nvarchar(20),
KBEZ int
);
declare #table2 table (
NumberOfApplications int,
Status nvarchar(15),
REF_NR int,
ReturnedApplicationDate datetime,
ContractsReceived bit,
PROJEKT int
);
declare #table3 table (
TypeInterview nvarchar(25),
REF_NR int
);
declare #table4 table (
PRJ_REFNR int,
StartingPoint datetime
);
insert into #table1 (REFNR, NameOfProject, Company, OpeningType, ReasonForOpening, KBEZ)
values (1, '2018_312', 'comp a' ,'Permanent', 'Business growth', 1),
(2, '2018_313', 'comp a', 'Permanent', 'Business growth', 2),
(3, '2018_313', 'comp a', 'Permanent', 'Business growth', 3);
insert into #table2 (NumberOfApplications, Status, REF_NR, ReturnedApplicationDate, ContractsReceived, PROJEKT)
values (27, 'Processed', 4, '2018-04-01 08:00', 0, 1),
(54, 'Withdrawn', 5, '2018-04-02 10:12', 0, 2),
(1, 'Processed', 6, '2018-04-15 15:00', 1, 3);
insert into #table3 (TypeInterview, REF_NR)
values ('Phone', 4),
('Phone', 5),
('Personal', 6);
insert into #table4 (PRJ_REFNR, StartingPoint)
values (1, '2018-02-25 08:00'),
(2, '2018-03-04 15:00'),
(3, '2018-03-04 15:00');
select * from
(
SELECT
RANK()OVER(Partition by NameOfProject, Company order by ContractsReceived desc) as rowno,
table1. NameOfProject,
table1.Company,
table1.OpeningType,
table1.ReasonForOpening,
case when ContractsReceived >0 then datediff(DAY, StartingPoint, ReturnedApplicationDate) else null end as TimeToFillInDays,
ContractsReceived Filled
FROM
#table2 table2 left join #table3 table3 on table2.REF_NR = table3.REF_NR
join #table1 table1 on table2.PROJEKT = table1.KBEZ
left join #table4 table4 on table1.REFNR = table4.PRJ_REFNR
group by NameOfProject, Company, OpeningType, ReasonForOpening, ContractsReceived,
StartingPoint, ReturnedApplicationDate
) x where rowno=1

How can I get a random number generated in a CTE not to change in JOIN?

The problem
I'm generating a random number for each row in a table #Table_1 in a CTE, using this technique. I'm then joining the results of the CTE on another table, #Table_2. Instead of getting a random number for each row in #Table_1, I'm getting a new random number for every resulting row in the join!
CREATE TABLE #Table_1 (Id INT)
CREATE TABLE #Table_2 (MyId INT, ParentId INT)
INSERT INTO #Table_1
VALUES (1), (2), (3)
INSERT INTO #Table_2
VALUES (1, 1), (2, 1), (3, 1), (4, 1), (1, 2), (2, 2), (3, 2), (1, 3)
;WITH RandomCTE AS
(
SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE r
INNER JOIN #Table_2 t
ON r.Id = t.ParentId
The results
Id MyId RandomNumber
----------- ----------- ------------
1 1 1
1 2 2
1 3 0
1 4 3
2 1 4
2 2 0
2 3 0
3 1 3
The desired results
Id MyId RandomNumber
----------- ----------- ------------
1 1 1
1 2 1
1 3 1
1 4 1
2 1 4
2 2 4
2 3 4
3 1 3
What I tried
I tried to obscure the logic of the random number generation from the optimizer by casting the random number to VARCHAR, but that did not work.
What I don't want to do
I'd like to avoid using a temporary table to store the results of the CTE.
How can I generate a random number for a table and preserve that random number in a join without using temporary storage?
This seems to do the trick:
WITH CTE AS(
SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
FROM #Table_1),
RandomCTE AS(
SELECT Id,
RandomNumber
FROM CTE
GROUP BY ID, RandomNumber)
SELECT *
FROM RandomCTE r
INNER JOIN #Table_2 t
ON r.Id = t.ParentId;
It looks like SQL Server is aware that, at the point of being outside the CTE, that RandomNumber is effectively just NEWID() with some additional functions wrapped around it (DB<>Fiddle), and hence it still generates a unique ID for each row. The GROUP BY clause in the second CTE therefore forces the data engine to define RandomNumber a value so it can perform the GROUP BY.
Per the quote in this answer
The optimizer does not guarantee timing or number of executions of
scalar functions. This is a long-estabilished tenet. It's the
fundamental 'leeway' tha allows the optimizer enough freedom to gain
significant improvements in query-plan execution.
If it is important for your application that the random number be evaluated once and only once you should calculate it up front and store it into a temp table.
Anything else is not guaranteed and so is irresponsible to add into your application's code base - as even if it works now it may break as a result of a schema change/execution plan change/version upgrade/CU install.
For example Lamu's answer breaks if a unique index is added to #Table_1 (Id)
How about not using a real random number at all? Use rand() with a seed:
WITH RandomCTE AS (
SELECT Id,
CONVERT(INT, RAND(ROW_NUMBER() OVER (ORDER BY NEWID()) * 999999) * 5) as RandomNumber
FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE rINNER JOIN
#Table_2 t
ON r.Id = t.ParentId;
The seed argument to rand() is pretty awful. Values of the seed near each other produce similar initial values, which is the reason for the multiplication.
Here is the db<>fiddle.

sql join using recursive cte

Edit: Added another case scenario in the notes and updated the sample attachment.
I am trying to write a sql to get an output attached with this question along with sample data.
There are two table, one with distinct ID's (pk) with their current flag.
another with Active ID (fk to the pk from the first table) and Inactive ID (fk to the pk from the first table)
Final output should return two columns, first column consist of all distinct ID's from the first table and second column should contain Active ID from the 2nd table.
Below is the sql:
IF OBJECT_ID('tempdb..#main') IS NOT NULL DROP TABLE #main;
IF OBJECT_ID('tempdb..#merges') IS NOT NULL DROP TABLE #merges
IF OBJECT_ID('tempdb..#final') IS NOT NULL DROP TABLE #final
SELECT DISTINCT id,
current
INTO #main
FROM tb_ID t1
--get list of all active_id and inactive_id
SELECT DISTINCT active_id,
inactive_id,
Update_dt
INTO #merges
FROM tb_merges
-- Combine where the id from the main table matched to the inactive_id (should return all the rows from #main)
SELECT id,
active_id AS merged_to_id
INTO #final
FROM (SELECT t1.*,
t2.active_id,
Update_dt ,
Row_number()
OVER (
partition BY id, active_id
ORDER BY Update_dt DESC) AS rn
FROM #main t1
LEFT JOIN #merges t2
ON t1.id = t2.inactive_id) t3
WHERE rn = 1
SELECT *
FROM #final
This sql partially works. It doesn't work, where the id was once active then gets inactive.
Please note:
the active ID should return the last most active ID
the ID which doesn't have any active ID should either be null or the ID itself
ID where the current = 0, in those cases active ID should be the ID current in tb_ID
ID's may get interchanged. For example there are two ID's 6 and 7, when 6 is active 7 is inactive and vice versa. the only way to know the most current active state is by the update date
Attached sample might be easy to understand
Looks like I might have to use recursive cte for achieiving the results. Can someone please help?
thank you for your time!
I think you're correct that a recursive CTE looks like a good solution for this. I'm not entirely certain that I've understood exactly what you're asking for, particularly with regard to the update_dt column, just because the data is a little abstract as-is, but I've taken a stab at it, and it does seem to work with your sample data. The comments explain what's going on.
declare #tb_id table (id bigint, [current] bit);
declare #tb_merges table (active_id bigint, inactive_id bigint, update_dt datetime2);
insert #tb_id values
-- Sample data from the question.
(1, 1),
(2, 1),
(3, 1),
(4, 1),
(5, 0),
-- A few additional data to illustrate a deeper search.
(6, 1),
(7, 1),
(8, 1),
(9, 1),
(10, 1);
insert #tb_merges values
-- Sample data from the question.
(3, 1, '2017-01-11T13:09:00'),
(1, 2, '2017-01-11T13:07:00'),
(5, 4, '2013-12-31T14:37:00'),
(4, 5, '2013-01-18T15:43:00'),
-- A few additional data to illustrate a deeper search.
(6, 7, getdate()),
(7, 8, getdate()),
(8, 9, getdate()),
(9, 10, getdate());
if object_id('tempdb..#ValidMerge') is not null
drop table #ValidMerge;
-- Get the subset of merge records whose active_id identifies a "current" id and
-- rank by date so we can consider only the latest merge record for each active_id.
with ValidMergeCTE as
(
select
M.active_id,
M.inactive_id,
[Priority] = row_number() over (partition by M.active_id order by M.update_dt desc)
from
#tb_merges M
inner join #tb_id I on M.active_id = I.id
where
I.[current] = 1
)
select
active_id,
inactive_id
into
#ValidMerge
from
ValidMergeCTE
where
[Priority] = 1;
-- Here's the recursive CTE, which draws on the subset of merges identified above.
with SearchCTE as
(
-- Base case: any record whose active_id is not used as an inactive_id is an endpoint.
select
M.active_id,
M.inactive_id,
Depth = 0
from
#ValidMerge M
where
not exists (select 1 from #ValidMerge M2 where M.active_id = M2.inactive_id)
-- Recursive case: look for records whose active_id matches the inactive_id of a previously
-- identified record.
union all
select
S.active_id,
M.inactive_id,
Depth = S.Depth + 1
from
#ValidMerge M
inner join SearchCTE S on M.active_id = S.inactive_id
)
select
I.id,
S.active_id
from
#tb_id I
left join SearchCTE S on I.id = S.inactive_id;
Results:
id active_id
------------------
1 3
2 3
3 NULL
4 NULL
5 4
6 NULL
7 6
8 6
9 6
10 6

SQL Server Sum a specific number of rows based on another column

Here are the important columns in my table
ItemId RowID CalculatedNum
1 1 3
1 2 0
1 3 5
1 4 25
1 5 0
1 6 8
1 7 14
1 8 2
.....
The rowID increments to 141 before the ItemID increments to 2. This cycle repeats for about 122 million rows.
I need to SUM the CalculatedNum field in groups of 6. So sum 1-6, then 7-12, etc. I know I end up with an odd number at the end. I can discard the last three rows (numbers 139, 140 and 141). I need it to start the SUM cycle again when I get to the next ItemID.
I know I need to group by the ItemID but I am having trouble trying to figure out how to get SQL to SUM just 6 CalculatedNum's at a time. Everything else I have come across SUMs based on a column where the values are the same.
I did find something on Microsoft's site that used the ROW_NUMBER function but I couldn't quite make sense of it. Please let me know if this question is not clear.
Thank you
You need to group by (RowId - 1) / 6 and ItemId. Like this:
drop table if exists dbo.Items;
create table dbo.Items (
ItemId int
, RowId int
, CalculatedNum int
);
insert into dbo.Items (ItemId, RowId, CalculatedNum)
values (1, 1, 3), (1, 2, 0), (1, 3, 5), (1, 4, 25)
, (1, 5, 0), (1, 6, 8), (1, 7, 14), (1, 8, 2);
select
tt.ItemId
, sum(tt.CalculatedNum) as CalcSum
from (
select
*
, (t.RowId - 1) / 6 as Grp
from dbo.Items t
) tt
group by tt.ItemId, tt.Grp
You could use integer division and group by.
SELECT ItemId, (RowId-1)/6 as Batch, sum(CalculatedNum)
FROM your_table GROUP BY ItemId, Batch
To discard incomplete batches:
SELECT ItemId, (RowId-1)/6 as Batch, sum(CalculatedNum), count(*) as Cnt
FROM your_table GROUP BY ItemId, Batch HAVING Cnt = 6
EDIT: Fix an off by one error.
To ensure you're querying 6 rows at a time you can try to use the modulo function : https://technet.microsoft.com/fr-fr/library/ms173482(v=sql.110).aspx
Hope this can help.
Thanks everyone. This was really helpful.
Here is what we ended up with.
SELECT ItemID, MIN(RowID) AS StartingRow, SUM(CalculatedNum)
FROM dbo.table
GROUP BY ItemID, (RowID - 1) / 6
ORDER BY ItemID, StartingRow
I am not sure why it did not like the integer division in the select statement but I checked the results against a sample of the data and the math is correct.

MSSQL ORDER BY Passed List

I am using Lucene to perform queries on a subset of SQL data which returns me a scored list of RecordIDs, e.g. 11,4,5,25,30 .
I want to use this list to retrieve a set of results from the full SQL Table by RecordIDs.
So SELECT * FROM MyFullRecord
where RecordID in (11,5,3,25,30)
I would like the retrieved list to maintain the scored order.
I can do it by using an Order by like so;
ORDER BY (CASE WHEN RecordID = 11 THEN 0
WHEN RecordID = 5 THEN 1
WHEN RecordID = 3 THEN 2
WHEN RecordID = 25 THEN 3
WHEN RecordID = 30 THEN 4
END)
I am concerned with the loading of the server loading especially if I am passing long lists of RecordIDs. Does anyone have experience of this or how can I determine an optimum list length.
Are there any other ways to achieve this functionality in MSSQL?
Roger
You can record your list into a table or table variable with sorting priorities.
And then join your table with this sorting one.
DECLARE TABLE #tSortOrder (RecordID INT, SortOrder INT)
INSERT INTO #tSortOrder (RecordID, SortOrder)
SELECT 11, 1 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 3, 3 UNION ALL
SELECT 25, 4 UNION ALL
SELECT 30, 5
SELECT *
FROM yourTable T
LEFT JOIN #tSortOrder S ON T.RecordID = S.RecordID
ORDER BY S.SortOrder
Instead of creating a searched order by statement, you could create an in memory table to join. It's easier on the eyes and definitely scales better.
SQL Statement
SELECT mfr.*
FROM MyFullRecord mfr
INNER JOIN (
SELECT *
FROM (VALUES (1, 11),
(2, 5),
(3, 3),
(4, 25),
(5, 30)
) q(ID, RecordID)
) q ON q.RecordID = mfr.RecordID
ORDER BY
q.ID
Look here for a fiddle
Something like:
SELECT * FROM MyFullRecord where RecordID in (11,5,3,25,30)
ORDER BY
CHARINDEX(','+CAST(RecordID AS varchar)+',',
','+'11,5,3,25,30'+',')
SQLFiddle demo