Can this ORDER BY on a CASE clause be made faster?

Can this ORDER BY on a CASE clause be made faster? - sql

I'm selecting results from a table of ~350 million records, and it's running extremely slowly - around 10 minutes. The culprit seems to be the ORDER BY, as if I remove it the query only takes a moment. Here's the gist:
SELECT TOP 100
(columns snipped)
FROM (
SELECT
CASE WHEN (e2.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
ORDER BY p1.RecordExists
Basically, I'm ordering the results by whether Files have a corresponding Record, as those without need to be handled first. I could run two queries with WHERE clauses, but I'd rather do it in a single query if possible.
Is there any way to speed this up?

The ultimate issue is that the use of CASE in the sub-query introduces an ORDER BY over something that is not being used in a sargable manner. Thus the entire intermediate result-set must first be ordered to find the TOP 100 - this is all 350+ million records!2
In this particular case, moving the CASE to the outside SELECT and use a DESC ordering (to put NULL values, which means "0" in the current RecordExists, first) should do the trick1. It's not a generic approach, though .. but the ordering should be much, much faster iff Files.ID is indexed. (If the query is still slow, consult the query plan to find out why ORDER BY is not using an index.)
Another alternative might be to include a persisted computed column for RecordExists (that is also indexed) that can be used as an index in the ORDER BY.
Once again, the idea is that the ORDER BY works over something sargable, which only requires reading sequentially inside the index (up to the desired number of records to match the outside limit) and not ordering 350+ million records on-the-fly :)
SQL Server is then able to push this ordering (and limit) down into the sub-query, instead of waiting for the intermediate result-set of the sub-query to come up. Look at the query plan differences based on what is being ordered.
1 Example:
SELECT TOP 100
-- If needed
CASE WHEN (p1.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM (
SELECT
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
-- Hopefully ID is indexed, DESC makes NULLs (!RecordExists) go first
ORDER BY p1.ID DESC
2 Actually, it seems like it could hypothetically just stop after the first 100 0's without a full-sort .. at least under some extreme query planner optimization under a closed function range, but that depends on when the 0's are encountered in the intermediate result set (in the first few thousand or not until the hundreds of millions or never?). I highly doubt SQL Server accounts for this extreme case anyway; that is, don't count on this still non-sargable behavior.

Give this form a try
SELECT TOP(100) *
FROM (
SELECT TOP(100)
0 AS RecordExists
--,(columns snipped)
FROM dbo.Files AS e1
WHERE NOT EXISTS (SELECT * FROM dbo.Records e2 WHERE e1.FID = e2.FID)
ORDER BY SecondaryOrderColumn
) X
UNION ALL
SELECT * FROM (
SELECT TOP(100)
1 AS RecordExists
--,(columns snipped)
FROM dbo.Files AS e1
INNER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
ORDER BY SecondaryOrderColumn
) X
ORDER BY SecondaryOrderColumn
Key indexes:
Records (FID)
Files (FID, SecondaryOrdercolumn)

Well the reason it is much slower is because it is really a very different query without the order by clause.
With the order by clause:
Find all matching records out of the entire 350 million rows. Then sort them.
Without the order by clause:
Find the first 100 matching records. Stop.

Q: If you say the only difference is "with/outout" the "order by", then could you somehow move the "top 100" into the inner select?
EXAMPLE:
SELECT
(columns snipped)
FROM (
SELECT TOP 100
CASE WHEN (e2.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
ORDER BY p1.RecordExists

In SQL Server, null values collate lower than any value in the domain. Given these two tables:
create table dbo.foo
(
id int not null identity(1,1) primary key clustered ,
name varchar(32) not null unique nonclustered ,
)
insert dbo.foo ( name ) values ( 'alpha' )
insert dbo.foo ( name ) values ( 'bravo' )
insert dbo.foo ( name ) values ( 'charlie' )
insert dbo.foo ( name ) values ( 'delta' )
insert dbo.foo ( name ) values ( 'echo' )
insert dbo.foo ( name ) values ( 'foxtrot' )
go
create table dbo.bar
(
id int not null identity(1,1) primary key clustered ,
foo_id int null foreign key references dbo.foo(id) ,
name varchar(32) not null unique nonclustered ,
)
go
insert dbo.bar( foo_id , name ) values( 1 , 'golf' )
insert dbo.bar( foo_id , name ) values( 5 , 'hotel' )
insert dbo.bar( foo_id , name ) values( 3 , 'india' )
insert dbo.bar( foo_id , name ) values( 5 , 'juliet' )
insert dbo.bar( foo_id , name ) values( 6 , 'kilo' )
go
The query
select *
from dbo.foo foo
left join dbo.bar bar on bar.foo_id = foo.id
order by bar.foo_id, foo.id
yields the following result set:
id name id foo_id name
-- ------- ---- ------ -------
2 bravo NULL NULL NULL
4 delta NULL NULL NULL
1 alpha 1 1 golf
3 charlie 3 3 india
5 echo 2 5 hotel
5 echo 4 5 juliet
6 foxtrot 5 6 kilo
(7 row(s) affected)
This should allow the query optimizer to use a suitable index (if such exists); however, it does not guarantee than any such index would be used.

Can you try this?
SELECT TOP 100
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
ORDER BY e2.ID ASC
This should give you where e2.ID is null first. Also, make sure Records.ID is indexed. This should give you the ordering you were wanting.

Related

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.

I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values

From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;

You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Needing system defined function to select updated or unmatched new records from two tables

I am having a live data table in which the old values are placed,in a new table i am moving data from that live table to this one how to find updated or new records that are inserted or updated in new table with out using except,checksum(binary_checksum) and join ,i am looking for a solution using System Defined Function.

The requirement is interesting as the best solutions are to use EXCEPT or a FULL JOIN. What you are trying to do is what is referred to as an left anti semi join. Here's a good article about the topic.
Note this sample data and the solutions (note that my solution that does not use EXCEPT or a join is the last solution):
-- sample data
if object_id('tempdb.dbo.orig') is not null drop table dbo.orig;
if object_id('tempdb.dbo.new') is not null drop table dbo.new;
create table dbo.orig (someid int, col1 int, constraint uq_cl_orig unique (someid, col1));
create table dbo.new (someid int, col1 int, constraint uq_cl_new unique (someid, col1));
insert dbo.orig values (1,100),(2,110),(3,120),(4,2000)
insert dbo.new values (1,100),(2,110),(3,122),(5,999);
Here's the EXCEPT version
select someid
from
(
select * from dbo.new except
select * from dbo.orig
) n
union -- union "distict"
select someid
from
(
select * from dbo.orig except
select * from dbo.new
) o;
Here's a FULL JOIN Solution which will also tell you if the record was removed, changed or added:
select
someid = isnull(n.someid, o.someid),
[status] =
case
when count(isnull(n.someid, o.someid)) > 1 then 'changed'
when max(n.col1) is null then 'removed' else 'added'
end
from dbo.new n
full join dbo.orig o
on n.col1=o.col1 and n.someid = o.someid
where n.col1 is null or o.col1 is null
group by isnull(n.someid, o.someid);
But, because those efficient solutions are not an option - you will need to go with a NOT IN or NOT EXISTS subquery.... And because it has to be a function, I am encapsulating the logic into a function.
create function dbo.newOrChangedOrRemoved()
returns table as return
-- get the new records
select someid, [status] = 'new'
from dbo.new n
where n.someid not in (select someid from dbo.orig)
union all
-- get the removed records
select someid, 'removed'
from dbo.orig o
where o.someid not in (select someid from dbo.new)
union all
-- get the changed records
select someid, 'changed'
from dbo.orig o
where exists
(
select *
from dbo.new n
where o.someid = n.someid and o.col1 <> n.col1
);
Results:
someid status
----------- -------
5 new
4 removed
3 changed

More efficient way of doing multiple joins to the same table and a "case when" in the select

At my organization clients can be enrolled in multiple programs at one time. I have a table with a list of all of the programs a client has been enrolled as unique rows in and the dates they were enrolled in that program.
Using an External join I can take any client name and a date from a table (say a table of tests that the clients have completed) and have it return all of the programs that client was in on that particular date. If a client was in multiple programs on that date it duplicates the data from that table for each program they were in on that date.
The problem I have is that I am looking for it to only return one program as their "Primary Program" for each client and date even if they were in multiple programs on that date. I have created a hierarchy for which program should be selected as their primary program and returned.
For Example:
1.)Inpatient
2.)Outpatient Clinical
3.)Outpatient Vocational
4.)Outpatient Recreational
So if a client was enrolled in Outpatient Clinical, Outpatient Vocational, Outpatient Recreational at the same time on that date it would only return "Outpatient Clinical" as the program.
My way of thinking for doing this would be to join to the table with the previous programs multiple times like this:
FROM dbo.TestTable as TestTable
LEFT OUTER JOIN dbo.PreviousPrograms as PreviousPrograms1
ON TestTable.date = PreviousPrograms1.date AND PreviousPrograms1.type = 'Inpatient'
LEFT OUTER JOIN dbo.PreviousPrograms as PreviousPrograms2
ON TestTable.date = PreviousPrograms2.date AND PreviousPrograms2.type = 'Outpatient Clinical'
LEFT OUTER JOIN dbo.PreviousPrograms as PreviousPrograms3
ON TestTable.date = PreviousPrograms3.date AND PreviousPrograms3.type = 'Outpatient Vocational'
LEFT OUTER JOIN dbo.PreviousPrograms as PreviousPrograms4
ON TestTable.date = PreviousPrograms4.date AND PreviousPrograms4.type = 'Outpatient Recreational'
and then do a condition CASE WHEN in the SELECT statement as such:
SELECT
CASE
WHEN PreviousPrograms1.name IS NOT NULL
THEN PreviousPrograms1.name
WHEN PreviousPrograms1.name IS NULL AND PreviousPrograms2.name IS NOT NULL
THEN PreviousPrograms2.name
WHEN PreviousPrograms1.name IS NULL AND PreviousPrograms2.name IS NULL AND PreviousPrograms3.name IS NOT NULL
THEN PreviousPrograms3.name
WHEN PreviousPrograms1.name IS NULL AND PreviousPrograms2.name IS NULL AND PreviousPrograms3.name IS NOT NULL AND PreviousPrograms4.name IS NOT NULL
THEN PreviousPrograms4.name
ELSE NULL
END as PrimaryProgram
The bigger problem is that in my actual table there are a lot more than just four possible programs it could be and the CASE WHEN select statement and the JOINs are already cumbersome enough.
Is there a more efficient way to do either the SELECTs part or the JOIN part? Or possibly a better way to do it all together?
I'm using SQL Server 2008.

You can simplify (replace) your CASE by using COALESCE() instead:
SELECT
COALESCE(PreviousPrograms1.name, PreviousPrograms2.name,
PreviousPrograms3.name, PreviousPrograms4.name) AS PreviousProgram
COALESCE() returns the first non-null value.
Due to your design, you still need the JOINs, but it would be much easier to read if you used very short aliases, for example PP1 instead of PreviousPrograms1 - it's just a lot less code noise.

You can simplify the Join by using a bridge table containing all the program types and their priority (my sql server syntax is a bit rusty):
create table BridgeTable (
programType varchar(30),
programPriority smallint
);
This table will hold all the program types and the program priority will reflect the priority you've specified in your question.
As for the part of the case, that will depend on the number of records involved. One of the tricks that I usually do is this (assuming programPriority is a number between 10 and 99 and no type can have more than 30 bytes, because I'm being lazy):
Select patient, date,
substr( min(cast(BridgeTable.programPriority as varchar) || PreviousPrograms.type), 3, 30)
From dbo.TestTable as TestTable
Inner Join dbo.BridgeTable as BridgeTable
Left Outer Join dbo.PreviousPrograms as PreviousPrograms
on PreviousPrograms.type = BridgeTable.programType
and TestTable.date = PreviousPrograms.date
Group by patient, date

You can achieve this using sub-queries, or you could refactor it to use CTEs, take a look at the following and see if it makes sense:
DECLARE #testTable TABLE
(
[id] INT IDENTITY(1, 1),
[date] datetime
)
DECLARE #previousPrograms TABLE
(
[id] INT IDENTITY(1,1),
[date] datetime,
[type] varchar(50)
)
INSERT INTO #testTable ([date])
SELECT '2013-08-08'
UNION ALL SELECT '2013-08-07'
UNION ALL SELECT '2013-08-06'
INSERT INTO #previousPrograms ([date], [type])
-- a sample user as an inpatient
SELECT '2013-08-08', 'Inpatient'
-- your use case of someone being enrolled in all 3 outpation programs
UNION ALL SELECT '2013-08-07', 'Outpatient Recreational'
UNION ALL SELECT '2013-08-07', 'Outpatient Clinical'
UNION ALL SELECT '2013-08-07', 'Outpatient Vocational'
-- showing our workings, this is what we'll join to
SELECT
PPP.[date],
PPP.[type],
ROW_NUMBER() OVER (PARTITION BY PPP.[date] ORDER BY PPP.[Priority]) AS [RowNumber]
FROM (
SELECT
[type],
[date],
CASE
WHEN [type] = 'Inpatient' THEN 1
WHEN [type] = 'Outpatient Clinical' THEN 2
WHEN [type] = 'Outpatient Vocational' THEN 3
WHEN [type] = 'Outpatient Recreational' THEN 4
ELSE 999
END AS [Priority]
FROM #previousPrograms
) PPP -- Previous Programs w/ Priority
SELECT
T.[date],
PPPO.[type]
FROM #testTable T
LEFT JOIN (
SELECT
PPP.[date],
PPP.[type],
ROW_NUMBER() OVER (PARTITION BY PPP.[date] ORDER BY PPP.[Priority]) AS [RowNumber]
FROM (
SELECT
[type],
[date],
CASE
WHEN [type] = 'Inpatient' THEN 1
WHEN [type] = 'Outpatient Clinical' THEN 2
WHEN [type] = 'Outpatient Vocational' THEN 3
WHEN [type] = 'Outpatient Recreational' THEN 4
ELSE 999
END AS [Priority]
FROM #previousPrograms
) PPP -- Previous Programs w/ Priority
) PPPO -- Previous Programs w/ Priority + Order
ON T.[date] = PPPO.[date] AND PPPO.[RowNumber] = 1
Basically we have our deepest sub-select giving all PreviousPrograms a priority based on type, then our wrapping sub-select gives them row numbers per date so we can select only the ones with a row number of 1.
I am guessing you would need to include a UR Number or some other patient identifier, simply add that as an output to both sub-selects and change the join.

Tricky MS Access SQL query to remove surplus duplicate records

I have an Access table of the form (I'm simplifying it a bit)
ID AutoNumber Primary Key
SchemeName Text (50)
SchemeNumber Text (15)
This contains some data eg...
ID SchemeName SchemeNumber
--------------------------------------------------------------------
714 Malcolm ABC123
80 Malcolm ABC123
96 Malcolms Scheme ABC123
101 Malcolms Scheme ABC123
98 Malcolms Scheme DEF888
654 Another Scheme BAR876
543 Whatever Scheme KJL111
etc...
Now. I want to remove duplicate names under the same SchemeNumber. But I want to leave the record which has the longest SchemeName for that scheme number. If there are duplicate records with the same longest length then I just want to leave only one, say, the lowest ID (but any one will do really). From the above example I would want to delete IDs 714, 80 and 101 (to leave only 96).
I thought this would be relatively easy to achieve but it's turning into a bit of a nightmare! Thanks for any suggestions. I know I could loop it programatically but I'd rather have a single DELETE query.

See if this query returns the rows you want to keep:
SELECT r.SchemeNumber, r.SchemeName, Min(r.ID) AS MinOfID
FROM
(SELECT
SchemeNumber,
SchemeName,
Len(SchemeName) AS name_length,
ID
FROM tblSchemes
) AS r
INNER JOIN
(SELECT
SchemeNumber,
Max(Len(SchemeName)) AS name_length
FROM tblSchemes
GROUP BY SchemeNumber
) AS w
ON
(r.SchemeNumber = w.SchemeNumber)
AND (r.name_length = w.name_length)
GROUP BY r.SchemeNumber, r.SchemeName
ORDER BY r.SchemeName;
If so, save it as qrySchemes2Keep. Then create a DELETE query to discard rows from tblSchemes whose ID value is not found in qrySchemes2Keep.
DELETE
FROM tblSchemes AS s
WHERE Not Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID);
Just beware, if you later use Access' query designer to make changes to that DELETE query, it may "helpfully" convert the SQL to something like this:
DELETE s.*, Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID)
FROM tblSchemes AS s
WHERE (((Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID))=False));

DELETE FROM Table t1
WHERE EXISTS (SELECT 1 from Table t2
WHERE t1.SchemeNumber = t2.SchemeNumber
AND Length(t2.SchemeName) > Length(t1.SchemeName)
)
Depend on your RDBMS you may use function different from Length (Oracle - length, mysql - length, sql server - LEN)

delete ShortScheme
from Scheme ShortScheme
join Scheme LongScheme
on ShortScheme.SchemeNumber = LongScheme.SchemeNumber
and (len(ShortScheme.SchemeName) < len(LongScheme.SchemeName) or (len(ShortScheme.SchemeName) = len(LongScheme.SchemeName) and ShortScheme.ID > LongScheme.ID))
(SQL Server flavored)
Now updated to include the specified tie resolution. Although, you may get better performance doing it in two queries: first deleting the schemes with shorter names as in my original query and then going back and deleting the higher ID where there was a tie in name length.

I'd do this in multiple steps. Large delete operations done in a single step make me too nervous -- what if you make a mistake? There's no sql 'undo' statement.
-- Setup the data
DROP Table foo;
DROP Table bar;
DROP Table bat;
DROP Table baz;
CREATE TABLE foo (
id int(11) NOT NULL,
SchemeName varchar(50),
SchemeNumber varchar(15),
PRIMARY KEY (id)
);
insert into foo values (714, 'Malcolm', 'ABC123' );
insert into foo values (80, 'Malcolm', 'ABC123' );
insert into foo values (96, 'Malcolms Scheme', 'ABC123' );
insert into foo values (101, 'Malcolms Scheme', 'ABC123' );
insert into foo values (98, 'Malcolms Scheme', 'DEF888' );
insert into foo values (654, 'Another Scheme ', 'BAR876' );
insert into foo values (543, 'Whatever Scheme ', 'KJL111' );
-- Find all the records that have dups, find the longest one
create table bar as
select max(length(SchemeName)) as max_length, SchemeNumber
from foo
group by SchemeNumber
having count(*) > 1;
-- Find the one we want to keep
create table bat as
select min(a.id) as id, a.SchemeNumber
from foo a join bar b on a.SchemeNumber = b.SchemeNumber
and length(a.SchemeName) = b.max_length
group by SchemeNumber;
-- Select into this table all the rows to delete
create table baz as
select a.id from foo a join bat b where a.SchemeNumber = b.SchemeNumber
and a.id != b.id;
This will give you a new table with only records for rows that you want to remove.
Now check these out and make sure that they contain only the rows you want deleted. This way you can make sure that when you do the delete, you know exactly what to expect. It should also be pretty fast.
Then when you're ready, use this command to delete the rows using this command.
delete from foo where id in (select id from baz);
This seems like more work because of the different tables, but it's safer probably just as fast as the other ways. Plus you can stop at any step and make sure the data is what you want before you do any actual deletes.

If your platform supports ranking functions and common table expressions:
with cte as (
select row_number()
over (partition by SchemeNumber order by len(SchemeName) desc) as rn
from Table)
delete from cte where rn > 1;

try this:
Select * From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or this:,...
Select * From Table t
Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))
if either of these selects the records that should be deleted, just change it to a delete
Delete
From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or using the second construction:
Delete From Table t Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))

SQL Server: row present in one query, missing in another

Ok so I think I must be misunderstanding something about SQL queries. This is a pretty wordy question, so thanks for taking the time to read it (my problem is right at the end, everything else is just context).
I am writing an accounting system that works on the double-entry principal -- money always moves between accounts, a transaction is 2 or more TransactionParts rows decrementing one account and incrementing another.
Some TransactionParts rows may be flagged as tax related so that the system can produce a report of total VAT sales/purchases etc, so it is possible that a single Transaction may have two TransactionParts referencing the same Account -- one VAT related, and the other not. To simplify presentation to the user, I have a view to combine multiple rows for the same account and transaction:
create view Accounting.CondensedEntryView as
select p.[Transaction], p.Account, sum(p.Amount) as Amount
from Accounting.TransactionParts p
group by p.[Transaction], p.Account
I then have a view to calculate the running balance column, as follows:
create view Accounting.TransactionBalanceView as
with cte as
(
select ROW_NUMBER() over (order by t.[Date]) AS RowNumber,
t.ID as [Transaction], p.Amount, p.Account
from Accounting.Transactions t
inner join Accounting.CondensedEntryView p on p.[Transaction]=t.ID
)
select b.RowNumber, b.[Transaction], a.Account,
coalesce(sum(a.Amount), 0) as Balance
from cte a, cte b
where a.RowNumber <= b.RowNumber AND a.Account=b.Account
group by b.RowNumber, b.[Transaction], a.Account
For reasons I haven't yet worked out, a certain transaction (ID=30) doesn't appear on an account statement for the user. I confirmed this by running
select * from Accounting.TransactionBalanceView where [Transaction]=30
This gave me the following result:
RowNumber Transaction Account Balance
-------------------- ----------- ------- ---------------------
72 30 23 143.80
As I said before, there should be at least two TransactionParts for each Transaction, so one of them isn't being presented in my view. I assumed there must be an issue with the way I've written my view, and run a query to see if there's anything else missing:
select [Transaction], count(*)
from Accounting.TransactionBalanceView
group by [Transaction]
having count(*) < 2
This query returns no results -- not even for Transaction 30! Thinking I must be an idiot I run the following query:
select [Transaction]
from Accounting.TransactionBalanceView
where [Transaction]=30
It returns two rows! So select * returns only one row and select [Transaction] returns both. After much head-scratching and re-running the last two queries, I concluded I don't have the faintest idea what's happening. Any ideas?
Thanks a lot if you've stuck with me this far!
Edit:
Here are the execution plans:
select *
select [Transaction]
1000 lines each, hence finding somewhere else to host.
Edit 2:
For completeness, here are the tables I used:
create table Accounting.Accounts
(
ID smallint identity primary key,
[Name] varchar(50) not null
constraint UQ_AccountName unique,
[Type] tinyint not null
constraint FK_AccountType foreign key references Accounting.AccountTypes
);
create table Accounting.Transactions
(
ID int identity primary key,
[Date] date not null default getdate(),
[Description] varchar(50) not null,
Reference varchar(20) not null default '',
Memo varchar(1000) not null
);
create table Accounting.TransactionParts
(
ID int identity primary key,
[Transaction] int not null
constraint FK_TransactionPart foreign key references Accounting.Transactions,
Account smallint not null
constraint FK_TransactionAccount foreign key references Accounting.Accounts,
Amount money not null,
VatRelated bit not null default 0
);

Demonstration of possible explanation.
Create table Script
SELECT *
INTO #T
FROM master.dbo.spt_values
CREATE NONCLUSTERED INDEX [IX_T] ON #T ([name] DESC,[number] DESC);
Query one (Returns 35 results)
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY NAME) AS rn
FROM #T
)
SELECT c1.number,c1.[type]
FROM cte c1
JOIN cte c2 ON c1.rn=c2.rn AND c1.number <> c2.number
Query Two (Same as before but adding c2.[type] to the select list makes it return 0 results)
;
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY NAME) AS rn
FROM #T
)
SELECT c1.number,c1.[type] ,c2.[type]
FROM cte c1
JOIN cte c2 ON c1.rn=c2.rn AND c1.number <> c2.number
Why?
row_number() for duplicate NAMEs isn't specified so it just chooses whichever one fits in with the best execution plan for the required output columns. In the second query this is the same for both cte invocations, in the first one it chooses a different access path with resultant different row_numbering.
Suggested Solution
You are self joining the CTE on ROW_NUMBER() over (order by t.[Date])
Contrary to what may have been expected the CTE will likely not be materialised which would have ensured consistency for the self join and thus you assume a correlation between ROW_NUMBER() on both sides that may well not exist for records where a duplicate [Date] exists in the data.
What if you try ROW_NUMBER() over (order by t.[Date], t.[id]) to ensure that in the event of tied dates the row_numbering is in a guaranteed consistent order. (Or some other column/combination of columns that can differentiate records if id won't do it)

If the purpose of this part of the view is just to make sure that the same row isn't joined to itself
where a.RowNumber <= b.RowNumber
then how does changing this part to
where a.RowNumber <> b.RowNumber
affect the results?

It seems you read dirty entries. (Someone else deletes/insertes new data)
try SET TRANSACTION ISOLATION LEVEL READ COMMITTED.
i've tried this code (seems equal to yours)
IF object_id('tempdb..#t') IS NOT NULL DROP TABLE #t
CREATE TABLE #t(i INT, val INT, acc int)
INSERT #t
SELECT 1, 2, 70
UNION ALL SELECT 2, 3, 70
;with cte as
(
select ROW_NUMBER() over (order by t.i) AS RowNumber,
t.val as [Transaction], t.acc Account
from #t t
)
select b.RowNumber, b.[Transaction], a.Account
from cte a, cte b
where a.RowNumber <= b.RowNumber AND a.Account=b.Account
group by b.RowNumber, b.[Transaction], a.Account
and got two rows
RowNumber Transaction Account
1 2 70
2 3 70

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Can this ORDER BY on a CASE clause be made faster? - sql

Well the reason it is much slower is because it is really a very different query without the order by clause. With the order by clause: Find all matching records out of the entire 350 million rows. Then sort them. Without the order by clause: Find the first 100 matching records. Stop.

Can you try this? SELECT TOP 100 (columns snipped) FROM dbo.Files AS e1 LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID ORDER BY e2.ID ASC This should give you where e2.ID is null first. Also, make sure Records.ID is indexed. This should give you the ordering you were wanting.

Related

Remove duplicated subsets from very large table

Needing system defined function to select updated or unmatched new records from two tables

More efficient way of doing multiple joins to the same table and a "case when" in the select

Tricky MS Access SQL query to remove surplus duplicate records

SQL Server: row present in one query, missing in another

Categories

Resources