Query Optimization using 'except' - sql

I've been lately working on some performance optimization and have been a bit stuck with the below query. Breaking it down, the individual steps don't seem to take very long, but when I run the query as a whole, it takes about 30minutes to complete.
The TABLE has around 100k rows, and the VIEW has around 400k rows, so they're not terribly large. I wasn't sure if I'm just not understanding the EXCEPT logic accurately, and if that's the likely culprit? Would there be an alternative to EXCEPT perhaps?
EDIT - The view itself has about 4 joins and a UNION, so it does have some logic to it.
CREATE TABLE [SCHEMA].[TABLE](
ColumnA [int] IDENTITY(1,1) NOT NULL,
ColumnB [tinyint] NOT NULL,
ColumnC [tinyint] NOT NULL,
ColumnD [int] NULL,
ColumnE [nvarchar](50) NOT NULL,
ColumnF [int] NOT NULL,
ColumnG [nvarchar](250) NULL,
ColumnH [nvarchar](250) NULL,
ColumnI [nvarchar](250) NULL,
ColumnJ [nvarchar](50) NULL,
columnK [nvarchar](400) NULL,
ColumnL [nvarchar](2) NULL,
ColumnM [nvarchar](250) NULL,
ColumnN [nvarchar](3) NULL,
----
DELETE FROM [DB].[SCHEMA].[TABLE] WHERE ColumnB NOT IN (4,6)
AND ColumnG not in
(SELECT ColumnG
FROM
(
SELECT ColumnG,ColumH,ColumnI FROM [DB].[SCHEMA].[TABLE] EXCEPT
SELECT ColumnG,ColumnH,ColumnI FROM [DB].[SCHEMA].[VIEW]
WHERE VIEW.ColumnB='Active' and year(LastChgDateTime) = 9999
) AAA )
Thanks for any help!

Without knowing your schema and indexing, it's hard to say. We also haven;t seen a query plan. And you haven't provided the view definition, so we don't know what's involved with that.
But for a start, you can simplify this query in the following way
Note the use of a sarge-able predicate on LastChgDateTime
DELETE FROM [DB].[SCHEMA].[TABLE]
WHERE ColumnB NOT IN (4,6)
AND NOT EXISTS (
SELECT ColumnG,ColumH,ColumnI
FROM [DB].[SCHEMA].[TABLE] AAA
WHERE AAA.ColumnG = [TABLE].ColumnG
EXCEPT
SELECT ColumnG,ColumnH,ColumnI
FROM [DB].[SCHEMA].[VIEW]
WHERE [VIEW].ColumnB = 'Active' and LastChgDateTime >= '9999-01-01'
);
For the above, the following indexes would make sense
The view would need indexing on the base tables
[TABLE] (ColumnG, ColumH, ColumnI) INCLUDE (ColumnB)
[VIEW] (ColumnB, ColumnG, ColumH, ColumnI, LastChgDateTime)
We can optimize this further by using an updatable CTE with a window function.
I'm not entirely sure the logic you are trying to achieve, but it appears to be something like this.
WITH cte AS (
SELECT
t.*,
IsGNotInView = COUNT(v.IsNotInView) OVER (PARTITION BY t.ColumnG)
FROM [DB].[SCHEMA].[TABLE] t
CROSS APPLY (
SELECT
CASE WHEN NOT EXISTS (SELECT 1
FROM [DB].[SCHEMA].[VIEW] v
WHERE v.ColumnB = 'Active'
AND v.LastChgDateTime >= '9999-01-01'
AND v.ColumnG = t.ColumnG
AND v.ColumnH = t.ColumnH
AND t.ColumnI = v.ColumnI
)
THEN 1 END
) v(IsNotInView)
)
DELETE FROM cte
WHERE ColumnB NOT IN (4,6)
AND IsGNotInView = 0;

Related

Sometime SQL Server Select Query is too slow

I have a table like this which has more than 7 million records:
CREATE TABLE [dbo].[Test]
(
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[UUID] [nvarchar](100) NOT NULL,
[FirstName] [nvarchar](50) NULL,
[LastName] [nvarchar](50) NULL,
[AddrLine1] [nvarchar](100) NULL,
[AddrLine2] [nvarchar](100) NULL,
[City] [nvarchar](50) NULL,
[Prov] [nvarchar](10) NULL,
[Postal] [nvarchar](10) NULL,
[DateAdded] [datetime] NULL,
CONSTRAINT [PK_Test]
PRIMARY KEY CLUSTERED ([Id] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
Now, the system runs the following select query everyday during the afternoons. The funny thing is sometimes the same query is so slow which takes about 4 minutes. The second time or other times, same query is pretty fast which takes less than a second.
The query:
WITH testquery AS
(
SELECT TOP 1
'Matched' as location,Firstname, LastName,
AddrLine1, AddrLine2, City, Prov, Postal
FROM
[Test]
WHERE
UUID = 'BLABLABLABLABLABLABLABLABLA'
ORDER BY
DateAdded DESC
),
defaults AS
(
SELECT
'Rejected' AS location, NULL AS Firstname, NULL AS LastName,
NULL AS AddrLine1, NULL AS AddrLine2, NULL AS City, NULL AS Prov,
NULL AS Postal
)
SELECT *
FROM testquery
UNION ALL
SELECT *
FROM defaults
WHERE NOT EXISTS (SELECT * FROM testquery);
Can somebody help please?
Notes:
I have a service which adds around 1000 new records to the table everyday in the mornings.
[avg_fragmentation_in_percent] is 0.01
UUID can be duplicated if I have the same person with different addresses.
The table is not used somewhere else at the same time.
Database is not busy with other queries at the same time. I checked using "sys.dm_exec_requests"
You need a good index to service this query efficiently.
You say that you can't create one because of duplicate key errors: there is no need for an index to be unique.
So the one you're looking for will depend on what other queries you are running, but the following will suffice for this query:
CREATE NONCLUSTERED INDEX IX_Test_UuidDate ON
Test (UUID ASC, DateAdded DESC)
INCLUDE (Firstname, LastName, AddrLine1, AddrLine2, City, Prov, Postal)
GO
Furthermore, there is no need to query the table twice.
Start with a dummy VALUES table constructor so you always have a row, then LEFT JOIN the table and use CASE to deal with not having a row.
WITH testquery AS
(
SELECT TOP 1
*
FROM
[Test]
WHERE
UUID = 'BLABLABLABLABLABLABLABLABLA'
ORDER BY
DateAdded DESC
)
SELECT
CASE WHEN UUID IS NULL 'Rejected' ELSE 'Matched' END as location,
t.Firstname,
t.LastName,
t.AddrLine1,
t.AddrLine2,
t.City,
t.Prov,
t.Postal
FROM (VALUES(0)) AS v(dummy)
LEFT JOIN testquery AS t ON 1=1;
The usual explanation for this is a cold cache. In your case, I think the issue would be the ORDER BY in the first CTE.
To fix this problem, you want an index on test(UUID, DateAdded desc).
I'm not sure why this would speed up after the first execution. Perhaps the server's caches are working particularly well.

SQL loop executes but new old values are over written

As my question title says, my program loops but all of my values I updated are being overwritten. Here's the code posted below. Say minRownum is 1 and max is 12, I see the loop execute 12 times correctly and min gets updated +1 each time. But in the end result, only the final row in my column whose RowNum is 12 have any values
I'm not exactly sure why overwriting is occurring since I'm saying "Update it where the rownumber = minrownumber" then I increment minrownum.
Can anyone point to what I am doing wrong? Thanks
WHILE (#MinRownum <= #MaxRownum)
BEGIN
print ' here'
UPDATE #usp_sec
set amount=(
SELECT sum(amount) as amount
FROM dbo.coverage
inner join dbo.owner
on coverage.api=owner.api
where RowNum=#MinRownum
);
SET #MinRownum = #MinRownum + 1
END
PS: I edited this line to say (below) and now every value has the same wrong number (its not distinct but duplicated to all.
set amount = (SELECT sum(amount) as amount
FROM dbo.coverage
INNER JOIN dbo.owner ON coverage.api = owner.api
where RowNum=#MinRownum
) WHERE RowNum = #MinRownum;
Tables:
CREATE TABLE dbo. #usp_sec
(
RowNum int,
amount numeric(20,2),
discount numeric(3,2)
)
CREATE TABLE [dbo].[handler](
[recordid] [int] IDENTITY(1,1) NOT NULL,
[covid] [varchar](25) NULL,
[ownerid] [char](10) NULL
)
CREATE TABLE [dbo].[coverage](
[covid] [varchar](25) NULL,
[api] [char](12) NULL,
[owncovid] [numeric](12, 0) NULL,
[amount] [numeric](14, 2) NULL,
[knote] [char](10) NULL
)
CREATE TABLE [dbo].[owner](
[api] [char](12) NOT NULL,
[owncovid] [numeric](12, 0) NULL,
[ownerid] [char](10) NOT NULL,
[officer] [char](20) NOT NULL,
[appldate] [date] NOT NULL
)
Your UPDATE statement needs its own WHERE clause. Otherwise, each UPDATE will update every row in the table.
And the way you have this written, your subquery still needs its WHERE clause too. In fact, you need to definitively correlate the subquery to your table's (#usp_sec) rows. We cannot tell you how that should be done without more information such as your table definitions.

Query is very very slow for processing 200000 plus records

I have 200,000 rows in Patient & Person table, and the query shown takes 30 secs to execute.
I have defined the primary key (and clustered index) in the Person table on PersonId and on PatientId in the Patient table. What else can I do here to improve performance of my procedure?
New to database development side. I know only basic SQL. Also not sure SQL Server can handle 200,000 rows quickly.
Whole dynamic Procedure you can see at https://github.com/Padayappa/SQLProblem/blob/master/Performance
Anyone faced handling huge rows like this? How do I improve performance here?
DECLARE #return_value int,
#unitRows bigint,
#unitPages int,
#TenantId int,
#unitItems int,
#page int
SET #TenantId = 1
SET #unitItems = 20
SET #page = 1
DECLARE #PatientSearch TABLE(
[PatientId] [bigint] NOT NULL,
[PatientIdentifier] [nvarchar](50) NULL,
[PersonNumber] [nvarchar](20) NULL,
[FirstName] [nvarchar](100) NOT NULL,
[LastName] [nvarchar](100) NOT NULL,
[ResFirstName] [nvarchar](100) NOT NULL,
[ResLastName] [nvarchar](100) NOT NULL,
[AddFirstName] [nvarchar](100) NOT NULL,
[AddLastName] [nvarchar](100) NOT NULL,
[Address] [nvarchar](255) NULL,
[City] [nvarchar](50) NULL,
[State] [nvarchar](50) NULL,
[ZipCode] [nvarchar](20) NULL,
[Country] [nvarchar](50) NULL,
[RowNumber] [bigint] NULL
)
INSERT INTO #PatientSearch SELECT PAT.PatientId
,PAT.PatientIdentifier
,PER.PersonNumber
,PER.FirstName
,PER.LastName
,RES_PER.FirstName AS ResFirstName
,RES_PER.LastName AS ResLastName
,ADD_PER.FirstName AS AddFirstName
,ADD_PER.LastName AS AddLastName
,PER.Address
,PER.City
,PER.State
,PER.ZipCode
,PER.Country
,ROW_NUMBER() OVER (ORDER BY PAT.PatientId DESC) AS RowNumber
FROM dbo.Patient AS PAT
INNER JOIN dbo.Person AS PER
ON PAT.PersonId = PER.PersonId
INNER JOIN dbo.Person AS RES_PER
ON PAT.ResponsiblePersonId = RES_PER.PersonId
INNER JOIN dbo.Person AS ADD_PER
ON PAT.AddedBy = ADD_PER.PersonId
INNER JOIN dbo.Booking AS B
ON PAT.PatientId = B.PatientId
WHERE PAT.TenantId = #TenantId AND B.CategoryId = #CategoryId
GROUP BY PAT.PatientId
,PAT.PatientIdentifier
,PER.PersonNumber
,PER.FirstName
,PER.LastName
,RES_PER.FirstName
,RES_PER.LastName
,ADD_PER.FirstName
,ADD_PER.LastName
,PER.Address
,PER.City
,PER.State
,PER.ZipCode
,PER.Country
;
SELECT #unitRows = ##ROWCOUNT
,#unitPages = (#unitRows / #unitItems) + 1;
SELECT *
FROM #PatientSearch AS IT
WHERE RowNumber BETWEEN (#page - 1) * #unitItems + 1 AND #unitItems * #page
Well, unless I am missing something (like duplicate rows?) you should be able to remove the GROUP BY
GROUP BY PAT.PatientId
,PAT.PatientIdentifier
,PER.PersonNumber
,PER.FirstName
,PER.LastName
,RES_PER.FirstName
,RES_PER.LastName
,ADD_PER.FirstName
,ADD_PER.LastName
,PER.Address
,PER.City
,PER.State
,PER.ZipCode
,PER.Country
as you are grouping by all fields in the select list, and you are partitioning by PAT.PatientId
Further to that, you should create index on the tables with the index containing columns that you join/filter on.
So for instance I would create an index on table Patient with columns (TenantId,PersonId,ResponsiblePersonId,AddedBy) with included columns (PatientId,PatientIdentifier)
Frankly speaking, 200,000 rows is nothing to SQL server. Please first remove logic redundancy, like you have primary key, why still group so many columns, and why you need to join same table (person) 3 times? After removing logic redundancy, you need to create some composite index/include index at least. Get the execution plan (CTRL+M) or (CTRL+M), to see what index you missed. If you need further help, please paste your table schema with few rows of sample data.

SQL Server View with TOP

I have a view that is driving me absolutely crazy..
Table AlarmMsg looks like this:
[TypeID] [smallint] NULL,
[SEFNum] [int] NULL,
[ServerName] [nvarchar](20) NOT NULL,
[DBName] [varchar](20) NOT NULL,
[PointName] [varchar](50) NOT NULL,
[AppName] [varchar](50) NOT NULL,
[Description] [varchar](100) NOT NULL,
[Priority] [tinyint] NOT NULL,
[Value] [float] NOT NULL,
[Limit] [float] NOT NULL,
[Msg] [nvarchar](255) NULL,
[DateStamp] [datetime2](7) NULL,
[UID] [uniqueidentifier] NOT NULL
On top of that AlarmMsg table is a view applied looking like this:
CREATE VIEW AlarmMsgView
AS
SELECT TOP (2000) WITH TIES
SEFNum, ServerName, DBName,
PointName, AppName, Description,
Priority, Value, Limit, Msg,
DateStamp, UID
FROM dbo.AlarmMsg WITH (NOLOCK)
ORDER BY DateStamp DESC
This query straight against the table returns the expected ten (10) rows:
SELECT TOP(10) [SEFNum]
FROM [RTIME_Logs].[dbo].[AlarmMsg]
where [Priority] = 1
The same query against the view returns....nothing (!):
SELECT TOP(10) [SEFNum]
FROM [RTIME_Logs].[dbo].[AlarmMsgView]
where [Priority] = 1
The table AlarmMsg contains some 11M+ rows and has a FT index declared on column Msg.
Can someone please tell me what's going on here, I think I'm losing my wits.
NOLOCK causes this issue.
Read this and this.
Basically, NOLOCK came from SQL Server 2000 era. It needs to be forgotten. You have upgraded your SQL Server (I hope), so you need to upgrade your queries. Consider switching to READ_COMMITTED_SNAPSHOT to read data in "unblocked" manner. Read here to decide which isolation level is best for your situation
EDIT:
After reading the comments from the author, I think this is the reason:
SQL Server is not doing anything wrong. Treat your view as a subquery of the main query, something like this:
SELECT * FROM (SELECT TOP(2000) col1, col2 FROM aTable ORDER BY Date1 DESC) WHERE priority = 1.
In this case the query in the brackets will be executed first, and the WHERE clause will be applied to the resulting set

what's the right way of joning two tables, group by a column, and select only one row for each record?

I have a crews table
CREATE TABLE crew(crew_id INT, crew_name nvarchar(20), )
And a time log table, which is just a very long list of actions performed by the crew
CREATE TABLE [dbo].[TimeLog](
[time_log_id] [int] IDENTITY(1,1) NOT NULL,
[experiment_id] [int] NOT NULL,
[crew_id] [int] NOT NULL,
[starting] [bit] NULL,
[ending] [bit] NULL,
[exception] [nchar](10) NULL,
[sim_time] [time](7) NULL,
[duration] [int] NULL,
[real_time] [datetime] NOT NULL )
I want to have a view that shows only one row for each crew with the latest sim_time + duration .
Is a view the way to go? If yes, how do I write it? If not, what's the best way of doing this?
Thanks
Here is a query to select what you want:
select * from (
select
*,
row_number() over (partition by c.crew_id order by l.sim_time desc) as rNum
from crew as c
inner join TileLog as l (on c.crew_id = l.crew_id)
) as t
where rNum = 1
it depends on what you need that data for.
anyway, a simple query to find latest sim time would be something like
select C.*, TL.sim_time
from crew C /*left? right? inner?*/ join TimeLog TL on TL.crew_id = C.crew.id
where TL.sim_time in (select max(timelog_subquery.sim_time) from TimeLog timelog_subquery where crew_id = C.crew_id )