I have tables "Products" 300,000 rows, and "Imported_Products" 4,000 rows. Also I have view "View_Imported_Products" which is based on "Imported_Products" to make it well-formed.
When I run UPDATE:
UPDATE Products SET DateDeleted = GETDATE()
WHERE Suppiler = 'Supplier1' AND SKU NOT IN (SELECT SKU FROM View_Imported_Products)
It takes a lot of time about 1 minute even if I run it second time and no rows update.
I have added non-clustered indexes on Products.SKU and View_Imported_Products.SKU, also I changed NOT IN to NOT EXISTS
UPDATE Products SET DateDeleted = GETDATE() FROM Products P
WHERE Supplier = 'Supplier1' AND NOT EXISTS (SELECT SKU FROM View_Imported_Products I WHERE P.SKU=I.SKU)
But it still takes about 16 seconds to run.
What I'm doing wrong, and how to improve that update to run it fast.
Appreciate any help.
Thank you
UPDATED
SELECT SKU FROM View_ImportedProducts - runs very fast, it takes 00:00:00 sec
Changed query to use LEFT JOIN, instead NOT EXISTS - doesn't help much
SELECT * FROM Products AS P
WHERE P.Supplier = 'Supplier1' AND DateDeleted IS NULL
AND
NOT EXISTS
(
SELECT
SKU
FROM View_ImportedProducts AS I
WHERE P.SKU = I.SKU
)
takes also long time to execute
Resolved it by added Non-clustered index to "Imported_Products".SKU field. My mistake was I added non-clustered index on "View_Imported_Products".SKU, instead of original table. Thank you all for help and replies!
If a lot of rows (tens of thousands) are being updated you are creating a big hit on the log. If this is so you want to update either 1000 or 10000 rows at a time and then commit. Your transaction will have a much smaller impact on the transaction log and will execute a lot faster.
Why not using joins
UPDATE Products SET DateDeleted = GETDATE() FROM Products P
Left join View_Imported_Products I On P.SKU=I.SKU
Where I.Sku is null
And you have to create non clustered indexes on p.sku and i.sku
Related
I have two tables, on one there are all the races that the buses do
dbo.Courses_Bus
|ID|ID_Bus|ID_Line|DateHour_Start_Course|DateHour_End_Course|
On the other all payments made in these buses
dbo.Payments
|ID|ID_Bus|DateHour_Payment|
The goal is to add the notion of a Line in the payment table to get something like this
dbo.Payments
|ID|ID_Bus|DateHour_Payment|Line|
So I tried to do this :
/** I first added a Line column to the dbo.Payments table**/
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN [dbo].[Courses_Bus] AS Table_B
ON Table_A.ID_Bus = Table_B.ID_Bus
AND Table_A.DateHour_Payment BETWEEN Table_B.DateHour_Start_Course AND Table_B.DateHour_End_Course
And this
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN (
SELECT
P.*,
CP.ID_Line AS ID_Line
FROM
[dbo].[Payments] AS P
INNER JOIN [dbo].[Courses_Bus] CP ON CP.ID_Bus = P.ID_Bus
AND CP.DateHour_Start_Course <= P.Date
AND CP.DateHour_End_Course >= P.Date
) AS Table_B ON Table_A.ID_Bus = Table_B.ID_Bus
The main problem, apart from the fact that these requests do not seem to work properly, is that each table has several million lines that are increasing every day, and because of the datehour filter (mandatory since a single bus can be on several lines everyday) SSMS must compare each row of the second table to all rows of the other table.
So it takes an infinite amount of time, which will increase every day.
How can I make it work and optimise it ?
Assuming that this is the logic you want:
UPDATE p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course;
To optimize this query, then you want an index on Courses_Bus(ID_Bus, DateHour_Start_Course, DateHour_End_Course).
There might be slightly more efficient ways to optimize the query, but your question doesn't have enough information -- is there always exactly one match, for instance?
Another big issue is that updating all the rows is quite expensive. You might find that it is better to do this in loops, one chunk at a time:
UPDATE TOP (10000) p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course
WHERE p.Line IS NULL;
Once again, though, this structure depends on all the initial values being NULL and an exact match for all rows.
Thank you Gordon for your answer.
I have investigated and came with this query :
MERGE [dbo].[Payments] AS p
USING [dbo].[Courses_Bus] AS cb
ON p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment>= cb.DateHour_Start_Course AND
p.DateHour_Payment<= cb.DateHour_End_Course
WHEN MATCHED THEN
UPDATE SET p.Line = cb.ID_Ligne;
As it seems to be the most suitable in an MS-SQL environment.
It also came with the error :
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
I understood this to mean that it finds several lines with identical
[p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment >= cb.DateHour_Start_Course AND
p.DateHour_Payment <= cb.DateHour_End_Course]
Yes, this is a possible case, however the ID is different each time.
For example, if two blue cards are beeped at the same time, or if there is a loss of network and the equipment has been updated, thus putting the beeps at the same time. These are different lines that must be treated separately, and you can obtain for example:
|ID|ID_Bus|DateHour_Payments|Line|
----------------------------------
|56|204|2021-01-01 10:00:00|15|
----------------------------------
|82|204|2021-01-01 10:00:00|15|
How can I improve this query so that it takes into account different payment IDs?
I can't figure out how to do this with the help I find online. Maybe this method is not the right one in this context.
When I execute the below query it took nearly 4 min.
This is the query
SELECT
transactionsEntry.StoreID StoreID,
items.ItemLookupCode ItemLookupCode,
SUM(transactionsEntry.Quantity)
FROM
[HQMatajer].[dbo].[TransactionEntry] transactionsEntry
RIGHT JOIN
[HQMatajer].[dbo].[Transaction] transactions ON transactionsEntry.TransactionNumber = transactions.TransactionNumber
INNER JOIN
[HQMatajer].[dbo].[Item] items ON transactionsEntry.ItemID = items.ID
WHERE
YEAR(transactions.Time) = 2015
AND MONTH(transactions.Time) = 1
GROUP BY
transactionsEntry.StoreID, items.ItemLookupCode
ORDER BY
items.ItemLookupCode
TransactionEntry table may have 90 billion records, Transaction table has 30 billion records, item table has 40 k records.
Estimation Cost
. It shows 84%. and it is clustered index.
Avoid function calls - they prevent usage of indexes. Try
Where transactions.Time >= '2015-01-01'
and transactions.Time < '2015-02-01'
If you don't have an index on the column transactions.Time then add an index for this column.
Create an Index for all the table which and all you are using in your query.
That is the better way for to produce the result very fastly.
for example
Create index for Time and TransactionNumber in Transaction Table
Create index for TransactionNumber, ItemID and StoreID in TransactionEntry Table
Create index for ItemID in Item Table.
Please visit this site. You can learn from the basic for query tuning and SQL optimization
You need to prevent the Clustered Index Scan.
I recommend creating a Covering Index on [transactionsentry]:
Key Columns:[TransactionNumber],[ItemID]
Include:[StoreID]
Also try this Index on [Transaction]:
Key Columns:[time],[TransactionNumber]
(Sorry I can't provide greater depth, but I don't know your current Indexing structure)
Try This code once it may help you
select transactionsEntry.StoreID StoreID,items.ItemLookupCode ItemLookupCode,SUM(transactionsEntry.Quantity)
FROM [HQMatajer].[dbo].[TransactionEntry] transactionsEntry
INNER JOIN [HQMatajer].[dbo].[Item] items ON transactionsEntry.ItemID = items.ID
RIGHT JOIN [HQMatajer].[dbo].[Transaction] transactions ON transactionsEntry.TransactionNumber = transactions.TransactionNumber
Where transactions.[Time] >= '2015-01-01' and transactions.[Time] < '2015-02-01'
GROUP BY transactionsEntry.StoreID,items.ItemLookupCode
ORDER BY items.ItemLookupCode
I have two tables in one database with about 50,000 to 70,000 rows. Both are MyISAM. The first, yahooprices, contains SKU codes (column code) for items and pricing (column price). The second table, combined_stock, contains partnumber (same information as code, but sorted differently), price, quantity, and description. Price is currently defined as FLOAT 10,2 and set to 0.00. I am attempting to pull the pricing over from yahooprices (also FLOAT 10,2) to combined_stock using this statement:
UPDATE combined_stock dest LEFT JOIN (
SELECT price, code FROM yahooprices
) src ON dest.partnumber = src.code
SET dest.price = src.price
I know this statement worked because I tried it on a smaller test amount. They have partnumber and code as non-unique indexes. I also tried indexing price on both tables to see if that would speed up the query. Technically it should finish within seconds, but last time I tried running this, it sat there overnight and even then I'm pretty certain it didn't work out. Anyone have any troubleshooting recommendations?
I would suggest some relatively small changes. First, get rid of the subquery. Second, switch to an inner join:
UPDATE combined_stock dest JOIN
yahooprices src
ON dest.partnumber = src.code
SET dest.price = src.price;
Finally create an index on yahooprices(code, price).
You can leave the left outer join if you really want the price to be set to NULL when there is no match.
I have a SQL stored procedure in which one statement is taking 95% of the total time (10 minutes) to complete. #Records has approximately 133,000 rows and Records has approximately 12,000 rows.
-- Check Category 1 first
UPDATE #Records
SET Id = (SELECT TOP 1 Id FROM Records WHERE Cat1=#Records.Cat1)
WHERE Cat1 IS NOT NULL
I have tried adding a index to Cat1 in #Records, but the statement time did not improve.
CREATE CLUSTERED INDEX IDX_C_Records_Cat1 ON #Records(Cat1)
A similar statement that follows, takes only a fraction of the time
-- Check Category 2
UPDATE #Records
SET Id = (SELECT TOP 1 Id FROM Records WHERE Cat2=#Records.Cat2)
WHERE ID IS NULL
Any ideas on why this is happening or what I can do to make this statement more time effective?
Thanks in advance.
I am running this on Microsoft SQL Server 2005.
update with join maybe
update t
set t.ID = r.ID
FROM (Select Min(ID) as ID,Cat1 From Records group by cat1) r
INNER JOIN #Records t ON r.Cat1 = t.cat1
Where t.cat1 is not null
I would say your problem is probably that you are using a correlated subquery instead of a join. Joins work in sets, correlated subqueries run row-by-agonzing-row and are essentially cursors.
In my experience, when you are trying to update a high number of records, sometimes is faster to use a cursor and iterate throught records rather than use an update query.
Maybe this help in your case.
I'm working on some upgrades to an internal web analytics system we provide for our clients (in the absence of a preferred vendor or Google Analytics), and I'm working on the following query:
select
path as EntryPage,
count(Path) as [Count]
from
(
/* Sub-query 1 */
select
pv2.path
from
pageviews pv2
inner join
(
/* Sub-query 2 */
select
pv1.sessionid,
min(pv1.created) as created
from
pageviews pv1
inner join Sessions s1 on pv1.SessionID = s1.SessionID
inner join Visitors v1 on s1.VisitorID = v1.VisitorID
where
pv1.Domain = isnull(#Domain, pv1.Domain) and
v1.Campaign = #Campaign
group by
pv1.sessionid
) t1 on pv2.sessionid = t1.sessionid and pv2.created = t1.created
) t2
group by
Path;
I've tested this query with 2 million rows in the PageViews table and it takes about 20 seconds to run. I'm noticing a clustered index scan twice in the execution plan, both times it hits the PageViews table. There is a clustered index on the Created column in that table.
The problem is that in both cases it appears to iterate over all 2 million rows, which I believe is the performance bottleneck. Is there anything I can do to prevent this, or am I pretty much maxed out as far as optimization goes?
For reference, the purpose of the query is to find the first page view for each session.
EDIT: After much frustration, despite the help received here, I could not make this query work. Therefore, I decided to simply store a reference to the entry page (and now exit page) in the sessions table, which allows me to do the following:
select
pv.Path,
count(*)
from
PageViews pv
inner join Sessions s on pv.SessionID = s.SessionID
and pv.PageViewID = s.ExitPage
inner join Visitors v on s.VisitorID = v.VisitorID
where
(
#Domain is null or
pv.Domain = #Domain
) and
v.Campaign = #Campaign
group by pv.Path;
This query runs in 3 seconds or less. Now I either have to update the entry/exit page in real time as the page views are recorded (the optimal solution) or run a batch update at some interval. Either way, it solves the problem, but not like I'd intended.
Edit Edit: Adding a missing index (after cleaning up from last night) reduced the query to mere milliseconds). Woo hoo!
For starters,
where pv1.Domain = isnull(#Domain, pv1.Domain)
won't SARG. You can't optimize a match on a function, as I remember.
I'm back. To answer your first question, you could probably just do a union on the two conditions, since they are obviously disjoint.
Actually, you're trying to cover both the case where you provide a domain, and where you don't. You want two queries. They may optimize entirely differently.
What's the nature of the data in these tables? Do you find most of the data is inserted/deleted regularly?
Is that the full schema for the tables? The query plan shows different indexing..
Edit: Sorry, just read the last line of text. I'd suggest if the tables are routinely cleared/insertsed, you could think about ditching the clustered index and using the tables as heap tables.. just a thought
Definately should put non-clustered index(es) on Campaign, Domain as John suggested
Your inner query (pv1) will require a nonclustered index on (Domain).
The second query (pv2) can already find the rows it needs due to the clustered index on Created, but pv1 might be returning so many rows that SQL Server decides that a table scan is quicker than all the locks it would need to take. As pv1 groups on SessionID (and hence has to order by SessionID), a nonclustered index on SessionID, Created, and including path should permit a MERGE join to occur. If not, you can force a merge join with "SELECT .. FROM pageviews pv2 INNER MERGE JOIN ..."
The two indexes listed above will be:
CREATE NONCLUSTERED INDEX ncixcampaigndomain ON PageViews (Domain)
CREATE NONCLUSTERED INDEX ncixsessionidcreated ON PageViews(SessionID, Created) INCLUDE (path)
SELECT
sessionid,
MIN(created) AS created
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
GROUP BY
sessionid
so that gives you the sessions for a campaign. Now let's see what you're doing with that.
OK, this gets rid of your grouping:
SELECT
campaignid,
sessionid,
pv.path
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
AND NOT EXISTS (
SELECT 1 FROM pageviews
WHERE sessionid = pv.sessionid
AND created < pv.created
)
To continue from doofledorf.
Try this:
where
(#Domain is null or pv1.Domain = #Domain) and
v1.Campaign = #Campaign
Ok, I have a couple of suggestions
Create this covered index:
create index idx2 on [PageViews]([SessionID], Domain, Created, Path)
If you can amend the Sessions table so that it stores the entry page, eg. EntryPageViewID you will be able to heavily optimise this.