I am trying to run a query that would produce only 2 million lines and 12 columns. However my query has been running for 6 hours... I would like to ask if there is anything I can do to speed it up and if there are general tips.
I am still a beginner in SQL and your help is highly appreciated
INSERT INTO #ORSOID values (321) --UK
INSERT INTO #ORSOID values (368) --DE
SET #startorderdate = '4/1/2019' --'1/1/2017' --EDIT THESE
SET #endorderdate = '6/30/2019' --EDIT THESE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---step 1 for the list of opids and check the table to see if more columns that are needed are present to include them
--Create a list of relevant OpIDs for the selected time period
select
op1.oporid,
op1.opcurrentponum,
o.orcompletedate,
o.orsoid,
op1.opid,
op1.opreplacesopid,
op1.opreplacedbyopid,
op1.OpSplitFromOpID,
op1.opsuid,
op1.opprsku,
--op1.orosid,
op1.opdatenew,
OPCOMPONENTMANUFACTURERPARTID
into csn_junk.dbo.SCOpid
from csn_order..vworder o with (nolock)
inner join csn_order..vworderproduct op1 with (nolock) on oporid = orid
LEFT JOIN CSN_ORDER..TBLORDERPRODUCT v WITH (NOLOCK) on op1.opid = v.OpID
where op1.OpPrGiftCertificate = 0
and orcompletedate between #startorderdate and #endorderdate
and orsoid in (select soid from #orsoid)
Select * From csn_junk.dbo.SCOpid
First, there is no way to know why a query is running on for many hours on a server we don't have access to or without any metrics (i.e an. execution plan or CPU/Memory/IO metrics.) Also, without any DDL there it's impossible to understand what's going on with your query.
General Guidelines for troubleshooting slow data modification:
Getting the right metrics
The first thing I'd do is run task manager on that server and see if you have a server issue or a query issue. Is the CPU pegged to 100%? If so, is sqlservr.exe the cause? How often do you run this query? How fast is it normally?
There are a number of native and 3rd party tools for collecting good metrics. Execution plans, DMFs and DMVs, Extended Events, SQL Traces, Query Store. You also have great third party tools like Brent Ozar's suite of tools, Adam Machanic's sp_whoisactive.
There's a saying in the BI World: If you can't measure it, you can't manage it. If you can't measure what's causing your queries to be slow, you won't know where to start.
Big updates like this can cause locking, blocking, lock-escalation and even deadlocks.
Understand execution plans, specifically actual execution plans.
I write my code in SSMS with "Show execution plan" turned on. I always want to know what my query is doing. You can also view the execution plans after the fact by capturing them using SQL Traces (via the SQL Profiler) or Extended Events.
This is a huge topic so I'll just mention some things off the top of my head that I look for in my plans when troubleshooting slow queries: Sorts, Key Lookups, RID lookups, Scans against large tables (e.g. you scan an entire 10,000,000 row table to retrieve 12,000 rows - for this you want a seek.) Sometimes there will be warnings in the execution plan such as a "tempdb spill" - these are bad. Sometimes the plan will call out "missing indexes" - a topic unto itself. Which brings me to...
INDEXES
This is where execution plans, DMV and other SQL monitoring tools really come in handy. The rule of thumb is, when you are doing SELECT queries it's nice to have plenty of good indexes available for the optimizer to chose; in a normalized data mart for example, more are better. For INSERT/UPDATE/DELETE operations you want as few indexes possible because each one associated with the query/data in the query is modified. For a big insert like the one you are doing, fewer indexes would be better on csn_junk.dbo.SCOpid and, as mentioned in the comments below your post, you want the correct indexes to support the tables used for the update.
CONSTRAINTS
Constraints slow data modification. The present referential integrity constraints (Primary/Foreign keys) and UNIQUE constraints will impact performance. CHECK constraints can as well; CHECK constraints that use a T-SQL scalar function will absolutely destroy data modification performance more than almost anything else I can think of except for scalar UDFs as CHECK constraints that also access other tables this can slow an insert that should a minute to several hours.
MDF & LDF file growth
A 2,000,000 row+/12 column insert is going to cause the associated MDF and LDF files to grow substantially. If your data files (.MDF or .NDF) or Log File (.LDF) fill up they will auto-grow to create space. This slows queries that run in seconds to minutes, especially when your auto-growth settings are bad. See: SQL Server Database Growth and Autogrowth Settings
Whenever I have a query that always runs for 10 seconds and now, out of nowhere, it's running for minutes. Assuming it's not a deadlock or server issue I will check for MDF or LDF autogrowth as this is often the culprit. Often you have a log file that needs to be shrunk (via log backup or manually depending on the recovery model.) This brings me to batching:
Batching
Huge inserts chew up log space and take forever to roll back if the query fails. Making things worse - cancelling a huge insert (or trying to Kill the Spid) will sometimes cause more problems. Doing data modifications in batches can circumvent this problem. See this article for more details.
Hopefully this helps get you started. I've given you plenty to google. Please forgive any typos - I spun this up fast. Feel free to ask followup questions.
Related
One of my projects has a very large database on which I can't edit indexes etc., have to work as it is.
What I saw when testing some queries that I will be running on their database via a service that I am writing in .net. Is that they are quite slow when ran the first time?
What they used to do before is - they have 2 main (large) tables that are used mostly. They showed me that they open SQL Server Management Studio and run a
SELECT *
FROM table1
JOIN table2
a query that takes around 5 minutes to run the first time, but then takes about 30 seconds if you run it again without closing SQL Server Management Studio. What they do is they keep open SQL Server Management Studio 24/7 so that when one of their programs executes queries that are related to these 2 tables (which seems to be almost all queries ran by their program) in order to have the 30 seconds run time instead of the 5 minutes.
This happens because I assume the 2 tables get cached and then there are no (or close to none) disk reads.
Is this a good idea to have a service which then runs a query to cache these 2 tables every now and then? Or is there a better solution to this, given the fact that I can't edit indexes or split the tables, etc.?
Edit:
Sorry just I was possibly unclear, the DB hopefully has indexes already, just I am not allowed to edit them or anything.
Edit 2:
Query plan
This could be a candidate for an indexed view (if you can persuade your DBA to create it!), something like:
CREATE VIEW transhead_transdata
WITH SCHEMABINDING
AS
SELECT
<columns of interest>
FROM
transhead th
JOIN transdata td
ON th.GID = td.HeadGID;
GO
CREATE UNIQUE CLUSTERED INDEX transjoined_uci ON transhead_transdata (<something unique>);
This will "precompute" the JOIN (and keep it in sync as transhead and transdata change).
You can't create indexes? This is your biggest problem regarding performance. A better solution would be to create the proper indexes and address any performance by checking wait stats, resource contention, etc... I'd start with Brent Ozar's blog and open source tools, and move forward from there.
Keeping SSMS open doesn't prevent the plan cache from being cleared. I would start with a few links.
Understanding the query plan cache
Check your current plan cache
Understanding why the cache would clear (memory constraint, too many plans (can't hold them all), Index Rebuild operation, etc. Brent talks about this in this answer
How to clear it manually
Aside from that... that query is suspect. I wouldn't expect your application to use those results. That is, I wouldn't expect you to load every row and column from two tables into your application every time it was called. Understand that a different query on those same tables, like selecting less columns, adding a predicate, etc could and likely would cause SQL Server to generate a new query plan that was more optimized. The current query, without predicates and selecting every column... and no indexes as you stated, would simply do two table scans. Any increase in performance going forward wouldn't be because the plan was cached, but because the data was stored in memory and subsequent reads wouldn't experience physical reads. i.e. it is reading from memory versus disk.
There's a lot more that could be said, but I'll stop here.
You might also consider putting this query into a stored procedure which can then be scheduled to run at a regular interval through SQL Agent that will keep the required pages cached.
Thanks to both #scsimon #Branko Dimitrijevic for their answers I think they were really useful and the one that guided me in the right direction.
In the end it turns out that the 2 biggest issues were hardware resources (RAM, no SSD), and Auto Close feature that was set to True.
Other fixes that I have made (writing it here for anyone else that tries to improve):
A helper service tool will rearrange(defragment) indexes once every
week and will rebuild them once a month.
Create a view which has all the columns from the 2 tables in question - to eliminate JOIN cost.
Advised that a DBA can probably help with better tables/indexes
Advised to improve server hardware...
Will accept #Branko Dimitrijevic 's answer as I can't accept both
We are currently having difficulties with a sql server procedure timing out on queries. 9 times out of 10 the query will run within 5 second max, however, on occasions, the proc can continue to run in excess of 2 mins and causing time outs on the front end (.net MVC application)..
They have been investigating this for over a week now, checking jobs, server performance and all seems to be ok..
The DBA's have narrowed it down to a particular table which is being bombarded from different application with inserts / updates. This in combination with the complex select query that is causing the time out that joins on that table (im being told) is causing the time outs..
Are there any suggestions at all to how to get around these time outs?
ie.
replicate the table and query the new table?
Any additional debugging that can prove that this is actually the issue?
Perhaps cache the data on the front end, if a time out, call data from cache?
A table being bombarded with updates is a table being bombarded with locks. And yes, this can affect performance.
First, copy the table and run the query multiple times. There are other possibilities for the performance issue.
One cause of unstable stored procedure performance in SQL Server is compilation. The code in the stored procedure is compiled the first time it is executed -- the resulting execution plan might work for some inputs and not others. This is readily fixed by using the option to recompile the queries each time (although this adds overhead).
Then, think about the query. Does it need the most up-to-date data? If not, perhaps you can just copy the table once per hour or once per day.
If the most recent data is needed, you might need to re-think the architecture. A table that does insert-only using a clustered identity column always inserts at the end of the table. This is less likely to interfere with queries on the table.
Replication may or may not help the problem. After all, full replication will be doing the updates on the replicated copy. You don't solve the "bombardment" problem by bombarding two tables.
If your queries involve a lot of historical data, then partitioning might help. Only the most recent partition would be "bombarded", leaving the others more responsive to queries.
The DBA's have narrowed it down to a particular table which is being bombarded from different application with inserts / updates. This in combination with the complex select query that is causing the time out that joins on that table (im being told) is causing the time outs
We used to face many time outs and used to get a lot of escalations..This is the approach we followed for reducing time outs..
Some may be applicable in your case,some may not...but following will not cause any harm
Change below sql server settings:
1.Remote login timeout :60
2.Remote query timeout:0
Also if your windows server is set to use Dynamic ram,try changing it to static ram..
You may also have to tune,some of windows server settings
TCP Offloading/Chimney & RSS…What is it and should I disable it?
Following above steps,reduced our time outs by 99%..
For the rest 1%,we have dealt each case seperately
1.Update statistics for those tables involved in the query
2.Try fine tuning the query further
This helped us reduce time outs totally..
I have a VIEW in both Databases. At one database, takes less then 1 second to run and but in the other database 1 minute or more to go. I check indexes and everything is the same. The diference between the number of rows is below than 10 millions of rows from each other database.
I check de exectuion plan, and what i found is that, the database that takes more time, i have 3 Hash Match(1 aggregate and 2 right outer join) that is responssible for 100% on the query batch. On the other database i don't have this in the execution plan.
Can anyone tell me where can i begin to search the problem?
Thank you, sorry for the bad english.
You can check this link here for a quick explanation on different types of joins.
Basically, with the information you've given us, here are some of the alternatives for what might be wrong:
One DB has indexes the other doesn't.
The size difference between some of the joined tables in one DB over the other, is dramatic enough to change the type of join used.
While your indexes might be the same on both DB table groups, as you said.. it's possible the other DB has outdated / bad statistics or too much index fragmentation, resulting in sub-optimal plans.
EDIT:
Regarding your comment below, it's true that rebuilding indexes is similar to dropping & recreating indexes. And since creating indexes also creates the statistics for those indexes, rebuilding will take care of them as well. Sometimes that's not enough however.
While officially default statistics should be built with about 20% sampling rate of the actual data, in reality the sampling rate can be as low as just a few percents depending on how massive the table is. It's rarely anywhere near 20%. Because of that, many DBA's build statistics manually with FULLSCAN to obtain a 100% sampling rate.
The statistics take equally much storage space either way, so there are really no downsides to this aside from the extra time required in maintenance plans. In my current project, we have several situations where the default sampling rate for the statistics is not enough, and would still produce bad plans. So we routinely update all statistics with FULLSCAN every few weeks to make sure the performance stays top notch.
I am trying to optimize the search query which is the most used in our system. So far I have added some missing indexes and that has helped slightly. But I want to further reduce the load on the db server. One option that I will use is caching the result set as a LIST in the asp.net Cache so that I don't have to hit the db often.
However, I was wondering if there is a way to Cache some portions of the select query at the db as well. e.g. for the search results we consider only users who have been active in the last 180 days and who have share-info set as true. So this is like a super set which the db processes everytime and then applies other conditions such as category specified, city etc. which are passed. Is it possible to somehow Cache the Super Set so that I can run queries against the super set rather than run the query against the whole table? Will creating a View help in this? I am a bit hesitant to create a view as I read managing views can be an overhead and takes away some flexibility to modfy the tables.
I am using Sql-Server 2005 so cannot create a filtered index on the table, which I think would have been helpful.
I agree with #Neville K. SQL Server is pretty smart at caching data in memory. You might see limited / no performance gains for your effort.
You could consider indexed views (Enterprise Edition only) http://technet.microsoft.com/en-us/library/cc917715.aspx for your sub-query.
It is, of course, possible to do this - but I'm not sure if it will help.
You can create a scheduled job - once a night, perhaps - which populates a table called "active_users_with_share_info" by truncating it, and then repopulating it based on a select query filtering out users active in the last 180 days with "share_info = true".
Then you can join your search query to this table.
However, I doubt this would do much good - SQL Server is pretty smart at caching. Unless you're dealing with huge volumes of data (100 of millions of records), or very limited hardware, I doubt you'd get any measurable performance improvements - but by all means try it!
Of course, the price for this would be more moving parts in your application, more interesting failure modes (what happens if the overnight batch fails silently?), and more training for any new developers you bring into the team.
I have been working with SQL server for a while and have used lot of performance techniques to fine tune many queries. Most of these queries were to be executed within few seconds or may be minutes.
I am working with a job which loads around 100K of data and runs for around 10 hrs.
What are the things I need to consider while writing or tuning such query? (e.g. memory, log size, other things)
Make sure you have good indexes defined on the columns you are querying on.
Ultimately, the best thing to do is to actually measure and find the source of your bottlenecks. Figure out which queries in a stored procedure or what operations in your code take the longest, and focus on slimming those down, first.
I am actually working on a similar problem right now, on a job that performs complex business logic in Java for a large number of database records. I've found that the key is to process records in batches, and make as much of the logic as possible operate on a batch instead of operating on a single record. This minimizes roundtrips to the database, and causes certain queries to be much more efficient than when I run them for one record at a time. Limiting the batch size prevents the server from running out of memory when working on the Java side. Since I am using Hibernate, I also call session.clear() after every batch, to prevent the session from keeping copies of objects I no longer need from previous batches.
Also, an RDBMS is optimized for working with large sets of data; use normal SQL operations whenever possible. Avoid things like cursors, and a lot procedural programming; as other people have said, make sure you have your indexes set up correctly.
It's impossible to say without looking at the query. Just because you have indexes doesn't mean they are being used. You'll have to look at the execution plan and see if they are being used. They might show that they aren't useful to the execution plan.
You can start with looking at the estimated execution plan. If the job actually completes, you can wait for the actual execution plan. Look at parameter sniffing. Also, I had an extremely odd case on SQL Server 2005 where
SELECT * FROM l LEFT JOIN r ON r.ID = l.ID WHERE r.ID IS NULL
would not complete, yet
SELECT * FROM l WHERE l.ID NOT IN (SELECT r.ID FROM r)
worked fine - but only for particular tables. Problem was never resolved.
Make sure your statistics are up to date.
If possible post your query here so there is something to look at. I recall a query someone built with joins to 12 different tables dealing with around 4 or so million records that took around a day to run. I was able to tune that to run within 30 mins by eliminating the unnecessary joins. Where possible try to reduce the datasets you are joining before returning your results. Use plenty of temp tables, views etc if you need.
In cases of large datasets with conditions try to preapply your conditions through a view before your joins to reduce the number of records.
100k joining 100k is a lot bigger than 2k joining 3k