Long running Jobs Performance Tips - sql-server-2005

I have been working with SQL server for a while and have used lot of performance techniques to fine tune many queries. Most of these queries were to be executed within few seconds or may be minutes.
I am working with a job which loads around 100K of data and runs for around 10 hrs.
What are the things I need to consider while writing or tuning such query? (e.g. memory, log size, other things)

Make sure you have good indexes defined on the columns you are querying on.

Ultimately, the best thing to do is to actually measure and find the source of your bottlenecks. Figure out which queries in a stored procedure or what operations in your code take the longest, and focus on slimming those down, first.
I am actually working on a similar problem right now, on a job that performs complex business logic in Java for a large number of database records. I've found that the key is to process records in batches, and make as much of the logic as possible operate on a batch instead of operating on a single record. This minimizes roundtrips to the database, and causes certain queries to be much more efficient than when I run them for one record at a time. Limiting the batch size prevents the server from running out of memory when working on the Java side. Since I am using Hibernate, I also call session.clear() after every batch, to prevent the session from keeping copies of objects I no longer need from previous batches.
Also, an RDBMS is optimized for working with large sets of data; use normal SQL operations whenever possible. Avoid things like cursors, and a lot procedural programming; as other people have said, make sure you have your indexes set up correctly.

It's impossible to say without looking at the query. Just because you have indexes doesn't mean they are being used. You'll have to look at the execution plan and see if they are being used. They might show that they aren't useful to the execution plan.
You can start with looking at the estimated execution plan. If the job actually completes, you can wait for the actual execution plan. Look at parameter sniffing. Also, I had an extremely odd case on SQL Server 2005 where
SELECT * FROM l LEFT JOIN r ON r.ID = l.ID WHERE r.ID IS NULL
would not complete, yet
SELECT * FROM l WHERE l.ID NOT IN (SELECT r.ID FROM r)
worked fine - but only for particular tables. Problem was never resolved.
Make sure your statistics are up to date.

If possible post your query here so there is something to look at. I recall a query someone built with joins to 12 different tables dealing with around 4 or so million records that took around a day to run. I was able to tune that to run within 30 mins by eliminating the unnecessary joins. Where possible try to reduce the datasets you are joining before returning your results. Use plenty of temp tables, views etc if you need.
In cases of large datasets with conditions try to preapply your conditions through a view before your joins to reduce the number of records.
100k joining 100k is a lot bigger than 2k joining 3k

Related

Is there a possibility to make my query in SQL run faster?

I am trying to run a query that would produce only 2 million lines and 12 columns. However my query has been running for 6 hours... I would like to ask if there is anything I can do to speed it up and if there are general tips.
I am still a beginner in SQL and your help is highly appreciated
INSERT INTO #ORSOID values (321) --UK
INSERT INTO #ORSOID values (368) --DE
SET #startorderdate = '4/1/2019' --'1/1/2017' --EDIT THESE
SET #endorderdate = '6/30/2019' --EDIT THESE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---step 1 for the list of opids and check the table to see if more columns that are needed are present to include them
--Create a list of relevant OpIDs for the selected time period
select
op1.oporid,
op1.opcurrentponum,
o.orcompletedate,
o.orsoid,
op1.opid,
op1.opreplacesopid,
op1.opreplacedbyopid,
op1.OpSplitFromOpID,
op1.opsuid,
op1.opprsku,
--op1.orosid,
op1.opdatenew,
OPCOMPONENTMANUFACTURERPARTID
into csn_junk.dbo.SCOpid
from csn_order..vworder o with (nolock)
inner join csn_order..vworderproduct op1 with (nolock) on oporid = orid
LEFT JOIN CSN_ORDER..TBLORDERPRODUCT v WITH (NOLOCK) on op1.opid = v.OpID
where op1.OpPrGiftCertificate = 0
and orcompletedate between #startorderdate and #endorderdate
and orsoid in (select soid from #orsoid)
Select * From csn_junk.dbo.SCOpid
First, there is no way to know why a query is running on for many hours on a server we don't have access to or without any metrics (i.e an. execution plan or CPU/Memory/IO metrics.) Also, without any DDL there it's impossible to understand what's going on with your query.
General Guidelines for troubleshooting slow data modification:
Getting the right metrics
The first thing I'd do is run task manager on that server and see if you have a server issue or a query issue. Is the CPU pegged to 100%? If so, is sqlservr.exe the cause? How often do you run this query? How fast is it normally?
There are a number of native and 3rd party tools for collecting good metrics. Execution plans, DMFs and DMVs, Extended Events, SQL Traces, Query Store. You also have great third party tools like Brent Ozar's suite of tools, Adam Machanic's sp_whoisactive.
There's a saying in the BI World: If you can't measure it, you can't manage it. If you can't measure what's causing your queries to be slow, you won't know where to start.
Big updates like this can cause locking, blocking, lock-escalation and even deadlocks.
Understand execution plans, specifically actual execution plans.
I write my code in SSMS with "Show execution plan" turned on. I always want to know what my query is doing. You can also view the execution plans after the fact by capturing them using SQL Traces (via the SQL Profiler) or Extended Events.
This is a huge topic so I'll just mention some things off the top of my head that I look for in my plans when troubleshooting slow queries: Sorts, Key Lookups, RID lookups, Scans against large tables (e.g. you scan an entire 10,000,000 row table to retrieve 12,000 rows - for this you want a seek.) Sometimes there will be warnings in the execution plan such as a "tempdb spill" - these are bad. Sometimes the plan will call out "missing indexes" - a topic unto itself. Which brings me to...
INDEXES
This is where execution plans, DMV and other SQL monitoring tools really come in handy. The rule of thumb is, when you are doing SELECT queries it's nice to have plenty of good indexes available for the optimizer to chose; in a normalized data mart for example, more are better. For INSERT/UPDATE/DELETE operations you want as few indexes possible because each one associated with the query/data in the query is modified. For a big insert like the one you are doing, fewer indexes would be better on csn_junk.dbo.SCOpid and, as mentioned in the comments below your post, you want the correct indexes to support the tables used for the update.
CONSTRAINTS
Constraints slow data modification. The present referential integrity constraints (Primary/Foreign keys) and UNIQUE constraints will impact performance. CHECK constraints can as well; CHECK constraints that use a T-SQL scalar function will absolutely destroy data modification performance more than almost anything else I can think of except for scalar UDFs as CHECK constraints that also access other tables this can slow an insert that should a minute to several hours.
MDF & LDF file growth
A 2,000,000 row+/12 column insert is going to cause the associated MDF and LDF files to grow substantially. If your data files (.MDF or .NDF) or Log File (.LDF) fill up they will auto-grow to create space. This slows queries that run in seconds to minutes, especially when your auto-growth settings are bad. See: SQL Server Database Growth and Autogrowth Settings
Whenever I have a query that always runs for 10 seconds and now, out of nowhere, it's running for minutes. Assuming it's not a deadlock or server issue I will check for MDF or LDF autogrowth as this is often the culprit. Often you have a log file that needs to be shrunk (via log backup or manually depending on the recovery model.) This brings me to batching:
Batching
Huge inserts chew up log space and take forever to roll back if the query fails. Making things worse - cancelling a huge insert (or trying to Kill the Spid) will sometimes cause more problems. Doing data modifications in batches can circumvent this problem. See this article for more details.
Hopefully this helps get you started. I've given you plenty to google. Please forgive any typos - I spun this up fast. Feel free to ask followup questions.

What's the curve for a simple select query?

This is a conceptual question.
Hypothetically, when do select * from table_name where the table has 1 million records it takes about 3 secs.
Similarly, when I select 10 million records the time taken is about 30 secs. But I am told the selection of records is not linearly proportional to time. After a certain number, the time required to select records increases exponentially?
Please help me understand how this works?
THere are things that can make one query take longer than the other even simple selects with no where clauses or joins.
First, the time to return the query depends on how busy the network is at the time the query is run. It could also depend on whether there are any locks on the data or how much memory is available.
It also depends on how wide the tables are and in general how many bytes an individual record would have. For instance I would expect that a 10 million record table that only has two columns both ints would return much faster than a million record table that has 50 columns including some large columns epecially if they are things like documents stored as database objects or large fields that have too much text to fit into an ordinary varchar or nvarchar field (in sql server these would be nvarchar(max) or text for instance). I would expect this becasue there is simply less total data to return even though more records.
As you start adding where clauses and joins of course there are many more things that affect performance of an indivuidual query. If you query datbases, you should read a good book on performance tuning for your particular database. There are many things you can do without realizing it that can cause queries to run more slowly than need be. You should learn the techniques that create the queries most likely to be performant.
I think this is different for each database-server. Try to monitor the performance while you fire your queries (what happens to the memory, and CPU?)
Eventually all hardware components have a bottleneck. If you come close to that point the server might 'suffocate'.

libpq very slow for large (20 million record) database

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks
I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.
Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.
You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.

How long should a query that returns 5 million records take?

I realise the answer should probably be 'as little time as possible' but I'm trying to learn how to optimise databases and I have no idea what an acceptable time is for my hardware.
For a start I'm using my local machine with a copy of sql server 2008 express. I have a dual-core processor, 2GB ram and a 64bit OS (if that makes a difference). I'm only using a simple table with about 6 varchar fields.
At first I queried the data without any indexing. This took a ridiculously long amount of time so I cancelled and added a clustered index (using the PK) to the table. This cut the time down to 1 minute 14 sec. I have no idea if this is the best I can get or whether I'm still able to cut this down even further?
Am I limited by my hardware or is there anything else I can do to my table/database/queries to get results faster?
FYI I'm only using a standard SELECT * FROM <Table> to retrieve my results.
EDIT: Just to clarify, I'm only doing this for testing purposes. I don't NEED to pull out all the data, I'm just using that as a consistent test to see if I can cut down the query times.
I suppose what I'm asking is: Is there anything I can do to speed up the performance of my queries other than a) upgrading hardware and b) adding indexes (assuming the schema is already good)?
I think you are asking the wrong question.
First of all - why do you need so many articles at one time on the local machine? What do you want to do with them? I'm asking because I think you want to transfer this of data to somewhere, so you should be measuring how long it takes to transfer the data.
Some advice:
Your applications should not select 5 million records at the time. Try to split your query and get the data in smaller sets.
UPDATE:
Because you are doing this for testing, I suggest that you
Remove * from your query - it takes SQL server some time to resolve this.
Put your data in temporary storage, try using VIEW or a temporary table for this.
Use plan caching on your server
to improve performance. But even if you're just testing, I still don't understand why you would need such tests if your application would never use such a query. Testing just for the sake of testing is a bad use of time
Look at the query execution plan. If your query is doing a table scan, it will obviously take a long time. The query execution plan can help you decide what kind of indexing you would need on the table. Also, creating table partitions can help sometimes in cases where the data is partitioned by a condition (usually date and time).
I did 5.5 million in 20 seconds. That's taking over 100k schedules with different frequencies and forecasting them for the next 25 years. Just max scenario testing, but proves the speed you can achieve in a scheduling system as an example.
The best optimized way depends on the indexing strategy you choose. As many of the above answers, i too would say partitioning the table would help sometimes. And its not the best practice to query all the billion record in a single time frame. Will give you much better results if you could try to query partially with the iterations. you may check this link to clear the doubts on the minimum requirements for the Sql server 2008 Minimum H/W and S/W Requirements for Sql server 2008
When fecthing 5 million rows you are almost 100% going spool to tempdb. you should try to optimize your temp Db by adding additional files. if you have multiple drives on seperate disks you should split the table data into different ndf files located on seperate disks. parititioning wont help when querying all the data on the disk
U can also use a query hint to force parrallelism MAXDOP this will increase the CPU utilization. Ensure that the columns contain few nulls as possible and rebuild ur indexes and stats

Performance for RBAR vs. set-based processing with varying transactional sizes

It is conventional wisdom that set based processing of tables should always be preferred over RBAR - especially when the tables grow larger and/or you need to update many rows.
But does that always hold? I have experienced quite a few situations - on different hardware - where set-based processing shows exponential growth in time consumption, while splitting the same workload into smaller chunks gives linear growth.
I think it would be interesting either to be proven totally wrong - if I'm missing something obvious - or, if not, it would be very good to know when splitting the workload is worth the effort. And subsequently identifying what indicators help make the decision of which approach to use. I'm personally expecting the following components to be interesting:
Size of workload
Size and growth of logfile
Amount of RAM
Speed of disksystem
Any other? Number of CPUs/CPU cores?
Example 1: I have a 12 million row table and I have to update one or two fields in each row with data from another table. If I do this in one simple UPDATE, this takes ~30 minutes on my test box. But I'll be done in ~24 minutes if I split this into twelve chunks - ie.:
WHERE <key> BETWEEN 0 AND 1000000
WHERE <key> BETWEEN 1000000 AND 2000000
...
Example 2: Is a 200+ million rows table that also need to have several calculations done to practically all rows. If a do the full set all in one, my box will run for three days and not even then be done. If I write a simple C# to execute the exact same SQL, but with WHERE-clauses appended to limit transaction size to 100k rows at a time, it'll be done in ~14 hours.
For the record: My results are from the same databases, resting on the same physical hardware, with statistics updated, no changes in indexes, Simple recovery model, etc.
And no, I haven't tried 'true' RBAR, although I probably should - even though it would only be to see how long that would really take.
No, there is no rule that set-based is always faster. We have cursors for a reason (and don't be fooled into believing that a while loop or some other type of looping is really all that different from a cursor). Itzik Ben-Gan has demonstrated a few cases where cursors are much better, particularly for running totals problems. There are also cases you describe where you're trying to update 12 million rows and due to memory constraints, log usage or other reasons it's just too much for SQL to handle as a single operation without having to spill to tempdb, or settle on a sub-optimal plan from early termination due to not getting a more optimal plan quick enough.
One of the reasons cursors get a bad rap is that people are lazy and just say:
DECLARE c CURSOR FOR SELECT ...
When they almost always should be saying:
DECLARE c CURSOR
LOCAL FORWARD_ONLY STATIC READ_ONLY
FOR SELECT ...
This is because those extra keywords make the cursor more efficient for various reasons. Based on the documentation you would expect some of those options to be redundant, but in my testing this is not the case. See this blog post of mine and this blog post from fellow SQL Server MVP Hugo Kornelis for more details.
That all said, in most cases your best bet is going to be set-based (or at least chunky set-based as you described above). But for one-off admin tasks (which I hope your 12-million row update is), it is sometimes easier / more efficient to just write a cursor than to spend a lot of effort constructing an optimal query that produces an adequate plan. For queries that will be run a lot as normal operations within the scope of the application, those are worth more effort to try to optimize as set-based (keeping in mind that you may still end up with a cursor).