Speeding up deletes that have joins

Speeding up deletes that have joins - sql

i am running a stored procedure to delete data from two tables:
delete from TESTING_testresults
from TESTING_testresults
inner join TESTING_QuickLabDump
on TESTING_QuickLabDump.quicklabdumpid = TESTING_TestResults.quicklabdumpid
where TESTING_quicklabdump.[Specimen ID]=#specimen
delete from TESTING_QuickLabDump
from TESTING_Quicklabdump
where [specimen id]=#specimen
one table is 60m rows and the other is about 2m rows
the procedure takes about 3 seconds to run.
is there any way i can speed this up? perhaps using EXISTS?
meaning IF EXISTS...THEN DELETE - because the delete should not be occurring every single time
something like this
if #specimen exists in TESTING_QuickLabDump then do the procedure with the two deletes
thank you !!!

Rewriting the query probably wont help speeding this up. Use the profiler to find out which parts of the query are slow. For this, make it profiler output the execution plan. Then, try adding appropriate indexes. Perhaps one or both tables could use an index over [specimen id].

For a table with 60 mil rows I would definitely look into partitioning the data horizontally and/or vertically. If it's time-sensitive data then you ought to be able to move old data into a history table. That's usually the first and most obvious thing people do so I would imagine if that were a possibility you would have already done it.
If there are many columns then it would definitely benefit you to denormalize the data into multiple tables. If you did this, I would suggest renaming the tables and creating a view of all the partitioned tables named after the original table. Doing that should ensure existing code isn't broken.
If you 'really' want to fine tune the speed then you should look into getting a faster hard drive and learn a little about hard drives work. Whether the data is stored towards the inner or outer section of the hd will affect speed of access slightly for example. And solid state hard drives have come a long way so you might look into getting one of those.

Beside indexing "obvious" fields, also look in your database schema and check if you have any FOREIGN KEYs whose ON DELETE CASCADE or SET NULL might be triggered by your delete (unlike Oracle, MS SQL Server will tend to show these in the execution plan). Fortunately, this is usually fairly easy to fix by indexing the child endpoint of the FOREIGN KEY.
Also check if you have any expensive triggers.

Related

SQL Server Performance Tuning of large table

I have a table of 755 columns and around holding 2 million records as of now and it will grow.There are many procedures accessing it with other tables join, are running slow. Now it's hard to split/normalize them as everything is already built and customer is not ready to spend much on it. Is there any way to make the query access to that table faster? Please advise.
Will column store index help?

How little are they prepared to spend?
It may be possible to split this table into multiple 1 to 1 joined tables (vertical partitioning), then use a view to present it as one single blob to existing code.
With some luck you may get join elimination happening frequently enough to make it worthwhile.
View will probably require INSTEAD OF triggers to fully replicate existing logic. INSTEAD OF triggers have a number of restrictions e.g. no support for OUTPUT clause, which can prove to be to hard to overcome depending on your specific setup.
You can name your view the same as existing table, which will eliminate the need of fixing code everywhere.
IMO this is the simplest you can do short of a full DB re-factoring exercise.
See: http://aboutsqlserver.com/2010/09/15/vertical-partitioning-as-the-way-to-reduce-io/ and https://logicalread.com/sql-server-optimizer-may-eliminate-foreign-key-joins-mc11/#.WXgEzlERW6I

755 Columns thats a lot. You should try to index the columns that are mostly used in where clause. this might speed up the process

It is fine, dont worry about it, actually how many columns you have it is not important in sql server (But be careful I said 'have'). The main problem is data count and how many column you select in queries. There is a few point firstly you can check.
Do not use * selector and change it if used in everywhere
In the joins, do not use it directly, you can firstly filter it as inner select. (Just try it, I have no idea about your table so I m telling the general rules.)
Try the diminish data count for ex: use history table for old records. This technicque depends on needs of your organization.
Try to use column index and something like that features.
And of course remove dynamic selects in your queries.
I wish one of them will work.

I want to create a copy of a table, fewer columns than original... but still updated from it

I have a 30gb table, which has 30-40 columns. I create reports using this table and it causes performance problems. I just use 4-5 columns of this table for the reports. So that, I want to create a second table for the reports. But the second table must be updated when the original table is changed without using triggers.
No matter what my query is, When the query is executed, sql tries to cache all 30gb. When the cache is fully loaded, sql starts to use disk. Actually I want to aviod this
How can I do this?
Is there a way of doing this using ssis
thanks in advance

CREATE VIEW myView
AS
SELECT
column1,
column3,
column4 * column7
FROM
yourTable
A view is effectively just a stored query, like a macro. You can then select from that view as if it were a normal table.
Unless you go for matierialised views, it's not really a table, it's just a query. So it won't speed anything up, but it does encapsulate code and assist in controlling what data different users/logins can read.

If you are using SQL Server, what you want is an indexed view. Create a view using the column you want and then place an index on them.
An indexed view stores the data in the view. It should keep the view up-to-date with the underlying table, and it should reduce the I/O for reading the table. Note: this assumes that your 4-5 columns are much narrower than the overall table.

Dems answer with the view seems ideal, but if you are truly looking for a new table, create it and have it automatically updated with triggers.
Triggers placed on the primary table can be added for all Insert, Update and Delete actions upon it. When the action happens, the trigger fires and can be used to do additional function... such as update your new secondary table. You will pull from the Inserted and Deleted tables (MSDN)
There are many great existing articles here on triggers:
Article 1, Article 2, Google Search

You can create that second table just like you're thinking, and use triggers to update table 2 whenever table 1 is updated.
However, triggers present performance problems of their own; the speed of your inserts and updates will suffer. I would recommend looking for more conventional alternatives to improve query performance, which sounds like SQL Server since you mentioned SSIS.
Since it's only 4-5 out of 30 columns, have you tried adding an index which covers the query? I'm not sure if there are even more columns in your WHERE clause, but you should try that first. A covering index would actually do exactly what you're describing, since the table would never need to be touched by the query. Of course, this does cost a little in terms of space and insert/update performance. There's always a tradeoff.
On top of that, I can't believe that you would need to pull a large percentage of rows for any given report out of a 30 gb table. It's simply too much data for a report to have. A filtered index can improve query performance even more by only indexing the rows that are most likely to be asked for. If you have a report which lists the results for the past calendar month, you could add a condition to only index the rows WHERE report_date > '5/1/2012' for example.

sql: DELETE + INSERT vs UPDATE + INSERT

A similar question has been asked, but since it always depends, I'm asking for my specific situation separately.
I have a web-site page that shows some data that comes from a database, and to generate the data from that database I have to do some fairly complex multiple joins queries.
The data is being updated once a day (nightly).
I would like to pre-generate the data for the said view to speed up the page access.
For that I am creating a table that contains exact data I need.
Question: for my situation, is it reasonable to do complete table wipe followed by insert? or should I do update,insert?
SQL wise seems like DELETE + INSERT will be easier (INSERT part is a single SQL expression).
EDIT: RDBMS: MS SQL Server 2008 Ent

TRUNCATE will be faster than delete, so if you need to empty a table do that instead
You didn't specify your RDBMS vendor but some of them also have MERGE/UPSERT commands This enables you do update the table if the data exists and insert if it doesn't

It partly depends on how the data is accessed. If you have a period of time with no (or very few) users accessing it, then there won't be much impact on the data disappearing (between the DELETE and the completion of the INSERT) for a short while.

Have you considered using a materialized view (MSSQL calls them indexed views) instead of doing it manually? This could also have other performance benefits as an indexed view gives the query optimizer more choices when its constructing execution plans for other queries that reference the table(s) in the view.

It depends on the size of the table and the recovery model on the database. If you are deleting many hundreds of thousands of records and reinstating them vs updating a small batch of a few hundred and inserting tens of rows, it will add an unnecessary size to your transaction logs. However you could use TRUNCATE to get around this as it won't affect the transaction log.
Do you have the option of a MERGE/UPSERT? If you're using MS-SQL you can use CROSS APPLY to do something similar if you don't.

One approach to handling this type of problem is to insert into new table, then do a table Rename. This will insure that all new data is present at the same time.

What if some data that was present yesterdays is not anymore? Delete may be safer or you could end up deleting some records anyway.
And in the end it doesnt really matter which way you go.
Unless on the case #kevinw mentioned

Although I fully agree with SQLMenace's answer I do would like to point out that MERGE does NOT remove unneeded records ! If you're sure that your new data will be a super-set of the existing data, then MERGE is great, otherwise you'll either need to make sure that you delete any superfluous records later on, or use the TRUNCATE + INSERT method ...
(Personally I'm still a fan of the latter as it usually is quite fast, just make sure to drop all indexes/unique constraints upfront and rebuild them one by one. This has the benefit of the INSERT transaction being smaller and the index-adding being done in (smaller) transactions again later on). (**)
(**: yes, this might be tricky on live system, but then again he already mentioned this was done during some kind of overnight anyway, I'm extrapolating there is no user-access at that time)

Oracle LEADING hint -- why is this required?

Suddenly (but unfortunately I don't know when "suddenly" was; I know it ran fine at some point in the past) one of my queries started taking 7+ seconds instead of milliseconds to execute. I have 1 local table and 3 tables being accessed via a DB link. The 3 remote tables are joined together, and one of them is joined with my local table.
The local table's where clause only takes a few millis to execute on its own, and only returns a few (10's or 100's at the most) records. The 3 remote tables have many hundreds of thousands, possibly millions, of records between them, and if I join them appropriately I get tens or hundreds of thousands of records.
I am only joining with the remote tables so that I can pull out a few pieces of data related to each record in my local table.
What appears to be happening, however, is that Oracle joins the remote tables together first and then my local table to that mess at the end. This is always going to be a bad idea, especially given the data set that exists right now, so I added a /*+ LEADING(local_tab remote_tab_1) */ hint to my query and it now returns in milliseconds.
I compared the explain plans and they are almost identical, save for a single BUFFER SORT on one of the remote tables.
I'm wondering what might cause Oracle to approach this the wrong way? Is it an index issue? What should I be looking for?

When choosing an execution plan, oracle estimates costs for the different plans. One crucial information for that estimate is the amount of rows will get returned from a step of the execution plan. Oracle tries to estimate those using 'statistics', i.e. information about how many rows a table contains, how many different values a column contains; How evenly these values are distributed.
These statistics are just that statistics, and they might be wrong, which is one of the most important reasons for misjudgments of the oracle optimizer.
So gathering new statistics as described in a comment might help. Have a look at the documentation on that dbms_stats package. There are many different ways to call that package.

A common problem I've come across is a query that joins many tables, where the joins form a chain from one end to another, e.g.:
SELECT *
FROM tableA, tableB, tableC, tableD, tableE
WHERE tableA.ID0 = :bind1
AND tableA.ID1 = tableB.ID1
AND tableB.ID2 = tableC.ID2
AND tableC.ID3 = tableD.ID3
AND tableD.ID4 = tableE.ID4
AND tableE.ID5 = :bind2;
Notice how the optimiser might choose to drive the query from tableA (e.g. if the index on ID0 is nicely selective) or from tableE (if the index on tableE.ID5 is more selective).
The statistics on the tables might cause the choice between these two plans to balance on a knife-edge; one day it's working fine (driving from tableA), next day new stats are gathered and all of a sudden the alternative plan driving from tableE has a lower cost and is chosen.
In this circumstance, adding a LEADING hint is one way to nudge it back to the original plan (i.e. drive from tableA) without dictating too much to the optimiser (i.e. it doesn't force the optimiser to choose any particular join methods).

You're doing distributed query optimization, and that's a tricky beast. It could be that the your table's statistics are current, but now the tables at the remote system are out-of-whack or have changed. Or the remote system added/removed/modified indexes, and that broke your plan. (This is an excellent reason to consider replication -- so you can control indexes and statistics against it.)
That said, Oracle's estimate of cardinality is a primary driver in execution plan. A 10053 trace analysis (Jonathan Lewis' Cost-Based Oracle Fundamentals book has wonderful examples from 8i to 10.1) can help shed light on why your statement's now broken and how the LEADING hint fixes it.
The DRIVING_SITE hint might be a better choice if you know you always want the local tables to be joined first before going after the remote site; it clarifies your intention without driving the plan the way a LEADING hint would.

Might not be relevant but I had a similar situation once where the remote table had been replaced by a single-table view. When it was a table the distributed query optimizer 'saw' that it had an index. When it became a view it couldn't see the index anymore and couldn't cost a plan that used an index on the remote object.
That was a few years ago. I documented my analysis at the time here.

RI,
It's hard to be sure about the cause of the performance problems without seeing the SQL.
When an Oracle query was performing well before, and suddenly starts performing badly, it is usually related to one of two issues:
A) Statistics are out of date. This is the easiest and quickest thing to check, even if you have a housekeeping batch process that's supposed to take care of it ... always double-check.
B) Data volume / data pattern change.
In your case, running a distributed query across multiple databases makes it 10x harder for Oracle to manage performance between them. Is it possible to put these tables in one database, perhaps separate schema owners in one database?
Hints are notoriously fragile, as Oracle is under no obligations to follow the hint. When the data volume or pattern changes some more, Oracle may just ignore the hint and do what it thinks is best (ie. worst ;-).
If you cannot put these tables all in one database, then I recommend you look to break your query up into two statements:
INSERT on sub-SELECT to copy external data to a global temporary table in your current database.
SELECT from the global temporary table to join with your other table.
You will have complete control over performance of step 1 above without resorting to hints. This approach typically scales well, providing you take time to do the performance tuning. I've seen this approach solve many complex performance problems.
The overhead for Oracle to create a whole new table, or insert a heap of records, is much smaller than most people expect. Defining a global temporary table further reduces that overhead.
Matthew

How do you optimize tables for specific queries?

What are the patterns you use to determine the frequent queries?
How do you select the optimization factors?
What are the types of changes one can make?

This is a nice question, if rather broad (and none the worse for that).
If I understand you, then you're asking how to attack the problem of optimisation starting from scratch.
The first question to ask is: "is there a performance problem?"
If there is no problem, then you're done. This is often the case. Nice.
On the other hand...
Determine Frequent Queries
Logging will get you your frequent queries.
If you're using some kind of data access layer, then it might be simple to add code to log all queries.
It is also a good idea to log when the query was executed and how long each query takes. This can give you an idea of where the problems are.
Also, ask the users which bits annoy them. If a slow response doesn't annoy the user, then it doesn't matter.
Select the optimization factors?
(I may be misunderstanding this part of the question)
You're looking for any patterns in the queries / response times.
These will typically be queries over large tables or queries which join many tables in a single query. ... but if you log response times, you can be guided by those.
Types of changes one can make?
You're specifically asking about optimising tables.
Here are some of the things you can look for:
Denormalisation. This brings several tables together into one wider table, so in stead of your query joining several tables together, you can just read one table. This is a very common and powerful technique. NB. I advise keeping the original normalised tables and building the denormalised table in addition - this way, you're not throwing anything away. How you keep it up to date is another question. You might use triggers on the underlying tables, or run a refresh process periodically.
Normalisation. This is not often considered to be an optimisation process, but it is in 2 cases:
updates. Normalisation makes updates much faster because each update is the smallest it can be (you are updating the smallest - in terms of columns and rows - possible table. This is almost the very definition of normalisation.
Querying a denormalised table to get information which exists on a much smaller (fewer rows) table may be causing a problem. In this case, store the normalised table as well as the denormalised one (see above).
Horizontal partitionning. This means making tables smaller by putting some rows in another, identical table. A common use case is to have all of this month's rows in table ThisMonthSales, and all older rows in table OldSales, where both tables have an identical schema. If most queries are for recent data, this strategy can mean that 99% of all queries are only looking at 1% of the data - a huge performance win.
Vertical partitionning. This is Chopping fields off a table and putting them in a new table which is joinned back to the main table by the primary key. This can be useful for very wide tables (e.g. with dozens of fields), and may possibly help if tables are sparsely populated.
Indeces. I'm not sure if your quesion covers these, but there are plenty of other answers on SO concerning the use of indeces. A good way to find a case for an index is: find a slow query. look at the query plan and find a table scan. Index fields on that table so as to remove the table scan. I can write more on this if required - leave a comment.
You might also like my post on this.

That's difficult to answer without knowing which system you're talking about.
In Oracle, for example, the Enterprise Manager lets you see which queries took up the most time, lets you compare different execution profiles, and lets you analyze queries over a block of time so that you don't add an index that's going to help one query at the expense of every other one you run.

Your question is a bit vague. Which DB platform?
If we are talking about SQL Server:
Use the Dynamic Management Views. Use SQL Profiler. Install the SP2 and the performance dashboard reports.
After determining the most costly queries (i.e. number of times run x cost one one query), examine their execution plans, and look at the sizes of the tables involved, and whether they are predominately Read or Write, or a mixture of both.
If the system is under your full control (apps. and DB) you can often re-write queries that are badly formed (quite a common occurrance), such as deep correlated sub-queries which can often be re-written as derived table joins with a little thought. Otherwise, you options are to create covering non-clustered indexes and ensure that statistics are kept up to date.

For MySQL there is a feature called log slow queries
The rest is based on what kind of data you have and how it is setup.

In SQL server you can use trace to find out how your query is performing. Use ctrl + k or l
For example if u see full table scan happening in a table with large number of records then it probably is not a good query.
A more specific question will definitely fetch you better answers.

If your table is predominantly read, place a clustered index on the table.

My experience is with mainly DB2 and a smattering of Oracle in the early days.
If your DBMS is any good, it will have the ability to collect stats on specific queries and explain the plan it used for extracting the data.
For example, if you have a table (x) with two columns (date and diskusage) and only have an index on date, the query:
select diskusage from x where date = '2008-01-01'
will be very efficient since it can use the index. On the other hand, the query
select date from x where diskusage > 90
would not be so efficient. In the former case, the "explain plan" would tell you that it could use the index. In the latter, it would have said that it had to do a table scan to get the rows (that's basically looking at every row to see if it matches).
Really intelligent DBMS' may also explain what you should do to improve the performance (add an index on diskusage in this case).
As to how to see what queries are being run, you can either collect that from the DBMS (if it allows it) or force everyone to do their queries through stored procedures so that the DBA control what the queries are - that's their job, keeping the DB running efficiently.

indices on PKs and FKs and one thing that always helps PARTITIONING...

1. What are the patterns you use to determine the frequent queries?
Depends on what level you are dealing with the database. If you're a DBA or a have access to the tools, db's like Oracle allow you to run jobs and generate stats/reports over a specified period of time. If you're a developer writing an application against a db, you can just do performance profiling within your app.
2. How do you select the optimization factors?
I try and get a general feel for how the table is being used and the data it contains. I go about with the following questions.
Is it going to be updated a ton and on what fields do updates occur?
Does it have columns with low cardinality?
Is it worth indexing? (tables that are very small can be slowed down if accessed by an index)
How much maintenance/headache is it worth to have it run faster?
Ratio of updates/inserts vs queries?
etc.
3. What are the types of changes one can make?
-- If using Oracle, keep statistics up to date! =)
-- Normalization/De-Normalization either one can improve performance depending on the usage of the table. I almost always normalize and then only if I can in no other practical way make the query faster will de-normalize. A nice way to denormalize for queries and when your situation allows it is to keep the real tables normalized and create a denormalized "table" with a materialized view.
-- Index judiciously. Too many can be bad on many levels. BitMap indexes are great in Oracle as long as you're not updating the column frequently and that column has a low cardinality.
-- Using Index organized tables.
-- Partitioned and sub-partitioned tables and indexes
-- Use stored procedures to reduce round trips by applications, increase security, and enable query optimization without affecting users.
-- Pin tables in memory if appropriate (accessed a lot and fairly small)
-- Device partitioning between index and table database files.
..... the list goes on. =)
Hope this is helpful for you.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas