Composite index is ignored when selecting unindexed columns (Oracle) - sql

We have a simple join statement in which some of the when clauses may turn into is null. The statement is generated by an application.
A problem with a query plan arises when we have this is null constraint. We followed the approach described in the article on StackExchange and created a composite index for columns - nullable and the one we join on. It helps only if we select only indexed columns. If we select unindexed columns it is ignored, while the query result is the same - e.g., no rows selected.
The only option we see - to change the logic of the application, but may be there is still a way to solve this on db-level?
--Illustrative sample. Prepare tables and indexes:
create table tableA
(
Acol1 varchar2(32) NOT NULL,
Acol2 varchar2(32),
Acol3 varchar2(32)
);
insert into tableA (Acol1, Acol2, Acol3)
values ('abcd1','abcd2A','abcd3A');
create table tableB
(
Bcol1 varchar2(32) NOT NULL,
Bcol2 varchar2 (32),
Bcol3 varchar2 (32)
);
insert into tableB (Bcol1, Bcol2, Bcol3)
values ('abcd1','abcd2B','abcd3B');
create index tableA_col12 on tableA (acol1, acol2);
create index tableB_col1 on tableB (Bcol1);
commit;
Then we check the plans:
1.
select a.Acol1 from tableA a join tableB b on a.Acol1 = b.Bcol1 where Acol2 is null;
--no rows selected
Plan1 - Range scan
2.
select * from tableA a join tableB b on a.Acol1 = b.Bcol1 where Acol2 is null;
--no rows selected
Plan2 (same link above) - Full table scan
What would be the best way to improve performance: change the queries, use smarter indexes or by applying fixed plan?
*Update* While I was preparing this question, the plan for my sample changed by itself, now we have Plan2* instead of Plan2 - no Full table scan. However, if I recreate the sample (drop tables and prepare them again) - the plan is Plan2 again (Full table scan). This trick does not happen in actual DB.

My comment was a bit flippant so I will try and give you some more detail. Here are some general tips for getting better at optimizing SQL systems and specific queries
First of all #GordonLinoff is right (as always) you won't get anything meaningful out of a tiny table. The optimizer knows and will work differently.
Second after you have a decent sized table (at least 50k rows depending on your memory) you need to make sure you run statistics on you tables or the optimizer (and the indexes) just won't work
Third you need to use the tools, learn how to understand an execution plan -- you can't get better at these techniques without having a deep understanding of what the system is telling you. Modern sql databases have tools that will look at a query and suggest indexes -- use them, just like the execution plan you can learn a lot. Remember, these tools are not foolproof you need to try out the suggestions and see if they work.
Finally, read a lot. One source that I think is particularly interesting is stackoverflow user Quassnoi who has a blog at explainextended. While not as active recently this blog (and many of his answers) are quite illuminating, I expect you will enjoy them. There are many blogs and books about the subject and every bit helps.
In this case, for your bigger table, I think (and this has the caveat there are many things about your DB and data model I don't know) just adding more columns to the index will work -- but use the Oracle tool and see what it suggests. Try that first.

Related

Will a SQL DELETE with a sub query execute inefficiently if there are many rows in the source table?

I am looking at an application and I found this SQL:
DELETE FROM Phrase
WHERE PhraseId NOT IN(SELECT Id FROM PhraseSource)
The intention of the SQL is to delete rows from Phrase that are not in the PhraseSource table.
The two tables are identical and have the following structure
Id - GUID primary key
...
...
...
Modified int
the ... columns are about ten columns containing text and numeric data. The PhraseSource table may or may not contain more recent rows with a higher number in the Modified column and different text and numeric data.
Can someone tell me will this query execute the SELECT Id from PhraseSource for every row in the Phrase table? If so is there a more efficient way that this could be coded.
1. Will this query execute the SELECT Id from PhraseSource for every row?
No.
In SQL you express what you want to do, not how you want it to be done1. The engine will create an execution plan to do what you want in the most performant way it can.
For your query, executing the query for each row is not necessary. Instead the engine will create an execution plan that executes the subquery once, then does a left anti-semi join to determine what IDs are not present in the PhraseSource table.
You can verify this when you include the Actual Execution Plan in SQL Server Management Studio.
2. Is there a more efficient way that this could be coded?
A little bit more efficient, as follows:
DELETE
p
FROM
Phrase AS p
WHERE
NOT EXISTS (
SELECT
1
FROM
PhraseSource AS ps
WHERE
ps.Id=p.PhraseId
);
This has been shown in tests done by user Aaron Bertrand on sqlperformance.com: Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?:
Conclusion
[...] for the pattern of finding all rows in table A where some condition does not exist in table B, NOT EXISTS is typically going to be your best choice.
Another benefit of using NOT EXISTS with a correlated subquery is that it does not have problems when PhraseSource.Id can be NULL. I suggest you read up on IN/NOT IN vs NULL values in the subquery. E.g. you can read more about that on sqlbadpractices.com: Using NOT IN operator with null values.
The PhraseSource.Id column is probably not nullable in your schema, but I prefer using a method that is resilient in all possible schemas.
1. Exceptions exist when forcing the engine to use a specific path, e.g. with Table Hints or Query Hints. The engine doesn't always get things right.
In this case the sub-query could be evaluated for each row if the database system is not smart enough (but in case of MS SQL Server, I suppose it should be able to recognize the fact that you don't need to evaluate the subquery more than once).
Still there is a better solution:
DELETE p
FROM Phrase p
LEFT JOIN PhraseSource ps ON ps.Id = p.PhraseId
WHERE ps.Id IS NULL
This uses the LEFT JOIN which matches the rows of both tables, but in case there is no match it leaves the ps entry NULL. Now you just check for NULLs on the left side to see which Phrases do not have a match and will delete those.
All types of JOIN statements are very nicely described in this answer.
Here you can see three different approaches for a similar issue compared on MySQL. As #Drammy mentions, to actually see the performance of a given approach, you could see the execution plan on your target database and do performance testing on different approaches of the same problem.
That query should optimise into a join. Have you looked at the execution plan?
If you're experiencing poor performance it is likely because of the guid primary keys.
A primary key is clustered by default. If the guid primary key is clustered on your table that means the data in the tables is ordered by the primary key. The problem with guids as clustered keys is that when you delete one record the table has to be reordered and shuffled around on disk.
This article is a good read on the topic..
https://blog.codinghorror.com/primary-keys-ids-versus-guids/

Join to SELECT vs. Join to Tableset

For the DB gurus out there, I was wondering if there is any functional/performance difference between Joining to the results a SELECT statement and Joining to a previously filled table variable. I'm working in SQL Server 2008 R2.
Example (TSQL):
-- Create a test table
DROP TABLE [dbo].[TestTable]
CREATE TABLE [dbo].[TestTable](
[id] [int] NOT NULL,
[value] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the test table with a few rows
INSERT INTO [dbo].[TestTable]
SELECT 1123, 'test1'
INSERT INTO [dbo].[TestTable]
SELECT 2234, 'test2'
INSERT INTO [dbo].[TestTable]
SELECT 3345, 'test3'
-- Create a reference table
DROP TABLE [dbo].[TestRefTable]
CREATE TABLE [dbo].[TestRefTable](
[id] [int] NOT NULL,
[refvalue] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the reference table with a few rows
INSERT INTO [dbo].[TestRefTable]
SELECT 1123, 'ref1'
INSERT INTO [dbo].[TestRefTable]
SELECT 2234, 'ref2'
-- Scenario 1: Insert matching results into it's own table variable, then Join
-- Create a table variable
DECLARE #subset TABLE ([id] INT NOT NULL, [refvalue] VARCHAR(MAX))
INSERT INTO #subset
SELECT * FROM [dbo].[TestRefTable]
WHERE [dbo].[TestRefTable].[id] = 1123
SELECT t.*, s.*
FROM [dbo].[TestTable] t
JOIN #subset s
ON t.id = s.id
-- Scenario 2: Join directly to SELECT results
SELECT t.*, s.*
FROM [dbo].TestTable t
JOIN (SELECT * FROM [dbo].[TestRefTable] WHERE id = 1123) s
ON t.id = s.id
In the "real" world, the tables and table variable are pre-defined. What I'm looking at is being able to have the matched reference rows available for further operations, but I'm concerned that the extra steps will slow the query down. Are there technical reasons as to why one would be faster than the other? What sort of performance difference may be seen between the two approaches? I realize it is difficult (if not impossible) to give a definitive answer, just looking for some advice for this scenario.
The database engine has an optimizer to figure out the best way to execute a query. There is more under the hood than you probably imagine. For instance, when SQL Server is doing a join, it has a choice of at least four join algorithms:
Nested Loop
Index Lookup
Merge Join
Hash Join
(not to mention the multi-threaded versions of these.)
It is not important that you understand how each of these works. You just need to understand two things: different algorithms are best under different circumstances and SQL Server does its best to choose the best algorithm.
The choice of join algorithm is only one thing the optimizer does. It also has to figure out the ordering of the joins, the best way to aggregate results, whether a sort is needed for an order by, how to access the data (via indexes or directly), and much more.
When you break the query apart, you are making an assumption about optimization. In your case, you are making the assumption that the first best thing is to do a select on a particular table. You might be right. If so, your result with multiple queries should be about as fast as using a single query. Well, maybe not. When in a single query, SQL Server does not have to buffer all the results at once; it can stream results from one place to another. It may also be able to take advantage of parallelism in a way that splitting the query prevents.
In general, the SQL Server optimizer is pretty good, so you are best letting the optimizer do the query all in one go. There are definitely exceptions, where the optimizer may not choose the best execution path. Sometimes fixing this is as easy as being sure that statistics are up-to-date on tables. Other times, you can add optimizer hints. And other times you can restructure the query, as you have done.
For instance, one place where loading data into a local table is useful is when the table comes from a different server. The optimizer may not have full information about the size of the table to make the best decisions.
In other words, keep the query as one statement. If you need to improve it, then focus on optimization after it works. You generally won't have to spend much time on optimization, because the engine is pretty good at it.
This would give the same result?
SELECT t.*, s.*
FROM dbo.TestTable AS t
JOIN dbo.TestRefTable AS s ON t.id = s.id AND s.id = 1123
Basically, this is a cross join of all records from TestTable and TestRefTable with id = 1123.
Joining to table variables will also result in bad cardinality estimates by the optimizer. Table variables are always assumed by the optimizer to contain only a single row. The more rows it actually has the worse that estimate becomes. This causes the optimizer to assume the wrong number of rows for the table itself, but in other places, for operators that might then join to that result, it can result in wrong estimations of the number executions for that operation.
Personally I think Table parameters should be used for getting data into and out of the server conveniently using client apps (C# .Net apps make good use of them), or for passing data between Stored Procs, but should not be used too much within the proc itself. The importance of getting rid of them within the Proc code itself increases with the expected number of rows to be carried by the parameter.
Sub Selects will perform better, or immediately copying into a temp table will work well. There is overhead for copying into the temp table, but again, the more rows you have the more worth it that overhead becomes because the estimates by the optimizer get worse and worse.
In general a derived table in the query is probably going to be faster than joining to a table variable because it can make use of indexes and they are not available in table variables. However, temp tables can also have indexes creted and that might solve the potential performance difference.
Also if the number of table variable records is expected to be small, then indexes won't make a great deal of difference anyway and so there would be little or no differnce.
As alawys you need to test on your own system as number of records and table design and index design havea great deal to do with what works best.
I'd expect the direct Table join to be faster than the Table to TableVariable, and use less resources.

TSQL join performance

My problem is that this query takes forever to run:
Select
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableB on tableA.CUST_PO_NUMBER like tableB.CustomerMask
Here is the structure of the tables:
CREATE TABLE [dbo].[TableA](
[CUSTOMER_NAME] [varchar](100) NULL,
[CUSTOMER_NUMBER] [varchar](50) NULL,
[CUST_PO_NUMBER] [varchar](50) NOT NULL,
[ORDER_NUMBER] [varchar](30) NOT NULL,
[ORDER_TYPE] [varchar](30) NULL)
CREATE TABLE [dbo].[TableB](
[RuleID] [varchar](50) NULL,
[CustomerMask] [varchar](500) NULL)
TableA has 14 million rows and TableB has 1000 rows. Data in column customermask can be anything like ‘%’,’ttt%’,’%ttt%’..etc
How can I tune it to make it faster?
Thanks!
The short answer is don't use the LIKE operator to join two tables containing millions of rows. It's not going to be fast, no matter how you tune it. You might be able to improve it incrementally, but it will just be putting lipstick on a pig.
You need to have a distinct value on which to join the tables. Right now it has to do a complete scan of tableA, and do an item-by-item wildcard comparison between Customer_Name and CustomerMask. You're looking at 14 billion comparisons, all using the slow LIKE operator.
The only suggestion I can give is to re-think the architecture of associating rules with Customers.
While you can't change what's already there, you can create a new table like this:
CREATE TABLE [dbo].[TableC](
[CustomerMask] [varchar](500) NULL)
[CUST_PO_NUMBER] [varchar](50) NOT NULL)
Then have a trigger on both TableA and TableB that inserts / updates / deletes records in TableC if they no longer match the condition CUST_PO_NUMBER LIKE CustomerMask (for the trigger on TableB you need to only update TableC if the CustomerMask field has been changed.
Then in your query will just become:
SELECT
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableC on tableA.CUST_PO_NUMBER = tableC.CUST_PO_NUMBER
INNER JOIN tableB on tableC.CustomerMask = tableB.CustomerMask
This will greatly improve your query performance and it shouldn't greatly affect your write performance. You will basically only be performing the like query once for each record (unless they change).
Only change order join then faster and enjoy! use this query:
Select tableA.CUSTOMER_NAME, tableB.CUSTOMER_NUMBER, TableB.RuleID
FROM tableB
INNER JOIN tableA
on tableB.CustomerMask like tableA.CUST_PO_NUMBER
Am I missing something? What about the following:
Select
tableA.CUSTOMER_NAME,
tableA.CUSTOMER_NUMBER,
tableB.RuleID
FROM tableA, tableB
WHERE tableA.CUST_PO_NUMBER = tableB.CustomerMask
EDIT2: Thinking about it, how many of those masks start and end with wildcards? You might gain some performance by first:
Indexing CUST_PO_NUMBER
Creating a persisted computed column CUST_PO_NUMBER_REV that's the reverse of CUST_PO_NUMBER
Indexing the persisted column
Putting statistics on these columns
Then you might build three queries, and UNION ALL the results together:
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* First character of CustomerMask is not a wildcard but last one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER_REV LIKE REVERSE(CustomerMask)
WHERE /* Last character of CustomerMask is not a wildcard but first one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* Everything else */
That's just a quick example, you'll need to take some care that the WHERE clauses give you mutually exclusive results (or use UNION, but aim for mutually exclusive results first).
If you can do that, you should have two queries using index seeks and one query using index scans.
EDIT: You can implement a sharding system to spread out the customers and customer masks tables across multiple servers and then have each server evaluate 1/n% of the results. You don't need to partition the data -- simple replication of the entire contents of each table will do. Link the servers to your main server and you can do something to the effect of:
SELECT ... FROM OPENQUERY(LinkedServer1, 'SELECT ... LIKE ... WHERE ID BETWEEN 0 AND 99')
UNION ALL
SELECT ... FROM OPENQUERY(LinkedServer2, 'SELECT ... LIKE ... WHERE ID BETWEEN 100 AND 199')
Note: the OPENQUERY may be extraneous, SQL Server might be smart enough to evaluate queries on remote servers and stream the results back. I know it doesn't do that for JET linked servers, but it might handle its own kind better.
That or through more hardware at the problem.
You can create an Indexed View of your query to improve performance.
From Designing Indexed Views:
For a standard view, the overhead of dynamically building the result set for each query that references a view can be significant for views that involve complex processing of large numbers of rows, such as aggregating lots of data, or joining many rows. If such views are frequently referenced in queries, you can improve performance by creating a unique clustered index on the view. When a unique clustered index is created on a view, the result set is stored in the database just like a table with a clustered index is stored.
Another benefit of creating an index on a view is that the optimizer starts using the view index in queries that do not directly name the view in the FROM clause. Existing queries can benefit from the improved efficiency of retrieving data from the indexed view without having to be recoded.
This should improve the performance of this particular query, but note that inserts, updates and deleted into the tables it uses may be slowed.
You can't use LIKE if you care about performance.
If you are trying to do approximate string matching (e.g. Test and est and best, etc.) and you don't want to use Sql full-text search take a look at this article.
At least you can shortlist approximate matches then run your wildcard test on them.
--EDIT 2--
Your problem is interesting in the context of your limitation. Thinking about it again, I am pretty sure that using 3 gram would boost the performance (going back to my initial suggestion).
Let's say if you setup your 3gram data, you'll be having the following tables:
Customer : 14M
Customer3Grams : Maximum 700M //Considering the field is varchar(50)
3Grams : 78
Pattern : 1000
Pattern3Grams : 50K
To join pattern to customer then you need the following join:
Pattern x Pattern3Grams x Customer3Grams x Customer
With appropriate indexing (which is easy) each look-up can happen in O(LOG(50K)+LOG(700M)+LOG(14M)) which is equal to 47.6.
Considering appropriate indexes are present the whole join can be calculated with less than 50,000 look-ups and of course the cost of scanning after look ups. I expect it to be very efficient (matter of seconds).
The cost of creating 3grams for each new customer is also minimal because it would be maximum of 50x75 possible three grams that should be appended to the customer3Grams table.
--EDIT--
Depending to your data I can also suggest hash based clustering. I assume customer numbers are numbers with some character patterns in them (e.g. 123231ttt3x4). If this is the case you can create a simple hash function that calculates the result of bit-wise OR for every letter (not numbers) and add it as an indexed column to your table. You can filter on the result of the hash before applying LIKE.
Depending to your data this may cluster your data effectively and improve your search by factor of the number of clusters (number of hash). You can test it by applying the hash and counting the number of distinct generated hash.

Many to one joins

Lets say I have the following tables:
create table table_a
(
id_a,
name_a,
primary_key (id_a)
);
create table table_b
(
id_b,
id_a is not null, -- (Edit)
name_b,
primary_key (id_b),
foreign_key (id_a) references table_a (id_a)
);
We can create a join view on these tables in a number of ways:
create view join_1 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_a a, table_b b
where a.id_a = b.id_a
);
create view join_2 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_b b left outer join table_a a
on a.id_a = b.id_a
);
create view join_3 as
(
select
b.id_b,
b.id_a,
b.name_b,
(select a.name_a from table_a a where b.id_b = a.id_a) as name_a
from table_b b;
);
Here we know:
(1) There must be at least one entry from table_a with id_a (due to the foreign key in table B) AND
(2) At most one entry from table_a with id_a (due to the primary key on table A)
then we know that there is exactly one entry in table_a that links in with the join.
Now consider the following SQL:
select id_b, name_b from join_X;
Note that this selects no columns from table_a, and because we know in this join, table_b joins to exactly one we really shouldn't have to look at table_a when performing the above select.
So what is the best way to write the above join view?
Should I just use the standard join_1, and hope the optimizer figures out that there is no need to access table_a based on the primary and foreign keys?
Or is it better to write it like join_2 or even join_3, which makes it more explicit that there is exactly one row from the join for each row from table_b?
Edit + additional question
Is there any time I should prefer a sub-select (as in join_3) over a normal join (as in join_1)?
Why are you using views at all for this purpose? If you want to get data from the table, get it from the table.
Or, if you need a view to translate some columns in the tables (such as coalescing NULLs into zeros), create a view over just the B table. This will also apply if some DBA wanna-be has implemented a policy that all selects must be via views rather than tables :-)
In both those cases, you don't have to worry about multi-table access.
Intuitively, I'd think join_1 will perform slightly slower, because your assumption that the optimiser can transform the join away is wrong as you didn't declare the table_b.id_a column to be NOT NULL. In fact, this means that (1) is wrong. table_b.id_a can be NULL. Even if you know it can't be, the optimiser doesn't know that.
As far as join_2 and join_3 is concerned, depending on your database, optimisation might be possible. The best way to find out is to run (Oracle syntax)
EXPLAIN select id_b, name_b from join_X;
And study the execution plan. It will tell you whether table_a was joined or not. On the other hand, if your view should be reusable, then I'd go for the plain join and forget about pre-mature optimisations. You can achieve better results with proper statistics and indexes, as a join operation is not always so expensive. But that depends on your statistics, of course.
1 + 2 are effectively identical under SQL Server.
I have never used [3], but it looks quite odd. I would strongly suspect the optimizer will make it equivalent to the other 2.
It's a good exercise to run all 3 statements and compare the execution plans produced.
So given identical performance, the clearest to read gets my vote - [2] is the standard where it is supported otherwise [1].
In your case, if you don't want any columns from A, why include Table_A in the statement anyway?
If it's simply a filter - i.e. only include rows from Table B where a row exists in Table A even though I don't want any cols from Table A, then all 3 syntaxes are fine, although you may find that use of IF EXISTS is more performant in some dbs:
SELECT * from Table_B b WHERE EXISTS (SELECT 1 FROM Table_A a WHERE b.id_b = a.id_a)
although in my experience this is usual equivalent in performance to any of the others.
You also ask, then would you choose a subquery over the other expressions. This boils down to whether it's a CORRELATED subquery or not.
Basically - a correlated subquery has to be run once for every row in the outer statement - this is true of the above - for every row in Table B you must run the subquery against Table A.
If the subquery can be run just once
SELECT * from Table_B b WHERE b.id_a IN (SELECT a.id_a FROM Table_A a WHERE a.id_a > 10)
Then the subquery is generally more performant than a join - although I suspect that some optimizers will still be able to spot this and reduce both to the same execution plan.
Again, the best thing to do is to run both statements, and compare execution plans.
Finally and most simply - given the FK you could just write:
SELECT * From Table_B b WHERE b.id_a IS NOT NULL
It will depend on the platform.
SQL Server both analyses the logical implications of constraints (foreign keys, primary keys, etc) and expands VIEWs inline. This means that the 'irrelevant' portion of the VIEW's code is obsoleted by the optimiser. SQL Server would give the exact same execution plan for all three cases. (Note; there is a limit to the complexity that the optimiser can handle, but it can certainly handle this.)
Not all platforms are created equal, however.
- Some may not analyse constraints in the same way, assuming you coded the join for a reason
- Some may pre-compile the VIEW's execution/explain plan
As such, to determine behaviour, you must be aware of the specific platform's capabilties. In the vast majority of situations the optimiser is a complex beast, and so the best test is simply to try it and see.
EDIT
In response to your extra question, are correlated sub-queries every prefered? There is no simple answer, as it depends on the data and the logic you are trying to implement.
There are certainly cases int he past where I have used them, both to simplify a query's structure (for maintainability), but also to enable specific logic.
If the field table_b.id_a referenced many entries in table_a you may only want the name from the latest one. And you could implement that using (SELECT TOP 1 name_a FROM table_a WHERE id_a = table_b.id_a ORDER BY id_a DESC).
In short, it depends.
- On the query's logic
- On the data's structure
- On the code's final layout
More often than not I find it's not needed, but measurably often I find that it is a positive choice.
Note:
Depending on the correlated sub-query, it doesn't actually always get executed 'once for every record'. SQL Server, for example, expands the required logic to be executed in-line with the rest of the query. It's important to note that SQL code is processed/compiled/whatever before being executed. SQL is simply a method for articulating set based logic, which is then transformed into traditional loops, etc, using the most optimal algorithms available to the optimiser.
Other RDBMS may perform differently, due to the capabilities or limitations of the optimiser. Some RDBMS perform well when using IN (SELECT blah FROM blah), or when using EXISTS (SELECT * FROM blah), butsome perform terribly. The same applies to correlated sub-queries. Sub perform exceptionally well with them, some don't perform so well, but most handle the very well in my experience.

Performance: Subquery or Joining

I got a little question about performance of a subquery / joining another table
INSERT
INTO Original.Person
(
PID, Name, Surname, SID
)
(
SELECT ma.PID_new , TBL.Name , ma.Surname, TBL.SID
FROM Copy.Person TBL , original.MATabelle MA
WHERE TBL.PID = p_PID_old
AND TBL.PID = MA.PID_old
);
This is my SQL, now this thing runs around 1 million times or more.
My question is what would be faster?
If I change TBL.SID to (Select new from helptable where old = tbl.sid)
OR
If I add the 'HelpTable' to the from and do the joining in the where?
edit1
Well, this script runs only as much as there r persons.
My program has 2 modules one that populates MaTabelle and one that transfers data. This program does merge 2 databases together and coz of this, sometimes the same Key is used.
Now I'm working on a solution that no duplicate Keys exists.
My solution is to make a 'HelpTable'. The owner of the key(SID) generates a new key and writes it into a 'HelpTable'. All other tables that use this key can read it from the 'HelpTable'.
edit2
Just got something in my mind:
if a table as a Key that can be null(foreignkey that is not linked)
then this won't work with the from or?
Modern RDBMs, including Oracle, optimize most joins and sub queries down to the same execution plan.
Therefore, I would go ahead and write your query in the way that is simplest for you and focus on ensuring that you've fully optimized your indexes.
If you provide your final query and your database schema, we might be able to offer detailed suggestions, including information regarding potential locking issues.
Edit
Here are some general tips that apply to your query:
For joins, ensure that you have an index on the columns that you are joining on. Be sure to apply an index to the joined columns in both tables. You might think you only need the index in one direction, but you should index both, since sometimes the database determines that it's better to join in the opposite direction.
For WHERE clauses, ensure that you have indexes on the columns mentioned in the WHERE.
For inserting many rows, it's best if you can insert them all in a single query.
For inserting on a table with a clustered index, it's best if you insert with incremental values for the clustered index so that the new rows are appended to the end of the data. This avoids rebuilding the index and often avoids locks on the existing records, which would slow down SELECT queries against existing rows. Basically, inserts become less painful to other users of the system.
Joining would be much faster than a subquery
The main difference betwen subquery and join is
subquery is faster when we have to retrieve data from large number of tables.Because it becomes tedious to join more tables.
join is faster to retrieve data from database when we have less number of tables.
Also, this joins vs subquery can give you some more info
Instead of focussing on whether to use join or subquery, I would focus on the necessity of doing 1,000,000 executions of that particular insert statement. Especially as Oracle's optimizer -as Marcus Adams already pointed out- will optimize and rewrite your statements under the covers to its most optimal form.
Are you populating MaTabelle 1,000,000 times with only a few rows and issue that statement? If yes, then the answer is to do it in one shot. Can you provide some more information on your process that is executing this statement so many times?
EDIT: You indicate that this insert statement is executed for every person. In that case the advice is to populate MATabelle first and then execute once:
INSERT
INTO Original.Person
(
PID, Name, Surname, SID
)
(
SELECT ma.PID_new , TBL.Name , ma.Surname, TBL.SID
FROM Copy.Person TBL , original.MATabelle MA
WHERE TBL.PID = MA.PID_old
);
Regards,
Rob.