SQL IN clause slower than individual queries - sql

I'm using Hibernate's JPA implementation with MySQL 5.0.67. MySQL is configured to use InnoDB.
In performing a JPA query (which is translated to SQL), I've discovered that using the IN clause is slower than performing individual queries. Example:
SELECT p FROM Person p WHERE p.name IN ('Joe', 'Jane', 'Bob', 'Alice')
is slower than four separate queries:
SELECT p FROM Person p WHERE p.name = 'Joe'
SELECT p FROM Person p WHERE p.name = 'Jane'
SELECT p FROM Person p WHERE p.name = 'Bob'
SELECT p FROM Person p WHERE p.name = 'Alice'
Why is this? Is this a MySQL performance limitation?

This is a known deficiency in MySQL.
It is often true that using UNION performs better than a range query like the one you show. MySQL doesn't employ indexes very intelligently for expressions using IN (...). A similar hole exists in the optimizer for boolean expressions with OR.
See http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/ for some explanation and detailed benchmarks.
The optimizer is being improved all the time. A deficiency in one version of MySQL may be improved in a subsequent version. So it's worth testing your queries on different versions.
It is also advantageous to use UNION ALL instead of simply UNION. Both queries use a temporary table to store results, but the difference is that UNION applies DISTINCT to the result set, which incurs an additional un-indexed sort.

If you're using the IN operator, it's not much different than saying:
(p.name = 'Joe' OR p.name = 'Jane' OR p.name = 'Bob' OR p.name = 'Alice')
Those are four conditions which must be checked for every row that the query must consider. Of course, each other query you cite has only one condition. I don't believe in most real-world scenarios doing four such queries would be faster, since you have to consider the time it takes for your client to read the result sets and do something with them. In that case, IN looks pretty nice; even better if it can use an index.

A query as simple as the IN demonstrated shouldn't have an issue with the optimizer choosing to use the index. The UNION work mentioned by Bill is only required occasionally when you have more complex queries. It could be an issue with index statistics.
Have you done an ANALYZE on the table in question?
How many rows are in the table and how many match the IN clause?
What does EXPLAIN say for the queries in question?

Are you measuring wall-clock time or query execution time? My guess is that the actual execution time for each of the four individual queries may add up to less than the time to execute the IN query, but the overall wall-clock time will be much longer for the four queries.
It will help to have an index on the name column.

For me because the IN clause can free the database and tables up to be used by other connections, and there are application structure benefit to using it, the IN clause is an invaluable tool, even if there is a slight lag over individual queries.
The following technique is utilized in almost every PHP/MySQL application I construct.
I use the IN clause quite a bit with numerical keys:
e.g.
grab five master items and all subites could be:
$master_arr = mysql_query(
select * from master table where master_id in (1,7,9,10)
);
then:
$subitem_arr = mysql_query(
select * from subitems table where par_master_id in (1,7,9,10)
);
the add the subarray to the master items:
foreach($subitem_arr AS $sv){
$m_key = $sv['par_master_id'];
$s_key = $sv['subitem_id'];
$master_arr[$m_key]['subitem'][$s_key] = $sv;
}
This does two things:
1.) the tables are not all held at once with a join
2.) only two mysql queries produce a tree of data

you can make the in clause faster if you get the values first then embed the values into the in clause instead of embedding the sql query into the sql statement
here is an example of using in clause

Related

Best practices of Oracle LEFT OUTER JOIN

I am new to sql, i use Sql Developer (Oracle db).
When I need to select some data with null values I write one of these selects:
1)
SELECT i.number
,i.amount
,(SELECT value FROM ATTRIBUTES a
WHERE a.parameter = 'aaa' AND a.item_nr = i.number) AS atr_value
FROM ITEMS i
2)
SELECT i.number
,i.amount
,a.value as atr_value
FROM ITEMS i
left outer join ATTRIBUTES a
on a.parameter = 'aaa'
and a.item_nr = i.number
Questions:
What is difference?
How first approach is called (how can I google it)? Where can I read about it?
Which one should I use further (what is best practices), maybe there is better way to select same data?
Axample of tables:
Your two queries are not exactly the same. If you have multiple matches in the second table, then the first generates an error and the second generates multiple rows.
Which is better? As a general rule, the LEFT JOIN method (the second method) is considered the better practice than the correlated subquery (the first method). Oracle has a pretty good optimizer and it offers many ways of optimizing joins. I also think Oracle can use JOIN algorithms for the correlated subqueries (not all databases are so smart). And, with the right indexes, the two forms probably have very similar performance.
There are situations where correlated subqueries have better performance than the equivalent JOIN construct. For this example, though, I would expect the performance to be similar.
In case there is never more than one matching row in table attributes, the queries do the same. It's just two ways to query the same data. Both querys are fine and straight-forward.
In case there can be more than one match, query one (which is using a correlated subquery) would fail. It would be inappropriate for the given task then.
The query with the outer join is easier to extend, when you want a second column from the attributes table.
The first query makes it crystal-clear that you expect zero or one matches in table attributes for each item. In case of data inconsistency or if you have an error in your query such as a forgotten criteria, it will fail, which is good.
The second query would simply retrieve more rows in case of such error, which may not be desired.
So it's a matter of personal preference and of your choice how the query is to deal with inconsistencies which query to choose.

How to improve the performance of multiple joins

I have a query with multiple joins in it. When I execute the query it takes too long. Can you please suggest me how to improve this query?
ALTER View [dbo].[customReport]
As
SELECT DISTINCT ViewUserInvoicerReport.Owner,
ViewUserAll.ParentID As Account , ViewContact.Company,
Payment.PostingDate, ViewInvoice.Charge, ViewInvoice.Tax,
PaymentProcessLog.InvoiceNumber
FROM
ViewContact
Inner Join ViewUserInvoicerReport on ViewContact.UserID = ViewUserInvoicerReport.UserID
Inner Join ViewUserAll on ViewUserInvoicerReport.UserID = ViewUserAll.UserID
Inner Join Payment on Payment.UserID = ViewUserAll.UserID
Inner Join ViewInvoice on Payment.UserID = ViewInvoice.UserID
Inner Join PaymentProcessLog on ViewInvoice.UserID = PaymentProcessLog.UserID
GO
Work on removing the distinct.
THat is not a join issue. The problem is that ALL rows have to go into a temp table to find out which are double - if you analyze the query plan (programmers 101 - learn to use that fast) you will see that the join likely is not the big problem but the distinct is.
And IIRC that distinct is USELESS because all rows are unique anyway... not 100% sure, but the field list seems to indicate.
Use distincts VERY rarely please ;)
You should see the Query Execution Plan and optimize the query section by section.
The overall optimization process consists of two main steps:
Isolate long-running queries.
Identify the cause of long-running queries.
See - How To: Optimize SQL Queries for step by step instructions.
and
It's difficult to say how to improve the performance of a query without knowing things like how many rows of data are in each table, which columns are indexed, what performance you're looking for and which database you're using.
Most important:
1. Make sure that all columns used in joins are indexed
2. Make sure that the query execution plan indicates that you are using the indexes you expect

SQL - IN clause vs equals operator for small list

Which should be the preferred and efficient way?
where #TeamId in (Team1Id, Team2Id)
or
where #TeamId=Team1Id or #TeamId=Team2Id
I am using sql server 2008.
Edit
When I checked execution plans, both the queries showed that they are using indexes and same execution plan.
Both are same
SQL server converts this
where #TeamId in (Team1Id, Team2Id)
Into
where #TeamId=Team1Id or #TeamId=Team2Id
It's better to write IN compare to OR more readable and easy.
For the specific example yo provide, of testing a variable, IN is simply syntactic sugar for multiple OR's.
However in the related case of selecting rows of a relation the use of a join to another relation is superior, particulalry if the data field being compared is indexed or the list of comparison values grows. Such a comparison relation is easily created using a static sub-query like this:
select *
from data
join (
select Team1Id as TeamId union all
select Team2Id
) comparison on comparison.TeamId = data.TeamId
This technique of a static sub-query is widely applicable to many circumstances.

Performance: Subquery or Joining

I got a little question about performance of a subquery / joining another table
INSERT
INTO Original.Person
(
PID, Name, Surname, SID
)
(
SELECT ma.PID_new , TBL.Name , ma.Surname, TBL.SID
FROM Copy.Person TBL , original.MATabelle MA
WHERE TBL.PID = p_PID_old
AND TBL.PID = MA.PID_old
);
This is my SQL, now this thing runs around 1 million times or more.
My question is what would be faster?
If I change TBL.SID to (Select new from helptable where old = tbl.sid)
OR
If I add the 'HelpTable' to the from and do the joining in the where?
edit1
Well, this script runs only as much as there r persons.
My program has 2 modules one that populates MaTabelle and one that transfers data. This program does merge 2 databases together and coz of this, sometimes the same Key is used.
Now I'm working on a solution that no duplicate Keys exists.
My solution is to make a 'HelpTable'. The owner of the key(SID) generates a new key and writes it into a 'HelpTable'. All other tables that use this key can read it from the 'HelpTable'.
edit2
Just got something in my mind:
if a table as a Key that can be null(foreignkey that is not linked)
then this won't work with the from or?
Modern RDBMs, including Oracle, optimize most joins and sub queries down to the same execution plan.
Therefore, I would go ahead and write your query in the way that is simplest for you and focus on ensuring that you've fully optimized your indexes.
If you provide your final query and your database schema, we might be able to offer detailed suggestions, including information regarding potential locking issues.
Edit
Here are some general tips that apply to your query:
For joins, ensure that you have an index on the columns that you are joining on. Be sure to apply an index to the joined columns in both tables. You might think you only need the index in one direction, but you should index both, since sometimes the database determines that it's better to join in the opposite direction.
For WHERE clauses, ensure that you have indexes on the columns mentioned in the WHERE.
For inserting many rows, it's best if you can insert them all in a single query.
For inserting on a table with a clustered index, it's best if you insert with incremental values for the clustered index so that the new rows are appended to the end of the data. This avoids rebuilding the index and often avoids locks on the existing records, which would slow down SELECT queries against existing rows. Basically, inserts become less painful to other users of the system.
Joining would be much faster than a subquery
The main difference betwen subquery and join is
subquery is faster when we have to retrieve data from large number of tables.Because it becomes tedious to join more tables.
join is faster to retrieve data from database when we have less number of tables.
Also, this joins vs subquery can give you some more info
Instead of focussing on whether to use join or subquery, I would focus on the necessity of doing 1,000,000 executions of that particular insert statement. Especially as Oracle's optimizer -as Marcus Adams already pointed out- will optimize and rewrite your statements under the covers to its most optimal form.
Are you populating MaTabelle 1,000,000 times with only a few rows and issue that statement? If yes, then the answer is to do it in one shot. Can you provide some more information on your process that is executing this statement so many times?
EDIT: You indicate that this insert statement is executed for every person. In that case the advice is to populate MATabelle first and then execute once:
INSERT
INTO Original.Person
(
PID, Name, Surname, SID
)
(
SELECT ma.PID_new , TBL.Name , ma.Surname, TBL.SID
FROM Copy.Person TBL , original.MATabelle MA
WHERE TBL.PID = MA.PID_old
);
Regards,
Rob.

simple sql query

which one is faster
select * from parents p
inner join children c on p.id = c.pid
where p.x = 2
OR
select * from
(select * from parents where p.x = 2)
p
inner join children c on p.id = c.pid
where p.x = 2
In MySQL, the first one is faster:
SELECT *
FROM parents p
INNER JOIN
children c
ON c.pid = p.id
WHERE p.x = 2
, since using an inline view implies generating and passing the records twice.
In other engines, they are usually optimized to use one execution plan.
MySQL is not very good in parallelizing and pipelining the result streams.
Like this query:
SELECT *
FROM mytable
LIMIT 1
is instant, while this one (which is semantically identical):
SELECT *
FROM (
SELECT *
FROM mytable
)
LIMIT 1
will first select all values from mytable, buffer them somewhere and then fetch the first record.
For Oracle, SQL Server and PostgreSQL, the queries above (and both of your queries) will most probably yield the same execution plans.
I know this is a simple case, but your first option is much more readable than the second one. As long as the two query plans are comparable I'd always opt for the more maintainable SQL code which your first example is for me.
It depends on how good the database is at optimising the query.
If the database manages to optimise the second one into the first one, they are equally fast, otherwise the first one is faster.
The first one gives more freedom for the database to optimise the query. The second one suggests a specific order of doing things. Either the database is able to see past this and optimise it into a single query, or it will run the query as two separate queries with the subquery as an intermediate result.
A database like SQL Server keeps statistics on what the database tables contain, which it uses to determine how to execute the query in the most efficient way. For example, depending on what will elliminate most records it can either start with joining the tables or filtering the parents table on the condition. If you write a query that forces a specific order, that might not be the most efficient order.
I'd think the first. I'm not sure if the optimizer would use any indexes on the the derived table in the second query, or if it would copy out all the rows that match into memory before joining back to the children.
This is why you have DBAs. It depends entirely on the DBMS, and how your tables and indexes are configured, as to which one runs the fastest.
Database tuning is not a set-and-forget operation, it should be done regularly, as the data changes, to ensure your database runs at peak performance. The question is not really meaningful without specifying:
which DBMS you are asking about.
what indexes you have on the tables.
a host of other possible configuration items (which may also depend on the DBMS, such as clustering).
You should run both those queries through the query optimizer to see which one is fastest, then start using that one. That's assuming the difference in noticeable in the first place. If the difference is minimal, go for the easiest to read/maintain.
For me, in the second query you are saying, I don't trust the optimizer to optimize this query so I'll provide some 'hints'.
I'd say, trust the optimizer until it let's you down and only then consider trying to do the optimizer's job for it.