How much load database takes when I SELECT ... IN `big_array`? - sql

Let's say, I have accumulated some array of id's (for example [1, 2, 3, ..., 1000]). Is wise to SELECT such big array from database. It's not big deal to take array of 10-20 things out of DB, but what if it were 1000-10000?
EDIT
Somehow it seems, that SELECT ... IN (SELECT ....id FROM ... BETWEEN 0 AND 100) is much slower(about 1200 ms !!!), than just form an array and SELECT ... IN [array]

In general, when you need to select many (1000+) records based on an array of ID's, a better approach than using the IN-operator, is to load your array of ID's into a temporary table, and then perform a join:
So instead of this:
SELECT * FROM MyTable WHERE Id IN (...)
Do this:
CREATE TABLE #TempIDs ( Id AS INT );
-- Bulk load the #TempIDs table with the ID's (DON'T issue one INSERT statement per ID!)
SELECT * FROM MyTable INNER JOIN #TempIDs ON MyTable.Id = #TempIDs.Id
Note the comment. For best performance, you need a mechanism for bulk loading the temporary table with ID's - this depends on your RDBMS and your application.

The problem
Pressure on parser and optimizer
A query of the kind
SELECT * FROM x WHERE x.a IN (1,2,3,...,1000)
will (at least in Oracle) be transformed to
SELECT * FROM x WHERE x.a=1 OR x.a=2 OR x.a=3 OR ... OR x.a=1000
You will get a very big parser tree and (at least in Oracle) you will hit the limit of the parser tree with more than 1000 values. So you put pressure on the parser and the optimizer and this will cost you some performance. Additionally the database will not be able to use some optimizations.
But there is another problem:
Fixed number of bind variables
Because your query is transformed into an equivalent query using OR expressions, you cannot use bind variables for the IN clause (SELECT * FROM x WHERE x.a IN (:values) will not work). You can only use bind variables for each value in the IN clause. So when you alter the number of values, you get a structural different query. This puts pressure on the query cache and (at least in Oracle) on the cursor cache.
Solutions
Solution: Use ranges
If you can describe your query without numbering each value, it will usually become much faster. E.g. instead of WHERE a.x in (1,...,1000) write WHERE a.x>=1 AND a.x <=1000.
Solution: Use a temporary table
This solution is already described in the answer from Dan: Pump your values in an (indexed!) temporary table and use either a nested query (WHERE a.x IN (SELECT temp.x FROM temp) or joins (FROM a JOIN temp USING (x), or WHERE EXISTS (SELECT * FROM temp WHERE temp.x=a.x)).
Style guide: My rule of thumb is to use a nested query when you expected few results in the temp table (mot much more than 1000) and joins when you expect many results (much more than 1000). With modern optimizers there should be no difference, but I think about it a s a hint to the human reader of the query, what the expected amount of values is. I use semi joins (WHERE EXISTS) when I don't care about the values in the temporary table in the further query. Again this is more for the human reader than for the SQL optimizer.
Solution: Use your database's native collection type
When you database has a native collection type, you can also use this in your query (e.g. TYPE nested_type IS TABLE OF VARCHAR2(20)
This will make your code non portable (usually not a big problem, because people switch their database engine very rarely in an established project).
This might make your code hard to read (at least for developers to being that experienced with your brand of SQL database).
An example for Oracle:
DECLARE
TYPE ILIST IS TABLE OF INTEGER;
temp ILIST := ILIST();
result VARCHAR2(20);
BEGIN
temp.extend(3);
temp(1) := 1;
temp(2) := 2;
temp(3) := 3;
SELECT a.y INTO result
FROM a WHERE a.x IN (select * from TABLE(temp));
END;
/

Related

Optimizing an Oracle SQL query which uses IN clause extensively

I maintain an application where I am trying to optimize an Oracle SQL query wherein multiple IN clauses are used. This query is now a blocker as it hogs nearly 3 minutes of execution time and affects application performance severely.The query is called from Java code(JDBC) and looks like this :
Select disctinct col1,col2,col3,.. colN from Table1
where 1=1 and not(col1 in (idsetone1,idsetone2,... idsetoneN)) or
(col1 in(idsettwo1,idsettwo2,...idsettwoN))....
(col1 in(idsetN1,idsetN2,...idsetNN))
The ID sets are retrieved from a different schema and therefore a JOIN between column1 of table 1 and ID sets is not possible. ID sets have grown over time with use of the application and currently they number more than 10,000 records.
How can I start with optimizing this query ?
I really doupt about "The ID sets are retrieved from a different schema and therefore a JOIN between column1 of table 1 and ID sets is not possible." Of course you can join the tables, provided you got select privileges on it.
Anyway, let's assume it is not possible due to whatever reason. One solution could be to insert all entries first into a Nested Table and the use this one:
CREATE OR REPLACE TYPE NUMBER_TABLE_TYPE AS TABLE OF NUMBER;
Select disctinct col1,col2,col3,.. colN from Table1
where 1=1
and not (col1 NOT MEMBER OF (NUMBER_TABLE_TYPE(idsetone1,idsetone2,... idsetoneN))
OR
(col1 MEMBER OF NUMBER_TABLE_TYPE(idsettwo1,idsettwo2,...idsettwoN))
Regarding the max. number of elements Oracle Documentation says: Because a nested table does not have a declared size, you can put as many elements in the constructor as necessary.
I don't know how serious you can take this statement.
You should put all the items into one temporary table and to an explicit join:
Select your cols
from Table1
left join table_with_items
on table_with_items.id = Table1.col1
where table_with_items.id is null;
Also that distinct suggest a problem in your business logic or in the architecture of application. Why do you have duplicate ids? You should get rid of that distinct.

I would like to treat on the momentary table or inner table of oracle by making two or more values into a line.

Since I am a Japanese, I am poor at English.
Please understand the situation.
There is the following as indispensable requirements.
This requirement is unchangeable.
I know only ID of two or more values.
This number is over 500000.
It acquires early at low cost by 1 time of SQL.
The index is created by id and it is optimized.
The following SQL queries think of me by making these things into the method of taking as a search condition.
select *
from emp
where id in(1,5,7,8.....)
or id in(5000,5002....)
It divides 1000 affairs at a time by "in" after above where.
However, processing takes most time in case of this method.
As a result of investigating many things, it turned out that it is processing time earlier to specify conditions by "exists" rather than having specified conditions by "in".
In order to use "exists", you have to ask by a subquery.
However, it calls by a subquery well by what kind of method, or I cannot imagine.
Someone should teach a good method.
Thank you for your consideration.
If my understanding is correct, you are trying to do this:
select * from emp where emp in (list of several thousand values)
Because oracle only support lists of 1000 values in that construct your code uses a union.
Suggested solution:
Create a global temporary table with an id field the same size as emp.id
Insert the id:s you want to find in this table
Join against this table in your select
create global temporary table temp_id (id number) on commit delete rows;
Your select code can be replaced by:
<code to insert the emp.id:s you want to search for>
select * from emp inner join temp_id tmp on emp.id = temp_id.id;

SQL WHERE ID IN (id1, id2, ..., idn)

I need to write a query to retrieve a big list of ids.
We do support many backends (MySQL, Firebird, SQLServer, Oracle, PostgreSQL ...) so I need to write a standard SQL.
The size of the id set could be big, the query would be generated programmatically. So, what is the best approach?
1) Writing a query using IN
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
My question here is. What happens if n is very big? Also, what about performance?
2) Writing a query using OR
SELECT * FROM TABLE WHERE ID = id1 OR ID = id2 OR ... OR ID = idn
I think that this approach does not have n limit, but what about performance if n is very big?
3) Writing a programmatic solution:
foreach (var id in myIdList)
{
var item = GetItemByQuery("SELECT * FROM TABLE WHERE ID = " + id);
myObjectList.Add(item);
}
We experienced some problems with this approach when the database server is queried over the network. Normally is better to do one query that retrieve all results versus making a lot of small queries. Maybe I'm wrong.
What would be a correct solution for this problem?
Option 1 is the only good solution.
Why?
Option 2 does the same but you repeat the column name lots of times; additionally the SQL engine doesn't immediately know that you want to check if the value is one of the values in a fixed list. However, a good SQL engine could optimize it to have equal performance like with IN. There's still the readability issue though...
Option 3 is simply horrible performance-wise. It sends a query every loop and hammers the database with small queries. It also prevents it from using any optimizations for "value is one of those in a given list"
An alternative approach might be to use another table to contain id values. This other table can then be inner joined on your TABLE to constrain returned rows. This will have the major advantage that you won't need dynamic SQL (problematic at the best of times), and you won't have an infinitely long IN clause.
You would truncate this other table, insert your large number of rows, then perhaps create an index to aid the join performance. It would also let you detach the accumulation of these rows from the retrieval of data, perhaps giving you more options to tune performance.
Update: Although you could use a temporary table, I did not mean to imply that you must or even should. A permanent table used for temporary data is a common solution with merits beyond that described here.
What Ed Guiness suggested is really a performance booster , I had a query like this
select * from table where id in (id1,id2.........long list)
what i did :
DECLARE #temp table(
ID int
)
insert into #temp
select * from dbo.fnSplitter('#idlist#')
Then inner joined the temp with main table :
select * from table inner join temp on temp.id = table.id
And performance improved drastically.
First option is definitely the best option.
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
However considering that the list of ids is very huge, say millions, you should consider chunk sizes like below:
Divide you list of Ids into chunks of fixed number, say 100
Chunk size should be decided based upon the memory size of your server
Suppose you have 10000 Ids, you will have 10000/100 = 100 chunks
Process one chunk at a time resulting in 100 database calls for select
Why should you divide into chunks?
You will never get memory overflow exception which is very common in scenarios like yours.
You will have optimized number of database calls resulting in better performance.
It has always worked like charm for me. Hope it would work for my fellow developers as well :)
Doing the SELECT * FROM MyTable where id in () command on an Azure SQL table with 500 million records resulted in a wait time of > 7min!
Doing this instead returned results immediately:
select b.id, a.* from MyTable a
join (values (250000), (2500001), (2600000)) as b(id)
ON a.id = b.id
Use a join.
In most database systems, IN (val1, val2, …) and a series of OR are optimized to the same plan.
The third way would be importing the list of values into a temporary table and join it which is more efficient in most systems, if there are lots of values.
You may want to read this articles:
Passing parameters in MySQL: IN list vs. temporary table
I think you mean SqlServer but on Oracle you have a hard limit how many IN elements you can specify: 1000.
Sample 3 would be the worst performer out of them all because you are hitting up the database countless times for no apparent reason.
Loading the data into a temp table and then joining on that would be by far the fastest. After that the IN should work slightly faster than the group of ORs.
For 1st option
Add IDs into temp table and add inner join with main table.
CREATE TABLE #temp (column int)
INSERT INTO #temp (column)
SELECT t.column1 FROM (VALUES (1),(2),(3),...(10000)) AS t(column1)
Try this
SELECT Position_ID , Position_Name
FROM
position
WHERE Position_ID IN (6 ,7 ,8)
ORDER BY Position_Name

how to prove 2 sql statements are equivalent

I set out to rewrite a complex SQL statement with joins and sub-statements and obtained a more simple looking statement. I tested it by running both on the same data set and getting the same result set. In general, how can I (conceptually) prove that the 2 statements are the same in any given data set?
I would suggest studying relational algebra (as pointed out by Mchl). It is the most essential concept you need if you want to get serious about optimizing queries and designing databases properly.
However I will suggest an ugly brute force approach that helps you to ensure correct results if you have sufficient data to test with: create views of both versions (for making the comparisons easier to manage) and compare the results. I mean something like
create view original as select xxx yyy zzz;
create view new as select xxx yyy zzz;
-- If the amount differs something is quite obviously very wrong
select count(*) from original;
select count(*) from new;
-- What is missing from the new one?
select *
from original o
where not exists (
select *
from new n
where o.col1=n.col2 and o.col2=n.col2 --and so on
);
-- Did something extra appear?
select *
from new o
where not exists (
select *
from old n
where o.col1=n.col2 and o.col2=n.col2 --and so on
)
Also as pointed out by others in comments you might feed both the queries to the optimizers of the product you are working with. Most of the time you get something that can be parsed with humans, complete drawings of the execution paths with the subqueries' impact on the performance etc. It is most often done with something like
explain plan for
select *
from ...
where ...
etc

SQL argument limit in Oracle

It appears that there is a limit of 1000 arguments in an Oracle SQL. I ran into this when generating queries such as....
select * from orders where user_id IN(large list of ids over 1000)
My workaround is to create a temporary table, insert the user ids into that first instead of issuing a query via JDBC that has a giant list of parameters in the IN.
Does anybody know of an easier workaround? Since we are using Hibernate I wonder if it automatically is able to do a similar workaround transparently.
An alternative approach would be to pass an array to the database and use a TABLE() function in the IN clause. This will probably perform better than a temporary table. It will certainly be more efficient than running multiple queries. But you will need to monitor PGA memory usage if you have a large number of sessions doing this stuff. Also, I'm not sure how easy it will be to wire this into Hibernate.
Note: TABLE() functions operate in the SQL engine, so they need us to declare a SQL type.
create or replace type tags_nt as table of varchar2(10);
/
The following sample populates an array with a couple of thousand random tags. It then uses the array in the IN clause of a query.
declare
search_tags tags_nt;
n pls_integer;
begin
select name
bulk collect into search_tags
from ( select name
from temp_tags
order by dbms_random.value )
where rownum <= 2000;
select count(*)
into n
from big_table
where name in ( select * from table (search_tags) );
dbms_output.put_line('tags match '||n||' rows!');
end;
/
As long as the temporary table is a global temporary table (ie only visible to the session), this is the recommended way of doing things (and I'd go that route for anything more than a dozen arguments, let alone a thousand).
I'd wonder where/how you are building that list of 1000 arguments. If this is a semi-permanent grouping (eg all employees based in a particular location) then that grouping should be in the database and the join done there. Databases are designed and built to do joins really quickly. Much quicker than pulling a bunch of id's back to the mid tier and then sending them back to the database.
select * from orders
where user_id in
(select user_id from users where location = :loc)
You can add additional predicates to split the list into chunks of 1000:
select * from orders where user_id IN (<first batch of 1000>)
OR user_id IN (<second batch of 1000>)
OR user_id IN ...
the comments regarding "if these IDs are in your database, use joins/correlation instead" hold true. However, if your list of IDs comes from elsewhere, like a SOLR result, you can get around the temp table requirement by issuing multiple queries, each with no more than 1000 ids present, and then merging the results of the query in memory. If you place the initial list of ids in a unique collection like a hashset, you can pop off 1000 ids at a time.