How can I make large IN clauses more efficient in SQL Server? - sql

My current query runs very slow when accessing a DB with pretty large tables
FROM table1
WHERE timestamp BETWEEN 635433140000000000 AND 635433150000000000
AND ID IN ('element1', 'element2', 'element3', ... , 'element 3002');
As you can see the IN clause has several thousand values. This query is executed roughly every second.
Is there another way to write it to improve performance?

Add the elements of the IN to an indexed temporary (if the elements change) or permanent table (if the elements are static) and inner join on them.

This is your query:
FROM table1
WHERE timestamp BETWEEN 635433140000000000 AND 635433150000000000 AND
ID IN ('element1', 'element2', 'element3', ... , 'element 3002');
The query is fine. Add an index on table1(id, timestamp).

The best answer depends on how those element ID listings are selected, but it all comes down to one thing: getting them into a table somewhere that you can join against. That will help performance tremendously. But again, the real question here is how best to get those items into a table, and that will depend on information not yet included in the question.

You should check your execution plan, I guess you could have a parameter sniffing problem caused by your between. Check if the actual rows are way off you expected values. And you can rewrite your IN to a EXISTS, which works inside like a INNER JOIN.


Exclude Certain Records with Certain Values SQL Select

I've used the solution (by AdaTheDev) from the thread below as it relates to the same question:
How to exclude records with certain values in sql select
But when applying the same query to over 40,000 records, the query takes too long to process (>30mins). Is there another way that's efficient in querying for certain records with certain values (the same question as in the above stackoverflow thread). I've attempted to use this below and still no luck:
FROM StoreClients
Where ClientId=5
Thanks --
You could use EXISTS:
FROM StoreClients cl
WHERE sc.StoreId = cl.StoreId
AND cl.ClientId = 5
Make sure to create an index on StoreId column. Your query could also benefit from an index on ClientId.
40,000 rows isn't too big for SQL server. The other thread is suggesting some good queries too. If I understand correctly, your problem right now is the performance.
To improve the performance, you may need to add a new index to your table, or just update the statistics on your table.
If you are running this query in SQL server management studio for testing and performance tuning, make sure that you are selecting "Include Actual Execution Plan". This will make the query to take longer to execute, but the execution plan result can help you on where the time is spent.
It usually suggests adding some indexes to improve the performance, which is easy to follow.
If adding new indexes are not possible, then you have to spend some time reading on how to understand the execution plan's diagram and finding the slow areas.

How to avoid Rownum?

The below query picks up 1000 rows due to batch constraints and should fetch only 1000 rows.
If I dont use rownum it is taking only 5secs to fetch more than 1000 recs.. but with rownum it is taking 20 secs.
Please help me on tuning the query without affecting functionality.
Look at the execution plans. Probably the optimizer thinks that it can get quicker to the first 1000 results by following a different path, wheras for the complete data it uses a hash join or such - which surprisingly turns out to be quick on the first records.
Once you know the execution plans you can use hints to let the optimizer follow the path, which you know from your experience to be better.
Anyhow, you are asking for tab1 records that don't exist in tab2, but rather than saying so with NOT EXISTS, NOT IN or MINUS, you kind of hide this by using a left join. This can be faster sometimes, but it's a trick after all. Why not re-write the query in a more straight-forward way and see how it performs? I think such a statement might be more stable as to slight alterations like using a rownum limit. It's worth a try.
EDIT: Some clarification. You are asking for IDs that exist in tab1 but not in tab2. This would be:
You can also word the task differently, such as: I want all IDs from tab1 that don't exist in tab2:
Or: I want all IDs from tab1 that are not in tab2
What you do instead is saying: For every ID in tab1 find all matching IDs in tab2 and combine these. For IDs in tab 1 that have no match in tab2 give me a result record, too. Then from that (probably huge) set of results remove all matches, so that I stay with those IDs that have no match.
Many words to describe the same task. Accordingly the query is not easy to read for people not familiar with this trick technique. The query certainly produces a large intermediate result. So why do people use it though? Database systems grow up with joins, so this is something they are really good in. For example they use hash mechanisms to get the records joined, rather than looping record for record. So in spite of suggesting a rather complicated access way, the left join technique may result in good performance.
However, the queries above are more straight-forward. Let's look at the first; a possible execution plan would be: Order tab1 IDs, Order tab2 IDs, then loop once to keep tab1 IDs withot a tab2 match. Very simple. Sorting takes time, but then you go sequentielly through both results. If this happens to give you the first thousand matches rather quickly, it is likely to do so when you limit the results with ROWNUM < 1000. And the second query? Loop through tab1 and with the id given find a match in tab2, if there is none keep that record. May be fast with an index, and adding ROWNUM < 1000 will probably not change the speed for getting the first records, for the execution path stays the same. Third query: Can be interpreted like the second. Or tab2 IDs are put in an array with fast access somehow. Anyway, the ROWNUM < 1000 is not likely to change much in the access path.
With your query however it is difficult to say. When all records must be regarded ahash join might be fastest. But if only some records suffice, why join everything? Maybe the optimizer decides to go record by record of tab1 and look for a match in tab2 then. This would alter the execution plan extremely and can be much faster for the first 1000 records. It's just not guaranteed to be so and with bad luck as in your case it can even get slower.
Well, after all, Oracle has a great optimizer. Queries get re-written, and your query might get turned into a NOT EXISTS query or vice versa. And even without re-writing: in spite of dealing with different queries, the optimizer can still decide for the same execution plan. So you never know. But it's always worth a try.
My advice: Write straight-forward SQL. Quite often an SQL statement can resemble the task how one would formulate it in words. Just as shown above. Only when facing performance problems think of how to re-write the query to deal with this.

Index on VARCHAR column

I have a table of 32,589 rows, and one of the columns is called 'Location' and is a Varchar(40) column type. The column holds a location, which is actually a suburb, all uppercase text.
A function that uses this table does a:
IF EXISTS(SELECT * FROM MyTable WHERE Location = 'A Suburb')
Would it be beneficial to add an index to this column, for efficiency? This is more a read-only table, so not much edits or inserts except for maintanance.
Without an index SQL Server will have to perform a table scan to find the first instance of the location you're looking for. You might get lucky and have the value be in one of the first few rows, but it could be at row 32,000, which would be a waste of time. Adding an index only takes a few second and you'll probably see a big performance gain.
I concur with #Brian Shamblen answer.
Also, try using TOP 1 in the inner select
IF EXISTS(SELECT TOP 1 * FROM MyTable WHERE Location = 'A Suburb')
You don't have to select all the records matching your criteria for EXISTS, one is enough.
An opportunistic approach to performance tuning is usually a bad idea.
To answer the specific question - if your function is using location in a where clause, and the table has more than a few hundred rows, and the values in the location column are not all identical, creating an index will speed up your function.
Whether you notice any difference is hard to say - there may be much bigger performance problems lurking in the database, and you might be fixing the wrong problem.

Does SQL performance degrade as the number elements in an "IN" clause increases?

I have a query like this,
SELECT Name FROM Customers WHERE Id IN (1,4,3,6,7)
There might be millions of customers in the DataBase. Will there be an efficiency problem with this query ? When the number of Ids inside IN statement are more ? If so, Why and Any workaround ?
I Use SQLServer. Below is my table Structure
Id is the primary key -non clustered index.
This query is as basic as it can get.
If you need to find the name of 5 customers, there is simply no other sane way of writing it.
It will perform well if you have an index on ID. The performance is almost instantaneous, directly related to the number of items in the IN clause.
If you don't it will scan the table, and the performance becomes directly related to the number of records in the table.
Assuming you have properly indexed the Id column, there should be no problem. That is the correct method, and if it does not work, you need a new database. (Millions shouldn't be an issue with most regular pieces of software; if you make it to multiple billions you might need to investigate clustered databases).
If you execute the following query:
select * from sys.objects where object_id in (
(I'm not going to break up all the lines).
In the resulting query, approximately 5% of the cost of the query is taken up with a constant scan (which is effectively turning all of those numbers into a temp table internally and that table is then passed to a join operator).
But, this is a remarkably simple query overall. For any more complex query, I'd expect that the cost, as a percentage, will go down (since I expect the absolute cost to remain the same)
I know this isn't the question that was asked, but, say your list of IDs came from another query:
Then this is cause to rewrite your query using EXISTS:
This is efficient because EXISTS gives more opportunity for the optimizer to determine an efficient execution path, whereas IN forces the subquery to be fully evaluated.
The query you specified didn't have a subsquery. It just has a list of constants which has little opportunity to be further optimized. As is, you have to do with the best you got, i.e. index the ID column as recommended by #zebediah49.

Performance: Subquery or Joining

I got a little question about performance of a subquery / joining another table
INTO Original.Person
PID, Name, Surname, SID
SELECT ma.PID_new , TBL.Name , ma.Surname, TBL.SID
FROM Copy.Person TBL , original.MATabelle MA
This is my SQL, now this thing runs around 1 million times or more.
My question is what would be faster?
If I change TBL.SID to (Select new from helptable where old = tbl.sid)
If I add the 'HelpTable' to the from and do the joining in the where?
Well, this script runs only as much as there r persons.
My program has 2 modules one that populates MaTabelle and one that transfers data. This program does merge 2 databases together and coz of this, sometimes the same Key is used.
Now I'm working on a solution that no duplicate Keys exists.
My solution is to make a 'HelpTable'. The owner of the key(SID) generates a new key and writes it into a 'HelpTable'. All other tables that use this key can read it from the 'HelpTable'.
Just got something in my mind:
if a table as a Key that can be null(foreignkey that is not linked)
then this won't work with the from or?
Modern RDBMs, including Oracle, optimize most joins and sub queries down to the same execution plan.
Therefore, I would go ahead and write your query in the way that is simplest for you and focus on ensuring that you've fully optimized your indexes.
If you provide your final query and your database schema, we might be able to offer detailed suggestions, including information regarding potential locking issues.
Here are some general tips that apply to your query:
For joins, ensure that you have an index on the columns that you are joining on. Be sure to apply an index to the joined columns in both tables. You might think you only need the index in one direction, but you should index both, since sometimes the database determines that it's better to join in the opposite direction.
For WHERE clauses, ensure that you have indexes on the columns mentioned in the WHERE.
For inserting many rows, it's best if you can insert them all in a single query.
For inserting on a table with a clustered index, it's best if you insert with incremental values for the clustered index so that the new rows are appended to the end of the data. This avoids rebuilding the index and often avoids locks on the existing records, which would slow down SELECT queries against existing rows. Basically, inserts become less painful to other users of the system.
Joining would be much faster than a subquery
The main difference betwen subquery and join is
subquery is faster when we have to retrieve data from large number of tables.Because it becomes tedious to join more tables.
join is faster to retrieve data from database when we have less number of tables.
Also, this joins vs subquery can give you some more info
Instead of focussing on whether to use join or subquery, I would focus on the necessity of doing 1,000,000 executions of that particular insert statement. Especially as Oracle's optimizer -as Marcus Adams already pointed out- will optimize and rewrite your statements under the covers to its most optimal form.
Are you populating MaTabelle 1,000,000 times with only a few rows and issue that statement? If yes, then the answer is to do it in one shot. Can you provide some more information on your process that is executing this statement so many times?
EDIT: You indicate that this insert statement is executed for every person. In that case the advice is to populate MATabelle first and then execute once:
INTO Original.Person
PID, Name, Surname, SID
SELECT ma.PID_new , TBL.Name , ma.Surname, TBL.SID
FROM Copy.Person TBL , original.MATabelle MA