SELECT DISTINCT Inside WHERE IN clause performance - sql

I have a performance question about the following code...
GCL_Loans has a list of loans with basic infomation
CCL_Loan_Items has information about a specific item in a loan. There can be duplicate Loan_ID's in GCL_Loan_Items
Can anyone explain why this query would be faster or slower than the one above?

The "DISTINCT" version is probably faster, because the IN clause will have a smaller data set to search to determine if any given GCL_Loans.Loan_ID is in the set. Without the DISTINCT, the data set will be larger.
There's a reasonably good argument to be made that the query optimizer will automatically recognize the IN test is a set-wise, not a list-wise test and do the DISTINCT during auto-indexing ... but I've seen that fail before.
Note that subselects can be a fail here too, because some databases (mysql) will execute the subselect for each element in the primary select.

The plan and performance of both is equal

Because by selecting DISTINCT there is less criteria in the SUBQuery (IN).
My understanding is SQL will run the subquery first to generate the list of items that are to be included in the IN.


Adding a SUM statement increases run time way too much, is there a better method?

I have a table with invoice payments, which can be partial or full. I am comparing this calculated field to the total amount of the invoice. I have it twice in the query, once in the Select statement and again in the Where clause. Even if I remove one so it's only in either the Where or the Select, it takes more than an hour to run. If I remove the SUM entirely, it takes 10 seconds to run.
Is there a better method to get the sum? Should I use an index view? A temp table? Note that an invoice number is unique only to a vendor, not unique in general. The initial FROM is a view, if this makes a difference.
select distinct
(select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] PAY1 where PAY.INVOICEID=PAY1.INVOICEID and PAY.OrderAccount=PAY1.OrderAccount) as "InvoiceSUM",
inner join INVOICE on INVOICE_DOC_NO =invoiceid
JOIN VENDOR V on PAY.OrderAccount=v.VendorAccount
where TRANSDATE is not null
and (select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] PAY1 where PAY.INVOICEID=PAY1.INVOICEID and PAY.OrderAccount=PAY1.OrderAccount)=total_cost_on_invoice
In this answer, when I refer to 'that select', I'm referring to the sub-query in the middle select sum(pay1.settlamountcur) ...
Note that aliases in 'that select' looks a little strange e.g., select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] AX1. Where does the PAY1 alias come from? I may have missed something. If that's a typo in your code, it could be doing bad things (if it even runs). Assuming it's not, however...
For your broader problem, I believe that it will be running that select statement once for every row being returned by your overall table. Indeed, it may be doing it more often, depending on when it's doing your filtering in the execution plan.
Note I'm assuming SQL Server in this answer - but it should apply to other databases as well.
A couple of options
Instead of referring to the view, instead bring the tables into your current query and modify the query as such
Try removing aggregation from the subquery, and instead do it over the whole data set etc e.g., GROUP BY relevant fields, sum across relevant fields. This can be combined with option 1.
Put the sub-query as a CTE, or a sub-query within the FROM component. This may make it use it as a single table rather than running many times (or it may not)
(Sometimes my preferred option for large tables) Get the relevant data from the view into a temporary table first e.g.,
SELECT INVOICEId, OrderAccount, SUM(settleamountcur) AS total_settleamountcur
INTO #Temp
-- Add any where/having clauses you can to filter
-- Consider creating temp table first with primary key, making joins easier for SQL Server
Then use the #Temp table instead of that select sub-query.

Postgres all subqueries in coalesce executed

COALESCE in Postgres is a function that returns the first parameter not null.
So I used coalesce in subqueries like:
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...)
I change the where in any query and they contain lots of params and CASE, also different ORDER BY clauses.
This is because I always want to return something but giving priorities.
What I noticed while issuing EXPLAIN ANALYZE is that any query is executed despite the first one actually returns NOT a null value.
I would expect the engine to run only the first one query and not the following ones if it returns not null.
This way I could have a bad performance.
So am I doing any bad practice and is it better to run the queries separately for performance reason?
Sorry you where right I don’t select * but I select only one column. I didn’t post my code because I am not interested in my query but it’s a generic question to understand how the engine is working. So I reproduce a very simple fiddle here!17/a8aa7/4
I may be wrong but I think it behaves as I was telling: it runs all the subqueries despite the first one already returns a not null value
EDIT 2: ok I read only now it says never executed. So the other two queries aren’t getting executed. What confused me was the fact they were included in the query plan.
Anyways it’s still important for my question. Is it better to run all the queries separately for performance reasons? Because it seems like that even if the first one returns a not null value the other two subqueries can slow down the performance
For separate SELECT queries, I suggest to use UNION ALL / LIMIT 1 instead. Based on your fiddle:
(select user_id from users order by age limit 1) -- misleading example, see below
(select user_id from users where user_id=1)
(select user_id from users order by user_id DESC limit 1)
db<>fiddle here
For three reasons:
Works for any SELECT list: single expressions (your fiddle), multiple or whole row (your example in the question).
You can distinguish actual NULL values from "no row". Since user_id is the PK in the example (and hence, NOT NULL), the problem cannot surface in the example. But with an expression that can be NULL, COALESCE cannot distinguish between both, "no row" is coerced to NULL for the purpose of the query. See:
Return a value if no record is found
Aside, your first SELECT in the example makes this a wild-goose chase. It returns a row if there is at least one. The rest is noise in this case.
PostgreSQL combine multiple select statements
SQL - does order of OR conditions matter?
Way to try multiple SELECTs till a result is available?

Does SQL performance degrade as the number elements in an "IN" clause increases?

I have a query like this,
SELECT Name FROM Customers WHERE Id IN (1,4,3,6,7)
There might be millions of customers in the DataBase. Will there be an efficiency problem with this query ? When the number of Ids inside IN statement are more ? If so, Why and Any workaround ?
I Use SQLServer. Below is my table Structure
Id is the primary key -non clustered index.
This query is as basic as it can get.
If you need to find the name of 5 customers, there is simply no other sane way of writing it.
It will perform well if you have an index on ID. The performance is almost instantaneous, directly related to the number of items in the IN clause.
If you don't it will scan the table, and the performance becomes directly related to the number of records in the table.
Assuming you have properly indexed the Id column, there should be no problem. That is the correct method, and if it does not work, you need a new database. (Millions shouldn't be an issue with most regular pieces of software; if you make it to multiple billions you might need to investigate clustered databases).
If you execute the following query:
select * from sys.objects where object_id in (
(I'm not going to break up all the lines).
In the resulting query, approximately 5% of the cost of the query is taken up with a constant scan (which is effectively turning all of those numbers into a temp table internally and that table is then passed to a join operator).
But, this is a remarkably simple query overall. For any more complex query, I'd expect that the cost, as a percentage, will go down (since I expect the absolute cost to remain the same)
I know this isn't the question that was asked, but, say your list of IDs came from another query:
Then this is cause to rewrite your query using EXISTS:
This is efficient because EXISTS gives more opportunity for the optimizer to determine an efficient execution path, whereas IN forces the subquery to be fully evaluated.
The query you specified didn't have a subsquery. It just has a list of constants which has little opportunity to be further optimized. As is, you have to do with the best you got, i.e. index the ID column as recommended by #zebediah49.

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
), Ordered AS (
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WITH Ordered AS (
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
), Ordered AS (
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

Does the order of the columns in a SELECT statement make a difference?

This question was inspired by a previous question posted on SO, "Does the order of the WHERE clause make a differnece?". Would it improve a SELECT statement's performance if the the columns used in the WHERE section are placed at the begining of the SELECT statement?
FROM customer, transaction
I do know that limiting the list of columns to only the needed ones in a SELECT statement improves performance as opposed to using SELECT * because the current list is smaller.
For Oracle and Informix and any other self-respecting DBMS, the order of the columns should have no impact on performance. Similarly, it should be the case that the query engine finds the optimal order to process the Where clause so the order should not matter all things being equal (i.e., looking past constructs which might force an execution order).