PostgreSQL query is slow when using NOT IN - sql

I have a PostgreSQL function that returns a query result to pgadmin results grid REALLY FAST.
Internally, this is a simple function that uses a dblink to connect to another database and does a query return so that I can simply run
SELECT * FROM get_customer_trans();
And it runs just like a basic table query.
The issue is when I use the NOT IN clause. So I want to run the following query, but it takes forever:
SELECT * FROM get_customer_trans()
WHERE user_email NOT IN
(SELECT do_not_email_address FROM do_not_email_tbl);
How can I speed this up? Anything faster than a NOT IN clause for this scenario?

get_customer_trans() is not a table - probably some stored procedure, so query is not really trivial. You'd need to look at what this stored procedure really does to understand why it might work slow.
However, regardless of stored procedure behavior, adding following index should help a lot:
CREATE INDEX do_not_email_tbl_idx1
ON do_not_email_tbl(do_not_email_address);
This index lets NOT IN query to quickly return answer. However, NOT IN is known to have issues in older PostgreSQL versions - so make sure that you are running at least PostgreSQL 9.1 or later.
UPDATE. Try to change your query to:
SELECT t.*
FROM get_customer_trans() AS t
WHERE NOT EXISTS (
SELECT 1
FROM do_not_email_tbl
WHERE do_not_email_address = t.user_email
LIMIT 1
)
This query does not use NOT IN, and should work fast.
I think that in PostgreSQL 9.2 this query should work as fast as one with NOT IN though.

Just do it this way:
SELECT * FROM get_customer_trans() as t1 left join do_not_email_tbl as t2
on user_email = do_not_email_address
where t2.do_not_email_address is null

Related

is IN(SELECT ...) bad for performance?

Suppose I have the following code:
SELECT *
FROM [myTable]
WHERE [myColumn] IN (SELECT [otherColumn] FROM [myOtherTable])
Will the subquery be executed again and again for every row?
If so, can I execute it and store its results and use them for every row instead? For example:
SELECT [otherColumn]
INTO #Results
FROM [myOtherTable]
SELECT *
FROM [myTable]
WHERE [myColumn] IN (#Results)
SQL server query optimizer is smart enough to not run the same subquery over and over again. If anything, the temp table is less optimal because of additional steps after getting the results.
You can see this by looking at the SQL query execution plan.
Edit: After looking into this further, it can also be more than once. Apparently query optimizer can also do a lot of interesting things like convert your IN to a JOIN to increase performance. There's lots of information on it here: Number of times a nested query is executed
None the less, view your execution plan to see what your RDMS's query optimizer decided to do.
Have you considered using a join instead? I think that could be best in terms of performance.
SELECT * FROM [myTable] INNER JOIN [myOtherTable]
ON ([myTable][myColumn] = [myOtherTable][otherColumn]);
This however will only work if you don't expect duplicates to be in myOtherTable.

Oracle SQL View with select statement in where

Hello I have a problem with one of my views.
I use this statement a view times
where date=(select d from user_date_table)
This works fine for the result but the perfomance is very slow.
When I do the following:
where date=to_date(
This is a lot faster but this will not work here since I have to give the view this value.
Is there anything else I can do?
Right now I've tested it with a package that has a function package_name.get_user_date that gives me the value. But this is also very slow.
Are there any other things that would maybe could make this query faster?
Thank you!
There are 2 possible ways you could try and resolve this.
Does user_date_table have duplicate dates in it?
If not, then you could join to this table in the query instead of putting it into a where clause.
If it does, then you could change the query in the view to
select ...
from yourTable t
where exists
(
select *
from user_date_table udt
where udt.d = r.date
)
Also, check and see what indexes are on the user_date_table. Maybe there is a function based index on to_date(d) which is why this works faster.

Index in query plan is skipped when using OR condition in Postgres

Say, I have a table my_table with field kind:string and an index on this field.
I've noticed that Postgres builds two different query plans for the queries:
SELECT * FROM my_table
WHERE kind = 'kind1' OR kind IS NULL;
and
SELECT * FROM my_table
WHERE kind = 'kind1';
The first one does not use index whereas the second one does. Why?
I know there are a lot of conditions why indexes may be used or not, and I've read a lot about query plans but this case still is not clear to me.
Abelisto explains that the two versions of the query are not the same. SQL engines (in general) can do a poor job of using indexes for ORs. It is possible that there are so many NULL values, that Postgres simply does not think an index is useful when comparing to NULLs. That depends on the data.
You can try rewriting the query as:
SELECT *
FROM my_table
WHERE kind = 'type1'
UNION ALL
SELECT *
FROM my_table
WHERE kind IS NULL;
Postgres might choose to use indexes on each subquery, if they are appropriate for the data.

PostgreSQL return select results AND add them to temporary table?

I want to select a set of rows and return them to the client, but I would also like to insert just the primary keys (integer id) from the result set into a temporary table for use in later joins in the same transaction.
This is for sync, where subsequent queries tend to involve a join on the results from earlier queries.
What's the most efficient way to do this?
I'm reticent to execute the query twice, although it may well be fast if it was added to the query cache. An alternative is store the entire result set into the temporary table and then select from the temporary afterward. That also seems wasteful (I only need the integer id in the temp table.) I'd be happy if there was a SELECT INTO TEMP that also returned the results.
Currently the technique used is construct an array of the integer ids in the client side and use that in subsequent queries with IN. I'm hoping for something more efficient.
I'm guessing it could be done with stored procedures? But is there a way without that?
I think you can do this with a Postgres feature that allows data modification steps in CTEs. The more typical reason to use this feature is, say, to delete records for a table and then insert them into a log table. However, it can be adapted to this purpose. Here is one possible method (I don't have Postgres on hand to test this):
with q as (
<your query here>
),
t as (
insert into temptable(pk)
select pk
from q
)
select *
from q;
Usually, you use the returning clause with the data modification queries in order to capture the data being modified.

SQL Where Clause Against View

I have a view (actually, it's a table valued function, but the observed behavior is the same in both) that inner joins and left outer joins several other tables. When I query this view with a where clause similar to
SELECT *
FROM [v_MyView]
WHERE [Name] like '%Doe, John%'
... the query is very slow, but if I do the following...
SELECT *
FROM [v_MyView]
WHERE [ID] in
(
SELECT [ID]
FROM [v_MyView]
WHERE [Name] like '%Doe, John%'
)
it is MUCH faster. The first query is taking at least 2 minutes to return, if not longer where the second query will return in less than 5 seconds.
Any suggestions on how I can improve this? If I run the whole command as one SQL statement (without the use of a view) it is very fast as well. I believe this result is because of how a view should behave as a table in that if a view has OUTER JOINS, GROUP BYS or TOP ##, if the where clause was interpreted prior to vs after the execution of the view, the results could differ. My question is why wouldn't SQL optimize my first query to something as efficient as my second query?
EDIT
So, I was working on coming up with an example and was going to use the generally available AdventureWorks database as a backbone. While replicating my situation (which is really debugging a slow process that someone else developed, aren't they all?) I was unable to get the same results. Looking further into the query I am debugging, I realized the issue might be related to the extensive use of User Defined Scalar Valued Functions. There is heavy use of a "GetDisplayName" function that depending upon the values you pass in, it will format lastname, firstname or firstname lastname etc. If I simply omit that function and do the string formatting in the main query/TVF/view or whatever, performance is great. When looking at the execution plan, it didn't give me a clue to look at this as the issue which is why I initially ignored it.
The scalar UDFs are very likely the issue. As soon as they go into your query you've got a RBAR execution plan. It's tolerable if they're in the SELECT but if they're being used in a WHERE or JOIN clause....
A pity because they can be very useful but they're performance killers in big SELECTs and I'd suggest trying to rewrite either the UDFs to table valued or the query to avoid the UDFs, if at all possible.
Though I'm not SQL guru but most probably it is due to fact that in second query you are selecting only one column that makes it faster and secondly ID column seems to be some key and thus indexed. This can be the reason why it is faster the second way.
First Query:
SELECT * FROM [v_MyView] WHERE [Name] like '%Doe, John%'
Second query:
SELECT * FROM [v_MyView] WHERE [ID] in
(SELECT [ID] FROM [v_MyView] WHERE [Name] like '%Doe, John%')