Is possible to cache in some variable results of subquery? - sql

I have query in postgresql 9.1 like
SELECT id
FROM students
INNER JOIN exams ON /some condition
WHERE studentsid NOT IN (SUBQUERY);
and when I run only the subquery it executes in 120ms, when I execute the previous query without condition with subquery it executes for 12 seconds, but when I add subquery it runs half hour
Is possible to cache in some variable results of subquery (results are always same array of ids) and execute in console/pgadmin ?
I found WITH statement but it looks like is not supported in postgres

First, the with statement is supported in Postgres.
Second, you need to identify where the performance problem is. Is it in the subquery? Or is it the not in?
You can put the subquery in a table, add indexes, and make the query more efficient.
You can rewrite the subquery using a left join, which often allows the query to be better optimized.
You can add appropriate indexes to make the entire query more efficient.
Without knowledge of what the subquery actually does, the right approach is speculation.

Related

Is an inner select from the same table more efficient from "regular" select

I can't find any answer that talks about the efficient side (which query is running faster, not how to write the query!). I appreciate your help in this matter.
is this query better (and run faster then):
SELECT c.user_id_1
FROM (
SELECT b.user_id_1, b.mail_id
from table_name as b where name='test
) as a where a.age ='18') as c
this query:
select user_id_1 from table_name
where name='test
and age ='18'
Both of the queries gives me the same results. but can I test which query is faster?
I can think of no database where using a subquery would be faster. In most databases, the two queries would produce exactly the same execution plan -- regardless of indexes, partitions, and other factors.
It is important to understand that SQL engines do not directly execute a SQL statement. They convert the SQL statement into a directed acyclic graph (DAG) that looks (to the uninitiated) nothing like the original statement. Part of this process is optimizing the code, which makes the execution graph even less like the original code.
Some versions of MySQL and MariaDB have a habit of materializing subqueries in the FROM clause. This can have a deleterious effect on performance! So, a subquery can sometimes make things much worse.
It is also possible that very complex subqueries might confuse the optimizer, but a simple case such as yours would not be one of those cases.
I just tested this on a table with 13,745,928 records. If both user_id and age are covered with a clustered index then both queries will produce the same plan using a Clustered Index Seek with a Cost 100%. In this case, both queries returned in under 5 seconds.
On a side note: If you have multiple subqueries that return data from the same table(s) then it may be more performant to build up an indexed #temptable to replace the subqueries or CTEs. When you use the same subquery more than once, the query analyzer will return a plan for each subquery, meaning each subquery will be executed, not just one.

Converting Postgre to Bigquery

I am having a trouble with a conversion from Postgre to Bigquery query.
Please, how could this query work on bigquery ? I need to subselect a table and bring the last register
(select g.fase__c
from operacoes.salesforce_gest_o_do_canal_c_full g
where g.n_de_indentifica_o_do_parceiro__c = a.n_de_indentifica_o_do_parceiro__c
order by g.data_do_credenciamento__c limit 1) as fase_jornada
from operacoes.salesforce_account_full
If I try to execute, Bigquery returns an error I should apply a join, If I apply the join, the order by doesn´t work
Thanks!
A correlated subquery (which is a subquery that references a table of an outer query), if executed "as is", would need to compute the subquery for each row of the results of the outer query, which would not be very efficient.
In order to optimize this, BigQuery first transforms (decorrelates) the query into one that is functionally equivalent to the correlated one but without any correlations between queries and subqueries. Different decorrelation processes are needed for different use cases and they can be quite complex and difficult to implement and properly test.
BigQuery is implementing decorrelation strategies to make a variety of correlated subqueries possible, however decorrelation for the LIMIT clause has not been implemented.
A workaround would be to use ARRAY_AGG instead of a subquery. In your case, I believe the following query would do the work:
SELECT
ARRAY_AGG(g.fase__c
ORDER BY g.data_do_credenciamento__c
LIMIT 1) AS fase_jornada
FROM
operacoes.salesforce_account_full a
JOIN
operacoes.salesforce_gest_o_do_canal_c_full g
ON
g.n_de_indentifica_o_do_parceiro__c = a.n_de_indentifica_o_do_parceiro__c
GROUP BY
g.n_de_indentifica_o_do_parceiro__c
Take into account that I have guessed some details since the whole context for the subquery was not provided, so you may need to make some changes to it.
Please let me know if anything was not clear!

SQL: IN vs EXISTS

I read that normally you should use EXISTS when the results of the subquery are large, and IN when the subquery results are small.
But it would seem to me that it's also relevant if a subquery has to be re-evaluated for each row, or if it can be evaluated once for the entire query.
Consider the following example of two equivalent queries:
SELECT * FROM t1
WHERE attr IN
(SELECT attr FROM t2
WHERE attr2 = ?);
SELECT * FROM t1
WHERE EXISTS
(SELECT * FROM t2
WHERE t1.attr = t2.attr
AND attr2 = ?);
The former subquery can be evaluated once for the entire query, the latter has to be evaluated for each row.
Assume that the results of the subquery are very large. Which would be the best way to write this?
This is a good question. Especially as in Oracle you can convert every EXISTS clause into an IN clause and vice versa, because Oracle's IN clause can deal with tuples (where (abc) in (select x,y,z from ...), which most other dbms cannot.
And your reasoning is good. Yes, with the IN clause you suggest to load all the subquery's data once instead of looking up the records in a loopg. However this is just partly true, because:
As good as it seems to get all subquery data selected just once, the outer query must loop through the resulting array for every record. This can be quite slow, because it's just an array. If Oracle looks up data in a table instead there are often indexes to help it, so the nested loop with repeated table lookups is eventually faster.
Oracle's optimizer re-writes queries. So it can come to the same execution plan for the two statements or even get to quite unexpected plans. You never know ;-)
Oracle might decide not to loop at all. It may decide for a hash join instead, which works completely different and is usually very effective.
Having said this, Oracle's optimizer should notice that the two statements are exactly the same actually and should generate the same execution plan. But experience shows that the optimizer sometimes doesn't notice, and quite often the optimizer does better with the EXISTS clause for whatever reason. (Not as much difference as in MySQL, but still, EXISTS seems preferable over IN in Oracle, too.)
So as to your question "Assume that the results of the subquery are very large. Which would be the best way to write this?", it is unlikely for the IN clause to be faster than the EXISTS clause.
I often like the IN clause better for its simplicity and mostly find it a bit more readable. But when it comes to performance, it is sometimes better to use EXISTS (or even outer joins for that matter).

SQL query processing order: Group By first or Join First?

I need to execute an SQL query.
If i have a query with multiple tables in From clause with Join condition in Where clause,
And i have Group by statement,
Should i perform Join operation first followed by Group By ?
OR should i perform Group By first then Join ?
Which one would be better ?
Note: In my environment, whichever operator that filters out more tuples should be executed first for better performance and less usage of memory for overall query execution.
Use DB's EXPLAIN syntax, You'll see witch of this two methods (in Yours specific environment) will cause more DB operations taken to produce output

Subqueries vs joins

I refactored a slow section of an application we inherited from another company to use an inner join instead of a subquery like:
WHERE id IN (SELECT id FROM ...)
The refactored query runs about 100x faster. (~50 seconds to ~0.3) I expected an improvement, but can anyone explain why it was so drastic? The columns used in the where clause were all indexed. Does SQL execute the query in the where clause once per row or something?
Update - Explain results:
The difference is in the second part of the "where id in ()" query -
2 DEPENDENT SUBQUERY submission_tags ref st_tag_id st_tag_id 4 const 2966 Using where
vs 1 indexed row with the join:
SIMPLE s eq_ref PRIMARY PRIMARY 4 newsladder_production.st.submission_id 1 Using index
A "correlated subquery" (i.e., one in which the where condition depends on values obtained from the rows of the containing query) will execute once for each row. A non-correlated subquery (one in which the where condition is independent of the containing query) will execute once at the beginning. The SQL engine makes this distinction automatically.
But, yeah, explain-plan will give you the dirty details.
You are running the subquery once for every row whereas the join happens on indexes.
Here's an example of how subqueries are evaluated in MySQL 6.0.
The new optimizer will convert this kind of subqueries into joins.
before the queries are run against the dataset they are put through a query optimizer, the optimizer attempts to organize the query in such a fashion that it can remove as many tuples (rows) from the result set as quickly as it can. Often when you use subqueries (especially bad ones) the tuples can't be pruned out of the result set until the outer query starts to run.
With out seeing the the query its hard to say what was so bad about the original, but my guess would be it was something that the optimizer just couldn't make much better. Running 'explain' will show you the optimizers method for retrieving the data.
Look at the query plan for each query.
Where in and Join can typically be implemented using the same execution plan, so typically there is zero speed-up from changing between them.
Optimizer didn't do a very good job. Usually they can be transformed without any difference and the optimizer can do this.
This question is somewhat general, so here's a general answer:
Basically, queries take longer when MySQL has tons of rows to sort through.
Do this:
Run an EXPLAIN on each of the queries (the JOIN'ed one, then the Subqueried one), and post the results here.
I think seeing the difference in MySQL's interpretation of those queries would be a learning experience for everyone.
The where subquery has to run 1 query for each returned row. The inner join just has to run 1 query.
Usually its the result of the optimizer not being able to figure out that the subquery can be executed as a join in which case it executes the subquery for each record in the table rather then join the table in the subquery against the table you are querying. Some of the more "enterprisey" database are better at this, but they still miss it sometimes.
With a subquery, you have to re-execute the 2nd SELECT for each result, and each execution typically returns 1 row.
With a join, the 2nd SELECT returns a lot more rows, but you only have to execute it once. The advantage is that now you can join on the results, and joining relations is what a database is supposed to be good at. For example, maybe the optimizer can spot how to take better advantage of an index now.
It isn't so much the subquery as the IN clause, although joins are at the foundation of at least Oracle's SQL engine and run extremely quickly.
The subquery was probably executing a "full table scan". In other words, not using the index and returning way too many rows that the Where from the main query were needing to filter out.
Just a guess without details of course but that's the common situation.
Taken from the Reference Manual (14.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOINS.