Converting Postgre to Bigquery - sql

I am having a trouble with a conversion from Postgre to Bigquery query.
Please, how could this query work on bigquery ? I need to subselect a table and bring the last register
(select g.fase__c
from operacoes.salesforce_gest_o_do_canal_c_full g
where g.n_de_indentifica_o_do_parceiro__c = a.n_de_indentifica_o_do_parceiro__c
order by g.data_do_credenciamento__c limit 1) as fase_jornada
from operacoes.salesforce_account_full
If I try to execute, Bigquery returns an error I should apply a join, If I apply the join, the order by doesn´t work
Thanks!

A correlated subquery (which is a subquery that references a table of an outer query), if executed "as is", would need to compute the subquery for each row of the results of the outer query, which would not be very efficient.
In order to optimize this, BigQuery first transforms (decorrelates) the query into one that is functionally equivalent to the correlated one but without any correlations between queries and subqueries. Different decorrelation processes are needed for different use cases and they can be quite complex and difficult to implement and properly test.
BigQuery is implementing decorrelation strategies to make a variety of correlated subqueries possible, however decorrelation for the LIMIT clause has not been implemented.
A workaround would be to use ARRAY_AGG instead of a subquery. In your case, I believe the following query would do the work:
SELECT
ARRAY_AGG(g.fase__c
ORDER BY g.data_do_credenciamento__c
LIMIT 1) AS fase_jornada
FROM
operacoes.salesforce_account_full a
JOIN
operacoes.salesforce_gest_o_do_canal_c_full g
ON
g.n_de_indentifica_o_do_parceiro__c = a.n_de_indentifica_o_do_parceiro__c
GROUP BY
g.n_de_indentifica_o_do_parceiro__c
Take into account that I have guessed some details since the whole context for the subquery was not provided, so you may need to make some changes to it.
Please let me know if anything was not clear!

Related

How to avoid duplicated SELECT phrases in SQL (MariaDB)

I am working with a small MariaDB database. To extract time intervals per user, I use the following query:
SELECT
SUM(TIMESTAMPDIFF(SECOND,Activity.startTime,Activity.endTime)) AS seconds,
TIME_FORMAT(SEC_TO_TIME(SUM(TIMESTAMPDIFF(SECOND,Activity.startTime,Activity.endTime))),'%Hh %im %ss') AS formattedTime,
TSUser.name
FROM Activity
INNER JOIN User ON User.id = Activity.userID
GROUP BY User.id
ORDER BY seconds DESC;
I have to select the time as plain seconds (... AS seconds) to be able to order the results by it, as can be seen in my query.
However, I also want MariaDB to format the time interval, for that I use the TIME_FORMAT function. The problem is, I have to duplicate the whole SUM(...) phrase inside the TIME_FORMAT call again. This doesn't seem very elegant. Will MariaDB recognize the duplication and calculate the SUM only once? Also, is there a way to get the same result without duplicating the SUM?
I figured this should be possible with a nested query construct like so:
SELECT
innerQuery.name,
innerQuery.seconds,
TIME_FORMAT(SEC_TO_TIME(innerQuery.seconds), '%Hh %im')
FROM (
//Do the sum here, once.
) AS innerQuery
ORDER BY innerQuery.seconds DESC;
Is this the best way to do it / "ok" to do?
Note: I don't need the raw seconds in the result, only the formatted time is needed.
I'd appreciate help, thanks.
Alas. There isn't a really good solution. When you use a subquery, then MariaDb materializes the subquery (as does MySQL). Your query is rather complex, so there is a lot of I/O happening anyway, so the additional materialization may not be important.
Repeating the expression is really more an issue of aesthetics than performance. The expression will be re-executed multiple times. But, the real expense of doing aggregations is the file sort for the group by (or whatever method is used). Doing the sum() twice is not a big deal (unless you are calling a really expensive function as well as the aggregation function).
Other database engines do not automatically materialize subqueries, so using a subquery in other databases is usually the recommended approach. In MariaDB/MySQL, I would guess that repeating the expression is more efficient, although you can try both on your data and report back.
In this case, you don't need the raw values. The formatted value will work correctly in the ORDER BY.
Your subquery idea is likely to be slower because of all the overhead in having two queries.
This is a Rule of Thumb: It takes far more effort for MySQL to fetch a row than to evaluate expressions in the row. With that rule, duplicate expressions are not a burden.

Is possible to cache in some variable results of subquery?

I have query in postgresql 9.1 like
SELECT id
FROM students
INNER JOIN exams ON /some condition
WHERE studentsid NOT IN (SUBQUERY);
and when I run only the subquery it executes in 120ms, when I execute the previous query without condition with subquery it executes for 12 seconds, but when I add subquery it runs half hour
Is possible to cache in some variable results of subquery (results are always same array of ids) and execute in console/pgadmin ?
I found WITH statement but it looks like is not supported in postgres
First, the with statement is supported in Postgres.
Second, you need to identify where the performance problem is. Is it in the subquery? Or is it the not in?
You can put the subquery in a table, add indexes, and make the query more efficient.
You can rewrite the subquery using a left join, which often allows the query to be better optimized.
You can add appropriate indexes to make the entire query more efficient.
Without knowledge of what the subquery actually does, the right approach is speculation.

How do I avoid repeating this subquery for the IN clause?

I have an SQL script (currently running against SQLite, but it should probably work against any DB engine) that uses the same subquery twice, and since it might be fetching a lot of records (the table has a couple of million rows) I'd like to only call it once.
A shortened pseudo-version of the query looks like this:
SELECT * FROM
([the subquery, returns a column of ids]) AS sq
[a couple of joins, that fetches things from other tables based on the ids]
WHERE thisorthat NOT IN ([the subquery again])
I tried just using the name (sq) in various ways (with/without parenthesis, with/without naming the column of sq etc) but to no avail.
Do I really have to repeat this subquery?
Clarification:
I am doing this in python and sqlite as a small demo of what can be done, but I would like my solution to scale as well as possible with as little modification as possible. In the real situation, the database will have a couple of million rows, but in my example there is just 10 rows with dummy data. Thus, code that would be well optimized on for example MySQL is absolutely good enough - it doesn't have to be optimized specifically for SQLite. But as I said, the less modification needed, the better.
There is a WITH clause in standard SQL, however, I don't know if it is supported by SQLlite - though of course worth a try:
WITH mySubQuery AS
(
[the subquery code]
)
SELECT * FROM
mySubQuery AS sq
[a couple of joins, that fetches things from other tables based on the ids]
WHERE thisorthat NOT IN (mySubQuery)
That said, what you do here will likely be horribly slow for any data set that is more than a few thousand rows, so I'd try to remodel it if possible - NOT IN should be avoided in general, especially if you also have a couple of joins.
Do you need a subquery? You could probably rewrite using an OUTER JOIN e.g. something like:
SELECT *
FROM [the subquery's FROM clause] AS sq
RIGHT OUTER JOIN [a couple of tables based on the ids]
ON thisorthat = sq.[a column of ids]
WHERE sq.[a column of ids] IS NULL;
In general, I question the need to eliminate the duplication. The SQL compiler can see that two subqueries are identical and chose to only do them once if that seems optimal.
In addition, by leaving duplicates in the source, the SQL compiler and optimizer is given the opportunity to treat them differently. For example the subquery flattening optimization of SQLite might be applied to one of a pair of duplicates or applied differently to each. See section 9.0, Subquery flattening of https://www.sqlite.org/optoverview.html.
you can put the SELECT part into a View than you can filter the View results using the alias "sq"
I hope it's helpful

SQL query processing order: Group By first or Join First?

I need to execute an SQL query.
If i have a query with multiple tables in From clause with Join condition in Where clause,
And i have Group by statement,
Should i perform Join operation first followed by Group By ?
OR should i perform Group By first then Join ?
Which one would be better ?
Note: In my environment, whichever operator that filters out more tuples should be executed first for better performance and less usage of memory for overall query execution.
Use DB's EXPLAIN syntax, You'll see witch of this two methods (in Yours specific environment) will cause more DB operations taken to produce output

Subqueries vs joins

I refactored a slow section of an application we inherited from another company to use an inner join instead of a subquery like:
WHERE id IN (SELECT id FROM ...)
The refactored query runs about 100x faster. (~50 seconds to ~0.3) I expected an improvement, but can anyone explain why it was so drastic? The columns used in the where clause were all indexed. Does SQL execute the query in the where clause once per row or something?
Update - Explain results:
The difference is in the second part of the "where id in ()" query -
2 DEPENDENT SUBQUERY submission_tags ref st_tag_id st_tag_id 4 const 2966 Using where
vs 1 indexed row with the join:
SIMPLE s eq_ref PRIMARY PRIMARY 4 newsladder_production.st.submission_id 1 Using index
A "correlated subquery" (i.e., one in which the where condition depends on values obtained from the rows of the containing query) will execute once for each row. A non-correlated subquery (one in which the where condition is independent of the containing query) will execute once at the beginning. The SQL engine makes this distinction automatically.
But, yeah, explain-plan will give you the dirty details.
You are running the subquery once for every row whereas the join happens on indexes.
Here's an example of how subqueries are evaluated in MySQL 6.0.
The new optimizer will convert this kind of subqueries into joins.
before the queries are run against the dataset they are put through a query optimizer, the optimizer attempts to organize the query in such a fashion that it can remove as many tuples (rows) from the result set as quickly as it can. Often when you use subqueries (especially bad ones) the tuples can't be pruned out of the result set until the outer query starts to run.
With out seeing the the query its hard to say what was so bad about the original, but my guess would be it was something that the optimizer just couldn't make much better. Running 'explain' will show you the optimizers method for retrieving the data.
Look at the query plan for each query.
Where in and Join can typically be implemented using the same execution plan, so typically there is zero speed-up from changing between them.
Optimizer didn't do a very good job. Usually they can be transformed without any difference and the optimizer can do this.
This question is somewhat general, so here's a general answer:
Basically, queries take longer when MySQL has tons of rows to sort through.
Do this:
Run an EXPLAIN on each of the queries (the JOIN'ed one, then the Subqueried one), and post the results here.
I think seeing the difference in MySQL's interpretation of those queries would be a learning experience for everyone.
The where subquery has to run 1 query for each returned row. The inner join just has to run 1 query.
Usually its the result of the optimizer not being able to figure out that the subquery can be executed as a join in which case it executes the subquery for each record in the table rather then join the table in the subquery against the table you are querying. Some of the more "enterprisey" database are better at this, but they still miss it sometimes.
With a subquery, you have to re-execute the 2nd SELECT for each result, and each execution typically returns 1 row.
With a join, the 2nd SELECT returns a lot more rows, but you only have to execute it once. The advantage is that now you can join on the results, and joining relations is what a database is supposed to be good at. For example, maybe the optimizer can spot how to take better advantage of an index now.
It isn't so much the subquery as the IN clause, although joins are at the foundation of at least Oracle's SQL engine and run extremely quickly.
The subquery was probably executing a "full table scan". In other words, not using the index and returning way too many rows that the Where from the main query were needing to filter out.
Just a guess without details of course but that's the common situation.
Taken from the Reference Manual (14.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOINS.