Why are these two SQL queries so different in efficiency?

Why are these two SQL queries so different in efficiency? - sql

I have to use SQL for my internship and while I know the gist of it, I do not really have a background in programming nor do I know what makes codes efficient etc.
Query #1
SELECT DISTINCT
c.[STAT], c.[EVENT], f.[STAT], f.[EVENT]
FROM
(SELECT *
FROM
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [PROCDT], [PROCTIME]) AS a
FROM
TABLE) AS b
) AS c
LEFT JOIN
(SELECT
*
FROM
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [PROCDT], [PROCTIME]) AS d
FROM
TABLE) AS e
) AS f ON c.[ID] = f.[ID] AND a = d - 1
ORDER BY
c.[STAT], c.[EVENT], f.[STAT], f.[EVENT]
Query #2
SELECT DISTINCT
b.[STAT], b.[EVENT], d.[STAT], d.[EVENT]
FROM
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [PROCDT], [PROCTIME]) AS a
FROM TABLE) AS b
LEFT JOIN
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [PROCDT], [PROCTIME]) AS c
FROM TABLE) AS d ON b.[ID] = f.[ID] AND a = c - 1
ORDER BY
b.[STAT], b.[EVENT], d.[STAT], d.[EVENT]
Queries #1 and #2 return the same result, which is expected, but query #1 has a runtime of roughly 5 seconds while query #2 has a runtime of roughly 1 minute and 35 seconds. In other words, the second query takes a good 1.5 minutes longer to run than the first and I am really curious to know why.

The correct way to write this query uses lead(). I'm pretty sure the select distinct is not needed, so this does what you want:
SELECT stat, event,
LEAD(stat) OVER (PARTITION BY ID, ORDER BY PROCDT, PROCTIME) as next_stat,
LEAD(event) OVER (PARTITION BY ID, ORDER BY PROCDT, PROCTIME) as next_event
FROM TABLE t
ORDER BY stat, event;
The two queries you have written should be the same in SQL Server. Apparently, the extra subqueries are confusing the optimizer. You would need to learn about execution plans to understand this better.

Related

Bigquery resources exceeded during query execution

I have a daily scheduler to run the job on Bigquery, however, it crashed due to running out of memory usage. The job consists of the most updated information from each of the 5 tables, which means I used over( ... order by) five times to query the updated record from each table and it consumed a lot of memory usage. Is there any efficient way to fix the error by refactoring the query?
Here's the brief code structure:
CREATE TEMP TABLE main_info AS
WITH orders_1 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_1
)
where rnk = 1
),
orders_2 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_2
)
where rnk = 1
),
orders_3 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_3
)
where rnk = 1
)
SELECT
*
FROM orders_1 o1
LEFT JOIN orders_2 o2
ON o1.order_id = o2.order_id
LEFT JOIN orders_3 o3
ON o1.order_id = o3.order_id
I was expecting to reduce memory usage under the limit. I did some research and found out to replace row_number() over( ... order by) with array_agg() to optimize the performance or to create the temp table for each table and combine it all? is there any better advice?

I'm not sure whether this will solve your problem, but we could definitely use QUALIFY to simplify your CTEs. For example:
SELECT *
FROM order_1
QUALIFY ROW_NUMBER() OVER(order_window) = 1
WINDOW order_window AS (
PARTITION BY order_id
ORDER BY update_time DESC
)
(also uses WINDOW for readability)
It's possible that this will help by eliminating subqueries, but that depends on whether it's already optimised to the same thing behind the scenes.
Other ideas:
do the left joins get you very different results to inner joins? If so, you could pre-empt this by prefiltering your second and third CTEs to not include order IDs that are just going to be dropped.
does it have to be a temporary table you create? Or could you create full tables for each of the CTEs instead and build this in stages?

create separate temp tables and combine those. That will reduce memory utilization as compared to initial query. To immediate releasing memory you can drop such temp tables at the appropriate steps. Refer below splits:
CREATE TEMP TABLE orders_1 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_1
)
where rnk = 1 );
CREATE TEMP TABLE orders_2 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_2
)
where rnk = 1 );
CREATE TEMP TABLE orders_3 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_3
)
where rnk = 1);
CREATE TEMP TABLE main_info AS
SELECT *
FROM orders_1 o1
LEFT JOIN orders_2 o2
ON o1.order_id = o2.order_id
LEFT JOIN orders_3 o3
ON o1.order_id = o3.order_id;
DROP TABLE orders_1;
DROP TABLE orders_2;
DROP TABLE orders_3;

Should I put a row number filter in join condition or in a prior CTE?

I have a subscription table and a payments table that I need to join.
I am trying to decide between 2 options and performance is a key consideration.
Which of the two OPTIONS below will perform better?
I am using Impala, and these tables are large (multiple millions of rows) I am needing to only get one row for every id and date grouping (hence the row_number() analytic function).
I have shortened the queries to illustrate my question:
OPTION 1:
WITH cte
AS (
SELECT *
, SUM(amount) OVER (PARTITION BY id, date)
AS sameday_total
, ROW_NUMBER() OVER (PARTITION BY id, date ORDER BY purchase_number DESC)
AS sameday_rownum
FROM payments
),
payment
AS (
SELECT *
FROM cte
WHERE sameday_rownum = 1
)
SELECT s.*
, p.sameday_total
FROM subscription
INNER JOIN payment ON s.id = p.id
OPTION 2:
WITH payment
AS (
SELECT *
, SUM(payment_amount) OVER (PARTITION BY id, date)
AS sameday_total
, ROW_NUMBER() OVER (PARTITION BY id, date ORDER BY purchase_number DESC)
AS sameday_rownum
FROM payments
)
SELECT s.*
, p.sameday_total
FROM subscription
INNER JOIN payment ON s.id = p.id
AND p.sameday_rownum = 1

An "Option 0" also exists. A far more traditional "derived table" which simply does not require use of any CTE.
SELECT s.*
, p.sameday_total
FROM subscription
INNER JOIN (
SELECT *
, SUM(payment_amount) OVER (PARTITION BY id, date)
AS sameday_total
, ROW_NUMBER() OVER (PARTITION BY id, date ORDER BY purchase_number DESC)
AS sameday_rownum
FROM payments
) p ON s.id = p.id
AND p.sameday_rownum = 1
All options 0,1 and 2 are likely to produce identical or very similar explain plans (although I'm more confident about that statement for SQL Server than Impala).
Adopting a CTE does - in itself - not make a query more efficient or better performing, so the syntax alteration between option 1 and 2 isn't major. I prefer option 0 myself as I prefer to use CTEs for specific tasks (e.g. recursion).
What you should do is use explain plans to study what each option produces.

Amalgamating SQL queries stored as views together / Combining tables

I have several summary queries stored as Views...
...and would like to join them together into one combined output as follows:
..so I can use it as a pivot table in Excel.
Date is the only common denominator in the case.
I can do this in Excel using SUMIFS but would prefer to manage it in the SQL before it arrives in Excel.
Can anyone help?

Without a matching ID, the best I can think of is to order by ROW_NUMBER(), which gives a slightly verbose query;
WITH cte1 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY DATE
ORDER BY CASE WHEN Dogs IS NULL THEN 1 END) r1
FROM View1
), cte2 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY DATE
ORDER BY CASE WHEN Region IS NULL THEN 1 END) r2
FROM View2
), cte3 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY DATE
ORDER BY CASE WHEN Bed IS NULL THEN 1 END) r3
FROM View3
)
SELECT COALESCE(cte1.Date, cte2.Date, cte3.Date) Date,
Dogs, D_Qty, Region, R_Qty, Bed, B_Qty
FROM cte1
FULL OUTER JOIN cte2
ON cte1.Date = cte2.Date AND r1=r2
FULL OUTER JOIN cte3
ON cte1.Date = cte3.Date AND r1=r3
OR cte2.Date = cte3.Date AND r2=r3
ORDER BY Date, COALESCE(r1,r2,r3)
An SQLfiddle to test with.
You may consider adding an order column to your views, using ROW_NUMBER() OVER (PARTITION BY DATE ORDER BY (whatever order is in them), that would eliminate all the cte's and give you a stable ordering of things.

if you can Add one more column in your view1 and view2 and view3 than you can solve your issue easily,
Check this

how to join two tables in sql server with out duplication

Hi I have two tables A and B
Table A:
Order Pick up
100 Toronto
100 Mississauga
100 Scarborough
Table B
Order Drop off
100 Oakvile
100 Hamilton
100 Milton
Please let me know how can I can get this output (ie I just want to join the fields from in B in right hand side of A)
Order pickup Dropoff
100 Toronto oakvile
100 Mississauga Hamilton
100 Scarborough Milton
How can I write query for the same I try to join a.rownum = b.rownum but no luck.

As OP has not mention any RDBMS
I am taking the liberty for taking SQL SERVER 2008 as his RDBMS. If OP wants the following Query can be converted to any other RDBMS easily.
select A.[Order],
ROW_NUMBER() OVER(ORDER BY A.[Pick up]) rn1,
A.[Pick up]
into A1
FROM A
;
select B.[Order],
ROW_NUMBER() OVER(ORDER BY B.[Drop off]) rn2,
B.[Drop off]
into B1
FROM B
;
Select A1.[Order],
A1.[Pick up],
B1.[Drop off]
FROM A1
INNER JOIN B1 on A1.rn1=B1.rn2
SQL FIDDLE to Test

From the use rownum, I'm presuming that you are using Oracle. You can attempt the following:
select a.Order as "order", a.Pickup, b.DropOff
from (select a.*, rownum as seqnum
from a
) a join
(select b.*, rownum as seqnum
from b
) b
on a.order = b.order and a.seqnum = b.seqnum;
(This assumes that all orders match up exactly.)
I must emphasize that although this might seem to work (and it should work on small examples), it will not work in general. And, it will not work on data that has deleted records. And, it probably won't work on parallel systems. If you have a small amount of data, I'd suggest dumping it in Excel and doing the work there -- that way, you can see if the pairs make sense.
Also, if you do have a column that specifies the ordering, then basically the same structure will work:
select coalesce(a.Order, b.Order) as "order", a.Pickup, b.DropOff
from (select a.*,
row_number() over (partition by "order" order by <ordering field>) as seqnum
from a
) a join
(select b.*,
row_number() over (partition by "order" order by <ordering field>) as seqnum
from b
) b
on a.order = b.order and a.seqnum = b.seqnum;

I'd use a CTE along with the ROW_NUMBER windowing function.
WITH keyed_A AS (
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS id
,[Order]
,[Pick Up]
FROM A
), keyed_B AS (
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS id
,[Order]
,[Drop Off]
FROM B
)
SELECT
a.[Pick Up]
,b.[Drop Off]
FROM keyed_A AS a
INNER JOIN keyed_B AS b
ON a.id = b.id
;
The CTE can be thought of as a virtual table with an id that crosses the two tables. The OVER clause with the Windowing function ROW_NUMBER can be used to create an id in the CTE. Since we are relying on the physical storage of the records (not a good idea, please add keys to the tables) we can ORDER BY (SELECT NULL) which means just use the order in will be read in.
SQLFiddle to test

How to write a derived query in Netezza SQL?

I need to query the data for inviteid based. For each inviteid I need to have the top 5 IDs and ID Descriptions.
I see that the query I wrote is taking all the time in the world to fetch. I didn't notice an error or anything wrong with it.
The code is:
SELECT count(distinct ID),
IDdesc,
inviteid,
A
FROM (
SELECT
ID,
IDdesc,
inviteid,
RANK() OVER(order by invtypeid asc ) A
FROM Fact_s
--WHERE dateid ='26012013'
GROUP BY invteid,IDdesc,ID
ORDER BY invteid,IDdesc,ID
) B
WHERE A <=5
GROUP BY A, IDDESC, inviteid
ORDER BY A

I'm not sure I understood you requirement completely, but as far as I can tell the group by in the derived table is not necessary (just as the order by as Mark mentioned) because you are using a window function.
And you probably want row_number() instead of rank() in there.
Including the result of rank() in the outer query seems dubious as well.
So this leads to the following statement:
SELECT count(distinct ID),
IDdesc,
inviteid
FROM (
SELECT ID,
IDdesc,
inviteid,
row_number() OVER (order by invtypeid asc ) as rn
FROM Fact_s
) B
WHERE rn <= 5
GROUP BY IDDESC, inviteid;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why are these two SQL queries so different in efficiency? - sql

Related

Bigquery resources exceeded during query execution

Should I put a row number filter in join condition or in a prior CTE?

Amalgamating SQL queries stored as views together / Combining tables

how to join two tables in sql server with out duplication

How to write a derived query in Netezza SQL?

Categories

Resources