How to perform CASE on values from subquery - sql

This is going to be difficult to explain, but here goes.
I am looking to perform a CASE condition in a SELECT clause that will use the results of two calculations to determine which calculation value to return for a column value.
Maybe a code sample will help:
this works:
SELECT
A.[COLUMN1]
, B.[COLUMN1]
, CASE
WHEN A.[COLUMN2] + A.[COLUMN3] >= B.[COLUMN2] + B.[COLUMN3] THEN A.[COLUMN2] + A.[COLUMN3]
ELSE B.[COLUMN2] + B.[COLUMN3]
FROM
[TABLE_A] A
INNER JOIN [TABLE_B] B INNER JOIN ON A.ID = B.ID
The problem here is that the query above, in the case statement, is forced to perform the calculation twice. Once for the WHEN clause and again for the THEN clause.
I want to do something like this, but SQL is not happy with it.
SELECT
A.[COLUMN1]
, B.[COLUMN1]
, CASE
WHEN AB.X >= AB.Y THEN AB.X
ELSE AB.Y
END
FROM ((A.[COLUMN2] + A.[COLUMN3]) X, (B.[COLUMN2] + B.[COLUMN3]) Y)
FROM
[TABLE_A] A
INNER JOIN [TABLE_B] B INNER JOIN ON A.ID = B.ID
Is this even possible? In the second example, I am calculating the values only once and referring to them in the case statement, both for the WHEN and the THEN clauses.

I would much prefer to push the calculations down into each table. This keeps the structure of the query quite similar. So, a syntactically correct (or almost correct) version would be:
SELECT A.[COLUMN1], B.[COLUMN1],
(CASE WHEN a.col_2_3 >= b.col_2_3 THEN a.col_2_3
ELSE b.col_2_3
end)
FROM (select a.*, (A.[COLUMN2] + A.[COLUMN3]) as col_2_3
from [TABLE_A] a
) a INNER JOIN
(select b.*, (B.[COLUMN2] + B.[COLUMN3]) as col_2_3
from [TABLE_B] b
)b
ON a.ID = b.ID
There are so many important factors in performance, and overhead for simple calculations is just not one of them. Reading the data and the join are way, way more expensive than simple calculations.
However, moving variables into subqueries is useful for a few reasons. First, the calculations could be more expensive (using subqueries, say). It also helps with readability and hence maintainability.
Finally, a SQL engine could decide to evaluate those expressions just once. In practice, I'm guessing that none make that trivial optimization.

You could reformulate it as this
SELECT a_column1,
b_column1,
CASE
WHEN x >= y THEN x
ELSE y
END AS foo
FROM (SELECT A.[column1] A_COLUMN1,
B.[column1] B_COLUMN1,
( A.[column2] + A.[column3] ) X,
( B.[column2] + B.[column3] ) Y
FROM [table_a] A
INNER JOIN [table_b] B
ON A.id = B.id)t
But I'm not sure it will make a difference since the operations may be performed once per row anyway

Related

Need help in optimizing sql query

I am new to sql and have created the below sql to fetch the required results.However the query seems to take ages in running and is quite slow. It will be great if any help in optimization is provided.
Below is the sql query i am using:
SELECT
Date_trunc('week',a.pair_date) as pair_week,
a.used_code,
a.used_name,
b.line,
b.channel,
count(
case when b.sku = c.sku then used_code else null end
)
from
a
left join b on a.ma_number = b.ma_number
and (a.imei = b.set_id or a.imei = b.repair_imei
)
left join c on a.used_code = c.code
group by 1,2,3,4,5
I would rewrite the query as:
select Date_trunc('week',a.pair_date) as pair_week,
a.used_code, a.used_name, b.line, b.channel,
count(*) filter (where b.sku = c.sku)
from a left join
b
on a.ma_number = b.ma_number and
a.imei in ( b.set_id, b.repair_imei ) left join
c
on a.used_code = c.code
group by 1,2,3,4,5;
For this query, you want indexes on b(ma_number, set_id, repair_imei) and c(code, sku). However, this doesn't leave much scope for optimization.
There might be some other possibilities, depending on the tables. For instance, or/in in the on clause is usually a bad sign -- but it is unclear what your intention really is.

What's the difference between these two SQL statements?

Here are two SQL statements below which I think are equal, but when I run the scripts the second one is much slower, can anyone tell me why?
First one:
select
a.name, if(b.score1 = 0, b.score2, b.score1)
from
a, b
where
a.id = b.id
and if(b.score1 = 0, b.score2, b.score1) > 0
Second one:
select
a.name, temp.score
from
a, b,
(select if(b.score1 = 0, b.score2, b.score1) as score from b) as temp
where
a.id = b.id
and temp.score > 0
The above is a simple example,if my query is:
select a.name,
if(b.usedname1='',if(b.usedname2='',b.usedname3,b.usedname2),b.usedname1)
from a,b
where a.id=b.id and
if(b.usedName1='',if(b.useNname2='',b.usedname3,b.usedname2),b.usedname1)<>'tom';
I got 5 more used names in my table, is there any way to simplify this kind of statement?
The right way to write the query is to use proper, explicit, standard JOIN syntax.
I would write this as:
select a.name, (case when b.score1 = 0 then b.score2 else b.score1 end)
from a join
b
on a.id = b.id
where (b.score1 = 0 and b.score2 > 0) or b.score1 > 0;
I suspect that you might really want greatest() rather than a conditional expression, but that is just speculation.
The second statement has an additional join. I have no idea why you think a query with three table references and two joins would be equivalent to a query with two table references and one join.
If you'd written the same query, I'd say quite possibly nothing. Use the query plan generated by whatever DB engine you are using.
However as you have introduced temp and still joined to b they are not the same.
This is probably the same:
select a.name,temp.score from a,temp,
(select if(b.score1=0,b.score2,b.score1) as score from b) as temp
where a.id=temp.id and temp.score>0

Refactoring slow SQL query

I currently have this very very slow query:
SELECT generators.id AS generator_id, COUNT(*) AS cnt
FROM generator_rows
JOIN generators ON generators.id = generator_rows.generator_id
WHERE
generators.id IN (SELECT "generators"."id" FROM "generators" WHERE "generators"."client_id" = 5212 AND ("generators"."state" IN ('enabled'))) AND
(
generators.single_use = 'f' OR generators.single_use IS NULL OR
generator_rows.id NOT IN (SELECT run_generator_rows.generator_row_id FROM run_generator_rows)
)
GROUP BY generators.id;
An I'm trying to refactor it/improve it with this query:
SELECT g.id AS generator_id, COUNT(*) AS cnt
from generator_rows gr
join generators g on g.id = gr.generator_id
join lateral(select case when exists(select * from run_generator_rows rgr where rgr.generator_row_id = gr.id) then 0 else 1 end as noRows) has on true
where g.client_id = 5212 and "g"."state" IN ('enabled') AND
(g.single_use = 'f' OR g.single_use IS NULL OR has.norows = 1)
group by g.id
For reason it doesn't quite work as expected(It returns 0 rows). I think I'm pretty close to the end result but can't get it to work.
I'm running on PostgreSQL 9.6.1.
This appears to be the query, formatted so I can read it:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id
WHERE gr.generators_id IN (SELECT g.id
FROM generators g
WHERE g.client_id = 5212 AND
g.state = 'enabled'
) AND
(g.single_use = 'f' OR
g.single_use IS NULL OR
gr.id NOT IN (SELECT rgr.generator_row_id FROM run_generator_rows rgr)
)
GROUP BY gr.generators_id;
I would be inclined to do most of this work in the FROM clause:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id JOIN
generators gg
on g.id = gg.id AND
gg.client_id = 5212 AND gg.state = 'enabled' LEFT JOIN
run_generator_rows rgr
ON g.id = rgr.generator_row_id
WHERE g.single_use = 'f' OR
g.single_use IS NULL OR
rgr.generator_row_id IS NULL
GROUP BY gr.generators_id;
This does make two assumptions that I think are reasonable:
generators.id is unique
run_generator_rows.generator_row_id is unique
(It is easy to avoid these assumptions, but the duplicate elimination is more work.)
Then, some indexes could help:
generators(client_id, state, id)
run_generator_rows(id)
generator_rows(generators_id)
Generally avoid inner selects as in
WHERE ... IN (SELECT ...)
as they are usually slow.
As it was already shown for your problem it's a good idea to think of SQL as of set- theory.
You do NOT join tables on their sole identity:
In fact you take (SQL does take) the set (- that is: all rows) of the first table and "multiply" it with the set of the second table - thus ending up with n times m rows.
Then the ON- clause is used to (often strongly) reduce the result by simply selecting each one of those many combinations by evaluating this portion to either true (take) or false (drop). This way you can chose any arbitrary logic to select those combinations in favor.
Things get trickier with LEFT JOIN and RIGHT JOIN, but one can easily think of them as to take one side for granted:
output the combinations of that row IF the logic yields true (once at least) - exactly like JOIN does
output exactly ONE row, with 'the other side' (right side on LEFT JOIN and vice versa) consisting of ALL NULL for every column.
Count(*) is great either, but if things getting complicated don't stick to it: Use Sub- Selects for the keys only, and once all the hard word is done join the Fun- Stuff to it. Like in
SELECT SUM(VALID), ID
FROM SELECT
(
(1 IF X 0 ELSE) AS VALID, ID
FROM ...
)
GROUP BY ID) AS sub
JOIN ... AS details ON sub.id = details.id
Difference is: The inner query is executed only once. The outer query does usually have no indices left to work with and will be slow, but if the inner select here doesn't make the data explode this is usually many times faster than SELECT ... WHERE ... IN (SELECT..) constructs.

Left Outer Join in SQL Server 2014

We are currently upgrading to SQL Server 2014; I have a join that runs fine in SQL Server 2008 R2 but returns duplicates in SQL Server 2014. The issue appears to be with the predicate AND L2.ACCOUNTING_PERIOD = RG.PERIOD_TO for if I change it to anything but 4, I do not get the duplicates. The query is returning those values in Accounting Period 4 twice. This query gets account balances for all the previous Accounting Periods so in this case it returns values for Accounting Periods 0, 1, 2 and 3 correctly but then duplicates the values from Period 4.
SELECT
A.ACCOUNT,
SUM(A.POSTED_TRAN_AMT),
SUM(A.POSTED_BASE_AMT),
SUM(A.POSTED_TOTAL_AMT)
FROM
PS_LEDGER A
LEFT JOIN PS_GL_ACCOUNT_TBL B
ON B.SETID = 'LTSHR'
LEFT OUTER JOIN PS_LEDGER L2
ON A.BUSINESS_UNIT = L2.BUSINESS_UNIT
AND A.LEDGER = L2.LEDGER
AND A.ACCOUNT = L2.ACCOUNT
AND A.ALTACCT = L2.ALTACCT
AND A.DEPTID = L2.DEPTID
AND A.PROJECT_ID = L2.PROJECT_ID
AND A.DATE_CODE = L2.DATE_CODE
AND A.BOOK_CODE = L2.BOOK_CODE
AND A.GL_ADJUST_TYPE = L2.GL_ADJUST_TYPE
AND A.CURRENCY_CD = L2.CURRENCY_CD
AND A.STATISTICS_CODE = L2.STATISTICS_CODE
AND A.FISCAL_YEAR = L2.FISCAL_YEAR
AND A.ACCOUNTING_PERIOD = L2.ACCOUNTING_PERIOD
AND L2.ACCOUNTING_PERIOD = RG.PERIOD_TO
WHERE
A.BUSINESS_UNIT = 'UK001'
AND A.LEDGER = 'LOCAL'
AND A.FISCAL_YEAR = 2015
AND ( (A.ACCOUNTING_PERIOD BETWEEN 1 and 4
AND B.ACCOUNT_TYPE IN ('E','R') )
OR
(A.ACCOUNTING_PERIOD BETWEEN 0 and 4
AND B.ACCOUNT_TYPE IN ('A','L','Q') ) )
AND A.STATISTICS_CODE = ' '
AND A.ACCOUNT = '21101'
AND A.CURRENCY_CD <> ' '
AND A.CURRENCY_CD = 'GBP'
AND B.SETID='LTSHR'
AND B.ACCOUNT=A.ACCOUNT
AND B.SETID = SETID
AND B.EFFDT=(SELECT MAX(EFFDT) FROM PS_GL_ACCOUNT_TBL WHERE SETID='LTSHR' AND WHERE ACCOUNT=B.ACCOUNT AND EFFDT<='2015-01-31 00:00:00.000')
GROUP BY A.ACCOUNT
ORDER BY A.ACCOUNT
I'm inclined to suspect that you have simplified your original query too much to reflect the real problem, but I'm going to answer the question as posed, in light of the comments on it to this point.
Since your query does not in fact select anything derived from table L2, nor do any other predicates rely on anything from that table, the only thing accomplished by (left) joining it is to duplicate rows of the pre-aggregation results where more than one satisfies the join condition for the same L2 row. That seems unlikely to be what you want, especially with that particular join being a self join, so I don't see any reason not to remove it altogether. Dollars to doughnuts, that solves the duplication problem.
I'm also going to suggest removing the correlated subquery in the WHERE clause in favor of joining an inline view, since you already join the base table for the subquery anyway. This particular inline view uses the window function version of MAX() instead of the aggregate function version. Ideally, it would directly select only the rows with the target EFFDT values, but it cannot do so without being rather more complicated, which is exactly what I am trying to avoid. The resulting query therefore filters EFFDT externally, as the original did, but without a correlated subquery.
I furthermore removed a few redundant predicates and rewrote one of the messier ones to a somewhat nicer equivalent. And I reordered the predicates in a way that seems more logical to me.
Additionally, since you are filtering on a specific value of A.ACCOUNT, it is pointless (but not wrong) to GROUP BY or ORDER_BY that column. Accordingly, I have removed those clauses to make the query simpler and clearer.
Here's what I came up with:
SELECT
A.ACCOUNT,
SUM(A.POSTED_TRAN_AMT),
SUM(A.POSTED_BASE_AMT),
SUM(A.POSTED_TOTAL_AMT)
FROM
PS_LEDGER A
INNER JOIN (
SELECT
*,
MAX(EFFDT) OVER (PARTITION BY ACCOUNT) AS MAX_EFFDT
FROM PS_GL_ACCOUNT_TBL
WHERE
EFFDT <= '2015-01-31 00:00:00.000'
AND SETID = 'LTSHR'
) B
ON B.ACCOUNT=A.ACCOUNT
WHERE
A.ACCOUNT = '21101'
AND A.BUSINESS_UNIT = 'UK001'
AND A.LEDGER = 'LOCAL'
AND A.FISCAL_YEAR = 2015
AND A.CURRENCY_CD = 'GBP'
AND A.STATISTICS_CODE = ' '
AND B.EFFDT = B.MAX_EFFDT
AND CASE
WHEN B.ACCOUNT_TYPE IN ('E','R')
THEN A.ACCOUNTING_PERIOD BETWEEN 1 and 4
WHEN B.ACCOUNT_TYPE IN ('A','L','Q')
THEN A.ACCOUNTING_PERIOD BETWEEN 0 and 4
ELSE 0
END

Why does this query run so much longer than the sum of the subqueries?

How can a query like the following take over sixteen hours to run? (We stopped execution to research optimizations, but none of us are DB experts.) It seems like it should be super-simple to perform the set-based exclusion, right?
SELECT
field
FROM
(subquery that returns 1173126 rows in 20 seconds)
WHERE
field NOT IN (subquery that returns 3927646 rows in 69 seconds)
What else should I include in this note to arm you with enough info to help?
(The actual query follows in case there's something tricksy and specific about it that's causing the problem.)
SELECT blob FROM (
SELECT a.line1 + '|' + substring(a.zip,1,5) as blob
FROM registrations r
JOIN customers c ON r.custId = c.Id
JOIN addresses a ON c.addressId = a.Id
WHERE r.purchaseDate > DATEADD(year,-1,getdate())
GROUP BY a.line1 + '|' + substring(a.zip,1,5)) sq
WHERE blob NOT IN (
SELECT a.line1 + '|' + substring(a.zip,1,5) as blob
FROM registrations r
JOIN customers c ON r.custId = c.Id
JOIN addresses a ON c.addressId = a.Id
WHERE r.purchaseDate BETWEEN DATEADD(year,-5,getdate()) AND DATEADD(year,-1,getdate())
GROUP BY a.line1 + '|' + substring(a.zip,1,5))
You seem to be searching for the addresses that have purchases within the last year but not within previous 5 years.
SELECT DISTINCT a.line1, SUBSTRING(a.zip, 1, 5)
FROM addresses a
WHERE id IN
(
SELECT c.addressId
FROM customers c
JOIN registrations r
ON r.custId = c.id
AND r.purchaseDate > DATEADD(year, -1 ,getdate())
)
AND NOT EXISTS
(
SELECT NULL
FROM customers c
JOIN registrations r
ON r.custId = c.id
JOIN addresses ai
ON ai.id = c.addressId
WHERE r.purchaseDate BETWEEN DATEADD(year,-5,getdate()) AND DATEADD(year,-1,getdate())
AND ai.line1 = a.line1
AND SUBSTRING(ai.zip, 1, 5) = SUBSTRING(a.zip, 1, 5)
)
This query cares of the duplicates of line1, zip on addresses with the different ids. Are you having such duplicates?
You may not realize this, but a NOT IN statement gets converted to an IF statement by the query engine. So, in your example, it is building a giant IF statement with all those rows (3.9M). Then it has to evaluate each of the IF conditions to see if the value exists. It's no surprise it's taking 16+ hours to run.
You would be much better trying to find a way to convert this to an EXISTS, or perhaps a join.
The second subquery is getting run once for each row in the first subquery.
Which means, estimated completion time would be around (1173126 * 69) = 80945394 seconds
Which is roughly 154 years...
After you added the actual query, the best thing for you to do is to optimize the two queries by adding indexes to the tables. I can't tell you exactly which indexes to add but there are plenty of good articles on choosing correct indexes for tables.