Need help in optimizing sql query - sql

I am new to sql and have created the below sql to fetch the required results.However the query seems to take ages in running and is quite slow. It will be great if any help in optimization is provided.
Below is the sql query i am using:
SELECT
Date_trunc('week',a.pair_date) as pair_week,
a.used_code,
a.used_name,
b.line,
b.channel,
count(
case when b.sku = c.sku then used_code else null end
)
from
a
left join b on a.ma_number = b.ma_number
and (a.imei = b.set_id or a.imei = b.repair_imei
)
left join c on a.used_code = c.code
group by 1,2,3,4,5

I would rewrite the query as:
select Date_trunc('week',a.pair_date) as pair_week,
a.used_code, a.used_name, b.line, b.channel,
count(*) filter (where b.sku = c.sku)
from a left join
b
on a.ma_number = b.ma_number and
a.imei in ( b.set_id, b.repair_imei ) left join
c
on a.used_code = c.code
group by 1,2,3,4,5;
For this query, you want indexes on b(ma_number, set_id, repair_imei) and c(code, sku). However, this doesn't leave much scope for optimization.
There might be some other possibilities, depending on the tables. For instance, or/in in the on clause is usually a bad sign -- but it is unclear what your intention really is.

Related

Too many columns in GROUP BY

I'm trying to aggregate some data, but I've a problem. There's my query (using 3 tables):
SELECT
ufc.counter_id,
gcrvf.goal_id,
gcrvf.date_of_visit,
ufc.utm_campaign,
ufc.utm_source,
ufc.utm_medium,
ufc.utm_content,
ufc.utm_term,
ufc.original_join_id,
max(gcrvf.last_update_time) AS last_update_time,
sum(gcrvf.conversions) AS conversions,
c.name AS counter_name,
c.owner_login AS owner_login,
c.status AS counter_status,
concat(g.goal_source,CAST('Goal','text')) AS metric_type,
multiIf(g.is_retargeting = 0,'non-retargeting',g.is_retargeting = 1,'retargeting',NULL) AS metric_key,
concat(g.name,' (',CAST(gcrvf.goal_id,'String'),')') AS metric_name
FROM connectors_yandex_metrika.goal_conversions_report_v_final AS gcrvf
INNER JOIN connectors_yandex_metrika.utm_for_collect AS ufc ON gcrvf.counter_id = ufc.counter_id
LEFT JOIN connectors_yandex_metrika.counter AS c ON gcrvf.counter_id = c.id
LEFT JOIN connectors_yandex_metrika.goal AS g ON gcrvf.goal_id = g.id
WHERE
((gcrvf.utm_campaign = ufc.utm_campaign) OR (ufc.utm_campaign IS NULL))
AND ((gcrvf.utm_source = ufc.utm_source) OR (ufc.utm_source IS NULL))
AND ((gcrvf.utm_medium = ufc.utm_medium) OR (ufc.utm_medium IS NULL))
AND ((gcrvf.utm_content = ufc.utm_content) OR (ufc.utm_content IS NULL))
AND ((gcrvf.utm_term = ufc.utm_term ) OR (ufc.utm_term IS NULL))
GROUP BY
ufc.counter_id,
gcrvf.date_of_visit,
gcrvf.goal_id,
ufc.utm_campaign,
ufc.utm_source,
ufc.utm_medium,
ufc.utm_content,
ufc.utm_term,
ufc.original_join_id,
c.name,
c.owner_login,
c.status,
metric_type,
metric_key,
metric_name
I have to GROUP BY by almost all columns. Is it a real problem?
Columns ufc.original_join_id, c.name,c.owner_login, c.status, metric_type, metric_key,metric_name are not necessary here. I added them to group by just because I need these columns. And I want to ask: any way to make it more abbreviated? Any ways to avoid unnecessary columns from group by? Or it's okay?
And my second question: does ClickHouse cache right table when we use JOINs? So I always should put huge table as left table?
All columns are required in the group by. It is not possible to leaf some columns out which where mentioned as select columns.
Depending on your indexed columns you can improve the speed of the query. You should try to make an index on the key columns.
The Database will handle the cache logic for you. Depending on how often you execute the query.

Refactoring slow SQL query

I currently have this very very slow query:
SELECT generators.id AS generator_id, COUNT(*) AS cnt
FROM generator_rows
JOIN generators ON generators.id = generator_rows.generator_id
WHERE
generators.id IN (SELECT "generators"."id" FROM "generators" WHERE "generators"."client_id" = 5212 AND ("generators"."state" IN ('enabled'))) AND
(
generators.single_use = 'f' OR generators.single_use IS NULL OR
generator_rows.id NOT IN (SELECT run_generator_rows.generator_row_id FROM run_generator_rows)
)
GROUP BY generators.id;
An I'm trying to refactor it/improve it with this query:
SELECT g.id AS generator_id, COUNT(*) AS cnt
from generator_rows gr
join generators g on g.id = gr.generator_id
join lateral(select case when exists(select * from run_generator_rows rgr where rgr.generator_row_id = gr.id) then 0 else 1 end as noRows) has on true
where g.client_id = 5212 and "g"."state" IN ('enabled') AND
(g.single_use = 'f' OR g.single_use IS NULL OR has.norows = 1)
group by g.id
For reason it doesn't quite work as expected(It returns 0 rows). I think I'm pretty close to the end result but can't get it to work.
I'm running on PostgreSQL 9.6.1.
This appears to be the query, formatted so I can read it:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id
WHERE gr.generators_id IN (SELECT g.id
FROM generators g
WHERE g.client_id = 5212 AND
g.state = 'enabled'
) AND
(g.single_use = 'f' OR
g.single_use IS NULL OR
gr.id NOT IN (SELECT rgr.generator_row_id FROM run_generator_rows rgr)
)
GROUP BY gr.generators_id;
I would be inclined to do most of this work in the FROM clause:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id JOIN
generators gg
on g.id = gg.id AND
gg.client_id = 5212 AND gg.state = 'enabled' LEFT JOIN
run_generator_rows rgr
ON g.id = rgr.generator_row_id
WHERE g.single_use = 'f' OR
g.single_use IS NULL OR
rgr.generator_row_id IS NULL
GROUP BY gr.generators_id;
This does make two assumptions that I think are reasonable:
generators.id is unique
run_generator_rows.generator_row_id is unique
(It is easy to avoid these assumptions, but the duplicate elimination is more work.)
Then, some indexes could help:
generators(client_id, state, id)
run_generator_rows(id)
generator_rows(generators_id)
Generally avoid inner selects as in
WHERE ... IN (SELECT ...)
as they are usually slow.
As it was already shown for your problem it's a good idea to think of SQL as of set- theory.
You do NOT join tables on their sole identity:
In fact you take (SQL does take) the set (- that is: all rows) of the first table and "multiply" it with the set of the second table - thus ending up with n times m rows.
Then the ON- clause is used to (often strongly) reduce the result by simply selecting each one of those many combinations by evaluating this portion to either true (take) or false (drop). This way you can chose any arbitrary logic to select those combinations in favor.
Things get trickier with LEFT JOIN and RIGHT JOIN, but one can easily think of them as to take one side for granted:
output the combinations of that row IF the logic yields true (once at least) - exactly like JOIN does
output exactly ONE row, with 'the other side' (right side on LEFT JOIN and vice versa) consisting of ALL NULL for every column.
Count(*) is great either, but if things getting complicated don't stick to it: Use Sub- Selects for the keys only, and once all the hard word is done join the Fun- Stuff to it. Like in
SELECT SUM(VALID), ID
FROM SELECT
(
(1 IF X 0 ELSE) AS VALID, ID
FROM ...
)
GROUP BY ID) AS sub
JOIN ... AS details ON sub.id = details.id
Difference is: The inner query is executed only once. The outer query does usually have no indices left to work with and will be slow, but if the inner select here doesn't make the data explode this is usually many times faster than SELECT ... WHERE ... IN (SELECT..) constructs.

SQL select results not appearing if a value is null

I am building a complex select statement, and when one of my values (pcf_auto_key) is null it will not disipaly any values for that header entry.
select c.company_name, h.prj_number, h.description, s.status_code, h.header_notes, h.cm_udf_001, h.cm_udf_002, h.cm_udf_008, l.classification_code
from project_header h, companies c, project_status s, project_classification l
where exists
(select company_name from companies where h.cmp_auto_key = c.cmp_auto_key)
and exists
(select status_code from project_status s where s.pjs_auto_key = h.pjs_auto_key)
and exists
(select classification_code from project_classification where h.pcf_auto_key = l.pcf_auto_key)
and pjm_auto_key = 11
--and pjt_auto_key = 10
and c.cmp_auto_key = h.cmp_auto_key
and h.pjs_auto_key = s.pjs_auto_key
and l.pcf_auto_key = h.pcf_auto_key
and s.status_type = 'O'
How does my select statement look? Is this an appropriate way of pulling info from other tables?
This is an oracle database, and I am using SQL Developer.
Assuming you want to show all the data that you can find but display the classification as blank when there is no match in that table, you can use a left outer join; which is much clearer with explicit join syntax:
select c.company_name, h.prj_number, h.description, s.status_code, h.header_notes,
h.cm_udf_001, h.cm_udf_002, h.cm_udf_008, l.classification_code
from project_header h
join companies c on c.cmp_auto_key = h.cmp_auto_key
join project_status s on s.pjs_auto_key = h.pjs_auto_key
left join project_classification l on l.pcf_auto_key = h.pcf_auto_key
where pjm_auto_key = 11
and s.status_type = 'O'
I've taken out the exists conditions as they just seem to be replicating the join conditions.
If you might not have matching data in any of the other tables you can make the other inner joins into outer joins in the same way, but be aware that if you outer join to project_status you will need to move the statatus_type check into the join condition as well, or Oracle will convert that back into an inner join.
Read more about the different kinds of joins.

Left Outer Join in SQL Server 2014

We are currently upgrading to SQL Server 2014; I have a join that runs fine in SQL Server 2008 R2 but returns duplicates in SQL Server 2014. The issue appears to be with the predicate AND L2.ACCOUNTING_PERIOD = RG.PERIOD_TO for if I change it to anything but 4, I do not get the duplicates. The query is returning those values in Accounting Period 4 twice. This query gets account balances for all the previous Accounting Periods so in this case it returns values for Accounting Periods 0, 1, 2 and 3 correctly but then duplicates the values from Period 4.
SELECT
A.ACCOUNT,
SUM(A.POSTED_TRAN_AMT),
SUM(A.POSTED_BASE_AMT),
SUM(A.POSTED_TOTAL_AMT)
FROM
PS_LEDGER A
LEFT JOIN PS_GL_ACCOUNT_TBL B
ON B.SETID = 'LTSHR'
LEFT OUTER JOIN PS_LEDGER L2
ON A.BUSINESS_UNIT = L2.BUSINESS_UNIT
AND A.LEDGER = L2.LEDGER
AND A.ACCOUNT = L2.ACCOUNT
AND A.ALTACCT = L2.ALTACCT
AND A.DEPTID = L2.DEPTID
AND A.PROJECT_ID = L2.PROJECT_ID
AND A.DATE_CODE = L2.DATE_CODE
AND A.BOOK_CODE = L2.BOOK_CODE
AND A.GL_ADJUST_TYPE = L2.GL_ADJUST_TYPE
AND A.CURRENCY_CD = L2.CURRENCY_CD
AND A.STATISTICS_CODE = L2.STATISTICS_CODE
AND A.FISCAL_YEAR = L2.FISCAL_YEAR
AND A.ACCOUNTING_PERIOD = L2.ACCOUNTING_PERIOD
AND L2.ACCOUNTING_PERIOD = RG.PERIOD_TO
WHERE
A.BUSINESS_UNIT = 'UK001'
AND A.LEDGER = 'LOCAL'
AND A.FISCAL_YEAR = 2015
AND ( (A.ACCOUNTING_PERIOD BETWEEN 1 and 4
AND B.ACCOUNT_TYPE IN ('E','R') )
OR
(A.ACCOUNTING_PERIOD BETWEEN 0 and 4
AND B.ACCOUNT_TYPE IN ('A','L','Q') ) )
AND A.STATISTICS_CODE = ' '
AND A.ACCOUNT = '21101'
AND A.CURRENCY_CD <> ' '
AND A.CURRENCY_CD = 'GBP'
AND B.SETID='LTSHR'
AND B.ACCOUNT=A.ACCOUNT
AND B.SETID = SETID
AND B.EFFDT=(SELECT MAX(EFFDT) FROM PS_GL_ACCOUNT_TBL WHERE SETID='LTSHR' AND WHERE ACCOUNT=B.ACCOUNT AND EFFDT<='2015-01-31 00:00:00.000')
GROUP BY A.ACCOUNT
ORDER BY A.ACCOUNT
I'm inclined to suspect that you have simplified your original query too much to reflect the real problem, but I'm going to answer the question as posed, in light of the comments on it to this point.
Since your query does not in fact select anything derived from table L2, nor do any other predicates rely on anything from that table, the only thing accomplished by (left) joining it is to duplicate rows of the pre-aggregation results where more than one satisfies the join condition for the same L2 row. That seems unlikely to be what you want, especially with that particular join being a self join, so I don't see any reason not to remove it altogether. Dollars to doughnuts, that solves the duplication problem.
I'm also going to suggest removing the correlated subquery in the WHERE clause in favor of joining an inline view, since you already join the base table for the subquery anyway. This particular inline view uses the window function version of MAX() instead of the aggregate function version. Ideally, it would directly select only the rows with the target EFFDT values, but it cannot do so without being rather more complicated, which is exactly what I am trying to avoid. The resulting query therefore filters EFFDT externally, as the original did, but without a correlated subquery.
I furthermore removed a few redundant predicates and rewrote one of the messier ones to a somewhat nicer equivalent. And I reordered the predicates in a way that seems more logical to me.
Additionally, since you are filtering on a specific value of A.ACCOUNT, it is pointless (but not wrong) to GROUP BY or ORDER_BY that column. Accordingly, I have removed those clauses to make the query simpler and clearer.
Here's what I came up with:
SELECT
A.ACCOUNT,
SUM(A.POSTED_TRAN_AMT),
SUM(A.POSTED_BASE_AMT),
SUM(A.POSTED_TOTAL_AMT)
FROM
PS_LEDGER A
INNER JOIN (
SELECT
*,
MAX(EFFDT) OVER (PARTITION BY ACCOUNT) AS MAX_EFFDT
FROM PS_GL_ACCOUNT_TBL
WHERE
EFFDT <= '2015-01-31 00:00:00.000'
AND SETID = 'LTSHR'
) B
ON B.ACCOUNT=A.ACCOUNT
WHERE
A.ACCOUNT = '21101'
AND A.BUSINESS_UNIT = 'UK001'
AND A.LEDGER = 'LOCAL'
AND A.FISCAL_YEAR = 2015
AND A.CURRENCY_CD = 'GBP'
AND A.STATISTICS_CODE = ' '
AND B.EFFDT = B.MAX_EFFDT
AND CASE
WHEN B.ACCOUNT_TYPE IN ('E','R')
THEN A.ACCOUNTING_PERIOD BETWEEN 1 and 4
WHEN B.ACCOUNT_TYPE IN ('A','L','Q')
THEN A.ACCOUNTING_PERIOD BETWEEN 0 and 4
ELSE 0
END

Strange performance issue with SELECT (SUBQUERY)

I have a stored procedure that has been having some issues lately and I finally narrowed it down to 1 SELECT. The problem is I cannot figure out exactly what is happening to kill the performance of this one query. I re-wrote it, but I am not sure the re-write is the exact same data.
Original Query:
SELECT
#userId, p.job, p.charge_code, p.code
, (SELECT SUM(b.total) FROM dbo.[backorder w/total] b WHERE b.ponumber = p.ponumber AND b.code = p.code)
, ISNULL(jm.markup, 0)
, (SELECT SUM(b.TOTAL_TAX) FROM dbo.[backorder w/total] b WHERE b.ponumber = p.ponumber AND b.code = p.code)
, p.ponumber
, p.billable
, p.[date]
FROM dbo.PO p
INNER JOIN dbo.JobCostFilter jcf
ON p.job = jcf.jobno AND p.charge_code = jcf.chargecode AND jcf.userno = #userId
LEFT JOIN dbo.JobMarkup jm
ON jm.jobno = p.job
AND jm.code = p.code
LEFT JOIN dbo.[Working Codes] wc
ON p.code = wc.code
INNER JOIN dbo.JOBFILE j
ON j.JOB_NO = p.job
WHERE (wc.brcode <> 4 OR #BmtDb = 0)
GROUP BY p.job, p.charge_code, p.code, p.ponumber, p.billable, p.[date], jm.markup, wc.brcode
This query will practically never finish running. It actually times out for some larger jobs we have.
And if I change the 2 subqueries in the select to read like joins instead:
SELECT
#userid, p.job, p.charge_code, p.code
, (SELECT SUM(b.TOTAL))
, ISNULL(jm.markup, 0)
, (SELECT SUM(b.TOTAL_TAX))
, p.ponumber, p.billable, p.[date]
FROM dbo.PO p
INNER JOIN dbo.JobCostFilter jcf
ON p.job = jcf.jobno AND p.charge_code = jcf.chargecode AND jcf.userno = 11190030
INNER JOIN [BACKORDER W/TOTAL] b
ON P.PONUMBER = b.ponumber AND P.code = b.code
LEFT JOIN dbo.JobMarkup jm
ON jm.jobno = p.job
AND jm.code = p.code
LEFT JOIN dbo.[Working Codes] wc
ON p.code = wc.code
INNER JOIN dbo.JOBFILE j
ON j.JOB_NO = p.job
WHERE (wc.brcode <> 4 OR #BmtDb = 0)
GROUP BY p.job, p.charge_code, p.code, p.ponumber, p.billable, p.[date], jm.markup, wc.brcode
The data comes out looking very nearly identical to me (though there are thousands of lines overall so I could be wrong), and it runs very quickly.
Any ideas appreciated..
Performace
In the second query you have less logical reads because the table [BACKORDER W/TOTAL] has been scanned only once. In the first query two separate subqueries are processed indenpendent and the table is scanned twice although both subqueries have the same predicates.
Correctness
If you want to check if two queries return the same resultset you can use the EXCEPT operator:
If both statements:
First SELECT Query...
EXCEPT
Second SELECT Query...
and
Second SELECT Query..
EXCEPT
First SELECT Query...
return an empty set the resultsets are identical.
In terms of correctness, you are inner joining [BACKORDER W/TOTAL] in the second query, so if the first query has Null values in the subqueries, these rows would be missing in the second query.
For performance, the optimizer is a heuristic - it will sometimes use spectacularly bad query plans, and even minimal changes can sometimes lead to a completely different query plan. Your best chance is to compare the query plans and see what causes the difference.