Too many columns in GROUP BY - sql

I'm trying to aggregate some data, but I've a problem. There's my query (using 3 tables):
SELECT
ufc.counter_id,
gcrvf.goal_id,
gcrvf.date_of_visit,
ufc.utm_campaign,
ufc.utm_source,
ufc.utm_medium,
ufc.utm_content,
ufc.utm_term,
ufc.original_join_id,
max(gcrvf.last_update_time) AS last_update_time,
sum(gcrvf.conversions) AS conversions,
c.name AS counter_name,
c.owner_login AS owner_login,
c.status AS counter_status,
concat(g.goal_source,CAST('Goal','text')) AS metric_type,
multiIf(g.is_retargeting = 0,'non-retargeting',g.is_retargeting = 1,'retargeting',NULL) AS metric_key,
concat(g.name,' (',CAST(gcrvf.goal_id,'String'),')') AS metric_name
FROM connectors_yandex_metrika.goal_conversions_report_v_final AS gcrvf
INNER JOIN connectors_yandex_metrika.utm_for_collect AS ufc ON gcrvf.counter_id = ufc.counter_id
LEFT JOIN connectors_yandex_metrika.counter AS c ON gcrvf.counter_id = c.id
LEFT JOIN connectors_yandex_metrika.goal AS g ON gcrvf.goal_id = g.id
WHERE
((gcrvf.utm_campaign = ufc.utm_campaign) OR (ufc.utm_campaign IS NULL))
AND ((gcrvf.utm_source = ufc.utm_source) OR (ufc.utm_source IS NULL))
AND ((gcrvf.utm_medium = ufc.utm_medium) OR (ufc.utm_medium IS NULL))
AND ((gcrvf.utm_content = ufc.utm_content) OR (ufc.utm_content IS NULL))
AND ((gcrvf.utm_term = ufc.utm_term ) OR (ufc.utm_term IS NULL))
GROUP BY
ufc.counter_id,
gcrvf.date_of_visit,
gcrvf.goal_id,
ufc.utm_campaign,
ufc.utm_source,
ufc.utm_medium,
ufc.utm_content,
ufc.utm_term,
ufc.original_join_id,
c.name,
c.owner_login,
c.status,
metric_type,
metric_key,
metric_name
I have to GROUP BY by almost all columns. Is it a real problem?
Columns ufc.original_join_id, c.name,c.owner_login, c.status, metric_type, metric_key,metric_name are not necessary here. I added them to group by just because I need these columns. And I want to ask: any way to make it more abbreviated? Any ways to avoid unnecessary columns from group by? Or it's okay?
And my second question: does ClickHouse cache right table when we use JOINs? So I always should put huge table as left table?

All columns are required in the group by. It is not possible to leaf some columns out which where mentioned as select columns.
Depending on your indexed columns you can improve the speed of the query. You should try to make an index on the key columns.
The Database will handle the cache logic for you. Depending on how often you execute the query.

Related

Need help in optimizing sql query

I am new to sql and have created the below sql to fetch the required results.However the query seems to take ages in running and is quite slow. It will be great if any help in optimization is provided.
Below is the sql query i am using:
SELECT
Date_trunc('week',a.pair_date) as pair_week,
a.used_code,
a.used_name,
b.line,
b.channel,
count(
case when b.sku = c.sku then used_code else null end
)
from
a
left join b on a.ma_number = b.ma_number
and (a.imei = b.set_id or a.imei = b.repair_imei
)
left join c on a.used_code = c.code
group by 1,2,3,4,5
I would rewrite the query as:
select Date_trunc('week',a.pair_date) as pair_week,
a.used_code, a.used_name, b.line, b.channel,
count(*) filter (where b.sku = c.sku)
from a left join
b
on a.ma_number = b.ma_number and
a.imei in ( b.set_id, b.repair_imei ) left join
c
on a.used_code = c.code
group by 1,2,3,4,5;
For this query, you want indexes on b(ma_number, set_id, repair_imei) and c(code, sku). However, this doesn't leave much scope for optimization.
There might be some other possibilities, depending on the tables. For instance, or/in in the on clause is usually a bad sign -- but it is unclear what your intention really is.

SQL not efficient enough, tuning assistance required

We have some SQL that is ok on smaller data volumes but poor once we scale up to selecting from larger volumes. Is there a faster alternative style to achieve the same output as below? The idea is to pull back a single unique row to get latest version of the data... The SQL does reference another view but this view runs very fast - so we expect the issue is here below and want to try a different approach
SELECT *
FROM
(SELECT (select CustomerId from PremiseProviderVersionsToday
where PremiseProviderId = b.PremiseProviderId) as CustomerId,
c.D3001_MeterId, b.CoreSPID, a.EnteredBy,
ROW_NUMBER() OVER (PARTITION BY b.PremiseProviderId
ORDER BY a.effectiveDate DESC) AS rowNumber
FROM PremiseMeterProviderVersions a, PremiseProviders b,
PremiseMeterProviders c
WHERE (a.TransactionDateTimeEnd IS NULL
AND a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND b.PremiseProviderId = c.PremiseProviderId)
) data
WHERE data.rowNumber = 1
As Bilal Ayub stated above, the correlated subquery can result in performance issues. See here for more detail. Below are my suggestions:
Change all to explicit joins (ANSI standard)
Use aliases that are more descriptive than single characters (this is mostly to help readers understand what each table does)
Convert data subquery to a temp table or cte (temp tables and ctes usually perform better than subqueries)
Note: normally, you should explicitly create and insert into your temp table but I chose not to do that here as I do not know the data types of your columns.
SELECT d.CustomerId
, c.D3001_MeterId
, b.CoreSPID
, a.EnteredBy
, rowNumber = ROW_NUMBER() OVER(PARTITION BY b.PremiseProviderId ORDER BY a.effectiveDate DESC)
INTO #tmp_RowNum
FROM PremiseMeterProviderVersions a
JOIN PremiseMeterProviders c ON c.PremiseMeterProviderId = a.PremiseMeterProviderId
JOIN PremiseProviders b ON b.PremiseProviderId = c.PremiseProviderId
JOIN PremiseProviderVersionsToday d ON d.PremiseProviderId = b.PremiseProviderId
WHERE a.TransactionDateTimeEnd IS NULL
SELECT *
FROM #tmp_RowNum
WHERE rowNumber = 1
You are running a correlated query that will run in loop, if size of table is small it will be faster, i would suggest to change it and try to join the table in order to get customerid.
(select CustomerId from PremiseProviderVersionsToday where PremiseProviderId = b.PremiseProviderId) as CustomerId
Consider derived tables including an aggregate query that calculates maximum EffectoveDate by PremiseProviderId and unit level query, each using explicit joins (current ANSI SQL standard) and not implicit as you currently use:
SELECT data.*
FROM
(SELECT t.CustomerId, c.D3001_MeterId, b.CoreSPID, a.EnteredBy,
b.PremiseProviderId, a.EffectiveDate
FROM PremiseMeterProviders c
INNER JOIN PremiseMeterProviderVersions a
ON a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND a.TransactionDateTimeEnd IS NULL
INNER JOIN PremiseProviders b
ON b.PremiseProviderId = c.PremiseProviderId
INNER JOIN PremiseProviderVersionsToday t
ON t.PremiseProviderId = b.PremiseProviderId
) data
INNER JOIN
(SELECT b.PremiseProviderId, MAX(a.EffectiveDate) As MaxEffDate
FROM PremiseMeterProviders c
INNER JOIN PremiseMeterProviderVersions a
ON a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND a.TransactionDateTimeEnd IS NULL
INNER JOIN PremiseProviders b
ON b.PremiseProviderId = c.PremiseProviderId
GROUP BY b.PremiseProviderId
) agg
ON data.PremiseProviderId = agg.PremiseProviderId
AND data.EffectiveDate = agg.MaxEffDate

Refactoring slow SQL query

I currently have this very very slow query:
SELECT generators.id AS generator_id, COUNT(*) AS cnt
FROM generator_rows
JOIN generators ON generators.id = generator_rows.generator_id
WHERE
generators.id IN (SELECT "generators"."id" FROM "generators" WHERE "generators"."client_id" = 5212 AND ("generators"."state" IN ('enabled'))) AND
(
generators.single_use = 'f' OR generators.single_use IS NULL OR
generator_rows.id NOT IN (SELECT run_generator_rows.generator_row_id FROM run_generator_rows)
)
GROUP BY generators.id;
An I'm trying to refactor it/improve it with this query:
SELECT g.id AS generator_id, COUNT(*) AS cnt
from generator_rows gr
join generators g on g.id = gr.generator_id
join lateral(select case when exists(select * from run_generator_rows rgr where rgr.generator_row_id = gr.id) then 0 else 1 end as noRows) has on true
where g.client_id = 5212 and "g"."state" IN ('enabled') AND
(g.single_use = 'f' OR g.single_use IS NULL OR has.norows = 1)
group by g.id
For reason it doesn't quite work as expected(It returns 0 rows). I think I'm pretty close to the end result but can't get it to work.
I'm running on PostgreSQL 9.6.1.
This appears to be the query, formatted so I can read it:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id
WHERE gr.generators_id IN (SELECT g.id
FROM generators g
WHERE g.client_id = 5212 AND
g.state = 'enabled'
) AND
(g.single_use = 'f' OR
g.single_use IS NULL OR
gr.id NOT IN (SELECT rgr.generator_row_id FROM run_generator_rows rgr)
)
GROUP BY gr.generators_id;
I would be inclined to do most of this work in the FROM clause:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id JOIN
generators gg
on g.id = gg.id AND
gg.client_id = 5212 AND gg.state = 'enabled' LEFT JOIN
run_generator_rows rgr
ON g.id = rgr.generator_row_id
WHERE g.single_use = 'f' OR
g.single_use IS NULL OR
rgr.generator_row_id IS NULL
GROUP BY gr.generators_id;
This does make two assumptions that I think are reasonable:
generators.id is unique
run_generator_rows.generator_row_id is unique
(It is easy to avoid these assumptions, but the duplicate elimination is more work.)
Then, some indexes could help:
generators(client_id, state, id)
run_generator_rows(id)
generator_rows(generators_id)
Generally avoid inner selects as in
WHERE ... IN (SELECT ...)
as they are usually slow.
As it was already shown for your problem it's a good idea to think of SQL as of set- theory.
You do NOT join tables on their sole identity:
In fact you take (SQL does take) the set (- that is: all rows) of the first table and "multiply" it with the set of the second table - thus ending up with n times m rows.
Then the ON- clause is used to (often strongly) reduce the result by simply selecting each one of those many combinations by evaluating this portion to either true (take) or false (drop). This way you can chose any arbitrary logic to select those combinations in favor.
Things get trickier with LEFT JOIN and RIGHT JOIN, but one can easily think of them as to take one side for granted:
output the combinations of that row IF the logic yields true (once at least) - exactly like JOIN does
output exactly ONE row, with 'the other side' (right side on LEFT JOIN and vice versa) consisting of ALL NULL for every column.
Count(*) is great either, but if things getting complicated don't stick to it: Use Sub- Selects for the keys only, and once all the hard word is done join the Fun- Stuff to it. Like in
SELECT SUM(VALID), ID
FROM SELECT
(
(1 IF X 0 ELSE) AS VALID, ID
FROM ...
)
GROUP BY ID) AS sub
JOIN ... AS details ON sub.id = details.id
Difference is: The inner query is executed only once. The outer query does usually have no indices left to work with and will be slow, but if the inner select here doesn't make the data explode this is usually many times faster than SELECT ... WHERE ... IN (SELECT..) constructs.

SQL Using join but without on Condition

I am looking for someone to help by updating this sql statement, as I want to join two tables but without using " ON AD.[UID] = UI.[UID] " ?
SELECT AD.[AdsID] ,AD.[UID] ,AD.[Section] ,AD.[Category] ,AD.[Country] ,AD.[State] ,AD.[City] ,SUBSTRING([AdsTit],1,30)+'...' as AdsTit ,SUBSTRING([AdsDesc],1,85) as AdsDesc ,AD.[AdsPrice] ,AD.[Img1] ,AD.[Currency] ,AD.[Section] ,AD.[Currency] ,AD.[AdsDate] ,AD.[approvAds] ,UI.[approv]
FROM [ads] as AD JOIN UserInfo as UI ON AD.[UID] = UI.[UID]
where AD.[Country] = #Location AND AD.[approvAds]= 'Y' AND UI.[approv]='Y'
ORDER BY [AdsDate] DESC
You can move condition from 'on' to 'where', it works like inner join
SELECT AD.[AdsID] ,AD.[UID] ,AD.[Section] ,AD.[Category] ,AD.[Country] ,AD.[State] ,AD.[City] ,SUBSTRING([AdsTit],1,30)+'...' as AdsTit ,SUBSTRING([AdsDesc],1,85) as AdsDesc ,AD.[AdsPrice] ,AD.[Img1] ,AD.[Currency] ,AD.[Section] ,AD.[Currency] ,AD.[AdsDate] ,AD.[approvAds] ,UI.[approv]
FROM [ads] as AD, UserInfo as UI
where AD.[UID] = UI.[UID]
and AD.[Country] = #Location AND AD.[approvAds]= 'Y' AND UI.[approv]='Y'
ORDER BY [AdsDate] DESC
You can use the USING keyword instead of ON in your SQL query or use a natural join, which compares all the common columns in two tables itself without the on condition.
Query for your tables like this:
select AD.column1, AD.column2, UI.column1, UI.column2........
from ads AD join userinfo UI using(UID)
where AD.country='location' and AD.ApprovAds='y' and UI.approv='y'
order by Ads(date) desc
Or
select AD.column1, AD.column2, UI.column1, UI.column2........
from ads AD natural join userinfo UI
where AD.country='location' and AD.ApprovAds='y' and UI.approv='y'
order by Ads(date) desc
you want a cross join, don't recommend doing this as the performance is better on a join with a conditional statement.
That being said here is how you do it:
SELECT A.*, B.*
FROM A CROSS JOIN B
WHERE A.Id = B.AId
Just put the tables you need in place of A and B statements

Remove duplicate SQL Join results

Okay, I and totally new to SQL so bear with me. I created a statement which I have the results I wanted but wanted get rid of duplicate results. What's a easy solution to this? Here is my statement
SELECT
li.location,
li.logistics_unit,
li.item,
li.company,
li.item_desc,
li.on_hand_qty,
li.in_transit_qty,
li.allocated_qty,
li.lot,
i.item_category3,
location.locating_zone,
location.location_subclass,
i.item_category4
FROM
location_inventory li
INNER JOIN item i ON li.item = i.item
INNER JOIN location l ON l.location = li.location
WHERE
i.item_category3 = 'AS' AND
li.warehouse = 'river' AND
li.location NOT LIKE 'd%' AND
li.location NOT LIKE 'stg%'
ORDER BY
li.item asc
If you are confident in your JOIN then DISTINCT should do the trick like so:
select DISTINCT
location_inventory.location ,
location_inventory.logistics_unit ,
location_inventory.item ,
location_inventory.company ,
location_inventory.item_desc ,
location_inventory.on_hand_qty ,
location_inventory.in_transit_qty ,
location_inventory.allocated_qty ,
location_inventory.lot ,
item.item_category3 ,
location.locating_zone ,
location.location_subclass ,
item.item_category4
from location_inventory
INNER JOIN item
on location_inventory.item=item.item
INNER JOIN location
on location_inventory.location=location.location
where item.item_category3 = 'AS' and
location_inventory.warehouse = 'river' and
location_inventory.location not like 'd%' and
location_inventory.location not like 'stg%'
order by location_inventory.item asc
Assuming that i.item and l.location are primary or unique keys, any duplicates you see are being caused by duplicate items in your location_inventory table. That might or might not be what you want.
SELECT DISTINCT will only eliminate true duplicates (i.e. those where all of the selected columns are duplicate). If that's what you want, use it. Otherwise, you might need to make an inner select that uses SELECT DISTINCT to identify the columns in which you don't want duplicates, and join the results of the inner select back to the tables to pull out all the other data.