Query Efficiency in Redshift - sql

I have a question about query efficiency in Redshift. I have two sample queries as below
Query A
select a.*, b.*
from a
left outer join b
on a.id=b.id
where a.market_id = 1
and a.dataset_date = to_date('{RUN_DATE_YYYY-MM-DD}', 'YYYY-MM-DD')
and b.market_id = 1
and b.user_group in ('X');
Query B
select a.*, b.*
from (
select *
from a
where marketplace_id = 1
and dataset_date = to_date('{RUN_DATE_YYYY-MM-DD}', 'YYYY-MM-DD')
)a
left outer join
(select *
from b
where market_id = 1
and user_group in ('X')
)b
on a.id=b.id
where b.market_id = 1
and b.user_group in ('X');
I used to think that Query B would be more efficient as the dataset used were filtered and smaller. But recently I found that Redshift would do some query optimization automatically before it runs each query. In this case, the efficiency of Query A and B should be very close, and Query A is easier to be understood and maintained.
I'm not sure about the Redshift optimization thing, so post this question here. Really appreciate for any knowledge shared!

Related

Better way to do corelated query having count in condition in AWS Athena sql

There are two table A and B. Table A has one to many relationship with B.
I want to fetch records from A and corresponding one single record from B (if B has one record),
If there is multiple record in Table B then find the one having status ='Active' find first.
Below is the query, running in oracle, but we want the same functionality running in AWS Athena, however correlated query is not supported in AWS athena sql. Athena supports ANSI Sql.
SELECT b.*
FROM A a ,B b
WHERE a.instruction_id = b.txn_report_instruction_id AND b.txn_report_instruction_id IN
(SELECT b2.txn_report_instruction_id FROM B b2
WHERE b2.txn_report_instruction_id=b.txn_report_instruction_id
GROUP BY b2.txn_report_instruction_id
HAVING COUNT(b2.txn_report_instruction_id)=1
)
UNION
SELECT * FROM
(SELECT b.*
FROM A a , B b
WHERE a.instruction_id = b.txn_report_instruction_id AND b.txn_report_instruction_id IN
(SELECT b2.txn_report_instruction_id
FROM B b2
WHERE b2.txn_report_instruction_id=b.txn_report_instruction_id
AND b2.status ='ACTIVE'
GROUP BY b2.txn_report_instruction_id
HAVING COUNT(b2.txn_report_instruction_id)> 1
)
)
We need to put all the field in select or in aggregate function when using group by so group by not preferable.
A help would be much appreciated.
[]
2
Output result table
Joining the best row can be achieved with a lateral join.
select *
from a
outer apply
(
select *
from b
where b.txn_report_instruction_id = a.instruction_id
order by case when b.status = 'ACTIVE' then 1 else 2 end
fetch first row only
) bb;
Another option is a window function:
select *
from a
left join
(
select
b.*,
row_number() over (partition by txn_report_instruction_id
order by case when status = 'ACTIVE' then 1 else 2 end) as rn
from b
) bb on bb.txn_report_instruction_id = a.instruction_id and bb.rn = 1;
I don't know about amazon athena's SQL coverage. This is all standard SQL, however, except for OUTER APPLY I think. If I am not mistaken, the SQL standard requires LEFT OUTER JOIN LATERAL (...) ON ... instead, for which you need a dummy ON clause, such as ON 1 = 1. So if above queries fail, there is another option for you :-)

Select in where condition is slow

It appears to me that a select statement (that is very fast on its own) is slower than writing the condition explicitly as a string. Here is a dummy example:
The fast version:
select *
FROM db.dbo.A
left join db.dbo.B on A.id = B.id
where A.selected_variable in ('XXX','YYY')
The slow version:
select
selected_variable
into #t_temp
from db.dbo.some_table
where some_condition = 'X'
select *
FROM db.dbo.A
left join db.dbo.B on A.id = B.id
where A.selected_variable in (select selected_variable from #t_temp) -- returns ('XXX','YYY')
Does someone knows why and what would be best practice to avoid the performance drop?
I think the performance perspective EXISTS is better than the IN clause. Since IN clause internally convert to OR conditions(resulting in more conditions)
Try this script and compare the query execution plan and execution time
select *
FROM db.dbo.A
left join db.dbo.B on A.id = B.id
where EXISTS (select 1
from db.dbo.some_table t
WHERE A.selected_variable = t.selected_variable AND some_condition = 'X')
You can try the below version, which avoids additional TempDB IO ( which is reducing the performance).
Also, see if the selected_variable, some_condition in some_table is having proper indexes.
SELECT A.*
FROM dbo.A AS A
INNER JOIN DBO.some_table AS s
ON A.Selected_Variable = s.selected_variable
LEFT JOIN dbo.B AS B ON A.id = B.id
WHERE s.some_condition = 'X'

Postgresql LATERAL vs INNER JOIN

JOIN
SELECT *
FROM a
INNER JOIN (
SELECT b.id, Count(*) AS Count
FROM b
GROUP BY b.id ) AS b ON b.id = a.id;
LATERAL
SELECT *
FROM a,
LATERAL (
SELECT Count(*) AS Count
FROM b
WHERE a.id = b.id ) AS b;
I understand that here join will be computed once and then merge with the main request vs the request for each FROM.
It seems to me that if join will rotate a few rows to one frame then it will be more efficient but if it will be 1 to 1 then LATERAL - I think right?
If I understand you right you are asking which of the two statements is more efficient.
You can test that yourself using EXPLAIN (ANALYZE), and I guess that the answer depends on the data:
If there are few rows in a, the LATERAL join will probably be more efficient if there is an index on b(id).
If there are many rows in a, the first query will probably be more efficient, because it can use a hash or merge join.

Efficient way to check if row exists for multiple records in postgres

I saw answers to a related question, but couldn't really apply what they are doing to my specific case.
I have a large table (300k rows) that I need to join with another even larger (1-2M rows) table efficiently. For my purposes, I only need to know whether a matching row exists in the second table. I came up with a nested query like so:
SELECT
id,
CASE cnt WHEN 0 then 'NO_MATCH' else 'YES_MATCH' end as match_exists
FROM
(
SELECT
A.id as id, count(*) as cnt
FROM
A, B
WHERE
A.id = B.foreing_id
GROUP BY A.id
) AS id_and_matches_count
Is there a better and/or more efficient way to do it?
Thanks!
You just want a left outer join:
SELECT
A.id as id, count(B.foreing_id) as cnt
FROM A
LEFT OUTER JOIN B ON
A.id = B.foreing_id
GROUP BY A.id

How to select records from a Table that has a certain number of rows in a related table in SQL Server?

Not quite sure how to ask this, but I have 2 tables that are related in a 1 to many relationship, I need to select all records in the "1" table that have less than three records in the "many' table.
select b.foreignkey,count(b.foreignkey) as bidcount
from b
where b.foreignkey in (select a.id from a) and bidcount< 3
group by b.foreignkey
this doesn't work at all I know but I am at a loss how to do this.
I need to in the end select all the records from the "a" table based on this criteria. Sorry if that is confusing!
Just using your code, not tested:
SELECT
b.foreignkey,
count(b.foreignkey) as bidcount
FROM
b
WHERE
b.foreignkey IN (SELECT a.id FROM a)
GROUP BY
b.foreignkey
HAVING
count(b.foreignkey) < 3
Try this:
SELECT t1.id,COUNT(t2.parentId)
FROM table1 as t1
INNER JOIN table2 as t2
ON t1.id = t2.parentId
GROUP BY t1.id
HAVING COUNT(t2.parentId) < 3
You didn't mention which version of SQL Server you're using - if you're on SQL Server 2005 or newer, you could use this CTE (Common Table Expression):
;WITH ChildRows AS
(
SELECT A.Id, COUNT(b.Id) AS 'BCount'
FROM
dbo.TableA A
INNER JOIN
dbo.TableB B ON B.TableAId = A.Id
)
SELECT A.*, R.BCount
FROM dbo.TableA A
INNER JOIN ChildRows R ON A.Id = R.Id
The inner SELECT lists the Id columns from TableA and the count of the child rows associated with those (using the INNER JOIN to TableB) - and the outer SELECT just builds on top of that result set and shows all fields from table A (and the count from the B table)
if you want to return all fields of your (1) table in one query, I suggest you consider using CROSS APPLY:
SELECT t1.* FROM table_1 t1
CROSS APPLY (SELECT COUNT(*) cnt FROM Table_Many t2 WHERE t2.fk = t1.pk) a
where a.cnt < 3
in some particular cases, based on your indices and db structure, this query may run 4 times faster than the GROUP BY method
you have posted this question in sql server, I have a answer in oracle database system (don't know whether it will run in sql server as well or not)
this is as follow-
select [desired column list] from
(select b.*, count(*) over (partition by b.foreignkey) c_1
from b
where b.foreignkey in (select a.id from a) )
where c_1 < 3 ;
i hope it should work on sql server as well...
if not please let me update ..