How to optimize a SQL Query containing a max?

How to optimize a SQL Query containing a max? - sql

Please note that both T and T1 refer to same table.
We are trying to retrieve a maximum value and while retrieving max value, we
are interested in those rows, which have equal columnC values.
select *
from table T
where T.columnA in (0,1,2,3)
and T.columnB = (select max(T1.columnB)
from table T1
where T1.columnC = T.columnC)

This type of query is typically more efficient using window functions:
select *
from (
select *,
max(columnb) over (partition by columnc) as max_b
from the_table
where columna in (0,1,2,3)
) t
where columnb = max_b;
If the condition on columna is very selective an index on that column would help. Some optimizers might generate more efficient plans if you change columna in (0,1,2,3) into columna between 0 and 3

a_horse_sith_no_name is correct that window functions are generally a better approach. Regardless of window functions or your query, indexes will help.
In particular, you want indexes on T(columnc, columnb) and T(columnA). That is two separate indexes. The SQL optimizer should be able to take advantage of the indexes both for your query and for the window functions approach.

Not sure about where do you want (which layer) the columnA filter, but maybe like this:
Select tt1.* from table tt1
inner join
(
select * from
table t1
inner join
( select max(T0.columnB) max_columnB
from table t0 ) t2
on t1.columnB = t2.max_columnB
) tt2
on tt1.columnC = tt2.columnC
Where tt1.columnA in (0,1,2,3)
An index is needed for columnA, and columnB and for columnC to run fast.

Related

Write a where clause that compares two columns to the same subquery?

I want to know if it's possible to make a where clause compare 2 columns to the same subquery. I know I could make a temp table/ variable table or write the same subquery twice. But I want to avoid all that if possible. The Subquery is long and complex and will cause significant overhead if I have to write it twice.
Here is an example of what I am trying to do.
SELECT * FROM Table WHERE (Column1 OR Column2) IN (Select column from TABLE)
I'm looking for a simple answer and that might just be NO but if it's possible without anything too elaborate please clue me in.
I updated the select to use OR instead of AND as this clarified my question a little better.

The example you've given would probably perform best using exists, such as:
select *
from t1
where exists (
select 1 from t2
where t2.col = t1.col1 and t2.col = t1.col2
);

To prevent writing the complicated subquery twice, you can use a CTE (Common Table Expression):
;WITH MyFirstCTE (x) AS
(
SELECT [column] FROM [TABLE1]
-- add all the very complicated stuff here
)
SELECT *
FROM Table2
WHERE Column1 IN (SELECT x FROM MyFirstCTE)
AND Column2 IN (SELECT x FROM MyFirstCTE)
Or using EXISTS:
;WITH MyFirstCTE (x) AS
(
SELECT [column] FROM [TABLE1]
-- add all the very complicated stuff here
)
SELECT *
FROM Table2
WHERE EXISTS (SELECT 1 FROM MyFirstCTE WHERE x = Column1)
AND EXISTS (SELECT 1 FROM MyFirstCTE WHERE x = Column2)
I used deliberately clumsy names, best to pick better ones.
I started it with a ; because if it's not the first command in a larger script then a ; is needed to separate the CTE from the commands before it.

SQL - Is there a way to check rows for duplicates in all columns of a table

I have a big table with over 1 million rows and 96 columns.
Using SQL I want to find rows where every value is the same. The table doesn't have any primary key so I'm not sure how to approach this. I'm not permitted to change the table structure.
I've seen people use count(*) and group by but I'm not sure if this is effective for a table with 96 columns.

Using COUNT() as an analytic function we can try:
WITH cte AS (
SELECT *, COUNT(*) OVER (PARTITION BY col1, col2, ..., col96) cnt
FROM yourTable
)
SELECT col1, col2, ..., col96
FROM cte
WHERE cnt > 1;

you can use md5 function as primary key.
select count(1),md5_col,* from (
select md5(concat_ws('',col1,col2)) as md5_col,* from db_name.table_name) tt group by md5_col;

For convenience, use BINARY_CHECKSUM:
with cte as (
select *, BINARY_CHECKSUM(*) checksum
from mytable
), cte2 as (
select checksum
from cte
group by checksum
having count(*) > 1
)
select distinct t1.*
from cte t1
join cte t2 on t1.checksum = t2.checksum
and t1.col1 = t2.col2
and t1.col2 = t2.col2
-- etc
where t1.checksum in (select checksum from cte2)
cte2 will return (almost) only truly matching rows, so join condition won't have many rows to exhaustively compare every column.

Rather than trying to boil the ocean and solve the entire problem with a single sql query (which you certainly can do...), I recommend using any indexes or statistics on the tables to filter out as many rows as you can.
Start by finding the columns with the most / fewest unique values (assuming you have statistics, that is), and smash them up against each other to rapidly exclude as many rows as possible. Take the results, dump them to a temp table, index fields as needed, and repeat.
Or you could just do this:
Declare #sql nvarchar(max);
Select #sql='select column1 from schema.table where case ' + stuff((select 'when col1!=' + quotename(name) + ' then 0 ' from sys.columns where object_id=object_id('schema.table') for xml path(''),Type).value('.','nvarchar(max)'),1,11,'') + 'else 1 end = 1';
Exec sp_executesql #sql;
If you must run that horrorshow of a query in production, please use snapshot isolation or move it to a temp table first (unless no one ever updates the table.
(Honestly, I would probably use something like that query on the temp table containing my filtered-down dataset... but anything you can do to makes sure that the comparisons aren't naïve (e.g. taking statistics into account) can improve your performance significantly. If you want to do it all at once, you could always join sys.tables to a temp table that puts your field comparisons into a thoughtful order. After all, once a case statement if found to be true, all the others will be skipped for that record. )

Using distinct on in subqueries

I noticed that in PostgreSQL the following two queries output different results:
select a.*
from (
select distinct on (t1.col1)
t1.*
from t1
order by t1.col1, t1.col2
) a
where a.col3 = value
;
create table temp as
select distinct on (t1.col1)
t1.*
from t1
order by t1.col1, t1.col2
;
select temp.*
from temp
where temp.col3 = value
;
I guess it has something to do with using distinct on in subqueries.
What is the correct way to use distinct on in subqueries? E.g. can I use it if I don't use where statement?
Or in queries like
(
select distinct on (a.col1)
a.*
from a
)
union
(
select distinct on (b.col1)
b.*
from b
)

In normal situation, both examples should return the same result.
I suspect that you are getting different results because the order by clause of your distinct on subquery is not deterministic. That is, there may be several rows in t1 sharing the same col1 and col2.
If the columns in the order by do not uniquely identify each row, then the database has to make its own decision about which row will be retained in the resultset: as a consequence, the results are not stable, meaning that consecutive executions of the same query may yield different results.
Make sure that your order by clause is deterministic (for example by adding more columns in the clause), and this problem should not arise anymore.

Efficient way to select all values from one column not in another column

I need to return all values from colA that are not in colB from mytable. I am using:
SELECT DISTINCT(colA) FROM mytable WHERE colA NOT IN (SELECT colB FROM mytable)
It is working however the query is taking an excessively long time to complete.
Is there a more efficient way to do this?

In standard SQL there are no parentheses in DISTINCT colA. DISTINCT is not a function.
SELECT DISTINCT colA
FROM mytable
WHERE colA NOT IN (SELECT DISTINCT colB FROM mytable);
Added DISTINCT to the sub-select as well. If you have many duplicates it could speed up the query.
A CTE might be faster, depending on your DBMS. I additionally demonstrate LEFT JOIN as alternative to exclude the values in valB, and an alternative way to get distinct values with GROUP BY:
WITH x AS (SELECT colB FROM mytable GROUP BY colB)
SELECT m.colA
FROM mytable m
LEFT JOIN x ON x.colB = m.colA
WHERE x.colB IS NULL
GROUP BY m.colA;
Or, simplified further, and with a plain subquery (probably fastest):
SELECT DISTINCT m.colA
FROM mytable m
LEFT JOIN mytable x ON x.colB = m.colA
WHERE x.colB IS NULL;
There are basically 4 techniques to exclude rows with keys present in another (or the same) table:
Select rows which are not present in other table
The deciding factor for speed will be indexes. You need to have indexes on colA and colB for this query to be fast.

You can use exists:
select distinct
colA
from
mytable m1
where
not exists (select 1 from mytable m2 where m2.colB = m1.colA)
exists does a semi-join to quickly match the values. not in completes the entire result set and then does an or on it. exists is typically faster for values in tables.

You can use the EXCEPT operator which effectively diffs two SELECT queries. EXCEPT DISTINCT will return only unique values. Oracle's MINUS operator is equivalent to EXCEPT DISTINCT.

Joining a table on itself

Is there a better way to write this SQL query?
SELECT *, (SELECT TOP 1 columnB FROM mytable WHERE mytable.columnC = T1.columnC ORDER BY columnD) as firstRecordOfColumnB
FROM
(SELECT * FROM mytable WHERE columnA = 'apple') as T1
Notice that columnC is not the primary key.

If the keyColumns is really a key column (i.e. unique), than the query can definitly be written more elegantly and efficiently...
SELECT
*, columnB
FROM
mytable
WHERE
columnA = 'apple'

This might be better in case of performance:
SELECT
*,
(TOP 1 myLookupTable.columnB FROM mytable AS myLookupTable WHERE myLookupTable.keyColumn = mytable.keyColumn) as firstRecordOfColumnB
FROM
mytable
WHERE
columnA = 'apple'
But for the TOP 1 part I don't know any better solution.
Edit:
If the keyColumn is unique, the data in firstRecordOfColumnB would be the same as in mytable.columnB.
If it's not unique at least you need to sort that data to get a relevant TOP 1, example:
SELECT
*,
(TOP 1 myLookupTable.columnB FROM mytable AS myLookupTable WHERE myLookupTable.keyColumn = mytable.keyColumn
ORDER BY myLookupTable.sortColumn) as firstRecordOfColumnB
FROM
mytable
WHERE
columnA = 'apple'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to optimize a SQL Query containing a max? - sql

Related

Write a where clause that compares two columns to the same subquery?

SQL - Is there a way to check rows for duplicates in all columns of a table

Using distinct on in subqueries

Efficient way to select all values from one column not in another column

Joining a table on itself

Categories

Resources