Which is better: Distinct or Group By

Which is better: Distinct or Group By - sql

Which is more efficient?
SELECT theField
FROM theTable
GROUP BY theField
or
SELECT DISTINCT theField
FROM theTable

In your example, both queries will generate the same execution plan so their performance will be the same.
However, they both have their own purpose. To make your code easier to understand, you should use distinct to eliminate duplicate rows and group by to apply aggregate operators (sum, count, max, ...).

Doesn't matter, it results in the same execution plan. (at least for these queries). These kind of questions are easy to solve, by enabling query analyzer or SSMS to show the execution plan and perhaps the server trace statistics after running the query.

In most cases, DISTINCT and GROUP BY generate the same plans, and their performance is usually identical

You can check the Execution Plan to look for the total cost of this statements. The answer may vary in different scenarios.

Hmmm...so far as I can see in the Execution Plan for running similar queries, they are identical.

In MySQL, DISTINCT seems a bit faster than GROUP BY if theField is not indexed. DISTINCT only eliminate duplicate rows but GROUP BY seems to sort them in addition.

Related

Is there performance impact when Non-Aggregate SQL functions are used in a SELECTed Column?

We have a report that uses a long and complex query that has the SELECT statement like below:
SELECT
NVL(nazwawystawcy,'BRAK') supplier_name,
NVL(AdresDostawcy,'BRAK') supplier_address,
NVL(NrDostawcy,'BRAK') supplier_registration,
DowodZakupu document_number,
DataZakupu document_issue_date,
DataWplywu document_recording_date,
trx_id,
KodKrajuNadaniaTIN country_code,
DokumentZakupu document_type_code,
payment_split MPP,
box_number box_number,
box_amount box_amount,
box_type box_type,
display_order display_order
...
FROM table1 t1
,table2 t2
....
We recently made modifications to this Query and just modified the 3rd SELECTed column to add a REGEXP_LIKE
SELECT
NVL(nazwawystawcy,'BRAK') supplier_name,
NVL(AdresDostawcy,'BRAK') supplier_address,
--NVL(NrDostawcy,'BRAK') supplier_registration,
Case When (NrDostawcy is not null and regexp_like(substr(NrDostawcy,1,2),'^[a-zA-Z]*$')) Then substr(NrDostawcy,3) else NVL(NrDostawcy,'BRAK') End supplier_registration,
DowodZakupu document_number,
DataZakupu document_issue_date,
DataWplywu document_recording_date,
trx_id,
KodKrajuNadaniaTIN country_code,
DokumentZakupu document_type_code,
payment_split MPP,
box_number box_number,
box_amount box_amount,
box_type box_type,
display_order display_order
...
FROM table1 t1
,table2 t2
....
I checked the Explain Plans of both queries and they turned out to have the same Plan hash value.
Does this mean there's no impact on performance if i use Seeded, non-aggregate, SQL Functions in SELECTed columns?
I believe there is an impact in performance if they're used in the WHERE clause, but i wasn't sure if the same applies to the SELECTed columns.
Apologies in advance as i can't provide the exact query since it's propietary and is very long and complex.
I also don't think I can create a good enough sample that would match the Explain plan of actual query as it joins over 10 tables, with thousand rows of data.
Thank you!

Since you are running this query on Oracle here's my advice. Run the query with Oracle hint /*+ gather_plan_statistics */. Run it with the first query without regex and with the regex. Then find this query in sharedpool (v$sql). The hint will give you the exact buffer gets, physical reads an also time spent in each step of the plan. With that data you can analyze in details how much more time query with regex needed to execute. I advice you, that you do this on data that returns you more than lets say 10k rows. In this way the difference should be seen (if you run this with 100 rows no difference will be seen).

The execution plan is the same as it needs to query exactly the same data from the same tables. You should also see the amount of data (logical IO) unchanged.
What will not be the same however is the execution time, as the regexp_like will consume more CPU, even if you see the logical IO unchanged.
Note that if you changed the selected columns, the execution plan could change as if all selected columns were part of an index, the optimizer might skip the table access and read the data from an index only.

it depends upon the query and the IO's being done to get the data. Sometimes you can try creating a Oracle Function based index, you may see some improvements.
Check this link, it could help you.
https://jeffkemponoracle.com/2007/11/will-oracle-use-my-regexp-function-based-index/
thanks

is IN(SELECT ...) bad for performance?

Suppose I have the following code:
SELECT *
FROM [myTable]
WHERE [myColumn] IN (SELECT [otherColumn] FROM [myOtherTable])
Will the subquery be executed again and again for every row?
If so, can I execute it and store its results and use them for every row instead? For example:
SELECT [otherColumn]
INTO #Results
FROM [myOtherTable]
SELECT *
FROM [myTable]
WHERE [myColumn] IN (#Results)

SQL server query optimizer is smart enough to not run the same subquery over and over again. If anything, the temp table is less optimal because of additional steps after getting the results.
You can see this by looking at the SQL query execution plan.
Edit: After looking into this further, it can also be more than once. Apparently query optimizer can also do a lot of interesting things like convert your IN to a JOIN to increase performance. There's lots of information on it here: Number of times a nested query is executed
None the less, view your execution plan to see what your RDMS's query optimizer decided to do.

Have you considered using a join instead? I think that could be best in terms of performance.
SELECT * FROM [myTable] INNER JOIN [myOtherTable]
ON ([myTable][myColumn] = [myOtherTable][otherColumn]);
This however will only work if you don't expect duplicates to be in myOtherTable.

Difference between two sql count and subquery count statements

Is there a big performance difference between those two sql count statements, when performing large counts (large here means 100k + records)
first:
SELECT count(*) FROM table1 WHERE <some very complex conditions>
second:
SELECT count(*) FROM (SELECT * FROM table1 WHERE <some very complex conditions>) subquery_alias
I know that first approach is right, but i want to know is this statements will perform similar ?

The query optimizer will most likely transform your second query into the first one. There should be no measurable performance difference between those two queries.

The answer depends on Database being used. for MS SQL the Query Optimizer will optimize the query and both will have similar performance. But for other database system it depends on the intelligence of the Query Optimizer.

In which sequence are queries and sub-queries executed by the SQL engine?

Hello I made a SQL test and dubious/curious about one question:
In which sequence are queries and sub-queries executed by the SQL engine?
the answers was
primary query -> sub query -> sub sub query and so on
sub sub query -> sub query -> prime query
the whole query is interpreted at one time
There is no fixed sequence of interpretation, the query parser takes a decision on fly
I choosed the last answer (just supposing that it is most reliable w.r.t. others).
Now the curiosity:
where can i read about this and briefly what is the mechanism under all of that?
Thank you.

I think answer 4 is correct. There are a few considerations:
type of subquery - is it corrrelated, or not. Consider:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
)
Here, the subquery is not correlated to the outer query. If the number of values in t2.id is small in comparison to t1.id, it is probably most efficient to first execute the subquery, and keep the result in memory, and then scan t1 or an index on t1.id, matching against the cached values.
But if the query is:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
WHERE t2.type = t1.type
)
here the subquery is correlated - there is no way to compute the subquery unless t1.type is known. Since the value for t1.type may vary for each row of the outer query, this subquery could be executed once for each row of the outer query.
Then again, the RDBMS may be really smart and realize there are only a few possible values for t2.type. In that case, it may still use the approach used for the uncorrelated subquery if it can guess that the cost of executing the subquery once will be cheaper that doing it for each row.

Option 4 is close.
SQL is declarative: you tell the query optimiser what you want and it works out the best (subject to time/"cost" etc) way of doing it. This may vary for outwardly identical queries and tables depending on statistics, data distribution, row counts, parallelism and god knows what else.
This means there is no fixed order. But it's not quite "on the fly"
Even with identical servers, schema, queries, and data I've seen execution plans differ

The SQL engine tries to optimise the order in which (sub)queries are executed. The part deciding about that is called a query optimizer. The query optimizer knows how many rows are in each table, which tables have indexes and on what fields. It uses that information to decide what part to execute first.

If you want something to read up on these topics, get a copy of Inside SQL Server 2008: T-SQL Querying. It has two dedicated chapters on how queries are processed logically and physically in SQL Server.

It's usually depends from your DBMS, but ... I think second answer is more plausible.
Prime query usually can't be calculated without sub query results.

Does the sequence in which we use join in a query effects its execution time?

Does the sequence in which we use join in a query effects its execution time ?

No, it does not.
SQL Server's optimizer picks the best (in its opinion) way regardless on the join order.
SQL Server supports a special hint, FORCE ORDER, which makes the tables to lead in the joins in the order they are listed.
These queries:
SELECT *
FROM t_a
JOIN t_b
ON a = b
OPTION (FORCE ORDER)
and
SELECT *
FROM t_b
JOIN t_a
ON a = b
OPTION (FORCE ORDER)
will produce identical plans with the OPTION (FORCE ORDER) omitted and different plans with that added.
However, you should use this hint only when you absolutely sure you know what you are doing.

Yes it does. Its effect can be seen in the query execution plan. Refer this and this.
Another link is here

The optimizer will usually compare all join orders and try to select the optimal order regardless of the order you wrote the query however in some complicated queries there are simply too many options to consider, note that the number of possible join orders increases to the factorial of the number of tables joined. In this case the optimizer will not consider all options but will definitely consider the order you wrote the joins in the query and therefor may effect execution plan and execution time.

For outer joins, the order of tables changes the meaning of your query and is very likely to change the execution plan.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Which is better: Distinct or Group By - sql

Which is more efficient? SELECT theField FROM theTable GROUP BY theField or SELECT DISTINCT theField FROM theTable

Doesn't matter, it results in the same execution plan. (at least for these queries). These kind of questions are easy to solve, by enabling query analyzer or SSMS to show the execution plan and perhaps the server trace statistics after running the query.

In most cases, DISTINCT and GROUP BY generate the same plans, and their performance is usually identical

You can check the Execution Plan to look for the total cost of this statements. The answer may vary in different scenarios.

Hmmm...so far as I can see in the Execution Plan for running similar queries, they are identical.

In MySQL, DISTINCT seems a bit faster than GROUP BY if theField is not indexed. DISTINCT only eliminate duplicate rows but GROUP BY seems to sort them in addition.

Related

Is there performance impact when Non-Aggregate SQL functions are used in a SELECTed Column?

is IN(SELECT ...) bad for performance?

Difference between two sql count and subquery count statements

In which sequence are queries and sub-queries executed by the SQL engine?

Does the sequence in which we use join in a query effects its execution time?

Categories

Resources