How MIN/MAX function works in SQL? - sql

I'm trying to understand how MIN/MAX function calculates value in backed in sql
Lets say I have below table Duplicate
ID NAME
1 A
2 A
3 A
4 A
5 A
6 B
7 B
8 B
9 B
10 B
11 C
12 C
13 C
14 C
SO when I run a below query
SELECT MAX(ID), NAME FROM Duplicate
GROUP BY NAME
Does sql engine finds first MAX value of ID in every group and then finds MAX ID out of those Grouped records ? Is it correct or something else happens ?

You'll see something like this in Oracle
SQL> set autotrace traceonly explain
SQL> select owner, max(object_id)
2 from t
3 group by owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 47235625
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 37 | 407 | 431 (2)| 00:00:01 |
| 1 | HASH GROUP BY | | 37 | 407 | 431 (2)| 00:00:01 |
| 2 | TABLE ACCESS FULL| T | 78939 | 847K| 427 (1)| 00:00:01 |
---------------------------------------------------------------------------
"group by hash". This a mechanism via which we can avoid a massive sorting cost to perform aggregation (min, max, etc etc).
Conceptually its like this:
Read first row
Hash the group by column ("owner" in my case)
Lets say the hash value is 1234.
Store value of "object_id" in bucket 1234.
then
Read next row
Hash the group by column ("owner" in my case)
Lets say the hash value is 5678.
Store value of "object_id" in bucket 5678.
then
Read next row
Hash the group by column ("owner" in my case)
Lets say the hash value is 1234 (ie, same value is row 1).
Compare object_id value with existing object_id in bucket 5678. If it's larger, then replace it, otherwise ignore and move on.
So you can see we can identify the max value without sorting - just a single scan of the all the data.

I don't know what DB you're using, but for Teradata, which distributes table rows in a parallel manner, a simple aggregation with GROUP BY typically will do:
Aggregate rows (local)
Redistribute rows
Sort rows
Aggregate rows (global)
Return final result
What DBMS are you using? Can you run an EXPLAIN on your query to see what the query plan is? That would give you some idea.

Related

listagg produces ORA-01489 if used as window function in conditional expression

My query returns many (thousands of) rows.
Column l has certain value for very small amount of rows (up to 10).
For each such row I want to output aggregated comma-separated values of very short (up to 5 chars) varchar column v over all of these rows.
For rows not having the special value of l I want to simply output the v value for that row.
Synthetized example of same problem: from first 10000 integers, I want to output 1,2,3,4,5,6,7,8,9 for each single-digit number; that number for multiple-digit number. (Yes, silly example but real case makes sense.)
with x (v,l) as (
select to_char(level), length(to_char(level)) from dual connect by level <= 10000
)
select case l
when 1 then listagg(v,',') within group (order by v) over (partition by l)
else v
end
from x
order by 1;
The problem is, listagg function fails on ORA-01489: result of string concatenation is too long error.
I am aware of 4000 char limit of listagg function as well as xmlagg-based workaround. I just don't get the limit is enough for data I want to concatenate even though not enough for all data. In example above, the partition of 9 single-digit numbers fits into 4000 chars, the partition of 9000 four-digit numbers not. I expected the case expression would prevent execution of window for unrelated rows but, for some reason, it seems the db engine evaluates window for all rows. (Also note that order by clause causes query to fail-fast - without it some rows are returned before failure.)
Can you please explain some reasoning for this behaviour? I suspect the window computation is logically before select clause but without any evidence. Reproduced on Oracle 11g, 18c and 19 (livesql).
Well you are using SQL which is not procedural, so you can't expect that some parts of the code path will not be executed, only because they are not used. (So filling a bug as other suggested will have no success).
Anyway you can do the often used trick based on the fact that listagg ignores null values.
So this formulation works fine:
with x (v,l) as (
select to_char(level), length(to_char(level)) from dual connect by level <= 10000
)
select nvl(listagg(case when l = 1 then v end,',') within group (order by v) over (partition by l),v) lst
from x
order by 1;
giving
LST
------------------
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9
..
10
100
1000
10000
The explanation of the problem can be found in the execution plan (showing only the relevant part)
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 35 | 4 (50)| 00:00:01 |
| 1 | SORT ORDER BY | | 1 | 35 | 4 (50)| 00:00:01 |
| 2 | WINDOW SORT | | 1 | 35 | 4 (50)| 00:00:01 |
| 3 | VIEW | | 1 | 35 | 2 (0)| 00:00:01 |
|* 4 | CONNECT BY WITHOUT FILTERING| | | | | |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
...
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (#keys=1) CASE "L" WHEN 1 THEN LISTAGG("V",',') WITHIN GROUP ( ORDER BY
"V") OVER ( PARTITION BY "L") ELSE "V" END [4000]
2 - (#keys=2) "L"[NUMBER,22], "V"[VARCHAR2,40], LISTAGG("V",',') WITHIN
GROUP ( ORDER BY "V") OVER ( PARTITION BY "L")[4000]
3 - "V"[VARCHAR2,40], "L"[NUMBER,22]
4 - LEVEL[4]
So in the line 2 the listagg is calculated (for all rows) only to be filtered in the line 1.
It is odd that you do get an error about the 4000 character limit even though no result is longer than 4000 characters. Maybe you could file this as a bug to Oracle Support.
Another workaround is to make use of the ON OVERFLOW logic of the LISTAGG function if you are on Oracle 12.2 or higher. Using LISTAGG (v, ',' ON OVERFLOW TRUNCATE) in the query allows the query to be run without error and does not truncate any values (at least in the example).

oracle update with correlated query

What is the correct answer? Choose two.
Examine this SQL statement:
UPDATE orders o
SET customer_name = (
SELECT cust_last_name FROM customers WHERE customer_id=o.customer_id
);
Which two are true?
A. The subquery is executed before the UPDATE statement is executed.
B. All existing rows in the ORDERS table are updated.
C. The subquery is executed for every updated row in the ORDERS
table.
D. The UPDATE statement executes successfully even if the subquery
selects multiple rows.
E. The subquery is not a correlated subquery.
I know B is correct, but all other selection I believe is incorrect.
A. Subquery executes for every row that the outer query returns, so
it should execute after the outer query.
C. NOT for every updated row, it is for every row that the outer
query returns.
D. I tried. It causes an error ORA-01427: single-row subquery returns
more than one row
E. It is a correlated subquery.
Consider option C:
C. The subquery is executed for every updated row in the ORDERS table.
You said:
NOT for every updated row, it is for every row that the outer query returns.
Yes. The subquery is indeed executed for every row in the outer query (let apart possible optimizations applied by the database). And every row in the outer query is updated - as you spotted, since you already, and correctly, selected option B: all existing rows in the ORDERS table are updated.
Note: your arguments against options A, D and 3 are valid.
The only second true answer is
F. this is a wrong desing denormalizing the CUSTOMER_NAME in the orders table and conflicting therefor with the normal form.
The answer C could be right somewhere in the times of Oracle 8 (i.e. 20 years ago) but now it is definitively wrong!.
Oracle introduces the scalar subquery caching event for the reason to limit the number of executions of the subqueries.
Here a Simple Demonstration
This setup in Oracle 19.2 has 10K orders and 1K customers.
create table customers as
select rownum customer_id, 'cust_'||rownum customer_name from dual connect by level <= 1000;
create index customers_idx1 on customers (customer_id);
create table orders as
select rownum order_id, trunc(rownum/10)+1 customer_id, cast (null as varchar2(100)) customer_name
from dual connect by level <= 10000;
The update is performed on 100K rows as expected
UPDATE /*+ gather_plan_statistics */ orders o
SET customer_name = (
SELECT customer_name FROM customers WHERE customer_id=o.customer_id
);
The hint gather_plan_statistics collects teh execution statistics which we will examine.
SQL_ID 8r610vz9fknr6, child number 0
-------------------------------------
UPDATE /*+ gather_plan_statistics */ orders o SET customer_name = (
SELECT customer_name FROM customers WHERE customer_id=o.customer_id )
Plan hash value: 3416863305
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
--------------------------------------------------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 1 | | 0 |00:00:00.18 | 60863 | 21 |
| 1 | UPDATE | ORDERS | 1 | | 0 |00:00:00.18 | 60863 | 21 |
| 2 | TABLE ACCESS FULL | ORDERS | 1 | 10000 | 10000 |00:00:00.01 | 21 | 18 |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMERS | 1001 | 1 | 1000 |00:00:00.01 | 2020 | 3 |
|* 4 | INDEX RANGE SCAN | CUSTOMERS_IDX1 | 1001 | 1 | 1000 |00:00:00.01 | 1020 | 3 |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("CUSTOMER_ID"=:B1)
The importatnt information is in the column Start, we see that the table customers were accessed only 1001 time, i.e. nearly only once per customer and not once per order.

where column is null taking longer time to execute

I am executing a select statement like the one below which is taking more than 6mins to execute.
select * from table where col1 is null;
whereas:
select * from table;
returns results in few seconds. The table contains 25million records. No indexes are used. there is a composite PK but not on the col used. Same query when executed on a different table with 50 million records, returns results in few seconds. only this table poses a problem.
Rebuilt the table to check if there was a miss, but still facing the same issue.
can some one help me here on why it is taking time?
datatype: VARCHAR2(40)
PLAN:
Plan hash value: 2838772322
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 794 | 60973 (16)| 00:00:03 |
|* 1 | TABLE ACCESS STORAGE FULL| table | 1 | 794 | 60973 (16)| 00:00:03 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - storage("column" IS NULL)
filter("column" IS NULL)
select * from table;
Oracle SQL Developer tool has a default setting to display only 50 records unless it was manually edited. So the entire 25 million records will not be fetched as you don't need all the records for display.
select * from table where col1 is null;
But when you filter for null values, the entire set of 25 million rows has to be scanned to apply the filter and get your 81 records satisfying that predicate. Hence your second query takes longer.

rownum / fetch first n rows

select * from Schem.Customer
where cust='20' and cust_id >= '890127'
and rownum between 1 and 2 order by cust, cust_id;
Execution time appr 2 min 10 sec
select * from Schem.Customer where cust='20'
and cust_id >= '890127'
order by cust, cust_id fetch first 2 rows only ;
Execution time appr 00.069 ms
The execution time is a huge difference but results are the same. My team is not adopting to later one. Don't ask why.
So what is the difference between Rownum and fetch first 2 rows and what should I do to improve or convince anyone to adopt.
DBMS : DB2 LUW
Although both SQL end up giving same resultset, it only happens for your data. There is a great chance that resultset would be different. Let me explain why.
I will make your SQL a little simpler to make it simple to understand:
SELECT * FROM customer
WHERE ROWNUM BETWEEN 1 AND 2;
In this SQL, you want only first and second rows. That's fine. DB2 will optimize your query and never look rows beyond 2nd. Because only first 2 rows qualify your query.
Then you add ORDER BY clause:
SELECT * FROM customer
WHERE ROWNUM BETWEEN 1 AND 2;
ORDER BY cust, cust_id;
In this case, DB2 first fetches 2 rows then order them by cust and cust_id. Then sends to client(you). So far so good. But what if you want to order by cust and cust_id first, then ask for first 2 rows? There is a great difference between them.
This is the simplified SQL for this case:
SELECT * FROM customer
ORDER BY cust, cust_id
FETCH FIRST 2 ROWS ONLY;
In this SQL, ALL rows qualify the query, so DB2 fetches all of the rows, then sorts them, then sends first 2 rows to client.
In your case, both queries give same results because first 2 rows are already ordered by cust and cust_id. But it won't work if first 2 rows would have different cust and cust_id values.
A hint about this is FETCH FIRST n ROWS comes after order by, that means DB2 orders the result then retrieves first n rows.
Excellent answer here:
https://blog.dbi-services.com/oracle-rownum-vs-rownumber-and-12c-fetch-first/
Now the index range scan is chosen, with the right cardinality estimation.
So which solution it the best one? I prefer row_number() for several reasons:
I like analytic functions. They have larger possibilities, such as setting the limit as a percentage of total number of rows for example.
11g documentation for rownum says:
The ROW_NUMBER built-in SQL function provides superior support for ordering the results of a query
12c allows the ANSI syntax ORDER BY…FETCH FIRST…ROWS ONLY which is translated to row_number() predicate
12c documentation for rownum adds:
The row_limiting_clause of the SELECT statement provides superior support
rownum has first_rows_n issues as well
PLAN_TABLE_OUTPUT
SQL_ID 49m5a3f33cmd0, child number 0
-------------------------------------
select /*+ FIRST_ROWS(10) */ * from test where contract_id=500
order by start_validity fetch first 10 rows only
Plan hash value: 1912639229
--------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | Buffers |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 10 | 15 |
|* 1 | VIEW | | 1 | 10 | 10 | 15 |
|* 2 | WINDOW NOSORT STOPKEY | | 1 | 10 | 10 | 15 |
| 3 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 11 | 15 |
|* 4 | INDEX RANGE SCAN | TEST_PK | 1 | | 11 | 4 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber" <=10)
2 - filter(ROW_NUMBER() OVER ( ORDER BY "TEST"."START_VALIDITY") <=10 )
4 - access("CONTRACT_ID"=500)

How to optimize query? Explain Plan

I have one table with 3 fields and I neeed get all value of fields, I have next query:
SELECT COM.FIELD1, COM.FIELD2, COM.FIELD3
FROM OWNER.TABLE_NAME COM
WHERE COM.FIELD1 <> V_FIELD
ORDER BY COM.FIELD3 ASC;
And i want optimaze, I have next values of explain plan:
Plan
SELECT STATEMENT CHOOSECost: 4 Bytes: 90 Cardinality: 6
2 SORT ORDER BY Cost: 4 Bytes: 90 Cardinality: 6
1 TABLE ACCESS FULL OWNER.TABLE_NAME Cost: 2 Bytes: 90 Cardinality: 6
Any solution for not get TAF(Table Acces Full)?
Thanks!
Since your WHERE condition is on the column FIELD1, an index on that column many help.
You may already have an index on that column. Even then, you will still see a full table access, if the expected number of rows that don't have VAL1 in that column is sufficiently large.
The only case when you will NOT see full table access is if you have an index on that column, the vast majority (at least, say, 80% to 90%) of rows in the table do have the value VAL1 in the column FIELD1, and statistics are up to date AND, perhaps, you need to use a histogram (because in this case the distribution of values in FIELD1 would be very skewed).
I suppose that your table has a very large number of rows with a given key (let call it 'B') and a very small number of rows with other keys.
Note, that the index access will work only for conditions FIELD1 <> 'B', all other predicates will return 'B' and therefore are not suitable for index access.
Note also that if you have more that one large key, the index access will not work from the same reason - you will never get only a few record where index can profit.
As a starting point you can reformulte the predicate
FIELD1 <> V_FIELD
as
DECODE(FIELD1,V_FIELD,1,0) = 0
The DECODE return 1 if FIELD1 = V_FIELD and returns 0 if FIELD1 <> V_FIELD
This transformation allows you to define a function based index with the DECODE expression.
Example
create table tt as
select
decode(mod(rownum,10000),1,'A','B') FIELD1
from dual connect by level <= 50000;
select field1, count(*) from tt group by field1;
FIELD1 COUNT(*)
------ ----------
A 5
B 49995
FBIndex
create index tti on tt(decode(field1,'B',1,0));
Use your large key for the index definition.
Access
To select FIELD1 <> 'B' use reformulated predicate decode(field1,'B',1,0) = 0
Which leads nicely to an index access:
EXPLAIN PLAN SET STATEMENT_ID = 'jara1' into plan_table FOR
SELECT * from tt where decode(field1,'B',1,0) = 0;
SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'jara1','ALL'));
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 471 | 2355 | 24 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TT | 471 | 2355 | 24 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TTI | 188 | | 49 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(DECODE("FIELD1",'B',1,0)=0)
To select FIELD1 <> 'A' use reformulated predicate decode(field1,'A',1,0) = 0
Here you don't want index access as nearly the whole table is returned- and the CBO opens FULL TABLE SCAN.
EXPLAIN PLAN SET STATEMENT_ID = 'jara1' into plan_table FOR
SELECT * from tt where decode(field1,'A',1,0) = 0;
SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'jara1','ALL'));
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47066 | 94132 | 26 (4)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TT | 47066 | 94132 | 26 (4)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(DECODE("FIELD1",'A',1,0)=0)
Bind Variables
This will work the same way even if you use bind variables FIELD1 <> V_FIELD - provided you pass always the same value.
The bind variable peeking will evaluate the correct plan in the first parse and generate the proper plan.
If you will use more that one values as bind variable (and therefore expect to get different plans for different values) - you will learn the feature of adaptive cursor sharing
The query is already optimized, don't spend any more time on it unless it's running noticeably slow. If you have a tuning checklist that says "avoid all full table scans" it might be time to change that checklist.
The cost of the full table scan is only 2. The exact meaning of the cost is tricky, and not always particularly helpful. But in this case it's probably safe to say that 2 means the full table scan will run quickly.
If the query is not running in less than a few microseconds, or is returning significantly more than the estimated 6 rows, then there may be a problem with the optimizer statistics. If that's the case, try gathering statistics like this:
begin
dbms_stats.gather_table_stats('OWNER', 'TABLE_NAME');
end;
/
As #symcbean pointed out, a full table scan is not always a bad thing. If a table is incredibly small, like this one might be, all the data may fit inside a single block. (Oracle accesses data by block(s)-at-a-time, where the block is usually 8KB of data.) When the data structures are trivially small there won't be any significant difference between using a table or an index.
Also, full table scans can use multi-block reads, whereas most index access paths use single-block reads. For reading a large percentage of data it's faster to read the whole thing with multi-block reads than reading it one-block-at-a-time with an index. Since this query only has a <> condition, it looks likely that this query will read a large percentage of data and a full table scan is optimal.