rownum / fetch first n rows - sql

select * from Schem.Customer
where cust='20' and cust_id >= '890127'
and rownum between 1 and 2 order by cust, cust_id;
Execution time appr 2 min 10 sec
select * from Schem.Customer where cust='20'
and cust_id >= '890127'
order by cust, cust_id fetch first 2 rows only ;
Execution time appr 00.069 ms
The execution time is a huge difference but results are the same. My team is not adopting to later one. Don't ask why.
So what is the difference between Rownum and fetch first 2 rows and what should I do to improve or convince anyone to adopt.
DBMS : DB2 LUW

Although both SQL end up giving same resultset, it only happens for your data. There is a great chance that resultset would be different. Let me explain why.
I will make your SQL a little simpler to make it simple to understand:
SELECT * FROM customer
WHERE ROWNUM BETWEEN 1 AND 2;
In this SQL, you want only first and second rows. That's fine. DB2 will optimize your query and never look rows beyond 2nd. Because only first 2 rows qualify your query.
Then you add ORDER BY clause:
SELECT * FROM customer
WHERE ROWNUM BETWEEN 1 AND 2;
ORDER BY cust, cust_id;
In this case, DB2 first fetches 2 rows then order them by cust and cust_id. Then sends to client(you). So far so good. But what if you want to order by cust and cust_id first, then ask for first 2 rows? There is a great difference between them.
This is the simplified SQL for this case:
SELECT * FROM customer
ORDER BY cust, cust_id
FETCH FIRST 2 ROWS ONLY;
In this SQL, ALL rows qualify the query, so DB2 fetches all of the rows, then sorts them, then sends first 2 rows to client.
In your case, both queries give same results because first 2 rows are already ordered by cust and cust_id. But it won't work if first 2 rows would have different cust and cust_id values.
A hint about this is FETCH FIRST n ROWS comes after order by, that means DB2 orders the result then retrieves first n rows.

Excellent answer here:
https://blog.dbi-services.com/oracle-rownum-vs-rownumber-and-12c-fetch-first/
Now the index range scan is chosen, with the right cardinality estimation.
So which solution it the best one? I prefer row_number() for several reasons:
I like analytic functions. They have larger possibilities, such as setting the limit as a percentage of total number of rows for example.
11g documentation for rownum says:
The ROW_NUMBER built-in SQL function provides superior support for ordering the results of a query
12c allows the ANSI syntax ORDER BY…FETCH FIRST…ROWS ONLY which is translated to row_number() predicate
12c documentation for rownum adds:
The row_limiting_clause of the SELECT statement provides superior support
rownum has first_rows_n issues as well
PLAN_TABLE_OUTPUT
SQL_ID 49m5a3f33cmd0, child number 0
-------------------------------------
select /*+ FIRST_ROWS(10) */ * from test where contract_id=500
order by start_validity fetch first 10 rows only
Plan hash value: 1912639229
--------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | Buffers |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 10 | 15 |
|* 1 | VIEW | | 1 | 10 | 10 | 15 |
|* 2 | WINDOW NOSORT STOPKEY | | 1 | 10 | 10 | 15 |
| 3 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 11 | 15 |
|* 4 | INDEX RANGE SCAN | TEST_PK | 1 | | 11 | 4 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber" <=10)
2 - filter(ROW_NUMBER() OVER ( ORDER BY "TEST"."START_VALIDITY") <=10 )
4 - access("CONTRACT_ID"=500)

Related

listagg produces ORA-01489 if used as window function in conditional expression

My query returns many (thousands of) rows.
Column l has certain value for very small amount of rows (up to 10).
For each such row I want to output aggregated comma-separated values of very short (up to 5 chars) varchar column v over all of these rows.
For rows not having the special value of l I want to simply output the v value for that row.
Synthetized example of same problem: from first 10000 integers, I want to output 1,2,3,4,5,6,7,8,9 for each single-digit number; that number for multiple-digit number. (Yes, silly example but real case makes sense.)
with x (v,l) as (
select to_char(level), length(to_char(level)) from dual connect by level <= 10000
)
select case l
when 1 then listagg(v,',') within group (order by v) over (partition by l)
else v
end
from x
order by 1;
The problem is, listagg function fails on ORA-01489: result of string concatenation is too long error.
I am aware of 4000 char limit of listagg function as well as xmlagg-based workaround. I just don't get the limit is enough for data I want to concatenate even though not enough for all data. In example above, the partition of 9 single-digit numbers fits into 4000 chars, the partition of 9000 four-digit numbers not. I expected the case expression would prevent execution of window for unrelated rows but, for some reason, it seems the db engine evaluates window for all rows. (Also note that order by clause causes query to fail-fast - without it some rows are returned before failure.)
Can you please explain some reasoning for this behaviour? I suspect the window computation is logically before select clause but without any evidence. Reproduced on Oracle 11g, 18c and 19 (livesql).
Well you are using SQL which is not procedural, so you can't expect that some parts of the code path will not be executed, only because they are not used. (So filling a bug as other suggested will have no success).
Anyway you can do the often used trick based on the fact that listagg ignores null values.
So this formulation works fine:
with x (v,l) as (
select to_char(level), length(to_char(level)) from dual connect by level <= 10000
)
select nvl(listagg(case when l = 1 then v end,',') within group (order by v) over (partition by l),v) lst
from x
order by 1;
giving
LST
------------------
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9
..
10
100
1000
10000
The explanation of the problem can be found in the execution plan (showing only the relevant part)
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 35 | 4 (50)| 00:00:01 |
| 1 | SORT ORDER BY | | 1 | 35 | 4 (50)| 00:00:01 |
| 2 | WINDOW SORT | | 1 | 35 | 4 (50)| 00:00:01 |
| 3 | VIEW | | 1 | 35 | 2 (0)| 00:00:01 |
|* 4 | CONNECT BY WITHOUT FILTERING| | | | | |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
...
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (#keys=1) CASE "L" WHEN 1 THEN LISTAGG("V",',') WITHIN GROUP ( ORDER BY
"V") OVER ( PARTITION BY "L") ELSE "V" END [4000]
2 - (#keys=2) "L"[NUMBER,22], "V"[VARCHAR2,40], LISTAGG("V",',') WITHIN
GROUP ( ORDER BY "V") OVER ( PARTITION BY "L")[4000]
3 - "V"[VARCHAR2,40], "L"[NUMBER,22]
4 - LEVEL[4]
So in the line 2 the listagg is calculated (for all rows) only to be filtered in the line 1.
It is odd that you do get an error about the 4000 character limit even though no result is longer than 4000 characters. Maybe you could file this as a bug to Oracle Support.
Another workaround is to make use of the ON OVERFLOW logic of the LISTAGG function if you are on Oracle 12.2 or higher. Using LISTAGG (v, ',' ON OVERFLOW TRUNCATE) in the query allows the query to be run without error and does not truncate any values (at least in the example).

oracle update with correlated query

What is the correct answer? Choose two.
Examine this SQL statement:
UPDATE orders o
SET customer_name = (
SELECT cust_last_name FROM customers WHERE customer_id=o.customer_id
);
Which two are true?
A. The subquery is executed before the UPDATE statement is executed.
B. All existing rows in the ORDERS table are updated.
C. The subquery is executed for every updated row in the ORDERS
table.
D. The UPDATE statement executes successfully even if the subquery
selects multiple rows.
E. The subquery is not a correlated subquery.
I know B is correct, but all other selection I believe is incorrect.
A. Subquery executes for every row that the outer query returns, so
it should execute after the outer query.
C. NOT for every updated row, it is for every row that the outer
query returns.
D. I tried. It causes an error ORA-01427: single-row subquery returns
more than one row
E. It is a correlated subquery.
Consider option C:
C. The subquery is executed for every updated row in the ORDERS table.
You said:
NOT for every updated row, it is for every row that the outer query returns.
Yes. The subquery is indeed executed for every row in the outer query (let apart possible optimizations applied by the database). And every row in the outer query is updated - as you spotted, since you already, and correctly, selected option B: all existing rows in the ORDERS table are updated.
Note: your arguments against options A, D and 3 are valid.
The only second true answer is
F. this is a wrong desing denormalizing the CUSTOMER_NAME in the orders table and conflicting therefor with the normal form.
The answer C could be right somewhere in the times of Oracle 8 (i.e. 20 years ago) but now it is definitively wrong!.
Oracle introduces the scalar subquery caching event for the reason to limit the number of executions of the subqueries.
Here a Simple Demonstration
This setup in Oracle 19.2 has 10K orders and 1K customers.
create table customers as
select rownum customer_id, 'cust_'||rownum customer_name from dual connect by level <= 1000;
create index customers_idx1 on customers (customer_id);
create table orders as
select rownum order_id, trunc(rownum/10)+1 customer_id, cast (null as varchar2(100)) customer_name
from dual connect by level <= 10000;
The update is performed on 100K rows as expected
UPDATE /*+ gather_plan_statistics */ orders o
SET customer_name = (
SELECT customer_name FROM customers WHERE customer_id=o.customer_id
);
The hint gather_plan_statistics collects teh execution statistics which we will examine.
SQL_ID 8r610vz9fknr6, child number 0
-------------------------------------
UPDATE /*+ gather_plan_statistics */ orders o SET customer_name = (
SELECT customer_name FROM customers WHERE customer_id=o.customer_id )
Plan hash value: 3416863305
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
--------------------------------------------------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 1 | | 0 |00:00:00.18 | 60863 | 21 |
| 1 | UPDATE | ORDERS | 1 | | 0 |00:00:00.18 | 60863 | 21 |
| 2 | TABLE ACCESS FULL | ORDERS | 1 | 10000 | 10000 |00:00:00.01 | 21 | 18 |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMERS | 1001 | 1 | 1000 |00:00:00.01 | 2020 | 3 |
|* 4 | INDEX RANGE SCAN | CUSTOMERS_IDX1 | 1001 | 1 | 1000 |00:00:00.01 | 1020 | 3 |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("CUSTOMER_ID"=:B1)
The importatnt information is in the column Start, we see that the table customers were accessed only 1001 time, i.e. nearly only once per customer and not once per order.

How MIN/MAX function works in SQL?

I'm trying to understand how MIN/MAX function calculates value in backed in sql
Lets say I have below table Duplicate
ID NAME
1 A
2 A
3 A
4 A
5 A
6 B
7 B
8 B
9 B
10 B
11 C
12 C
13 C
14 C
SO when I run a below query
SELECT MAX(ID), NAME FROM Duplicate
GROUP BY NAME
Does sql engine finds first MAX value of ID in every group and then finds MAX ID out of those Grouped records ? Is it correct or something else happens ?
You'll see something like this in Oracle
SQL> set autotrace traceonly explain
SQL> select owner, max(object_id)
2 from t
3 group by owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 47235625
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 37 | 407 | 431 (2)| 00:00:01 |
| 1 | HASH GROUP BY | | 37 | 407 | 431 (2)| 00:00:01 |
| 2 | TABLE ACCESS FULL| T | 78939 | 847K| 427 (1)| 00:00:01 |
---------------------------------------------------------------------------
"group by hash". This a mechanism via which we can avoid a massive sorting cost to perform aggregation (min, max, etc etc).
Conceptually its like this:
Read first row
Hash the group by column ("owner" in my case)
Lets say the hash value is 1234.
Store value of "object_id" in bucket 1234.
then
Read next row
Hash the group by column ("owner" in my case)
Lets say the hash value is 5678.
Store value of "object_id" in bucket 5678.
then
Read next row
Hash the group by column ("owner" in my case)
Lets say the hash value is 1234 (ie, same value is row 1).
Compare object_id value with existing object_id in bucket 5678. If it's larger, then replace it, otherwise ignore and move on.
So you can see we can identify the max value without sorting - just a single scan of the all the data.
I don't know what DB you're using, but for Teradata, which distributes table rows in a parallel manner, a simple aggregation with GROUP BY typically will do:
Aggregate rows (local)
Redistribute rows
Sort rows
Aggregate rows (global)
Return final result
What DBMS are you using? Can you run an EXPLAIN on your query to see what the query plan is? That would give you some idea.

SQL script runs VERY slowly with small change

I am relatively new to SQL. I have a script that used to run very quickly (<0.5 seconds) but runs very slowly (>120 seconds) if I add one change - and I can't see why this change makes such a difference. Any help would be hugely appreciated!
This is the script and it runs quickly if I do NOT include "tt2.bulk_cnt
" in line 26:
with bulksum1 as
(
select t1.membercode,
t1.schemecode,
t1.transdate
from mina_raw2 t1
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
group by t1.membercode,
t1.schemecode,
t1.transdate
),
bulksum2 as
(
select t1.schemecode,
t1.transdate,
count(*) as bulk_cnt
from bulksum1 t1
group by t1.schemecode,
t1.transdate
having count(*) >= 10
),
results as
(
select t1.*, tt2.bulk_cnt
from mina_raw2 t1
inner join bulksum2 tt2
on t1.schemecode = tt2.schemecode and t1.transdate = tt2.transdate
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
)
select * from results
EDIT: I apologise for not putting enough detail in here previously - although I can use basic SQL code, I am a complete novice when it comes to databases.
Database: Oracle (I'm not sure which version, sorry)
Execution plans:
QUICK query:
Plan hash value: 1712123489
---------------------------------------------
| Id | Operation | Name |
---------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN | |
| 2 | VIEW | |
| 3 | FILTER | |
| 4 | HASH GROUP BY | |
| 5 | VIEW | VM_NWVW_0 |
| 6 | HASH GROUP BY | |
| 7 | TABLE ACCESS FULL| MINA_RAW2 |
| 8 | TABLE ACCESS FULL | MINA_RAW2 |
---------------------------------------------
SLOW query:
Plan hash value: 1298175315
--------------------------------------------
| Id | Operation | Name |
--------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | FILTER | |
| 2 | HASH GROUP BY | |
| 3 | HASH JOIN | |
| 4 | VIEW | VM_NWVW_0 |
| 5 | HASH GROUP BY | |
| 6 | TABLE ACCESS FULL| MINA_RAW2 |
| 7 | TABLE ACCESS FULL | MINA_RAW2 |
--------------------------------------------
A few observations, and then some things to do:
1) More information is needed. In particular, how many rows are there in the MINA_RAW2 table, what indexes exist on this table, and when was the last time it was analyzed? To determine the answers to these questions, run:
SELECT COUNT(*) FROM MINA_RAW2;
SELECT TABLE_NAME, LAST_ANALYZED, NUM_ROWS
FROM USER_TABLES
WHERE TABLE_NAME = 'MINA_RAW2';
From looking at the plan output it looks like the database is doing two FULL SCANs on MINA_RAW2 - it would be nice if this could be reduced to no more than one, and hopefully none. It's always tough to tell without very detailed information about the data in the table, but at first blush it appears that an index on TRANSACTIONTYPE might be helpful. If such an index doesn't exist you might want to consider adding it.
2) Assuming that the statistics are out-of-date (as in, old, nonexistent, or a significant amount of data (> 10%) has been added, deleted, or updated since the last analysis) run the following:
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(owner => 'YOUR-SCHEMA-NAME',
table_name => 'MINA_RAW2');
END;
substituting the correct schema name for "YOUR-SCHEMA-NAME" above. Remember to capitalize the schema name! If you don't know if you should or shouldn't gather statistics, err on the side of caution and do it. It shouldn't take much time.
3) Re-try your existing query after updating the table statistics. I think there's a fair chance that having up-to-date statistics in the database will solve your issues. If not:
4) This query is doing a GROUP BY on the results of a GROUP BY. This doesn't appear to be necessary as the initial GROUP BY doesn't do any grouping - instead, it appears this is being done to get the unique combinations of MEMBERCODE, SCHEMECODE, and TRANSDATE so that the count of the members by scheme and date can be determined. I think the whole query can be simplified to:
WITH cteWORKING_TRANS AS (SELECT *
FROM MINA_RAW2
WHERE TRANSACTIONTYPE IN ('RSP','SP','UNTV',
'ASTR','CN','TVIN',
'UCON','TRAS')),
cteBULKSUM AS (SELECT a.SCHEMECODE,
a.TRANSDATE,
COUNT(*) AS BULK_CNT
FROM (SELECT DISTINCT MEMBERCODE,
SCHEMECODE,
TRANSDATE
FROM cteWORKING_TRANS) a
GROUP BY a.SCHEMECODE,
a.TRANSDATE)
SELECT t.*, b.BULK_CNT
FROM cteWORKING_TRANS t
INNER JOIN cteBULKSUM b
ON b.SCHEMECODE = t.SCHEMECODE AND
b.TRANSDATE = t.TRANSDATE
I managed to remove an unnecessary subquery, but this syntax with distinct inside count may not work outside of PostgreSQL or may not be the desired result. I know I've certainly used it there.
select t1.*, tt2.bulk_cnt
from mina_raw2 t1
inner join (select t2.schemecode,
t2.transdate,
count(DISTINCT membercode) as bulk_cnt
from mina_raw2 t2
where t2.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
group by t2.schemecode,
t2.transdate
having count(DISTINCT membercode) >= 10) tt2
on t1.schemecode = tt2.schemecode and t1.transdate = tt2.transdate
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
When you use those with queries, instead of subqueries when you don't need to, you're kneecapping the query optimizer.

PostgreSQL - repeating rows from LIMIT OFFSET

I noticed some repeating rows in a paginated recordset.
When I run this query:
SELECT "students".*
FROM "students"
ORDER BY "students"."status" asc
LIMIT 3 OFFSET 0
I get:
| id | name | status |
| 1 | foo | active |
| 12 | alice | active |
| 4 | bob | active |
Next query:
SELECT "students".*
FROM "students"
ORDER BY "students"."status" asc
LIMIT 3 OFFSET 3
I get:
| id | name | status |
| 1 | foo | active |
| 6 | cindy | active |
| 2 | dylan | active |
Why does "foo" appear in both queries?
Why does "foo" appear in both queries?
Because all rows that are returned have the same value for the status column. In that case the database is free to return the rows in any order it wants.
If you want a reproducable ordering you need to add a second column to your order by statement to make it consistent. E.g. the ID column:
SELECT students.*
FROM students
ORDER BY students.status asc,
students.id asc
If two rows have the same value for the status column, they will be sorted by the id.
For more details from PostgreSQL documentation (http://www.postgresql.org/docs/8.3/static/queries-limit.html) :
When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a unique order. Otherwise you will get an unpredictable subset of the query's rows. You might be asking for the tenth through twentieth rows, but tenth through twentieth in what ordering? The ordering is unknown, unless you specified ORDER BY.
The query optimizer takes LIMIT into account when generating a query plan, so you are very likely to get different plans (yielding different row orders) depending on what you give for LIMIT and OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results of a query in any particular order unless ORDER BY is used to constrain the order.
select * from(
Select "students".*
from "students"
order by "students"."status" asc
limit 6
) as temp limit 3 offset 0;
select * from(
Select "students".*
from "students"
order by "students"."status" asc
limit 6
) as temp limit 3 offset 3;
where 6 is the total number of records that is under examination.