Limit number of rows from join, in oracle - sql

I apologize in advance for my long-winded question and if the formatting isn't up to par (newbie), here goes.
I have a table MY_TABLE with the following schema -
MY_ID | TYPE | REC_COUNT
1 | A | 1
1 | B | 3
2 | A | 0
2 | B | 0
....
The first column corresponds to an ID, the second is some type and 3rd some count. NOTE that the MY_ID column is not the primary key, there could be many records having the same MY_ID.
I want to write a stored procedure which will take an array of IDs and return the subset of them that match the following criteria -
the ID should match the MY_ID field of at least 1 record in the table and at least 1 matching record should not have TYPE = A OR REC_COUNT = 0.
This is the procedure I came up with -
PROCEDURE get_id_subset(
iIds IN ID_ARRAY,
oMatchingIds OUT NOCOPY ID_ARRAY
)
IS
BEGIN
SELECT t.column_value
BULK COLLECT INTO oMatchingIds
FROM TABLE(CAST(iIds AS ID_ARRAY)) t
WHERE EXISTS (
SELECT /*+ NL_SJ */ 1
FROM MY_TABLE m
WHERE (m.my_id = t.column_value)
AND (m.type != 'A' OR m.rec_count != 0)
);
END get_id_subset;
But I really care about performance and some IDs could match 1000s of records in the table. There is an index on the MY_ID and TYPE column but no index on the REC_COUNT column. So I was thinking if there are more than 1000 rows that have a matching MY_ID field then I'll just return the ID without applying the TYPE and REC_COUNT predicates. Here's this version -
PROCEDURE get_id_subset(
iIds IN ID_ARRAY,
oMatchingIds OUT NOCOPY ID_ARRAY
)
IS
BEGIN
SELECT t.column_value
BULK COLLECT INTO oMatchingIds
FROM TABLE(CAST(iIds AS ID_ARRAY)) t, MY_TABLE m
WHERE (m.my_id = t.column_value)
AND ( ((SELECT COUNT(m.my_id) FROM m WHERE 1) >= 1000)
OR EXISTS (m.type != 'F' OR m.rec_count != 0)
);
END get_id_subset;
But this doesn't compile, I get the following error on the inner select -
PL/SQL: ORA-00936: missing expression
Is there another way of writing this? The inner select needs to work on the joined table.
And to clarify, I'm OK with the result set being different for this query. My assumption is that since there is an index on the my_id column, doing count(*) would be much cheaper than actually applying the rec_count predicate to 10000s of rows since there is no index on that column. Am I wrong?

I don't see your second query as being much if any improvement over the first. At best, the first subquery has to hit 1000 matching records in order to determine if the count is less than 1000, so I don't think it will save lots of work. Also it changes the actual result, and it's not clear from your description if you're saying that's OK as long as it's more efficient. (And if it is OK, then the business logic is very unclear -- why do the other conditions matter at all, if they don't matter when there's lots of records?)
You ask, "will the group by be applied before or after the predicate". I'm not clear what part of the query you're talking about, but logically speaking the order is always
Where predicates
Group By
Having predicates
The optimizer can change the order in which things are actually evaluated, but the result must always be logically equivalent to the above order of evaluation (barring optimizer bugs).
1000s of records is really not that much. Have you actually encountered a case where performance of the first query is unacceptable?
For either query, it may be better to rewrite the correlated EXISTS subquery as a non-correlated IN subquery. You need to test this.
You need to show actual execution plans to get more useful feedback.
Edit
For the kind of short-circuiting you're talking about, I think you need to rewrite your subquery (from the initial version of the query) like this (sorry, my first attempt at this wouldn't work because I tried to access a column from the top-level table in a sub-sub-query):
WHERE EXISTS (
SELECT /*+ NL_SJ */ 1
FROM MY_TABLE m
WHERE (m.my_id = t.column_value)
AND rownum <= 1000
HAVING MAX( CASE WHEN m.type != 'A' OR m.rec_count != 0 THEN 1 ELSE NULL END ) I S NOT NULL
OR MAX(rownum) >= 1000
)
That should force it to hit no more than 1,000 records per id, then return a row if either at least one row matches the conditions on type and rec_count, or the 1,000-record limit was reached. If you view the execution plan, you should expect to see a COUNT STOPKEY operation, which shows that Oracle is going to stop running a query block after a certain number of rows are returned.

Related

Efficiently determine if any rows satisfy a predicate in Postgres

I'd like to query the database as to whether or not one or more rows exist that satisfy a given predicate. However, I am not interested in the distinction between there being one such row, two rows or a million - just if there are 'zero' or 'one or more'. And I do not want Postgres to waste time producing an exact count that I do not need.
In DB2, I would do it like this:
SELECT 1 FROM SYSIBM.SYSDUMMY1 WHERE EXISTS
(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')
and then checking if zero rows or one row was returned from the query.
But Postgres has no dummy table available, so what is the best option?
If I create a one-row dummy table myself and use that in place of SYSIBM.SYSDUMMY1, will the query optimizer be smart enough to not actually read that table when running the query, and otherwise 'do the right thing'?
PostgreSQL doesn't have a dummy table because you don't need one.
SELECT 1 WHERE EXISTS
(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')
Alternatively if you want a true/false answer:
SELECT EXISTS(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')
How about just doing this?
SELECT (CASE WHEN EXISTS (SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE') THEN 1 ELSE 0 END)
1 means there is a value. 0 means no value.
This will always return one row.
If you are happy with "no row" if no row matches, you can even just:
SELECT 1 FROM real_table WHERE column = 'VALUE' LIMIT 1;
Performance is basically the same as with EXISTS. Key to performance for big tables is a matching index.

Find out if query exceeds arbitrary limit using ROWNUM?

I have a stored proc in Oracle, and we're limiting the number of records with ROWNUM based on a parameter. However, we also have a requirement to know whether the search result count exceeded the arbitrary limit (even though we're only passing data up to the limit; searches can return a lot of data, and if the limit is exceeded a user may want to refine their query.)
The limit's working well, but I'm attempting to pass an OUT value as a flag to signal when the maximum results were exceeded. My idea for this was to get the count of the inner table and compare it to the count of the outer select query (with ROWNUM) but I'm not sure how I can get that into a variable. Has anyone done this before? Is there any way that I can do this without selecting everything twice?
Thank you.
EDIT: For the moment, I am actually doing two identical selects - one for the count only, selected into my variable, and one for the actual records. I then pass back the comparison of the base result count to my max limit parameter. This means two selects, which isn't ideal. Still looking for an answer here.
You can add a column to the query:
select * from (
select . . . , count(*) over () as numrows
from . . .
where . . .
) where rownum <= 1000;
And then report numrows as the size of the final result set.
You could use a nested subquery:
select id, case when max_count > 3 then 'Exceeded' else 'OK' end as flag
from (
select id, rn, max(rn) over () as max_count
from (
select id, rownum as rn
from t
)
where rownum <= 4
)
where rownum <= 3;
The inner level is your actual query (which you probably have filters and an order-by clause in really). The middle later restricts to your actual limit + 1, which still allows Oracle to optimise using a stop key, and uses an analytic count over that inner result set to see if you got a fourth record (without requiring a scan of all matching records). And the outer layer restricts to your original limit.
With a sample table with 10 rows, this gets:
ID FLAG
---------- --------
1 Exceeded
2 Exceeded
3 Exceeded
If the inner query had a filter that returned fewer rows, say:
select id, rownum as rn
from t
where id < 4
it would get:
ID FLAG
---------- --------
1 OK
2 OK
3 OK
Of course for this demo I haven't done any ordering so you would get indeterminate results. And from your description you would use your variable instead of 3, and (your variable + 1) instead of 4.
In my application I do a very simple approach. I do the normal SELECT and when the number of returned rows is equal to the limit then the client application shows LIMIT reached message, because is it very likely that my query would return more rows in case you would not limit the result.
Of course, when the number of rows is exactly the limit then this is a wrong indication. However, in my application the limit is set mainly for performance reasons by end-user, a typical limit is "1000 rows" or "10000 rows", for example.
In my case this solution is fully sufficient - and it is simple.
Update:
Are you aware of the row_limiting_clause? It was introduced in Oracle 12.1
For example this query
SELECT employee_id, last_name
FROM employees
ORDER BY employee_id
OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY;
will return row 6 to row 16 of the entire result set. It may support you in finding a solution.
Another idea is this one:
SELECT employee_id, last_name
FROM employees
UNION ALL
SELECT NULL, NULL FROM dual
ORDER BY employee_id NULLS LAST
When you get the row where employee_id IS NULL then you know you reached the end of your result-set and no further records will arrive.
Select the whole thing, then select the count and the data, restricting the number of rows.
with
base as
(
select c1, c2, c3
from table
where condition
)
select (select count(*) from base), c1, c2, c3
from base
where rownum < 100

Query using Rownum and order by clause does not use the index

I am using Oracle (Enterprise Edition 10g) and I have a query like this:
SELECT * FROM (
SELECT * FROM MyTable
ORDER BY MyColumn
) WHERE rownum <= 10;
MyColumn is indexed, however, Oracle is for some reason doing a full table scan before it cuts the first 10 rows. So for a table with 4 million records the above takes around 15 seconds.
Now consider this equivalent query:
SELECT MyTable.*
FROM
(SELECT rid
FROM
(SELECT rowid as rid
FROM MyTable
ORDER BY MyColumn
)
WHERE rownum <= 10
)
INNER JOIN MyTable
ON MyTable.rowid = rid
ORDER BY MyColumn;
Here Oracle scans the index and finds the top 10 rowids, and then uses nested loops to find the 10 records by rowid. This takes less than a second for a 4 million table.
My first question is why is the optimizer taking such an apparently bad decision for the first query above?
An my second and most important question is: is it possible to make the first query perform better. I have a specific need to use the first query as unmodified as possible. I am looking for something simpler than my second query above. Thank you!
Please note that for particular reasons I am unable to use the /*+ FIRST_ROWS(n) */ hint, or the ROW_NUMBER() OVER (ORDER BY column) construct.
If this is acceptable in your case, adding a WHERE ... IS NOT NULL clause will help the optimizer to use the index instead of doing a full table scan when using an ORDER BY clause:
SELECT * FROM (
SELECT * FROM MyTable
WHERE MyColumn IS NOT NULL
-- ^^^^^^^^^^^^^^^^^^^^
ORDER BY MyColumn
) WHERE rownum <= 10;
The rational is Oracle does not store NULL values in the index. As your query was originally written, the optimizer took the decision of doing a full table scan, as if there was less than 10 non-NULL values, it should retrieve some "NULL rows" to "fill in" the remaining rows. Apparently it is not smart enough to check first if the index contains enough rows...
With the added WHERE MyColumn IS NOT NULL, you inform the optimizer that you don't want in any circumstances any row having NULL in MyColumn. So it can blindly use the index without worrying about hypothetical rows having NULL in MyColumn.
For the same reason, declaring the ORDER BY column as NOT NULL should prevent the optimizer to do a full table scan. So, if you can change the schema, a cleaner option would be:
ALTER TABLE MyTable MODIFY (MyColumn NOT NULL);
See http://sqlfiddle.com/#!4/e3616/1 for various comparisons (click on view execution plan)

Can someone explain this query

here is the query
SELECT * FROM customers
WHERE
NOT EXISTS
(
SELECT 1 FROM brochure_requests
WHERE brochure_requests.first_name = customers.customer_first_name AND
brochure_requests.last_name = customers.customer_last_name
)
This query works just fine but I am not sure why it works. In the NOT EXISTS part SELECT 1 what is the 1 for. When I ran this query
select 1 from test2
Here were the results:
1
-----
1
1
1
1
1
1
1
1
1
1
1
..
How does the not exists query work?
The compiler is smart enough to ignore the actual SELECT in an EXISTS. So, basically, if it WOULD return rows because the filters match, that is all it cares about...the SELECT portion of the EXISTS never executes. It only uses the EXISTS clauses for evaluation purposes
I had this misconception for quite some time since you will see this SELECT 1 a lot. But, I have seen 42, *, etc....It never actually cares about the result, only that there would be one :). The key to keep in mind that SQL is a compiled language, so it will optimize this appropriately.
You could put a 1/0 and it will not throw a divide-by-zero exception...thus further proving that the result set is not evaluated. This is shown in this SQLFiddle
Code from Fiddle:
CREATE TABLE test (i int)
CREATE TABLE test2 (i int)
INSERT INTO test VALUES (1)
INSERT INTO test2 VALUES (1)
SELECT i
FROM test
WHERE EXISTS
(
SELECT 1/0
FROM test2
WHERE test2.i = test.i
)
And finally, more to your point, the NOT simply negates an EXISTS, saying to IGNORE any rows that match
The subquery is a correlated subquery joining between the customers and brochure_requests tables on the selected fields.
The EXISTS clause is simply a predicate that will only return the matching rows (and the NOT negates that).
The query :
select 1 from test2
shows you the value 1 as the value for all the records in test2 table.
Every SELECT query must have at least one column. I think that's why an unnamed column, which has the value 1, is used here.
The sub-query gives you the rows of the related Customers from the table brochure_requests.
NOT EXISTS causes the main query to return all the rows from the Customers table, which are not in the table brochure_requests.
The relational operator in question is known as 'antijoin' (alternatively 'not match' or 'semi difference'). In natural language: customers who do not match brochure_requests using the common attributes first_name and last_name.
A closely related operator is relational difference (alternatively 'minus' or 'except') e.g. in SQL
SELECT customer_last_name, customer_first_name
FROM customers
EXCEPT
SELECT last_name, first_name
FROM brochure_requests;
if customer requested a brochure, subquery returns 1 for this customer. and this customer not be added to return resultset. bcouse of NOT EXISTS clause.
Note: I don't know Oracle, and am not especially expert in SQL.
However, SELECT 1 from simply returns a 1 for every row matching the from clause. So the inner select can find a brochure_requests row whose name fields match those of the customer row currently being considered, it will produce a 1 result and fail the NOT EXISTS.
Hence the query selects all customers who do not have a brochure_request matching their name.
For each row of the table Customers, the query returns the rows when the sub-query.
NOT EXISTS returns no row.
If the sub-query in NOT EXISTS returns rows, then the rows of the table Customers are not returned.

Assistance with SQL statement

I'm using sql-server 2005 and ASP.NET with C#.
I have Users table with
userId(int),
userGender(tinyint),
userAge(tinyint),
userCity(tinyint)
(simplified version of course)
I need to select always two fit to userID I pass to query users of opposite gender, in age range of -5 to +10 years and from the same city.
Important fact is it always must be two, so I created condition if ##rowcount<2 re-select without age and city filters.
Now the problem is that I sometimes have two returned result sets because I use first ##rowcount on a table. If I run the query.
Will it be a problem to use the DataReader object to read from always second result set? Is there any other way to check how many results were selected without performing select with results?
Can you simplify it by using SELECT TOP 2 ?
Update: I would perform both selects all the time, union the results, and then select from them based on an order (using SELECT TOP 2) as the union may have added more than two. Its important that this next select selects the rows in order of importance, ie it prefers rows from your first select.
Alternatively, have the reader logic read the next result-set if there is one and leave the SQL alone.
To avoid getting two separate result sets you can do your first SELECT into a table variable and then do your ##ROWCOUNT check. If >= 2 then just select from the table variable on its own otherwise select the results of the table variable UNION ALLed with the results of the second query.
Edit: There is a slight overhead to using table variables so you'd need to balance whether this was cheaper than Adam's suggestion just to perform the 'UNION' as a matter of routine by looking at the execution stats for both approaches
SET STATISTICS IO ON
Would something along the following lines be of use...
SELECT *
FROM (SELECT 1 AS prio, *
FROM my_table M1 JOIN my_table M2
WHERE M1.userID = supplied_user_id AND
M1.userGender <> M2.userGender AND
M1.userAge - 5 >= M2.userAge AND
M1.userAge + 15 <= M2.userAge AND
M1.userCity = M2.userCity
LIMIT TO 2 ROWS
UNION
SELECT 2 AS prio, *
FROM my_table M1 JOIN my_table M2
WHERE M1.userID = supplied_user_id AND
M1.userGender <> M2.userGender
LIMIT TO 2 ROWS)
ORDER BY prio
LIMIT TO 2 ROWS;
I haven't tried it as I have no SQL Server and there may be dialect issues.