Is there a way to get "ordered" resultset from oracle table without actually using an "ORDER BY" clause?
I am working on an application that reads data from oracle table (which has no unique column) and I want to introduce some sort of resume mechanism so that in case of query failure (e.g. network error during fetch) we avoid reading rows that are fetched already.
The application is developed using oracle OCI and currently simple select queries are used.
Is there any efficient mechanism to achieve this?
In some very special condition you have a defined order of results without given any ORDER BY clause. However, you shoul not rely on that, Oracle may change this behaviour any time.
Maybe you can count total number of rows (read SQL%ROWCOUNT after executioin of the query) and check this number with received records on your client.
As Wernfried pointed out, there is no reliable way to get ordered results without any ORDER BY. But the question assumes that ORDER BY is impossible because there are no unique columns. There are at least two workarounds to this.
1. ROWID. Every Oracle row has a unique pseudo-column, ROWID. The application could ORDER BY ROWID, store the latest ROWID, and then use WHERE ROWID <= :rowid to pick up where it left off. Note that ROWIDs can change, if the table was modified or moved.
2. ROW_NUMBER. Another option is to sort all the data and keep track of the duplicates. If two rows are exactly the same then it does not matter which of the duplicates were returned and processed. The query and application only need to track how many of them have been processed. Then it can later process the rest.
drop table test1;
create table test1(a number);
insert into test1 values(1);
insert into test1 values(1);
insert into test1 values(2);
commit;
select a ,row_number() over (order by a /*and all other columns*/) rowNumber
from test1
order by rowNumber
A ROWNUMBER
1 1 --Am I the real #1? It doesn't matter.
1 2
2 3
If there was a failure after the first row, adding the predicate where rownumber > :last_rownumber_processed will get the rest of the rows. The second query may return the "first" 1 instead of the "second" 1, but the application won't care. As with the first workaround, this will fail if the data changes between runs.
Either way, the query must pay for sorting:
----------------------------
| Id | Operation |
----------------------------
| 0 | SELECT STATEMENT |
| 1 | WINDOW SORT |
| 2 | TABLE ACCESS FULL|
----------------------------
case 1:
In order to achieve a simple resume mechanism the only simple way is to use RowID.
select t?.rowid, t1.*, t2.*, t3.*
from table1 t1,
table2 t2,
table3 t3
where t1.? = t2.?
and t2.? = t3.?
and t?.rowid > :rowidProcessedPriorNetworkFailure
order by t?.rowid
Reason for t?.rowid is because you have to choose the leaf table.
in the following case you should choose t3 as t?.
T2 - T1 : one T2 record may have one or more T1 records
T1 - T3 : one T1 record may have one or more T3 records
But keep in mind that RowID's will change whenever Oracle maintains the underlying physical structure (ie. defragmantation, Regorganization, Move of table to a new datafile)
case 2:
if it is a join, select the leaf table.
Select a column that has the most distinct values. Order by that column. and whenever you have to resume, reverse operations done with the last value on hand.
Hope this will help you.
I have used the following query to get ordered output without using "order by" words. Here, I want to sort empno without using order by clause.
Query:
Select empno, sal, deptno, max(sal) over(partition by empno) from emp;
However you can exclude any columns except empno and max() analytical function.
Let me know if you have any questions.
Related
I have a table with two columns namely ID and KEY (let key here be an integer) such as
ID KEY
ABC 6
DEF 1
GHI 12
TASK: Get the ID of the MAX key
Solution 1:
Select Top(1) ID
from TABLE
order by KEY desc
Solution 2:
Select ID
from TABLE
where ID = MAX(ID)
EDIT: The query was invalid. This is what I meant:
Select ID
from TABLE
where KEY = (select max(KEY) from TABLE)
Is one of these solutions categorically better than the other? What are the advantages/disvantages of each solution.
EDIT:
Assume there is no index.
Case 1 - large table
Case 2 - small table
Background:
I am doing code review and I have found both solutions multiple times in different context - sometimes with indices, sometimes without, sometimes for large tables, sometimes for small.
The two queries are different (after your edits fixing the second one).
The first necessarily returns a single row.
The second returns all matching rows.
The first returns a row even when key is NULL.
The second does not.
You should use the logic that does what you want.
An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list..
Solution 1 will be the best. A subquery in a where clause will be less optimal.
There really are lots of design techniques to look at for performance which I am not going to go into with this answer. I found this article yesterday which gave me more perspective https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-storage-internals-101/
In Solution 1, the order by clause will just sort your query result.
Query execution order:
FROM clause ON clause OUTER clause WHERE clause GROUP BY clause HAVING clause SELECT clause DISTINCT clause ORDER BY clause TOP clause
You can use the following query:
Select ID,
RANK() OVER (ORDER BY KEY DESC) AS KeyRank
from table1
HAVING keyRank = 1
Solution 1 will work but Solution 2 will throw exception like bellow
Msg 147, Level 15, State 1, Line 22 An aggregate may not appear in the
WHERE clause unless it is in a subquery contained in a HAVING clause
or a select list, and the column being aggregated is an outer
reference.
You can go with query 1 ,
You cannot use query 2 because you cannot use aggregate function like that if you want to use where clause and aggregate function in your query you have to go with as below :
Select id from table where key in (select max(key) from test);
reference only using aggregate function and having clause
Select ID ,max(key)
from test
group by ID,key
having (key) >= 12
order by 1
I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1
I have a table with n number of records
How can i retrieve the nth record and (n-1)th record from my table in SQL without using derived table ?
I have tried using ROWID as
select * from table where rowid in (select max(rowid) from table);
It is giving the nth record but i want the (n-1)th record also .
And is there any other method other than using max,derived table and pseudo columns
Thanks
You cannot depend on rowid to get you to the last row in the table. You need an auto-incrementing id or creation time to have the proper ordering.
You can use, for instance:
select *
from (select t.*, row_number() over (order by <id> desc) as seqnum
from t
) t
where seqnum <= 2
Although allowed in the syntax, the order by clause in a subquery is ignored (for instance http://docs.oracle.com/javadb/10.8.2.2/ref/rrefsqlj13658.html).
Just to be clear, rowids have nothing to do with the ordering of rows in a table. The Oracle documentation is quite clear that they specify a physical access path for the data (http://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i6732). It is true that in an empty database, inserting records into a newtable will probably create a monotonically increasing sequence of row ids. But you cannot depend on this. The only guarantees with rowids are that they are unique within a table and are the fastest way to access a particular row.
I have to admit that I cannot find good documentation on Oracle handling or not handling order by's in subqueries in its most recent versions. ANSI SQL does not require compliant databases to support order by in subqueries. Oracle syntax allows it, and it seems to work in some cases, at least. My best guess is that it would probably work on a single processor, single threaded instance of Oracle, or if the data access is through an index. Once parallelism is introduced, the results would probably not be ordered. Since I started using Oracle (in the mid-1990s), I have been under the impression that order bys in subqueries are generally ignored. My advice would be to not depend on the functionality, until Oracle clearly states that it is supported.
select * from (select * from my_table order by rowid) where rownum <= 2
and for rows between N and M:
select * from (
select * from (
select * from my_table order by rowid
) where rownum <= M
) where rownum >= N
Try this
select top 2 * from table order by rowid desc
Assuming rowid as column in your table:
SELECT * FROM table ORDER BY rowid DESC LIMIT 2
I apologize in advance for my long-winded question and if the formatting isn't up to par (newbie), here goes.
I have a table MY_TABLE with the following schema -
MY_ID | TYPE | REC_COUNT
1 | A | 1
1 | B | 3
2 | A | 0
2 | B | 0
....
The first column corresponds to an ID, the second is some type and 3rd some count. NOTE that the MY_ID column is not the primary key, there could be many records having the same MY_ID.
I want to write a stored procedure which will take an array of IDs and return the subset of them that match the following criteria -
the ID should match the MY_ID field of at least 1 record in the table and at least 1 matching record should not have TYPE = A OR REC_COUNT = 0.
This is the procedure I came up with -
PROCEDURE get_id_subset(
iIds IN ID_ARRAY,
oMatchingIds OUT NOCOPY ID_ARRAY
)
IS
BEGIN
SELECT t.column_value
BULK COLLECT INTO oMatchingIds
FROM TABLE(CAST(iIds AS ID_ARRAY)) t
WHERE EXISTS (
SELECT /*+ NL_SJ */ 1
FROM MY_TABLE m
WHERE (m.my_id = t.column_value)
AND (m.type != 'A' OR m.rec_count != 0)
);
END get_id_subset;
But I really care about performance and some IDs could match 1000s of records in the table. There is an index on the MY_ID and TYPE column but no index on the REC_COUNT column. So I was thinking if there are more than 1000 rows that have a matching MY_ID field then I'll just return the ID without applying the TYPE and REC_COUNT predicates. Here's this version -
PROCEDURE get_id_subset(
iIds IN ID_ARRAY,
oMatchingIds OUT NOCOPY ID_ARRAY
)
IS
BEGIN
SELECT t.column_value
BULK COLLECT INTO oMatchingIds
FROM TABLE(CAST(iIds AS ID_ARRAY)) t, MY_TABLE m
WHERE (m.my_id = t.column_value)
AND ( ((SELECT COUNT(m.my_id) FROM m WHERE 1) >= 1000)
OR EXISTS (m.type != 'F' OR m.rec_count != 0)
);
END get_id_subset;
But this doesn't compile, I get the following error on the inner select -
PL/SQL: ORA-00936: missing expression
Is there another way of writing this? The inner select needs to work on the joined table.
And to clarify, I'm OK with the result set being different for this query. My assumption is that since there is an index on the my_id column, doing count(*) would be much cheaper than actually applying the rec_count predicate to 10000s of rows since there is no index on that column. Am I wrong?
I don't see your second query as being much if any improvement over the first. At best, the first subquery has to hit 1000 matching records in order to determine if the count is less than 1000, so I don't think it will save lots of work. Also it changes the actual result, and it's not clear from your description if you're saying that's OK as long as it's more efficient. (And if it is OK, then the business logic is very unclear -- why do the other conditions matter at all, if they don't matter when there's lots of records?)
You ask, "will the group by be applied before or after the predicate". I'm not clear what part of the query you're talking about, but logically speaking the order is always
Where predicates
Group By
Having predicates
The optimizer can change the order in which things are actually evaluated, but the result must always be logically equivalent to the above order of evaluation (barring optimizer bugs).
1000s of records is really not that much. Have you actually encountered a case where performance of the first query is unacceptable?
For either query, it may be better to rewrite the correlated EXISTS subquery as a non-correlated IN subquery. You need to test this.
You need to show actual execution plans to get more useful feedback.
Edit
For the kind of short-circuiting you're talking about, I think you need to rewrite your subquery (from the initial version of the query) like this (sorry, my first attempt at this wouldn't work because I tried to access a column from the top-level table in a sub-sub-query):
WHERE EXISTS (
SELECT /*+ NL_SJ */ 1
FROM MY_TABLE m
WHERE (m.my_id = t.column_value)
AND rownum <= 1000
HAVING MAX( CASE WHEN m.type != 'A' OR m.rec_count != 0 THEN 1 ELSE NULL END ) I S NOT NULL
OR MAX(rownum) >= 1000
)
That should force it to hit no more than 1,000 records per id, then return a row if either at least one row matches the conditions on type and rec_count, or the 1,000-record limit was reached. If you view the execution plan, you should expect to see a COUNT STOPKEY operation, which shows that Oracle is going to stop running a query block after a certain number of rows are returned.
I have a table with this data:
Id Qty
-- ---
A 1
A 2
A 3
B 112
B 125
B 109
But I'm supposed to only have the max values for each id. Max value for A is 3 and for B is 125. How can I isolate (and delete) the other values?
The final table should look like this :
Id Qty
-- ---
A 3
B 125
Running MySQL 4.1
Oh wait. Got a simpler solution :
I'll select all the max values(group by id), export the data, flush the table, reimport only the max values.
CREATE TABLE tabletemp LIKE table;
INSERT INTO tabletemp SELECT id,MAX(qty) FROM table GROUP BY id;
DROP TABLE table;
RENAME TABLE tabletemp TO table;
Thanks to all !
Try this in SQL Server:
delete from tbl o
left outer join
(Select max(qty) anz , id
from tbl i
group by i.id) k on o.id = k.id and k.anz = o.qty
where k.id is null
Revision 2 for MySQL... Can anyone check this one?:
delete from tbl o
where concat(id,qty) not in
(select concat(id,anz) from (Select max(qty) anz , id
from tbl i
group by i.id))
Explanation:
Since I was supposed to not use joins (See comments about MySQL Support on joins and delete/update/insert), I moved the subquery into a IN(a,b,c) clause.
Inside an In clause I can use a subquery, but that query is only allowed to return one field. So in order to filter all elements that are not the maximum, i need to concat both fields into a single one, so i can return it inside the in clause. So basically my query inside the IN returns the biggest ID+QTY only. To compare it with the main table i also need to make a concat on the outside, so the data for both fields match.
Basically the In clause contains:
("A3","B125")
Disclaimer: The above query is "evil!" since it uses a function (concat) on fields to compare against. This will cause any index on those fields to become almost useless. You should never formulate a query that way that is run on a regular basis. I only wanted to try to bend it so it works on mysql.
Example of this "bad construct":
(Get all o from the last 2 weeks)
select ... from orders where orderday + 14 > now()
You should allways do:
select ... from orders where orderday > now() - 14
The difference is subtle: Version 2 only has to do the math once, and is able to use the index, and version 1 has to do the math for every single row in the orders table., and you can forget about the index usage...
I'd try this:
delete from T
where exists (
select * from T as T2
where T2.Id = T.Id
and T2.Qty > T.Qty
);
For those who might have similar question in the future, this might be supported some day (it is now in SQL Server 2005 and later)
It won't require a join, and it has advantages over the use of a temporary table if the table has dependencies
with Tranked(Id,Qty,rk) as (
select
Id, Qty,
rank() over (
partition by Id
order by Qty desc
)
from T
)
delete from Tranked
where rk > 1;
You'll have to go via another table (among other things that makes a single delete statement here quite impossible in mysql is you can't delete from a table and use the same table in a subquery).
BEGIN;
create temporary table tmp_del select id,max(qty) as qty from the_tbl;
delete the_tbl from the_tbl,tmp_del where
the_tbl.id=tmp_del.id and the_tbl.qty=tmp_del.qty;
drop table tmp_del;
END;
MySQL 4.0 and later supports a simple multi-table syntax for DELETE:
DELETE t1 FROM MyTable t1 JOIN MyTable t2 ON t1.id = t2.id AND t1.qty < t2.qty;
This produces a join of each row with a given id to all other rows with the same id, and deletes only the row with the lesser qty in each pairing. After this is all done, the row with the greatest qty per group of id is left not deleted.
If you only have one row with a given id, it still works because a single row is naturally the one with the greatest value.
FWIW, I just tried my solution using MySQL 5.0.75 on a Macbook Pro 2.40GHz. I inserted 1 million rows of synthetic data, with different numbers of rows per "group":
2 rows per id completes in 26.78 sec.
5 rows per id completes in 43.18 sec.
10 rows per id completes in 1 min 3.77 sec.
100 rows per id completes in 6 min 46.60 sec.
1000 rows per id didn't complete before I terminated it.