Get min/max value by key - different approaches - sql

I have a table with two columns namely ID and KEY (let key here be an integer) such as
ID KEY
ABC 6
DEF 1
GHI 12
TASK: Get the ID of the MAX key
Solution 1:
Select Top(1) ID
from TABLE
order by KEY desc
Solution 2:
Select ID
from TABLE
where ID = MAX(ID)
EDIT: The query was invalid. This is what I meant:
Select ID
from TABLE
where KEY = (select max(KEY) from TABLE)
Is one of these solutions categorically better than the other? What are the advantages/disvantages of each solution.
EDIT:
Assume there is no index.
Case 1 - large table
Case 2 - small table
Background:
I am doing code review and I have found both solutions multiple times in different context - sometimes with indices, sometimes without, sometimes for large tables, sometimes for small.

The two queries are different (after your edits fixing the second one).
The first necessarily returns a single row.
The second returns all matching rows.
The first returns a row even when key is NULL.
The second does not.
You should use the logic that does what you want.

An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list..
Solution 1 will be the best. A subquery in a where clause will be less optimal.
There really are lots of design techniques to look at for performance which I am not going to go into with this answer. I found this article yesterday which gave me more perspective https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-storage-internals-101/

In Solution 1, the order by clause will just sort your query result.
Query execution order:
FROM clause ON clause OUTER clause WHERE clause GROUP BY clause HAVING clause SELECT clause DISTINCT clause ORDER BY clause TOP clause
You can use the following query:
Select ID,
RANK() OVER (ORDER BY KEY DESC) AS KeyRank
from table1
HAVING keyRank = 1

Solution 1 will work but Solution 2 will throw exception like bellow
Msg 147, Level 15, State 1, Line 22 An aggregate may not appear in the
WHERE clause unless it is in a subquery contained in a HAVING clause
or a select list, and the column being aggregated is an outer
reference.

You can go with query 1 ,
You cannot use query 2 because you cannot use aggregate function like that if you want to use where clause and aggregate function in your query you have to go with as below :
Select id from table where key in (select max(key) from test);
reference only using aggregate function and having clause
Select ID ,max(key)
from test
group by ID,key
having (key) >= 12
order by 1

Related

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

Confused syntax in Where clause

what does the line (rowid,0) mean in the following query
select * from emp
WHERE (ROWID,0) in (
select rowid, mod(rownum,2) from emp
);
i dont get the line WHERE (ROWID,0).
what is it?
thanx in advance
IN clause in Oracle SQL can support column groups. You can do things like this:
select ...
from tab1
where (tab1.col1, tab1.col2) in (
select tab2.refcol1, tab2.refcol2
from tab2
)
That can be useful in many cases.
In your particular case, the subquery use for the second expression mod(rownum,2). Since there is no order by, that means that rownum will be in whichever order the database retrieves the rows - that might be a full table scan or a fast full index scan.
Then by using mod every other row in the subquery gets the value 0, every other row gets the value 1.
The IN clause then filters on second value in the subquery being equal to 0. The end result is that this query retrieves half of your employees. Which half will depend on which access path the optimizer chooses.
Not sure what dialect of sql you're using, but it appears that since the subquery in the IN clause has two columns in the select list, then the (ROWID,0) indicates which columns align with the subquery. I have never seen multiple columns in an IN statment's select list before.
This is a syntax used by some databases (but not all) that allows you to do in with multiple values.
With in, this is the same as:
where exists (select 1
from emp e2
where e2.rowid = emp.rowid and
mod(rownum, 2) = 0
)
I should note that if you are using Oracle (which allows this syntax), then you are using rownum in a subquery with no order by. The results are going to be rather arbitrary. However, the intention seems to be to return every other row, in some sense.

Count of 2 columns by GROUP BY and catx giving different outputs

I have to find distinct count of combination of 2 variables. I used the following 2 queries to find the count:
select count(*) from
( select V1, V2
from table1
group by 1,2
) a
select count(distinct catx('-', V1, V2))
from table1
Logically, both the above queries should give the same count but I am getting different counts. Note that
both V1 and V2 are integers
Both variables can have null values, though there are no null values in my table
There are no negative values
Any idea why I might be getting different outputs? And which is the best way to find the count of distinct combinations of 2 or more columns?
Thanks.
The SAS log gives the answer when you run the first sql code. Using 'group by' requires a summary function, otherwise it is ignored. The count will therefore return the overall number of rows instead of a distinct count of the 2 variables combined.
Just add count(*) to the subquery and you will get the same answer with both methods.
select count(*) from
( select V1, V2, count(*)
from table1
group by 1,2
) a
Use distinct in the subquery for the first query..
When you do a group by but don't include any aggregate function, it discards the group by.
so you will still have duplicate combinations of v1 and v2.
It seems that GROUP BY doesn't work that way in SAS. You can't use it to remove duplicates unless you have an aggregate function in your query. I found this in the log of my query output -
NOTE: A GROUP BY clause has been discarded because neither the SELECT
clause nor the optional HAVING clause of the associated
table-expression referenced a summary function.
This answers the question.
you can ignore the group by part also and just add a distinct in the sub-query. Also the second query you wrote is more efficient

GROUP BY combined with ORDER BY

The GROUP BY clause groups the rows, but it does not necessarily sort the results in any particular order. To change the order, use the ORDER BY clause, which follows the GROUP BY clause. The columns used in the ORDER BY clause must appear in the SELECT list, which is unlike the normal use of ORDER BY. [Oracle by Example, fourth Edition, page 274]
Why is that? Why does using GROUP BY influence the required columns in the SELECT clause?
Also, in the case where I do not use GROUP BY: Why would I want to ORDER BY some columns but then select only a subset of the columns?
Actually the statement is not entirely true as Dave Costa's example shows.
The Oracle documentation says that an expression can be used but the expression must be based on the columns in the selection list.
expr - expr orders rows based on their value for expr. The expression is based on
columns in the select list or columns in the tables, views, or materialized views in the
FROM clause. Source: Oracle® Database
SQL Language Reference
11g Release 2 (11.2)
E26088-01
September 2011. Page 19-33
From the the same work page 19-13 and 19-33 (Page 1355 and 1365 in the PDF)
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#SQLRF01702
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2171079
The bold text from your quote is incorrect (it's probably an oversimplification that is true in many common use cases, but it is not strictly true as a requirement). For instance, this statement executes just fine, although AVG(val) is not in the select list:
WITH DATA AS (SELECT mod(LEVEL,3) grp, LEVEL val FROM dual CONNECT BY LEVEL < 100)
SELECT grp,MIN(val),MAX(val)
FROM DATA
GROUP BY grp
ORDER BY AVG(val)
The expressions in the ORDER BY clause simply have to be possible to evaluate in the context of the GROUP BY. For instance, ORDER BY val would not work in the above example, because the expression val does not have a distinct value for each row produced by the grouping.
As to your second question, you may care about the ordering but not about the value of the ordering expression. Excluding unneeded expressions from the select lists reduces the amount of data that must actually be sent from the server to the client.
First:
The implementation of group by is one which creates a new resultset that differs in structure to the original from clause (table view or some joined tables). That resultset is defined by what is selected.
Not every SQL RDBMS has this restriction, though it is a always requirement that what is ordered by be either an aggregate function of the non-grouped columns (AVG, SUM, etc) or one of the columns grouped by, or functions upon more than one of those results (like adding two columns), because this is a logical requirement of the result of the grouping operation.
Second:
Because you only care about that column for the ordering. For example, you might have a list of the top selling singles without giving their sales (the NYT Bestsellers keeps some details of their data a secret, but do have a ranked list). Of course, you can get around this by just selecting that column and then not using it.
The data is aggregated before it is sorted for the ORDER BY.
If you try to order by any other column (that is not in the group by list or an aggregation function), what value would be used? There is no single value to use for ordering.
I believe that you can use combinations of the values for sorting. So you can say:
order by a+b
If a and b are in the group by. You just cannot introduce columns not mentioned in the SELECT. I believe you can use aggregation functions not mentioned in the SELECT, however.
Sample table
sample.grades
Name Grade Score
Adam A 95
Bob A 97
Charlie C 75
First Query using GROUP BY
Select grade, count(Grade) from sample.grades GROUP BY Grade
Output
Grade Count
A 2
C 1
Second Query using order by
select Name, score from sample grades order by score
Output
Bob A 97
Adam A 95
Charlie C 75
Third Query using GROUP BY and ordering
Select grade, count(Grade) from sample.grades GROUP BY Grade desc
Output
Grade Count
A 2
C 1
Once you start using things like Count, you must have group by. You can use them together, but they have very different uses, as I hope the examples clearly show.
To try and answer the question, why does group by effect the items in the select section, because that is what group by is meant to do. You can't do the count of a column if you do not group by that column.
Second question, why would you want to order by but not select all the columns?
If I want to order by the score, but do not care about the actual grade or even the score I might do
select name from sample.grades order by score
Output
Name
Bob
Adam
Charlie
Which results do you expect to see ordering by columns not listed in the select list and not participated in group by clause? at any case all kind of sort by non-mentioned in SELECT list columns will be omitted so Oracle guys added the restriction correctly.
with c as (
select 1 id, 2 value from dual
union all
select 1 id, 3 value from dual
union all
select 2 id, 3 value from dual
)
select id
from c
group by id
order by count(*) desc
Here my inderstanding
"The GROUP BY clause groups the rows, but it does not necessarily sort the results in any particular order."
-> you can use Group by without order by
"To change the order, use the ORDER BY clause, which follows the GROUP BY clause."
-> the rows are selected by defaut with primary key, and if you add order by you must add after group by
"The columns used in the ORDER BY clause must appear in the SELECT list, which is unlike the normal use of ORDER BY."

Aggregate functions in WHERE clause in SQLite

Simply put, I have a table with, among other things, a column for timestamps. I want to get the row with the most recent (i.e. greatest value) timestamp. Currently I'm doing this:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 1
But I'd much rather do something like this:
SELECT * FROM table WHERE timestamp=max(timestamp)
However, SQLite rejects this query:
SQL error: misuse of aggregate function max()
The documentation confirms this behavior (bottom of page):
Aggregate functions may only be used in a SELECT statement.
My question is: is it possible to write a query to get the row with the greatest timestamp without ordering the select and limiting the number of returned rows to 1? This seems like it should be possible, but I guess my SQL-fu isn't up to snuff.
SELECT * from foo where timestamp = (select max(timestamp) from foo)
or, if SQLite insists on treating subselects as sets,
SELECT * from foo where timestamp in (select max(timestamp) from foo)
There are many ways to skin a cat.
If you have an Identity Column that has an auto-increment functionality, a faster query would result if you return the last record by ID, due to the indexing of the column, unless of course you wish to put an index on the timestamp column.
SELECT * FROM TABLE ORDER BY ID DESC LIMIT 1
I think I've answered this question 5 times in the past week now, but I'm too tired to find a link to one of those right now, so here it is again...
SELECT
*
FROM
table T1
LEFT OUTER JOIN table T2 ON
T2.timestamp > T1.timestamp
WHERE
T2.timestamp IS NULL
You're basically looking for the row where no other row matches that is later than it.
NOTE: As pointed out in the comments, this method will not perform as well in this kind of situation. It will usually work better (for SQL Server at least) in situations where you want the last row for each customer (as an example).
you can simply do
SELECT *, max(timestamp) FROM table
Edit:
As aggregate function can't be used like this so it gives error. I guess what SquareCog had suggested was the best thing to do
SELECT * FROM table WHERE timestamp = (select max(timestamp) from table)