How to select the row with the lowest value- oracle - sql

I have a table where I save authors and songs, with other columns. The same song can appear multiple times, and it obviously always comes from the same author. I would like to select the author that has the least songs, including the repeated ones, aka the one that is listened to the least.
The final table should show only one author name.

Clearly, one step is to find the count for every author. This can be done with an elementary aggregate query. Then, if you order by count and you can just select the first row, this would solve your problem. One approach is to use ROWNUM in an outer query. This is a very elementary approach, quite efficient, and it works in all versions of Oracle (it doesn't use any advanced features).
select author
from (
select author
from your_table
group by author
order by count(*)
)
where rownum = 1
;
Note that in the subquery we don't need to select the count (since we don't need it in the output). We can still use it in order by in the subquery, which is all we need it for.
The only tricky part here is to remember that you need to order the rows in the subquery, and then apply the ROWNUM filter in the outer query. This is because ORDER BY is the very last thing that is processed in any query - it comes after ROWNUM is assigned to rows in the output. So, moving the WHERE clause into the subquery (and doing everything in a single query, instead of a subquery and an outer query) does not work.

You can use analytical functions as follows:
Select * from
(Select t.*,
Row_number() over (partition by song order by cnt_author) as rn
From
(Select t.*,
Count(*) over (partition by author) as cnt_author
From your_table t) t ) t
Where rn = 1

Related

Retrieving last record in each group from database with additional max() condition in MSSQL

This is a follow-up question to Retrieving last record in each group from database - SQL Server 2005/2008
In the answers, this example was provided to retrieve last record for a group of parameters (example below retrieves last updates for each value in computername):
select t.*
from t
where t.lastupdate = (select max(t2.lastupdate)
from t t2
where t2.computername = t.computername
);
In my case, however, "lastupdate" is not unique (some updates come in batches and have same lastupdate value, and if two updates of "computername" come in the same batch, you will get non-unique output for "computername + lastupdate").
Suppose I also have field "rowId" that is just auto-incremental. The mitigation would be to include in the query another criterion for a max('rowId') field.
NB: while the example employs time-specific name "lastupdate", the actual selection criteria may not be related to the time at all.
I, therefore, like to ask, what would be the most performant query that selects the last record in each group based both on "group-defining parameter" (in the case above, "computername") and on maximal rowId?
If you don't have uniqueness, then row_number() is simpler:
select t.*
from (select t.*,
row_number() over (partition by computername order by lastupdate, rowid desc) as seqnum
from t
) t
where seqnum = 1;
With the right indexes, the correlated subquery is usually faster. However, the performance difference is not that great.

Oracle subquery in select

I have a table that keeps costs of products. I'd like to get the average cost AND last buying invoice for each product.
My solution was creating a sub-select to get last buying invoice but unfortunately I'm getting
ORA-00904: "B"."CODPROD": invalid identifier
My query is
SELECT (b.cod_aux) product,
-- here goes code to get average cost,
(SELECT round(valorultent, 2)
FROM (SELECT valorultent
FROM pchistest
WHERE codprod = b.codprod
ORDER BY dtultent DESC)
WHERE ROWNUM = 1)
FROM pchistest a, pcembalagem b
WHERE a.codprod = b.codprod
GROUP BY a.codprod, b.cod_aux
ORDER BY b.cod_aux
In short what I'm doing on sub-select is ordering descendantly and getting the first row given the product b.codprod
Your problem is that you can't use your aliased columns deeper than one sub-query. According to the comments, this was changed in 12C, but I haven't had a chance to try it as the data warehouse that I use is still on 11g.
I would use something like this:
SELECT b.cod_aux AS product
,ROUND (r.valorultent, 2) AS valorultent
FROM pchistest a
JOIN pcembalagem b ON (a.codprod = b.codprod)
JOIN (SELECT valorultent
,codprod
,ROW_NUMBER() OVER (PARTITION BY codprod
ORDER BY dtultent DESC)
AS row_no
FROM pchistest) r ON (r.row_no = 1 AND r.codprod = b.codprod)
GROUP BY a.codprod, b.cod_aux
ORDER BY b.cod_aux
I avoid sub-queries in SELECT statements. Most of the time, the optimizer wants to run a SELECT for each item in the cursor, OR it does some crazy nested loops. If you do it as a sub-query in the JOIN, Oracle will normally process the rows that you are joining; normally, it is more efficient. Finally, complete your per item functions (in this case, the ROUND) in the final product. This will prevent Oracle from doing it on ALL rows, not just the ones you use. It should do it correctly, but it can get confused on complex queries.
The ROW_NUMBER() OVER (PARTITION BY ..) is where the magic happens. This adds a row number to each group of CODPRODs. This allows you to pluck the top row from each CODPROD, so this allows you to get the newest/oldest/greatest/least/etc from your sub-query. It is also great for filtering duplicates.

Why do partitions require nested selects?

I have a page to show 10 messages by each user (don't ask me why)
I have the following code:
SELECT *, row_number() over(partition by user_id) as row_num
FROM "posts"
WHERE row_num <= 10
It doesn't work.
When I do this:
SELECT *
FROM (
SELECT *, row_number() over(partition by user_id) as row_num FROM "posts") as T
WHERE row_num <= 10
It does work.
Why do I need nested query to see row_num column? Btw, in first request I actually see it in results but can't use where keyword for this column.
It seems to be the same "rule" as any query, column aliases aren't visible to the WHERE clause;
This will also fail;
SELECT id AS newid
FROM test
WHERE newid=1; -- must use "id" in WHERE clause
SQL Query like:
SELECT *
FROM table
WHERE <condition>
will execute in next order:
3.SELECT *
1.FROM table
2.WHERE <condition>
so, as Joachim Isaksson say, columns in SELECt clause are not visible in WHERE clause, because of processing order.
In your second query, column row_num are fetched in FROM clause first, so it will be visible in WHERE clause.
Here is simple list of steps in order they executes.
There is a good reason for this rule in standard SQL.
Consider the statement:
SELECT *, row_number() over (partition by user_id) as row_num
FROM "posts"
WHERE row_num <= 10 and p.type = 'xxx';
When does the p.type = 'xxx' get evaluated relative to the row number? In other words, would this return the first ten rows of "xxx"? Or would it return the "xxx"s in the first ten rows?
The designers of the SQL language recognize that this is a hard problem to resolve. Only allowing them in the select clause resolves the issue.
You can check this topic and this one on dba.stockexchange.com about order in which SQL executes SELECT clause. I think it aplies not only for PostgreSQL, but for all RDBMS.

SQL random aggregate

Say I have a simple table with 3 fields: 'place', 'user' and 'bytes'. Let's say, that under some filter, I want to group by 'place', and for each 'place', to sum all the bytes for that place, and randomly select a user for that place (uniformly from all the users that fit the 'where' filter and the relevant 'place'). If there was a "select randomly from" aggregate function, I would do:
SELECT place, SUM(bytes), SELECT_AT_RANDOM(user) WHERE .... GROUP BY place;
...but I couldn't find such an aggregate function. Am I missing something? What could be a good way to achieve this?
If your RDBMS supports analytical functions.
WITH T
AS (SELECT place,
Sum(bytes) OVER (PARTITION BY place) AS Sum_bytes,
user,
Row_number() OVER (PARTITION BY place ORDER BY random_function()) AS RN
FROM YourTable
WHERE .... )
SELECT place,
Sum_bytes,
user
FROM T
WHERE RN = 1;
For SQL Server Crypt_gen_random(4) or NEWID() would be examples of something that could be substituted in for random_function()
I think your question is DBMS specific. If your DBMS is MySql, you can use a solution like this:
SELECT place_rand.place, SUM(place_rand.bytes), place_rand.user as random_user
FROM
(SELECT place, bytes, user
FROM place
WHERE ...
ORDER BY rand()) place_rand
GROUP BY
place_rand.place;
The subquery orders records in random order. The outer query groups by place, sums bytes, and returns first random user, since user is not in an aggregate function and neither in the group by clause.
With a custom aggregate function, you could write expressions as simple as:
SELECT place, SUM(bytes), SELECT_AT_RANDOM(user) WHERE .... GROUP BY place;
SELECT_AT_RAMDOM would be the custom aggregate function.
Here is precisely an implementation in PostgreSQL.
I would do a bit of a variation on Martin's solution:
select place, sum(bytes), max(case when seqnum = 1 then user end) as random_user
from (select place, bytes,
row_number() over (partition by place order by newid()) as sequm
from t
) t
group by place
(Where newid() is just one way to get a random number, depending on the database.)
For some reason, I prefer this approach, because it still has the aggregation function in the outer query. If you are summarizing a bunch of fields, then this seems cleaner to me.

Oracle Select Max Date on Multiple records

I've got the following SELECT statement, and based on what I've seen here: SQL Select Max Date with Multiple records I've got my example set up the same way. I'm on Oracle 11g. Instead of returning one record for each asset_tag, it's returning multiples. Not as many records as in the source table, but more than (I think) it should be. If I run the inner SELECT statement, it also returns the correct set of records (1 per asset_tag), which really has me stumped.
SELECT
outside.asset_tag,
outside.description,
outside.asset_type,
outside.asset_group,
outside.status_code,
outside.license_no,
outside.rentable_yn,
outside.manufacture_code,
outside.model,
outside.manufacture_vin,
outside.vehicle_yr,
outside.meter_id,
outside.mtr_uom,
outside.mtr_reading,
outside.last_read_date
FROM mp_vehicle_asset_profile outside
RIGHT OUTER JOIN
(
SELECT asset_tag, max(last_read_date) as last_read_date
FROM mp_vehicle_asset_profile
group by asset_tag
) inside
ON outside.last_read_date=inside.last_read_date
Any suggestions?
Try with analytical functions:
SELECT outside.asset_tag,
outside.description,
outside.asset_type,
outside.asset_group,
outside.status_code,
outside.license_no,
outside.rentable_yn,
outside.manufacture_code,
outside.model,
outside.manufacture_vin,
outside.vehicle_yr,
outside.meter_id,
outside.mtr_uom,
outside.mtr_reading,
outside.last_read_date
FROM ( SELECT *, ROW_NUMBER() OVER(PARTITION BY asset_tag ORDER BY last_read_date DESC) Corr
FROM mp_vehicle_asset_profile) outside
WHERE Corr = 1
I think you need to add...
AND outside.asset_tag=inside.asset_tag
...to the criteria in your ON list.
Also a RIGHT OUTER JOIN is not needed. An INNER JOIN will give the same results (and may be more efficicient), since there will be cannot be be combinations of asset_tag and last_read_date in the subquery that do not exist in mp_vehicle_asset_profile.
Even then, the query may return more than one row per asset tag if there are "ties" -- that is, multiple rows with the same last_read_date. In contrast, #Lamak's analytic-based answer will arbitrarily pick exactly one row this situation.
Your comment suggests that you want to break ties by picking the row with highest mtr_reading for the last_read_date.
You could modify #Lamak's analyic-based answer to do this by changing the ORDER BY in the OVER clause to:
ORDER BY last_read_date DESC, mtr_reading DESC
If there are still ties (that is, multiple rows with the same asset_tag, last_read_date, and mtr_reading), the query will again abritrarily pick exactly one row.
You could modify my aggregate-based answer to break ties using highest mtr_reading as follows:
SELECT
outside.asset_tag,
outside.description,
outside.asset_type,
outside.asset_group,
outside.status_code,
outside.license_no,
outside.rentable_yn,
outside.manufacture_code,
outside.model,
outside.manufacture_vin,
outside.vehicle_yr,
outside.meter_id,
outside.mtr_uom,
outside.mtr_reading,
outside.last_read_date
FROM
mp_vehicle_asset_profile outside
INNER JOIN
(
SELECT
asset_tag,
MAX(last_read_date) AS last_read_date,
MAX(mtr_reading) KEEP (DENSE_RANK FIRST ORDER BY last_read_date DESC) AS mtr_reading
FROM
mp_vehicle_asset_profile
GROUP BY
asset_tag
) inside
ON
outside.asset_tag = inside.asset_tag
AND
outside.last_read_date = inside.last_read_date
AND
outside.mtr_reading = inside.mtr_reading
If there are still ties (that is, multiple rows with the same asset_tag, last_read_date, and mtr_reading), the query may again return more than one row.
One other way that the analytic- and aggregate-based answers differ is in their treatment of nulls. If any of asset_tag, last_read_date, or mtr_reading are null, the analytic-based answer will return related rows, but the aggregate-based one will not (because the equality conditions in the join do not evaluate to TRUE when a null is involved.