SQL group by not returning row value for an aggregate column - sql

I was using SQL statement to bring an aggregate (MAX) for a column and rest of the columns should come from that row. I was using group by clause but for other columns I must also use either max or min, etc. This was budget oriented project so I could not have time to do it using LINQ. (Where I could have used first or default). Anyways I believe this is strong inability of SQL language.
Again this could have done by many ways but not using simple SQL group by.
any ideas?

Your question is a bit light on details but it sounds like you want to know, for some set of items, which item has the maximum of something and then what it’s other properties are.
You cannot group by all the non max columns because this breaks the group down into too small chunks to make the max work
You cannot max all the other columns because this mixes row data up
Here is a simple example:
Name, JobRole, StartDate
John, JuniorProgrammer, 2000-01-01
John, SeniorProgrammer, 2010-01-01
John was promoted to senior programmer in 2010. We want johns most recent promotion and what he does now. If we do this:
SELECT name, jobrole, max(startdate)
FROM emp
GROUP BY name
The database will complain that jobrole is not in the group by. If we add it to the group by, John will appear twice, not what we want. If instead we max(jobrole), it DOES accidentally work out ok because alphabetically, SeniorProgrmamer is higher than JuniorProgrammer
If however, John then gets a promotion again in 2019:
Name, JobRole, StartDate
John, JuniorProgrammer, 2000-01-01
John, SeniorProgrammer, 2010-01-01
John, ExecutiveDirector, 2019-01-01
This time our query is wrong:
SELECT name, max(jobrole), max(startdate)
FROM emp
GROUP BY name
Hi he row data will be mixed up: the date will be 2019 but the job will still be seniorprogrammer because it’s alphabetically the maximum value
Instead we have to find the max for the person and then join it back to find the rest of the data:
SELECT name, jobrole, startdate
FROM
emp
INNER JOIN
(
SELECT name, max(startdate) d
FROM emp
GROUP BY name
)findmax
ON findmax.d = emp.startdate and findmax.name = emp.name
There are other ways of achieving the same thing without a join- this method would have issues if an employee was promoted twice on the same day, two records would result. In a dB that supports analytical functions we an do:
SELECT name, jobrole, row_number() over (partition by name order by startdate desc)
FROM emp
This establishes an incrementing counter in order of descending start date. The counter restarts from 1 for every different employee. There is no group by so no complaints that the extra data isn’t grouped or on aggregate function. All we need to do to choose the most recent promotion date is wrap the whole thing in a select that demands the row number be 1:
SELECT * FROM
(
SELECT name, jobrole, row_number() over (partition by name order by startdate desc) r
FROM emp
) emp_with_rownum
WHERE r = 1

You don't want a group by. You seem to want a window function:
select t.*, max(col) over () as overall_max
from t;

Related

SQL Oracle: How to show only one row when the columns diverge

I have an employees table in which most of the results show me only one employee per row.
However, I have to bring the amount of employees by area where 3 employees out of the 3432 have worked on a different area before.
Therefore, the results show me duplicated rows for these 3 employees. It's something like this:
Notice that on Brian's situation he's been admitted on a different area before.
How can I show Brian only once? Nonetheless, how can I show only the most recent area where he's worked on?
You can use ROW_NUMBER() to identify new and old rows per each employee, ordered by admission date.
Then filtering out old rows is easy. For example:
select *
from (
select t.*,
row_number() over(partition by employee order by admission desc) as rn
from t
) x
where rn = 1 -- keeps the latest row only, per employee

row_number() function in oracle

I am using ROW_NUMBER function in oracle and trying to understand how it is going to behave when the partition by and order by clause holds the same data then how ranking will work (if there are duplicate records).
below is the sample dataset
select * from test
Result
Dept salary created date
HR 500 25-Jul
HR 200 25-Jul
HR 500 26-Jul
Accounts 300 25-Jan
Accounts 300 26-Jan
Accounts 300 27-Jan
i ran the row_number function based on above set
select *,ROW_NUMBER() OVER(partition by Dept order by salary) as row_number
from test
result
Dept salary created date row_number
HR 500 25-Jul 1
HR 200 25-Jul 1
HR 500 26-Jul 2
Accounts 300 25-Jan 1
Accounts 300 26-Jan 2
Accounts 300 27-Jan 3
As you can see the output above, i am using the Dept as partition by and salary as order by for row_number, it gave me the ranking 1,2,3.
I am trying to understand here is that for the same data in the partition by and order by clause, does oracle assign the row_number based on when record entered into the system like in above "Accounts" "300" it gave the row_number 1 for the record which entered earliest in the system "25-Jan"
is there anywhere it is clearly mentioned that if it is doing partition by and order by on same data then ranking will be done based on when those records entered into the system.
I am trying to understand here is that for the same data in the partition by and order by clause, does oracle assign the row_number based on when record entered into the system like in above "Accounts" "300"
No, it does not. SQL tables represent unordered sets. There is no ordering, unless provided by explicitly by referring to column values.
If you are sorting by values that are the same, there is no guarantee on the ordering of the rows. Note that running the same query twice can produce different results when there are ties in order by keys. It is even possible within the same query. This is true both for the order by clause and for analytic functions.
If you want a guarantee, then you need to include a unique column as the last sorting key (well, it could not be the last, but it would effectively be the last one).
I guess you end result can be achieved using ROWID pseudocolumn as ROWID only generated when data entered into system -
SELECT T.*,ROW_NUMBER() OVER(partition by Dept order by salary, ROWID) as row_number
FROM test T

How to get the most frequent value SQL

I have a table Orders(id_trip, id_order), table Trip(id_hotel, id_bus, id_type_of_trip) and table Hotel(id_hotel, name).
I would like to get name of the most frequent hotel in table Orders.
SELECT hotel.name from Orders
JOIN Trip
on Orders.id_trip = Trip.id_hotel
JOIN hotel
on trip.id_hotel = hotel.id_hotel
FROM (SELECT hotel.name, rank() over (order by cnt desc) rnk
FROM (SELECT hotel.name, count(*) cnt
FROM Orders
GROUP BY hotel.name))
WHERE rnk = 1;
The "most frequently occurring value" in a distribution is a distinct concept in statistics, with a technical name. It's called the MODE of the distribution. And Oracle has the STATS_MODE() function for it. https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm
For example, using the EMP table in the standard SCOTT schema, select stats_mode(deptno) from scott.emp will return 30 - the number of the department with the most employees. (30 is the department "name" or number, it is NOT the number of employees in that department!)
In your case:
select stats_mode(h.name) from (the rest of your query)
Note: if two or more hotels are tied for "most frequent", then STATS_MODE() will return one of them (non-deterministic). If you need all the tied values, you will need a different solution - a good example is in the documentation (linked above). This is a documented flaw in Oracle's understanding and implementation of the statistical concept.
Use FIRST for a single result:
SELECT MAX(hotel.name) KEEP (DENSE_RANK FIRST ORDER BY cnt DESC)
FROM (
SELECT hotel.name, COUNT(*) cnt
FROM orders
JOIN trip USING (id_trip)
JOIN hotel USING (id_hotel)
GROUP BY hotel.name
) t
Here is one method:
select name
from (select h.name,
row_number() over (order by count(*) desc) as seqnum -- use `rank()` if you want duplicates
from orders o join
trip t
on o.id_trip = t.id_trip join -- this seems like the right join condition
hotels h
on t.id_hotel = h.id_hotel
) oth
where seqnum = 1;
** Getting the most recent statistical mode out of a data sample **
I know it's more than a year, but here's my answer. I came across this question hoping to find a simpler solution than what I know, but alas, nope.
I had a similar situation where I needed to get the mode from a data sample, with the requirement to get the mode of the most recently inserted value if there were multiple modes.
In such a case neither the STATS_MODE nor the LAST aggregate functions would do (as they would tend to return the first mode found, not necessarily the mode with the most recent entries.)
In my case it was easy to use the ROWNUM pseudo-column because the tables in question were performance metric tables that only experienced inserts (not updates)
In this oversimplified example, I'm using ROWNUM - it could easily be changed to a timestamp or sequence field if you have one.
SELECT VALUE
FROM
(SELECT VALUE ,
COUNT( * ) CNT,
MAX( R ) R
FROM
( SELECT ID, ROWNUM R FROM FOO
)
GROUP BY ID
ORDER BY CNT DESC,
R DESC
)
WHERE
(
ROWNUM < 2
);
That is, get the total count and max ROWNUM for each value (I'm assuming the values are discrete. If they aren't, this ain't gonna work.)
Then sort so that the ones with largest counts come first, and for those with the same count, the one with the largest ROWNUM (indicating most recent insertion in my case).
Then skim off the top row.
Your specific data model should have a way to discern the most recent (or the oldest or whatever) rows inserted in your table, and if there are collisions, then there's not much of a way other than using ROWNUM or getting a random sample of size 1.
If this doesn't work for your specific case, you'll have to create your own custom aggregator.
Now, if you don't care which mode Oracle is going to pick (your bizness case just requires a mode and that's it, then STATS_MODE will do fine.

OracleSQL: Assigning employees to groups with date values, querying current assignments by date

I have a database which consists of employees (one table) which can be assigned to groups (another table). Bother are joined together with another table, employee-to-group, which lists the group id, the employee id and the start date of the assignment.
An employee always has to be assigned to a group, but the assignments can change daily. One employee could be working in group A for day, then change into group B and work in group C only a week later.
My task is to find out which employees are assigned to a certain group given by its name at any given date. So the input should be: group name, date and I want the output to be the data of all the employees which are part of that group at the given moment in time.
Here's an SQL fiddle with some test data:
http://sqlfiddle.com/#!9/6d0bb
I recreated the database with mysql-statements because I couldn't figure out the oracle statements, I'm sorry.
As you can see from the test data, some employees may never change groups, while others change frequently. THere are also employees which are planned to change assignments in the future. The query has to account for that.
Because the application is a legacy one, the values (especially in the date field) are questionable. They are given as "days since the 1st of january, 1990", so the entry "9131" means "1st of january, 2015". 9468 would be today (2015-12-04) and 9496 would be 2016-01-01).
What I already have is code to find out the "date value" for any given date in what I call the "legacy format" of the application I'm working with (here I've just used CURRENT_DATE):
SELECT FLOOR(CURRENT_DATE - TO_DATE('1990-01-01', 'YYYY-MM-DD')) AS diffdate
For finding out which group a certain employee is assigned to, I tried:
SELECT * FROM history h
WHERE emp_nr = 1 AND valid_from <= 9131
ORDER BY valid_from DESC
FETCH FIRST ROW ONLY;
which should return me the group which an employee is assigned to on the 1st of january 2015.
What I do need help with is creating a statement that joins all tables does the same for a whole group instead of only one employee (as there are thousands of employees in the database and I only want the data of at most 10 groups).
I'm thankful for any kind of pointers in the right direction.
Use row_number to rank your history and get the latest group, just as you did with your FETCH FIRST query:
select *
from
(
select
h.*,
row_number() over (partition by emp_nr order by valid_from desc) as rn
from history h
where valid_from <= 9131
)
where rn = 1
You can then join this result with other tables.

Controlling result of oracle query

I have a schema like this
create table sample(id number ,name varchar2(30),mark number);
Now i has to return names of the top three marks. How can i write sql query for this?
If i use max(mark) it will return only maximum and
select name from sample
returns all the names!! I tried in many ways but i was unable to control the result to 3 rows..
Please suggest the way to get rid of my problem..
How do you want to handle ties? If Mary gets a mark of 100, Tom gets a mark of 95, and John and Dave both get a mark of 90, what results do you want, for example? Do you want both John and Dave to be returned since they both tied for third? Or do you want to pick one of the two so that the result always has exactly three rows? What happens if Beth also tied for second with a score of 95? Do you still consider John and Dave tied for third place or do you consider them tied for fourth place?
You can use analytic functions to get the top N results though which analytic function you pick depends on how you want to resolve ties.
SELECT id,
name,
mark
FROM (SELECT id,
name,
mark,
rank() over (order by mark desc) rnk
FROM sample)
WHERE rnk <= 3
will return the top three rows using the RANK analytic function to rank them by MARK. RANK returns the same rank for people that are tied and uses the standard sports approach to determining your rank so that if two people tie for second, the next competitor is in fourth place, not third. DENSE_RANK ensures that numeric ranks are not skipped so that if two people tie for second, the next row is third. ROW_NUMBER assigns each row a different rank by arbitrarily breaking ties.
If you really want to use ROWNUM rather than analytic functions, you can also do
SELECT id,
name,
mark
FROM (SELECT id,
name,
mark
FROM sample
ORDER BY mark DESC)
WHERE rownum <= 3
You cannot, however, have the ROWNUM predicate at the same level as the ORDER BY clause since the predicate is applied before the ordering.
SELECT t2.name FROM
(
SELECT t.*, t.rownum rn
FROM sample t
ORDER BY mark DESC
) t2
WHERE t2.rn <=3