Using Sub-queries with comparison - Oracle - sql

I've been trying to manipulate a result into these three queries and I don't know what's wrong that I'm doing
List all academic members participating in less than three groups.
List the academic ID leading the maximum number of groups
with this query ( for the first part )
SELECT a.ID , min(a.name) as Name
FROM Academic a , researchGroup r
WHERE count(r.managerID)>3
GROUP BY a.ID;
but It doesn't seem to work .
I have this relational schema
researchGroup(name (P.R Key Composite) , codeD , mainResearchArea , managerID /* forgien key with AcademicStaff(ID) */ , labID (P.R Key Composite) )
AcademicStaff(ID {PR KEY} , name)
any solutions ?

The following will give you a list of academics and the number of research groups managed:
SELECT
*
FROM
(
SELECT
ac.ID AS academic_id
,MAX(ac.name) AS academic_name
,COUNT(rg.managerID) AS num_groups_managed
,DENSE_RANK() OVER (ORDER BY COUNT(rg.managerID) DESC) AS academic_rank
FROM
Academic AS ac
INNER JOIN
researchGroup AS rg
ON (rg.managerID = ac.ID)
GROUP BY
ac.ID
) AS subquery
WHERE
--** uncomment the following line for the academics managing above 3 groups
--num_groups_managed >= 3
--** or uncomment the following line for the top-ranked academics (there could be more than 1)
--academic_rank = 1
ORDER BY
academic_rank ASC
,academic_name ASC
Uncommenting the relevant part of the WHERE clause will give you the results that you want.
Incidentally, it's a while since I've used Oracle SQL, so excuse any small syntax errors (in particular, I can't remember whether Oracle accepts the keyword AS after the table name).

Related

Easier way to limit rows in SELECT subquery?

I perform queries on an Oracle database. Let's say I have a table, PEOPLE. Each person can have multiple reference numbers. The reference numbers are stored in a different table, REFERENCENUMBERS.
REFERENCENUMBERS contains a column, PERSON_ID, which is identical to the ID column of the PEOPLE table. It is through this ID that the tables are joined.
Let's say I want to perform a query on the PEOPLE table. However I only want a single reference number returned per person record: i.e if a person has multiple reference numbers, I don't want multiple rows returned per person per reference number.
I choose a criterion for how to select only one reference number: the one which was created earliest. The date of reference number creation is stored in the REFERENCENUMBERS table as DATECREATED.
The following code does this job:
SELECT
PEOPLE.ID,
PEOPLE.NAME,
PEOPLE.AGE,
PEOPLE.ADDRESS,
-- Subquery to return the earliest-created reference number for this person
(
SELECT
REFERENCENUMBERS.NUMBER
FROM
REFERENCENUMBERS
WHERE
REFERENCENUMBERS.PERSON_ID = PEOPLE.ID -- Link back to the main people ID
AND REFERENCENUMBERS.DATECREATED =
-- Sub-sub query simply to match the earliest date
(
SELECT
MIN(R.DATECREATED) -- To ensure that only the earliest-created reference number is returned.
FROM
REFERENCENUMBERS R -- Give this sub-sub query an alias for the table
WHERE
R.PERSON_ID = PEOPLE.ID -- Link back to the main people ID
)
)
FROM
PEOPLE
WHERE
PEOPLE.AGE > 18 -- Or whatever
However, my question to you knowledgeable SQL people, is.. is there an easier way of doing this? It just appears cumbersome to have to include a sub-sub-query solely for the purpose of finding the earliest date, and limiting the WHERE clause of the sub-query.
There must be an easier, or cleaner way of doing this. Any suggestions?
(By the way - the sample code is greatly simplified from what I'm actually working on. Please don't provide answers which substantively modify my primary query with different-style JOINs etc - thanks).
The simplest would be a top-n filter:
select people.id
, people.name
, people.age
, people.address
, ( select referencenumbers.number
from referencenumbers
where referencenumbers.person_id = people.id
order by referencenumbers.datecreated
fetch first row only )
from people
where people.age > 18;
More details here (requires Oracle 12.1 or later.)
Or this (works in earlier versions):
select people.id
, people.name
, people.age
, people.address
, ( select min(rn.person_id) keep (dense_rank first order by rn.datecreated)
from referencenumbers rn
where rn.person_id = people.id )
from people
where people.age > 18;
(I gave referencenumbers a shorter alias for readability.)
Try this
SELECT
PEOPLE.ID,
PEOPLE.NAME,
PEOPLE.AGE,
PEOPLE.ADDRESS,
REFERENCENUMBERS.NUMBER
FROM PEOPLE
JOIN REFERENCENUMBERS ON REFERENCENUMBERS.PERSON_ID = PEOPLE.ID -- Link back to the main people ID
JOIN
(
SELECT
R.PERSON_ID,
MIN(R.DATECREATED) minc -- To ensure that only the earliest-created reference number is returned.
FROM
REFERENCENUMBERS R -- Give this sub-sub query an alias for the table
GROUP BY R.PERSON_ID
) t ON t.minc = REFERENCENUMBERS.DATECREATED and
t.PERSON_ID = REFERENCENUMBERS.PERSON_ID
WHERE
PEOPLE.AGE > 18 -- Or whatever

How to refactor complicated SQL query which is broken

Here is the simplified model of the domain
In a nutshell, unit grants documents to to a customer. There are two types of units: main units and their child units. Both belong to the same province, and to one province may belong multiple cities. Document has numerous events (processing history). Customer belongs to one city and province.
I have to write query, which returns random set of documents, given a target main unit code. Here is the criteria:
Return 10 documents where the newest event_code = 10
Each document must belong to a different customer living in any city of the unit's region (prefer different cities)
Return the Customers newest Document which meets the criteria
There must be both document types present in the result
Result (customers chosen) should be random with each query
But...
If there's not enough customers, try to use multiple documents of the same customer as a last resort
If there aren't enough documents either, return as much as possible
If there's not a single instance of another document type, then return all the same
There may be million of rows, and the query must be as fast as possible, it is executed frequently.
I'm not sure how to structure this kind of complex query in a sane manner. I'm using Oracle and PL/SQL. Here is something I tried, but it isn't working as expected (returns wrong data). How should I refactor this query and get the random result, and also honor all those borderline rules? I'm also worried about the performance regarding the joins and wheres.
CURSOR c_documents IS
WITH documents_cte AS
SELECT d.document_id AS document_id, d.create_dt AS create_dt,
c.customer_id
FROM documents d
JOIN customers c ON (c.customer_id = d.customer_id AND
c.province_id = (SELECT region_id FROM unit WHERE unit_code = 1234))
WHERE exists (
SELECT 1
FROM event
where document_id = d.document_id AND
event_code = 10
AND create_dt =
SELECT MAX(create_dt)
FROM event
WHERE document_id = d.document_id)
SELECT * FROM documents_cte d
WHERE create_dt = (SELECT MAX(create_dt)
from documents_cte
WHERE customer_id = d.customer_id)
How to correctly make this query with efficiency, randomness in mind? I'm not asking for exact solution, but guidelines at least.
I'd avoid hierarchic tables whenever possible. In your case you are using a hierarchic table to allow for an unlimited depth, but at last it's just two levels you store: provinces and their cities. That should better be just two tables: one for provinces and one for cities. Not a big deal, but that would make your data model simpler and easier to query.
Below I am starting with a WITH clause to get a city table, as such doesn't exist. Then I go step by step: get the customers belonging to the unit, then get their documents and rank them. At last I select the ranked documents and randomly take 10 of the best ranked ones.
with cities as
(
select
c.region_id as city_id,
o.region_id as province_id
from region c
join region p on p.region_id = c.parent_region_id
)
, unit_customers as
(
select customer_id
from customer
where city_id in
(
select city_id
from cities
where
(
select region_id
from unit
where unit_code = 1234
) in (city_id, province_id)
)
)
, ranked_documents as
(
select
document.*,
row_number(partition by customer_id order by create_dt desc) as rn
from document
where customer_id in -- customers belonging to the unit
(
select customer_id
from unit_customers
)
and document_id in -- documents with latest event code = 10
(
select document_id
from event
group by document_id
having max(event_code) keep (dense_rank last order by create_dt) = 10
)
)
select *
from ranked_documents
order by rn, dbms_random.value
fetch first 10 rows only;
This doesn't take into account to get both document types, as this contradicts the rule to get the latest documents per customer.
FETCH FIRST is availavle as of Oracle 12c. In earlier versions you would use one more subquery and another ROW_NUMBER instead.
As to speed, I'd recommend these indexes for the query:
create index idx_r1 on region(region_id); -- already exists for region_id = primary key
create index idx_r2 on region(parent_region_id, region_id);
create index idx_u1 on unit(unit_code, region_id);
create index idx_c1 on customer(city_id, customer_id);
create index idx_e1 on event(document_id, create_dt, event_code);
create index idx_d1 on document(document_id, customer_id, create_dt);
create index idx_d2 on document(customer_id, document_id, create_dt);
One of the last two will be used, the other not. Check which with EXPLAIN PLAN and drop the unused one.

Efficient approach to get two-dimensional datau using

For the sake of example, let's say I have the following models:
teams
each team has an arbitrary amount of fans
In SQL, this means you end up with the following tables:
team: identifier, name
fan: identifier, name
team_fan: team_identifier, fan_identifier
I am looking for an approach to retrieve:
all teams, and
for each team, the first 5 fans of which his/her name starts with an 'A'.
What is an efficient approach to do this?
In my current naive approach, I do <# teams> + 1 queries, which is troublesome:
First: SELECT * FROM team
Then, for each team with identifier X:
SELECT *
FROM fan
INNER JOIN team_fan
ON fan.identifier = team_fan.fan_identifier AND team_fan.team_identifier = X
WHERE fan.name LIKE 'A%'
ORDER BY fan.name LIMIT 5
There should be a better way to do this.
I could first retrieve all teams, as I do now, and then do something like:
SELECT *
FROM fan
WHERE fan.name LIKE 'A%'
AND fan.identifier IN (
SELECT fan_identifier
FROM team_fan
WHERE team_identifier IN (<all team identifiers from first query>))
ORDER BY fan.name
However, this approach ignores the requirement that I need the first 5 fans for each team with his/her name starting with an 'A'. Just adding LIMIT 5 to the query above is not correct.
Also, with this approach, if I have a large amount of teams, I am sending the corresponding team identifiers back to the database in the second query (for the IN (<all team identifiers from first query>)), which might kill performance?
I am developing against PostgreSQL, Java, Spring and plain JDBC.
You need a three table join
SELECT team.*, fan.*
FROM team
JOIN team_fan
ON team.team_identifier = team_fan.team_identifier
JOIN fan
ON fan.fan_identifier = team_fan.fan_identifier
Now to filter you need to do this.
with cte as (
SELECT team.*, fan.*,
row_number() over (partition by team.team_identifier
order by fan.name) as rn
FROM team
JOIN team_fan
ON team.team_identifier = team_fan.team_identifier
JOIN fan
ON fan.fan_identifier = team_fan.fan_identifier
WHERE fan.name LIKE 'A%'
)
SELECT *
FROM cte
WHERE rn <= 5
Usually, RDBMSes have their own hacks around standard SQL that allows you to have a number in a count over some condition of grouping/ordering.
Postgres is no exception, it got ROW_NUMBER() function.
What you need is to partition your row numbers properly, order them by alphabet and restrict the query to row numbers < 6.

How to get the most frequent value SQL

I have a table Orders(id_trip, id_order), table Trip(id_hotel, id_bus, id_type_of_trip) and table Hotel(id_hotel, name).
I would like to get name of the most frequent hotel in table Orders.
SELECT hotel.name from Orders
JOIN Trip
on Orders.id_trip = Trip.id_hotel
JOIN hotel
on trip.id_hotel = hotel.id_hotel
FROM (SELECT hotel.name, rank() over (order by cnt desc) rnk
FROM (SELECT hotel.name, count(*) cnt
FROM Orders
GROUP BY hotel.name))
WHERE rnk = 1;
The "most frequently occurring value" in a distribution is a distinct concept in statistics, with a technical name. It's called the MODE of the distribution. And Oracle has the STATS_MODE() function for it. https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm
For example, using the EMP table in the standard SCOTT schema, select stats_mode(deptno) from scott.emp will return 30 - the number of the department with the most employees. (30 is the department "name" or number, it is NOT the number of employees in that department!)
In your case:
select stats_mode(h.name) from (the rest of your query)
Note: if two or more hotels are tied for "most frequent", then STATS_MODE() will return one of them (non-deterministic). If you need all the tied values, you will need a different solution - a good example is in the documentation (linked above). This is a documented flaw in Oracle's understanding and implementation of the statistical concept.
Use FIRST for a single result:
SELECT MAX(hotel.name) KEEP (DENSE_RANK FIRST ORDER BY cnt DESC)
FROM (
SELECT hotel.name, COUNT(*) cnt
FROM orders
JOIN trip USING (id_trip)
JOIN hotel USING (id_hotel)
GROUP BY hotel.name
) t
Here is one method:
select name
from (select h.name,
row_number() over (order by count(*) desc) as seqnum -- use `rank()` if you want duplicates
from orders o join
trip t
on o.id_trip = t.id_trip join -- this seems like the right join condition
hotels h
on t.id_hotel = h.id_hotel
) oth
where seqnum = 1;
** Getting the most recent statistical mode out of a data sample **
I know it's more than a year, but here's my answer. I came across this question hoping to find a simpler solution than what I know, but alas, nope.
I had a similar situation where I needed to get the mode from a data sample, with the requirement to get the mode of the most recently inserted value if there were multiple modes.
In such a case neither the STATS_MODE nor the LAST aggregate functions would do (as they would tend to return the first mode found, not necessarily the mode with the most recent entries.)
In my case it was easy to use the ROWNUM pseudo-column because the tables in question were performance metric tables that only experienced inserts (not updates)
In this oversimplified example, I'm using ROWNUM - it could easily be changed to a timestamp or sequence field if you have one.
SELECT VALUE
FROM
(SELECT VALUE ,
COUNT( * ) CNT,
MAX( R ) R
FROM
( SELECT ID, ROWNUM R FROM FOO
)
GROUP BY ID
ORDER BY CNT DESC,
R DESC
)
WHERE
(
ROWNUM < 2
);
That is, get the total count and max ROWNUM for each value (I'm assuming the values are discrete. If they aren't, this ain't gonna work.)
Then sort so that the ones with largest counts come first, and for those with the same count, the one with the largest ROWNUM (indicating most recent insertion in my case).
Then skim off the top row.
Your specific data model should have a way to discern the most recent (or the oldest or whatever) rows inserted in your table, and if there are collisions, then there's not much of a way other than using ROWNUM or getting a random sample of size 1.
If this doesn't work for your specific case, you'll have to create your own custom aggregator.
Now, if you don't care which mode Oracle is going to pick (your bizness case just requires a mode and that's it, then STATS_MODE will do fine.

Compare 2 columns in 2 tables with DISTINCT value

I am now creating a reporting service with visual business intelligent.
i try to count how many users have been created under an org_id.
but the report consist of multiple org_id. and i have difficulties on counting how many has been created under that particular org_id.
TBL_USER
USER_ID
0001122
0001234
ABC9999
DEF4545
DEF7676
TBL_ORG
ORG_ID
000
ABC
DEF
EXPECTED OUTPUT
TBL_RESULT
USER_CREATED
000 - 2
ABC - 1
DEF - 2
in my understanding, i need nested SELECT, but so far i have come to nothing.
SELECT COUNT(TBL_USER.USER_ID) AS Expr1
FROM TBL_USER INNER JOIN TBL_ORG
WHERE TBL_USER.USER_ID LIKE 'TBL_ORG.ORG_ID%')
this is totally wrong. but i hope it might give us clue.
It looks like the USER_ID value is the concatenation of your ORG_ID and something to make it unique. I'm assuming this is from a COTS product and nothing a human would have built.
Your desire is to find out how many entries there are by department. In SQL, when you read the word by in a requirement, that implies grouping. The action you want to take is to get a count and the reserved word for that is COUNT. Unless you need something out of the TBL_ORG, I see no need to join to it
SELECT
LEFT(T.USER_ID, 3) AS USER_CREATED
, COUNT(1) AS GroupCount
FROM
TBL_USER AS T
GROUP BY
LEFT(T.USER_ID, 3)
Anything that isn't in an aggregate (COUNT, SUM, AVG, etc) must be in your GROUP BY.
SQLFiddle
I updated the fiddle to also show how you could link to TBL_ORG if you need an element from the row in that table.
-- Need to have the friendly name for an org
-- Now we need to do the join
SELECT
LEFT(T.USER_ID, 3) AS USER_CREATED
, O.SOMETHING_ELSE
, COUNT(1) AS GroupCount
FROM
TBL_USER AS T
-- inner join assumes there will always be a match
INNER JOIN
TBL_ORG AS O
-- Using a function on a column is a performance killer
ON O.ORG_ID = LEFT(T.USER_ID, 3)
GROUP BY
LEFT(T.USER_ID, 3)
, O.SOMETHING_ELSE;