How would you explain this query in layman terms? - sql

Here is the database I'm using: https://drive.google.com/file/d/1ArJekOQpal0JFIr1h3NXYcFVngnCNUxg/view?usp=sharing
select distinct
AC1.givename, AC1.famname, AC2.givename, AC2.famname
from
academic AC1, author AU1, academic AC2, author AU2
where
AC1.acnum = AU1.acnum
and AC2.acnum = AU2.acnum
and AU1.panum = AU2.panum
and AU2.acnum > AU1.acnum
and not exists (select *
from Interest I1, Interest I2
where I1.acnum = AC1.acnum
and I2.acnum = AC2.acnum);
Output:
I'm having trouble explaining this output of the subquery and query in layman terms(Normal english).
Not sure if my explanation is right:
"The subquery finds the interested fields where two authors have no common field of interest.
The whole query finds the first and last names of the authors of papers which have at least two authors, and have no common field of interest."

As it currently stands, the subquery will produce rows if each academic has at least one interest.
So overall, the query is "produce pairs of academics who co-authored at least one paper and where at least one of them has no interests whatsoever". It's difficult to believe that that was the intent, and if it was, there are clearer ways of writing it that make it more clear that that is what we're looking for.
If that's the query we want, though, I'd write it as:
SELECT
AC1.givename, AC1.famname, AC2.givename, AC2.famname
FROM
academic AC1
inner join
academic AC2
on
AC1.acnum < AC2.acnum
WHERE EXISTS
(select * from author au1 inner join author au2 on au1.panum = au2.panum
where au1.acnum = ac1.acnum and au2.acnum = ac2.acnum)
AND
(
NOT EXISTS (select * from interest i where i.acnum = ac1.acnum)
OR
NOT EXISTS (select * from interest i where i.acnum = ac2.acnum)
)
If, as is more likely, we wanted pairs of co-authors who have no interests in common, we would write something like:
SELECT
AC1.givename, AC1.famname, AC2.givename, AC2.famname
FROM
academic AC1
inner join
academic AC2
on
AC1.acnum < AC2.acnum
WHERE EXISTS
(select * from author au1 inner join author au2 on au1.panum = au2.panum
where au1.acnum = ac1.acnum and au2.acnum = ac2.acnum)
AND NOT EXISTS
(select * from interest i1 inner join interest i2 on i1.field = i2.field
where i1.acnum = ac1.acnum and i2.acnum = ac2.acnum)
Notice how neither of my queries uses distinct, because we've made sure that the outer query isn't joining additional rows where we only care about the existence or absence of those rows - we've moved all such checks into EXISTS subqueries.
I generally see distinct used far too often when the author is getting multiple results when they only want a single result and they're unwilling to expend the effort to discover why they're getting multiple results. In this case, it would be situations where the same pairs of academics have co-authored more than one paper.

Related

How do I make a query shorter and neater?

Im trying to make This query more understandable and neater. But im not sure how to?
SELECT a.Patient_id, COUNT (p.Person_id) AS "Number of Operations", SUM (w.Daily_charge * (a.Discharge_date - a.Admission_date) + ot.Theatre_fee + b.Charges + c.Charges ) AS "Total Payment"
FROM person p, admission a, ward w, operation o, operation_type ot, staff b, staff c
WHERE w.Ward_code = a.Ward_code AND p.Person_id = a.Patient_id
AND a.Admission_id = o.Admission_id AND ot.Op_code = o.Actual_op
AND o.Surgeon = b.Person_id AND o.Anaesthetist = c.Person_id
GROUP BY a.Patient_id, p.Person_id
ORDER BY COUNT (p.Person_id) DESC FETCH FIRST 1 ROWS ONLY;
Any decent formatter would do it for you.
Other than that,
JOIN instead of comma-separate tables in the FROM clause
remove p.person_id from group by clause, there's no use of it as it is
equal to a.patient_id which is correctly put into the clause,
not part of the select statement's column list
So:
select a.patient_id,
count (p.person_id) as "number of operations",
sum (w.daily_charge * (a.discharge_date - a.admission_date) +
ot.theatre_fee + b.charges + c.charges
) as "total payment"
from person p join admission a on a.patient_id = p.person_id
join ward w on w.ward_code = a.ward_code
join operation o on o.admission_id = a.admission_id
join operation_type ot on ot.op_code = o.actual_op
join staff b on o.surgeon = b.person_id
join staff c on o.anaesthetist = c.person_id
group by a.patient_id
order by count (p.person_id) desc
fetch first 1 rows only;
Sadly there is no perfect formatter, although as Ed mentioned in the comments, they can be a start if you review the settings carefully. (It's a tradition in the industry that the default formatter settings are always horrible.)
It's also been said (I think by Steven Feuerstein) that you should only set formatting rules that are supported by your formatter, and of course he makes a good point. But taken with the limitations of all formatters, an industry tradition for horrible formatting and the impossibility of consistent rules for formatting SQL anyway, that puts us PL/SQL developers in a difficult position.
I'd say the first principle of computer code layout is to use vertically aligned blocks to indicate dependency levels (similar to the grids used in graphic design). A lot of the choices then become about how to apply that principle.
We then need to separate the code into logical sections, but at the same time not let it sprawl down the page unnecessarily. I think this is difficult for automated formatters as the rules become a bit fuzzy, e.g. for a join with only one condition I keep it on one line, but if there is more than one I start splitting it out onto multiple lines, one per on or and keyword. The same goes for your complex sum() expression - normally I would place it all on one line, but if it aids readability then I split it up.
Finally, opinions vary on where to place commas in stacked lists, of which SQL has a lot. I say they go on the left, to act like bullet points and also make it easier to add items to the ends of lists. Others will disagree.
select ad.patient_id
, count(*) as "Number of Operations"
, sum(
wa.daily_charge * (ad.discharge_date - ad.admission_date)
+ ot.theatre_fee + ss.charges + sa.charges
) as "Total Payment"
from person pr
join admission ad on ad.patient_id = pr.person_id
join ward wa on wa.ward_code = ad.ward_code
join operation op on op.admission_id = ad.admission_id
join operation_type ot on ot.op_code = op.actual_op
join staff ss on ss.person_id = op.surgeon
join staff sa on sa.person_id = op.anaesthetist
group by ad.patient_id
order by count(*) desc
fetch first row only;

Two almost identical queries returning different results

I am getting different results for the following two queries and I have no idea why. The only difference is one has an IN and one has an equals.
Before I go into the queries you should know that I found a better way to do it by moving the subquery into a common table expression, but this is still driving me crazy! I really want to know what caused the issue in the first place, I am asking out of curiosity
Here's the first query:
use [DB.90_39733]
Select distinct x.uniqproducer, cn.Firstname,cn.lastname,e.code,
ecn.FirstName, ecn.LastName, ecn.entid, x.uniqline
from product x
join employ e on e.EmpID=x.uniqproducer
join contactname cn on cn.uniqentity=e.uniqentity
join [ETL_GAWR92]..idlookupentity ide on ide.enttype='EM'
and ide.UniqEntity=e.UniqEntity
left join [ETL_GAWR92]..EntConName ecn on ecn.entid=ide.empid
and ecn.opt='Y'
Where x.UniqProducer =(SELECT TOP 1 idl.UniqEntity
FROM [ETL_GAWR92]..IDLookupEntity idl
LEFT JOIN [ETL_GAWR92]..Employ e2 ON e2.ProdID = ''
WHERE idl.empID = e2.EmpID AND
idl.EntType = 'EM')
And the second one:
use [DB.90_39733]
Select distinct x.uniqproducer, cn.Firstname,cn.lastname,e.code,
ecn.FirstName, ecn.LastName, ecn.entid, x.uniqline
from product x
join employ e on e.EmpID=x.uniqproducer
join contactname cn on cn.uniqentity=e.uniqentity
join [ETL_GAWR92]..idlookupentity ide on ide.enttype='EM'
and ide.UniqEntity=e.UniqEntity
left join [ETL_GAWR92]..EntConName ecn on ecn.entid=ide.empid
and ecn.opt='Y'
Where x.UniqProducer IN (SELECT TOP 1 idl.UniqEntity
FROM [ETL_GAWR92]..IDLookupEntity idl
LEFT JOIN [ETL_GAWR92]..Employ e2 ON e2.ProdID = ''
WHERE idl.empID = e2.EmpID AND
idl.EntType = 'EM')
The first query returns 0 rows while the second query returns 2 rows.The only difference is x.UniqProducer = versus x.UniqProducer IN for the last where clause.
Thanks for your time
SELECT TOP 1 doesn't guarantee that the same record will be returned each time.
Add an ORDER BY to your select to make sure the same record is returned.
(SELECT TOP 1 idl.UniqEntity
FROM [ETL_GAWR92]..IDLookupEntity idl
LEFT JOIN [ETL_GAWR92]..Employ e2 ON e2.ProdID = ''
WHERE idl.empID = e2.EmpID AND
idl.EntType = 'EM' ORDER BY idl.UniqEntity)
I would guess (with strong emphasis on the word “guess”) that the reason is based on how equals and in are processed by the query engine. For equals, SQL knows it needs to do a comparison with a specific value, where for in, SQL knows it needs to build a subset, and find if the "outer" value is in that "inner" subset. Yes, the end results should be the same as there’s only 1 row returned by the subquery, but as #RickS pointed out, without any ordering there’s no guarantee of which value ends up “on top” – and the (sub)query plan used to build the in - driven subquery might differ from that used by the equals pull.
A follow-up question: which is the correct dataset? When you analyze the actual data, should you have gotten zero, two, or a different number of rows?

I get the same object twice

I am trying to get all the lessons of the students that have a grade that contains a certain term.
The orange relations are the relevant relations:
The query:
SELECT
tg.nhsColor AS cellColor,
tg.nhsTgradeName AS LessonName,
lsons.nhsLessonID AS LessonID,
lsons.nhsTgradeID AS TgradeID,
lsons.nhsDay AS nhsDay,
lsons.nhsHour AS nhsHour,
tg.nhsTeacherID AS TeacherID
FROM
nhsTeacherGrades AS tg,
nhsLessons AS lsons,
nhsLearnGroups,
nhsMembers AS mem,
nhsGrades AS grd
WHERE
tg.nhsTgradeID = lsons.nhsTgradeID
AND nhsLearnGroups.nhsTgradeID = tg.nhsTgradeID
AND mem.nhsUserID = nhsLearnGroups.nhsStudentID
AND mem.nhsGradeID = grd.nhsGradeID
AND grd.nhsGradeName LIKE '%"+gradePart+"%'
The query works, yet, i get the same lesson twice from this query.
You can get duplicates for at least two reasons:
the same lessons can occur in different teacher grades followed by a certain student
different students can follow the same teacher grade
The following (untested) nested SQL could solve this. It gets the teacher grade ID of each lesson and checks which of these have at least one viable student linked to it:
SELECT tg.nhsColor AS cellColor,
tg.nhsTgradeName AS LessonName,
lsons.nhsLessonID AS LessonID,
lsons.nhsTgradeID AS TgradeID,
lsons.nhsDay AS nhsDay,
lsons.nhsHour AS nhsHour,
tg.nhsTeacherID AS TeacherID
FROM nhsLessons AS lsons
INNER JOIN nhsTeacherGrades AS tg
ON tg.nhsTgradeID = lsons.nhsTgradeID
WHERE tg.nhsTgradeID IN (
SELECT grp.nhsTgradeID
FROM (nhsLearnGroups grp
INNER JOIN nhsMembers AS mem
ON mem.nhsUserID = grp.nhsStudentID)
INNER JOIN nhsGrades AS grd
ON mem.nhsGradeID = grd.nhsGradeID
WHERE grd.nhsGradeName LIKE '%"+gradePart+"%'
)
Note that I used the JOIN syntax, which is considered better practice than placing join conditions in the WHERE clause. MS Access is quite pesky about using parentheses in the JOIN clauses, so you might need to play with those a bit to make it work.

How to get Django QuerySet 'exclude' to work right?

I have a database that contains schemas for skus, kits, kit_contents, and checklists. Here is a query for "Give me all the SKUs defined for kitcontent records defined for kit records defined in checklist 1":
SELECT DISTINCT s.* FROM skus s
JOIN kit_contents kc ON kc.sku_id = s.id
JOIN kits k ON k.id = kc.kit_id
JOIN checklists c ON k.checklist_id = 1;
I'm using Django, and I mostly really like the ORM because I can express that query by:
skus = SKU.objects.filter(kitcontent__kit__checklist_id=1).distinct()
which is such a slick way to navigate all those foreign keys. Django's ORM produces basically the same as the SQL written above. The trouble is that it's not clear to me how to get all the SKUs not defined for checklist 1. In the SQL query above, I'd do this by replacing the "=" with "!=". But Django's models don't have a not equals operator. You're supposed to use the exclude() method, which one might guess would look like this:
skus = SKU.objects.filter().exclude(kitcontent__kit__checklist_id=1).distinct()
but Django produces this query, which isn't the same thing:
SELECT distinct s.* FROM skus s
WHERE NOT ((skus.id IN
(SELECT kc.sku_id FROM kit_contents kc
INNER JOIN kits k ON (kc.kit_id = k.id)
WHERE (k.checklist_id = 1 AND kc.sku_id IS NOT NULL))
AND skus.id IS NOT NULL))
(I've cleaned up the query for easier reading and comparison.)
I'm a beginner to the Django ORM, and I'd like to use it when possible. Is there a way to get what I want here?
EDIT:
karthikr gave an answer that doesn't work for the same reason the original ORM .exclude() solution doesn't work: a SKU can be in kit_contents in kits that exist on both checklist_id=1 and checklist_id=2. Using the by-hand query I opened my post with, using "checklist_id = 1" produces 34 results, using "checklist_id = 2" produces 53 results, and the following query produces 26 results:
SELECT DISTINCT s.* FROM skus s
JOIN kit_contents kc ON kc.sku_id = s.id
JOIN kits k ON k.id = kc.kit_id
JOIN checklists c ON k.checklist_id = 1
JOIN kit_contents kc2 ON kc2.sku_id = s.id
JOIN kits k2 ON k2.id = kc2.kit_id
JOIN checklists c2 ON k2.checklist_id = 2;
I think this is one reason why people don't seem to find the .exclude() solution a reasonable replacement for some kind of not_equals filter -- the latter allows you to say, succinctly, exactly what you mean. Presumably the former could also allow the query to be expressed, but I increasingly despair of such a solution being simple.
You could do this - get all the objects for checklist 1, and exclude it from the complete list.
sku_ids = skus.values_list('pk', flat=True)
non_checklist_1 = SKU.objects.exclude(pk__in=sku_ids).distinct()

Find Most Recent Entry for an Intersection of Categories

I know there are a number of "How do I find the most recent record" questions out there, but none of them quite solved my particular problem: in MySQL, I'm trying to find the most recent record for an entry that's mapped to two different categories in the same table. There's essentially an ENTRIES table with a bunch of info, a CATEGORIES table (id, name) and a ENTRY_CATEGORIES table (entry_id, category_id). I need to find the most recent record that's mapped to two different categories. I've managed to do it, but only by essentially joining a derived table on itself and it feels like there's a cleaner way to do this. How can I better express the following mess:
SELECT doc.entry_id
FROM exp_category_posts doc
INNER JOIN exp_category_posts fund ON doc.entry_id = fund.entry_id
INNER JOIN exp_weblog_titles t ON doc.entry_id = t.entry_id
WHERE doc.cat_id = 408
AND fund.cat_id = 548
AND t.entry_date = (SELECT MAX(t.entry_date)
FROM exp_category_posts doc
INNER JOIN exp_category_posts fund ON doc.entry_id = fund.entry_id
INNER JOIN exp_weblog_titles t ON doc.entry_id = t.entry_id
WHERE doc.cat_id = 408
AND fund.cat_id = 548)
It's a hard-coded example where 408 and 548 would normally be fields as well. This is an Expression Engine database, if you're curious.
You might try replacing
AND t.entry_date = (SELECT MAX(t.entry_date)
FROM exp_category_posts doc
INNER JOIN exp_category_posts fund ON doc.entry_id = fund.entry_id
INNER JOIN exp_weblog_titles t ON doc.entry_id = t.entry_id
WHERE doc.cat_id = 408
AND fund.cat_id = 548)
with:
ORDER BY t.entry_date DESC
LIMIT 1
The optimizer will probably end up with a similar query in the end (that's not guaranteed, of course, but it's fairly likely), but the query is half as long. You'd have to run explain and profile a few select queries to see if it performs as well or better.