Why are queries these not equivalent? (correlated subquery vs. group by) - sql

Why are these two SQL queries not equivalent? One uses a correlated subquery, the other uses group by. The first produces a little over 51000 rows from my database, the second nearly 66000. In both cases, I am simply trying to return all the parts meeting the stated condition, current revision only. A comparison of the output files shows that method #1 (oracle_test1.txt) fails to return quite a few values. Based on that, I can only assume that method #2 is correct. I have some code that has used method #1 for a long time, but it appears I will have to change it. My reasoning concerning the correlated subquery was that as the inner select is comparing the columns in the self join, it will find the max vaule for the prev value for all matches; then return that max prev value for use in the outer query. I designed that query long ago before becoming familiar with the use of group by. Any insights would be appreciated.
Query #1
select pobj_name, prev
from pfmc_part
where pmodel in ('PN-DWG', 'NO-DWG') and pstatus = 'RELEASED'
and prev = (select max(prev) from pfmc_part a where a.pobj_name = pfmc_part.pobj_name)
order by pobj_name, prev"
Query #2
select pobj_name, max(prev) prev
from pfmc_part
where pmodel in ('PN-DWG', 'NO-DWG') and pstatus = 'RELEASED'
group by pobj_name
order by pobj_name, prev"
Sample output:
Query #2 Query #1
P538512 B P538512 B
P538513 A P538513 A
P538514 C P538514 C
P538520 B
P538522 B P538522 B
P538525 A P538525 A
P538531 C P538531 C
P538533 A P538533 A
P538538 B
P538541 B
P538542 B
P538553 A P538553 A
P538569 A P538569 A

Query 1 is returning each of the max ids and then those that have a pmodel of the type specified within your where clause.
Whereas query 2 is selecting all items with a pmodel of the type specified in your where clause and each of the max ids of that.
You may have data which isn't the max id which satisfies your where clause in query 2 which is why it's being omitted in query 1

There are two differences and the rest of the answers focus on one. The "easy" difference is that the max() in the group by is affected by the filter clause. The max() in the other query has no filter, and so it might return no rows (when max(prev) is on a row otherwise filtered out by the where conditions).
In addition, the where version of the query might return duplicate rows when there are multiple rows with the same value of max(prev) for a given pobj_name. The group by will never return duplicate rows.

this query
select pobj_name, prev
from pfmc_part
where pmodel in ('PN-DWG', 'NO-DWG') and pstatus = 'RELEASED'
and prev = (select max(prev) from pfmc_part a where a.pobj_name = pfmc_part.pobj_name)
order by pobj_name, prev"
has a where clause declaration causing it to return less rows -- specifically, only rows where prev = (subquery). that and prev makes it entirely different, and also assigns the value into prev in the first line
if you wanted them to be the more similar, you'd need to modify it like so
select pobj_name, prev, maxes.max
from pfmc_part
JOIN (select max(prev) as max from pfmc_part a where a.pobj_name = pfmc_part.pobj_name) maxes
where pmodel in ('PN-DWG', 'NO-DWG') and pstatus = 'RELEASED'
order by pobj_name, prev"

In query 1 you are ONLY selecting the rows whose prev field is equal to the max(prev) and in query 2 you are selecting all records ALONG WITH max(prev) that's meeting the conditions in the where and group by clause.
Basically, query 1 and query 2 have completely different where clauses. Hope this explains the missing records from query 1.

Your query #1 will certainly fail to return a row for a given pobj_name where maximum prev for that name does not correspond to a revision currently in the database. That could perhaps happen if a revision was skipped or if its row was deleted.
Your Query #2 does not suffer Query #1's limitation, and it may perform better on account of avoiding a correlated subquery. It would be inappropriate, however, if you wanted more data than just pobj_name and aggregate functions of the groups. And by the way, there's no point in including prev in the ORDER BY clause, since pobj_name will already be unique to each result row.
Overall, if the two queries happen to return similar results then that is a matter of the details of the data, not of the queries. They arrive at their results completely differently.

Related

How can I get non repeating value from select?

This is my select statement, it returns duplicate rows (see screen shot).
How can I prevent the duplicated rows?
SELECT
A.TOTAL_PRESENT,
A."LIMIT",
A.COST_CENTER,
A.ID,
A.PLANT,
A.BUDGET_YEAR,
A."VERSION",
B.BUDGET_YEAR,
B."VERSION",
B.PLANT,
B.CHARGE_CC,
B.YEAR_DATE_USD
FROM
CMS.SUM_REPANDMAINT A,
CMS.V_SUM_REPANDMAINT B
WHERE
(A.BUDGET_YEAR = B.BUDGET_YEAR(+)) AND
(A."VERSION" = B."VERSION"(+)) AND
(A.PLANT = B.PLANT(+)) AND
(A.COST_CENTER = B.CHARGE_CC(+)) AND
(B.USERNAME = '[usr_name]')
Output
Duplicate entries mean the filter criteria are not precise enough. One of your data sources produces multiple rows and the WHERE clause doesn't offer sufficient restriction.
You haven't posted any raw data so we can't tell you what additional criteria you need. However you should look at the use of outer joins. Outer joins mean you will return rows if the criteria for the right hand table don't match the criteria of the left-hand table. Why are you doing that?

CASE Clause on select clause throwing 'SQLCODE=-811, SQLSTATE=21000' Error

This query is very well working in Oracle. But it is not working in DB2. It is throwing
DB2 SQL Error: SQLCODE=-811, SQLSTATE=21000, SQLERRMC=null, DRIVER=3.61.65
error when the sub query under THEN clause is returning 2 rows.
However, my question is why would it execute in the first place as my WHEN clause turns to be false always.
SELECT
CASE
WHEN (SELECT COUNT(1)
FROM STOP ST,
FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC=1
AND ST.SHIPMENT_ID = 2779) = 1
THEN
(SELECT ST.FACILITY_ALIAS_ID
FROM STOP ST,
FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC=1
AND ST.SHIPMENT_ID = 2779
)
ELSE NULL
END STAPPFAC
FROM SHIPMENT SHIPMENT
WHERE SHIPMENT.SHIPMENT_ID IN (2779);
The SQL standard does not require short cut evaluation (ie evaluation order of the parts of the CASE statement). Oracle chooses to specify shortcut evaluation, however DB2 seems to not do that.
Rewriting your query a little for DB2 (8.1+ only for FETCH in subqueries) should allow it to run (unsure if you need the added ORDER BY and don't have DB2 to test on at the moment)
SELECT
CASE
WHEN (SELECT COUNT(1)
FROM STOP ST,
FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC=1
AND ST.SHIPMENT_ID = 2779) = 1
THEN
(SELECT ST.FACILITY_ALIAS_ID
FROM STOP ST,
FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC=1
AND ST.SHIPMENT_ID = 2779
ORDER BY ST.SHIPMENT_ID
FETCH FIRST 1 ROWS ONLY
)
ELSE NULL
END STAPPFAC
FROM SHIPMENT SHIPMENT
WHERE SHIPMENT.SHIPMENT_ID IN (2779);
Hmm... you're running the same query twice. I get the feeling you're not thinking in sets (how SQL operates), but in a more procedural form (ie, how most common programming languages work). You probably want to rewrite this to take advantage of how RDBMSs are supposed to work:
SELECT Current_Stop.facility_alias_id
FROM SYSIBM/SYSDUMMY1
LEFT JOIN (SELECT MAX(Stop.facility_alias_id) AS facility_alias_id
FROM Stop
JOIN Facility
ON Facility.facility_id = Stop.facility_id
AND Facility.is_dock_sched_fac = 1
WHERE Stop.shipment_id = 2779
HAVING COUNT(*) = 1) Current_Stop
ON 1 = 1
(no sample data, so not tested. There's a couple of other ways to write this based on other needs)
This should work on all RDBMSs.
So what's going on here, why does this work? (And why did I remove the reference to Shipment?)
First, let's look at your query again:
CASE WHEN (SELECT COUNT(1)
FROM STOP ST, FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC = 1
AND ST.SHIPMENT_ID = 2779) = 1
THEN (SELECT ST.FACILITY_ALIAS_ID
FROM STOP ST, FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC = 1
AND ST.SHIPMENT_ID = 2779)
ELSE NULL END
(First off, stop using the implicit-join syntax - that is, comma-separated FROM clauses - always explicitly qualify your joins. For one thing, it's way too easy to miss a condition you should be joining on)
...from this it's obvious that your statement is the 'same' in both queries, and shows what you're attempting - if the dataset has one row, return it, otherwise the result should be null.
Enter the HAVING clause:
HAVING COUNT(*) = 1
This is essentially a WHERE clause for aggregates (functions like MAX(...), or here, COUNT(...)). This is useful when you want to make sure some aspect of the entire set matches a given criteria. Here, we want to make sure there's just one row, so using COUNT(*) = 1 as the condition is appropriate; if there's more (or less! could be 0 rows!) the set will be discarded/ignored.
Of course, using HAVING means we're using an aggregate, the usual rules apply: all columns must either be in a GROUP BY (which is actually an option in this case), or an aggregate function. Because we only want/expect one row, we can cheat a little, and just specify a simple MAX(...) to satisfy the parser.
At this point, the new subquery returns one row (containing one column) if there was only one row in the initial data, and no rows otherwise (this part is important). However, we actually need to return a row regardless.
FROM SYSIBM/SYSDUMMY1
This is a handy dummy table on all DB2 installations. It has one row, with a single column containing '1' (character '1', not numeric 1). We're actually interested in the fact that it has only one row...
LEFT JOIN (SELECT ... )
ON 1 = 1
A LEFT JOIN takes every row in the preceding set (all joined rows from the preceding tables), and multiplies it by every row in the next table reference, multiplying by 1 in the case that the set on the right (the new reference, our subquery) has no rows. (This is different from how a regular (INNER) JOIN works, which multiplies by 0 in the case that there is no row) Of course, we only maybe have 1 row, so there's only going to be a maximum of one result row. We're required to have an ON ... clause, but there's no data to actually correlate between the references, so a simple always-true condition is used.
To get our data, we just need to get the relevant column:
SELECT Current_Stop.facility_alias_id
... if there's the one row of data, it's returned. In the case that there is some other count of rows, the HAVING clause throws out the set, and the LEFT JOIN causes the column to be filled in with a null (no data) value.
So why did I remove the reference to Shipment? First off, you weren't using any data from the table - the only column in the result set was from the subquery. I also have good reason to believe that there would only be one row returned in this case - you're specifying a single shipment_id value (which implies you know it exists). If we don't need anything from the table (including the number of rows in that table), it's usually best to remove it from the statement: doing so can simplify the work the db needs to do.

Use of the HAVING clause when using muliple sums

I was having a problem getting mulitple sums from multiple tables. Short story, my answer was solved in the "sql sum data from multiple tables" thread on this site. But where it came up short, is that now I'd like to only show sums that are greater than a certain amount. So while I have sub-selects in my select, I think I need to use a HAVING clause to filter the summed amounts that are too low.
Example, using the code specified in the link above (more specifically the answer that the owner has chosen as correct), I would only like to see a query result if SUM(AP2.Value) > 1500. Any thoughts?
If you need to filter on the results of ANY aggregate function, you MUST use a HAVING clause. WHERE is applied at the row level as the DB scans the tables for matching things. HAVING is applied basically immediately before the result set is sent out to the client. At the time WHERE operates, the aggregate function results are not (and cannot) be available, so you have to use a HAVING clause, which is applied after the main query is complete and all aggregate results are available.
So... long story short, yes, you'll need to do
SELECT ...
FROM ...
WHERE ...
HAVING (SUM_AP > 1500)
Note that you can use column aliases in the having clause. In technical terms, having on a query as above works basically exactly the same as wrapping the initial query in another query and applying another WHERE clause on the wrapper:
SELECT *
FROM (
SELECT ...
) AS child
WHERE (SUM_AP > 1500)
You could wrap that query as a subselect and then specify your criteria in the WHERE clause:
SELECT
PROJECT,
SUM_AP,
SUM_INV
FROM (
SELECT
AP1.[PROJECT],
(SELECT SUM(AP2.Value) FROM AP AS AP2 WHERE AP2.PROJECT = AP1.PROJECT) AS SUM_AP,
(SELECT SUM(INV2.Value) FROM INV AS INV2 WHERE INV2.PROJECT = AP1.PROJECT) AS SUM_INV
FROM AP AS AP1
INNER JOIN INV AS INV1 ON
AP1.[PROJECT] = INV1.[PROJECT]
WHERE
AP1.[PROJECT] = 'XXXXX'
GROUP BY
AP1.[PROJECT]
) SQ
WHERE
SQ.SUM_AP > 1500

Oracle Group by issue

I have the below query. The problem is the last column productdesc is returning two records and the query fails because of distinct. Now i need to add one more column in where clause of the select query so that it returns one record. The issue is that the column i need
to add should not be a part of group by clause.
SELECT product_billing_id,
billing_ele,
SUM(round(summary_net_amt_excl_gst/100)) gross,
(SELECT DISTINCT description
FROM RES.tariff_nt
WHERE product_billing_id = aa.product_billing_id
AND billing_ele = aa.billing_ele) productdescr
FROM bil.bill_sum aa
WHERE file_id = 38613 --1=1
AND line_type = 'D'
AND (product_billing_id, billing_ele) IN (SELECT DISTINCT
product_billing_id,
billing_ele
FROM bil.bill_l2 )
AND trans_type_desc <> 'Change'
GROUP BY product_billing_id, billing_ele
I want to modify the select statement to the below way by adding a new filter to the where clause so that it returns one record .
(SELECT DISTINCT description
FROM RRES.tariff_nt
WHERE product_billing_id = aa.product_billing_id
AND billing_ele = aa.billing_ele
AND (rate_structure_start_date <= TO_DATE(aa.p_effective_date,'yyyymmdd')
AND rate_structure_end_date > TO_DATE(aa.p_effective_date,'yyyymmdd'))
) productdescr
The aa.p_effective_date should not be a part of GROUP BY clause. How can I do it? Oracle is the Database.
So there are multiple RES.tariff records for a given product_billing_id/billing_ele, differentiated by the start/end dates
You want the description for the record that encompasses the 'p_effective_date' from bil.bill_sum. The kicker is that you can't (or don't want to) include that in the group by. That suggests you've got multiple rows in bil.bill_sum with different effective dates.
The issue is what do you want to happen if you are summarising up those multiple rows with different dates. Which of those dates do you want to use as the one to get the description.
If it doesn't matter, simply use MIN(aa.p_effective_date), or MAX.
Have you looked into the Oracle analytical functions. This is good link Analytical Functions by Example

SQL (Mysql) order problem

I have the following query:
SELECT a.field_eventid_key_value, a.field_showdate_value, b.nid , c.nid AS CNID
FROM content_type_vorfuehrung AS a
LEFT JOIN content_type_movies as b ON a.field_eventid_key_value = b.field_eventid_value
LEFT JOIN content_type_sonderveranstaltung as c ON a.field_eventid_key_value = c.field_sonderveranstaltungid_value
WHERE /* something */
GROUP BY a.field_eventid_key_value, a.field_showdate_value,
ORDER BY a.field_showdate_value ASC,a.field_showtime_value ASC
(where clause removed since it's irrelevant to the question)
This pulls data from 3 different tables and sorts it according to the "showdate" field in the first table. This is used in a PHP function that returns an array with the results.
Now there is a new requirement: The table "content_type_movies" in this query has a field that's supposed to be a boolean (actually it's an int with a value of either "0" oder "1"). This field is supposed to override the chronological ordering - that is, results where the field is "true" (or "1" respectively) should appear at the beginning of the result array (with the remaining entries ordered chronologically as before).
Is this at all possible with a single query ?
Thank you in advance for your time,
eike
You can use:
ORDER BY b.MyIntField DESC,
a.field_showdate_value,
a.field_showtime_value
where MyIntField is the field that is either 1 or 0 that you want to sort first.
ORDER BY a.content_type_movies DESC, /*then other fields*/ a.field_showdate_value ASC,a.field_showtime_value ASC
that should place all rows with content_type_movies=1 first then others.