Minus operation gives wrong answer - sql

I am trying to get the result of the minus operation of join tables which means that I am finding unmatched records.
I tried:
SELECT count(*) FROM mp_v1 mp
left join cm_v1 sop
on mp.study_name=sop.study_name and
sop.site_id=sop.site_id
--where mp.study_name='1101'
MINUS
SELECT count(*) FROM iv_mpv1 mp
inner join cm sop
on mp.study_name=sop.study_name and
sop.site_id=sop.site_id
--where mp.study_name='1101'
output: the count of this gives me 171183251
but when I run the first query individually I get 171183251 for left outer join and 171070345 for inner join so the output needs to be 112906. I am not sure where my query is wrong. Could anyone please give your opinion.

If you want unmatched records you wouldn't use MINUS on the counts. The query would look more like:
SELECT COUNT(*)
FROM ((SELECT *
FROM mp_v1 mp LEFT JOIN
cm_v1 sop
USING (study_name, site_id)
) MINUS
(SELECT *
FROM iv_mpv1 mp LEFT JOIN
iv_cmv1 sop
USING (study_name, site_id)
)
) x;
Also note that MINUS removes duplicates, so if you have duplicates within each set of tables, then they only count as one row.
The SELECT * assumes that the tables have the same columns and compatible types -- which makes sense given the gist of the question. You may need to list the particular columns you care about.

Related

Sum fields of an Inner join

How I can add two fields that belong to an inner join?
I have this code:
select
SUM(ACT.NumberOfPlants ) AS NumberOfPlants,
SUM(ACT.NumOfJornales) AS NumberOfJornals
FROM dbo.AGRMastPlanPerformance MPR (NOLOCK)
INNER JOIN GENRegion GR ON (GR.intGENRegionKey = MPR.intGENRegionLink )
INNER JOIN AGRDetPlanPerformance DPR (NOLOCK) ON
(DPR.intAGRMastPlanPerformanceLink =
MPR.intAGRMastPlanPerformanceKey)
INNER JOIN vwGENPredios P ​​(NOLOCK) ON ( DPR.intGENPredioLink =
P.intGENPredioKey )
INNER JOIN AGRSubActivity SA (NOLOCK) ON (SA.intAGRSubActivityKey =
DPR.intAGRSubActivityLink)
LEFT JOIN (SELECT RA.intGENPredioLink, AR.intAGRActividadLink,
AR.intAGRSubActividadLink, SUM(AR.decNoPlantas) AS
intPlantasTrabajads, SUM(AR.decNoPersonas) AS NumOfJornales,
SUM(AR.decNoPlants) AS NumberOfPlants
FROM AGRRecordActivity RA WITH (NOLOCK)
INNER JOIN AGRActividadRealizada AR WITH (NOLOCK) ON
(AR.intAGRRegistroActividadLink = RA.intAGRRegistroActividadKey AND
AR.bitActivo = 1)
INNER JOIN AGRSubActividad SA (NOLOCK) ON (SA.intAGRSubActividadKey
= AR.intAGRSubActividadLink AND SA.bitEnabled = 1)
WHERE RA.bitActive = 1 AND
AR.bitActive = 1 AND
RA.intAGRTractorsCrewsLink IN(2)
GROUP BY RA.intGENPredioLink,
AR.decNoPersons,
AR.decNoPlants,
AR.intAGRAActivityLink,
AR.intAGRSubActividadLink) ACT ON (ACT.intGENPredioLink IN(
DPR.intGENPredioLink) AND
ACT.intAGRAActivityLink IN( DPR.intAGRAActivityLink) AND
ACT.intAGRSubActivityLink IN( DPR.intAGRSubActivityLink))
WHERE
MPR.intAGRMastPlanPerformanceKey IN(4) AND
DPR.intAGRSubActivityLink IN( 1153)
GROUP BY
P.vchRegion,
ACT.NumberOfFloors,
ACT.NumOfJournals
ORDER BY ACT.NumberOfFloors DESC
However, it does not perform the complete sum. It only retrieves all the values ​​of the columns and adds them 1 by 1, instead of doing the complete sum of the whole column.
For example, the query returns these results:
What I expect is the final sums. In NumberOfPlants the result of the sum would be 163,237 and of NumberJornales would be 61.
How can I do this?
First of all the (nolock) hints are probably not accomplishing the benefit you hope for. It's not an automatic "go faster" option, and if such an option existed you can be sure it would be already enabled. It can help in some situations, but the way it works allows the possibility of reading stale data, and the situations where it's likely to make any improvement are the same situations where risk for stale data is the highest.
That out of the way, with that much code in the question we're better served with a general explanation and solution for you to adapt.
The issue here is GROUP BY. When you use a GROUP BY in SQL, you're telling the database you want to see separate results per group for any aggregate functions like SUM() (and COUNT(), AVG(), MAX(), etc).
So if you have this:
SELECT Sum(ColumnB) As SumB
FROM [Table]
GROUP BY ColumnA
You will get a separate row per ColumnA group, even though it's not in the SELECT list.
If you don't really care about that, you can do one of two things:
Remove the GROUP BY If there are no grouped columns in the SELECT list, the GROUP BY clause is probably not accomplishing anything important.
Nest the query
If option 1 is somehow not possible (say, the original is actually a view) you could do this:
SELECT SUM(SumB)
FROM (
SELECT Sum(ColumnB) As SumB
FROM [Table]
GROUP BY ColumnA
) t
Note in both cases any JOIN is irrelevant to the issue.

LEFT JOIN help in sql

I have to make a list of customer who do not have any invoice but have paid an invoice … maybe twice.
But with my code (stated below) it contains everything from the left join. However I only need the lines highlighted with green.
How should I make a table with only the 2 highlights?
Select paymentsfrombank.invoicenumber,paymentsfrombank.customer,paymentsfrombank.value
FROM paymentsfrombank
LEFT OUTER JOIN debtors
ON debtors.value = paymentsfrombank.value
You only want to select columns from paymentsfrombank. So why do you even join?
select invoice_number, customer, value from paymentsfrombank
except
select invoice_number, customer, value from debtors;
(This requires exact matches as in your example, i.e. same amount for the invoice/customer).
There are two issues in your SQL. First, you need to join on Invoice number, not on value, as joining on value is pointless. Second, you need to only pick those payments where there are no corresponding debts, i.e. when you left-join, the table on the right has "null" in the joining column. The SQL would be something like this:
SELECT paymentsfrombank.invoicenumber,paymentsfrombank.customer,paymentsfrombank.value
FROM paymentsfrombank
LEFT OUTER JOIN debtors
ON debtors.InvoiceNumber = paymentsfrombank.InvoiceNumber
WHERE debtors.InvoiceNumber is NULL
in mysql we usually have this way to flip the relation and extract the rows that dosen't have relation.
Select paymentsfrombank.invoicenumber,paymentsfrombank.customer,paymentsfrombank.value
FROM paymentsfrombank
LEFT OUTER JOIN debtors
ON debtors.value = paymentsfrombank.value where debtors.value is null
You can use NOT EXISTS :
SELECT p.*
FROM paymentsfrombank p
WHERE NOT EXISTS (SELECT 1 FROM debtors d WHERE d.invoice_number = p.invoice_number);
However, the LEFT OUTER JOIN would also work if you add filtered with WHERE Clause to filtered out only missing customers that haven't any invoice information :
SELECT p.invoicenumber, p.customer, p.value
FROM paymentsfrombank P LEFT OUTER JOIN
debtors d
ON d.InvoiceNumber = p.InvoiceNumber
WHERE d.InvoiceNumber IS NULL;
Note : I have used table alias (p & d) that makes query to easier read & write.

Why is LEFT JOIN deleting rows?

I have been using sql for a long time, but I am now working in Databricks and I am getting a very strange result. I have a table called block_durations with a set of ids (called block_ts), and I have another table called mergetable, which I want to left join to that table. Mergetable is indexed by acct_id and block_ts, so it has many different records for each block_ts. I want to keep the rows in block_durations that don't match, and if there are multiple matches in mergetable I want there to be multiple corresponding entries in the resulting join, as you would expect from a left join.
But this is not happening. In order to demonstrate this, I am showing the result of joining mergetable, after filtering for a single acct_id so that there is at most one match per block_ts.
select count(*) from mergetable where acct_id = '0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98'
16579
select count(*) from block_durations
82817
select count(*) from
(
SELECT
mt.*,
bd.block_duration
FROM
block_durations bd
left outer JOIN mergetable mt
ON mt.block_ts = bd.block_ts
where acct_id='0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98'
) countTable
16579
As you can see, even though there are >80000 records in block_durations, most of them are getting lost in the left join. Why is this happening? I thought the whole point of a left join is that the non-matching rows of the left table are kept. This is exactly the behavior I would expect from an inner join -- and indeed when I switch to an inner join nothing changes.
Could someone please help me figure out what's going on?
-Paul
All rows from left side of the join are preserved, but later on you run WHERE ... condition on that which removed rows not matching the condition.
Merge your WHERE condition into JOIN condition:
SELECT
mt.*,
bd.block_duration
FROM
block_durations bd
left outer JOIN mergetable mt
ON mt.block_ts = bd.block_ts AND acct_id='0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98'
You can also filter mergetable before you run JOIN on the results:
SELECT
mt.*,
bd.block_duration
FROM
block_durations bd
left outer JOIN (SELECT * FROM mergetable WHERE acct_id='0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98') mt
ON mt.block_ts = bd.block_ts

SQL - why is this 'where' needed to remove row duplicates, when I'm already grouping?

Why, in this query, is the final 'WHERE' clause needed to limit duplicates?
The first LEFT JOIN is linking programs to entities on a UID
The first INNER JOIN is linking programs to a subquery that gets statistics for those programs, by linking on a UID
The subquery (that gets the StatsForDistributorClubs subset) is doing a grouping on UID columns
So, I would've thought that this would all be joining unique records anyway so we shouldn't get row duplicates
So why the need to limit based on the final WHERE by ensuring the 'program' is linked to the 'entity'?
(irrelevant parts of query omitted for clarity)
SELECT LmiEntity.[DisplayName]
,StatsForDistributorClubs.*
FROM [Program]
LEFT JOIN
LMIEntityProgram
ON LMIEntityProgram.ProgramUid = Program.ProgramUid
INNER JOIN
(
SELECT e.LmiEntityUid,
sp.ProgramUid,
SUM(attendeecount) [Total attendance],
FROM LMIEntity e,
Timetable t,
TimetableOccurrence [to],
ScheduledProgramOccurrence spo,
ScheduledProgram sp
WHERE
t.LicenseeUid = e.lmientityUid
AND [to].TimetableOccurrenceUid = spo.TimetableOccurrenceUid
AND sp.ScheduledProgramUid = spo.ScheduledProgramUid
GROUP BY e.lmientityUid, sp.ProgramUid
) AS StatsForDistributorClubs
ON Program.ProgramUid = StatsForDistributorClubs.ProgramUid
INNER JOIN LmiEntity
ON LmiEntity.LmiEntityUid = StatsForDistributorClubs.LmiEntityUid
LEFT OUTER JOIN Region
ON Region.RegionId = LMIEntity.RegionId
WHERE (
[Program].LicenseeUid = LmiEntity.LmiEntityUid
OR
[LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid
)
If you were grouping in your outer query, the extra criteria probably wouldn't be needed, but only your inner query is grouped. Your LEFT JOIN to a grouped inner query can still result in multiple records being returned, for that matter any of your JOINs could be the culprit.
Without seeing sample of duplication it's hard to know where the duplicates originate from, but GROUPING on the outer query would definitely remove full duplicates, or revised JOIN criteria could take care of it.
You have in result set:
SELECT LmiEntity.[DisplayName]
,StatsForDistributorClubs.*
I suppose that you dublicates comes from LMIEntityProgram.
My conjecture: LMIEntityProgram - is a bridge table with both LmiEntityId an ProgramId, but you join only by ProgramId.
If you have several LmiEntityId for single ProgramId - you must have dublicates.
And this dublicates you're filtering in WHERE:
[LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid
You can do it in JOIN:
LEFT JOIN LMIEntityProgram
ON LMIEntityProgram.ProgramUid = Program.ProgramUid
AND [LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid

Mysql statement (syntax error on FULL JOIN)

What is wrong with my sql statement, it says that the problem is near the FULL JOIN, but I'm stumped:
SELECT `o`.`name` AS `offername`, `m`.`name` AS `merchantName`
FROM `offer` AS `o`
FULL JOIN `offerorder` AS `of` ON of.offerId = o.id
INNER JOIN `merchant` AS `m` ON o.merchantId = m.id
GROUP BY `of`.`merchantId`
Please be gentle, as I am not a sql fundi
MySQL doesn't offer full join, you can either use
a pair of LEFT+RIGHT and UNION; or
use a triplet of LEFT, RIGHT and INNER and UNION ALL
The query is also very wrong, because you have a GROUP BY but your SELECT columns are not aggregates.
After you convert this properly to LEFT + RIGHT + UNION, you still have the issue of getting an offername and merchantname from any random record per each distinct of.merchantid, and not even necessarily from the same record.
Because you have an INNER JOIN condition against o.merchant, the FULL JOIN is not necessary since "offerorder" records with no match in "offer" will fail the INNER JOIN. That turns it into a LEFT JOIN (optional). Because you are grouping on of.merchantid, any missing offerorder records will be grouped together under "NULL" as merchantid.
This is a query that will work, for each merchantid, it will show just one offer that the merchant made (the one with the first name when sorted in lexicographical order).
SELECT MIN(o.name) AS offername, m.name AS merchantName
FROM offer AS o
LEFT JOIN offerorder AS `of` ON `of`.offerId = o.id
INNER JOIN merchant AS m ON o.merchantId = m.id
GROUP BY `of`.merchantId, m.name
Note: The join o.merchantid = m.id is highly suspect. Did you mean of.merchantid = m.id? If that is the case, change the LEFT to RIGHT join.