Joining on multiple tables causing incorrect results

Joining on multiple tables causing incorrect results - sql

I am trying to extract some data grouped by the markets we operate in. The table structure looks like this:
bks:
opportunity_id
bks_opps:
opportunity_id | trip_start | state
bts:
boat_id | package_id
pckgs:
package_id | boat_id
addresses:
addressable_id | district_id
districts:
district_id
What I wanted to do is to count the number of won, lost and total and percentage won for each district.
SELECT d.name AS "District",
SUM(CASE WHEN bo.state IN ('won') THEN 1 ELSE 0 END) AS "Won",
SUM(CASE WHEN bo.state IN ('lost') THEN 1 ELSE 0 END) AS "Lost",
Count(bo.state) AS "Total",
Round(100 * SUM(CASE WHEN bo.state IN ('won') THEN 1 ELSE 0 END) / Count(bo.state)) AS "% Won"
FROM bks b
INNER JOIN bks_opps bo ON bo.id = b.opportunity_id
INNER JOIN pckgs p ON p.id = b.package_id
INNER JOIN bts bt ON bt.id = p.boat_id
INNER JOIN addresses a ON a.addressable_type = 'Boat' AND a.addressable_id = bt.id
INNER JOIN districts d ON d.id = a.district_id
WHERE bo.trip_start BETWEEN '2016-05-12' AND '2016-06-12'
GROUP BY d.name;
This returns incorrect data (The values are way higher than expected). However, when I get rid of all the joins and stop grouping by district - the numbers are correct (Counting the toal # of opportunities). Anybody that can spot what I am doing wrong? The most related question on here is this one.
Example data:
District | won | lost | total
----+---------+---------+------
1 | 42 | 212 | 254
Expected data:
District | won | lost | total |
----+---------+---------+--
1 | 22 | 155 | 177

Formatted comment here:
I would venture a guess that one of your join conditions is at fault here, but with the provided structure it is impossible to say.
For instance, you have this join INNER JOIN pckgs p ON p.id = b.package_id, but package_id is not listed as a column in bks.
And these joins look especially suspect:
INNER JOIN pckgs p ON p.id = b.package_id
INNER JOIN bts bt ON bt.id = p.boat_id
If a boat can exist in multiple packages, it will be an issue.
To troubleshoot, start with the simplest query you can:
SELECT b.opportunity_id
FROM bks b
Then leave the select alone, and proceed to add in each join:
SELECT b.opportunity_id
FROM bks b
INNER JOIN pckgs p ON p.id = b.package_id
At some point you'll likely see a jump in the number of rows returned. Whichever JOIN you added last is your issue.

Related

Adding an unused table after FROM changes retrieved data

I am referring to the Chinook database, which i am using to learn SQLite.
This query retrieves the number of invoices for each CustomerId, as I wanted:
select i.customerid, count(i.invoiceid)
from invoices as i
group by i.customerid
returns:
+------------+--------------------+
| CustomerId | count(i.invoiceid) |
+------------+--------------------+
| 1 | 7 |
| 2 | 7 |
| 3 | 7 |
...
But as i was building a more complex query i observed something that i cannot explain:
select i.customerid, count(i.invoiceid)
from invoices as i, customers as c
group by i.customerid
returns:
+------------+--------------------+
| CustomerId | count(i.invoiceid) |
+------------+--------------------+
| 1 | 413 |
| 2 | 413 |
| 3 | 413 |
...
Turns out 413 = 7 * 59, and 59 is the number of distinct CustomerID's.
There must be some fundamental SQL behavior that i am misunderstanding here, because I would expect no difference by adding "customers as c" in the "from" clause, since I am not using it yet. Can anyone enlighten me here on what is happening?

Never use commas in the FROM clause. Only use proper, explicit, standard, readable JOIN syntax.
Your query is producing a Cartesian product of the rows in the two tables. Then your aggregation counts the number of rows, for each customer, in the Cartesian product.
You need something like this:
select i.customerid, count(i.invoiceid)
from invoices i join
customers c
on i.customerid = c.customerid
group by i.customerid

You are performing a cross join which is the cartesian product of the rows of your 2 tables. You were right about the origin of the 413 value.
With a cross join, if table A has 5 rows and table B has 7 rows, it will produce a result of 5 * 7 = 35 rows.
When joining tables, you then need to add a join condition which will filter unrelated rows (cross joins are rarely what you want):
SELECT i.customerid, count(i.invoiceid)
FROM invoices as i, customers as c
WHERE i.customerid = c.id -- join condition
GROUP BY i.customerid
But the recommended syntax for join is explicit (no comma):
SELECT i.customerid, count(i.invoiceid)
FROM invoices as i
JOIN customers as c -- explicit join
ON i.customerid = c.id -- join condition
GROUP BY i.customerid
But this will perform an INNER JOIN by default which requires that rows from invoices table matches at least 1 row from customers, and vice-versa.
If you still want to display customers with 0 invoices, you need to use LEFT JOIN to keep rows from the left table (first one of the FROM clause) even if they have no match with the right table:
SELECT i.customerid, count(i.invoiceid)
FROM invoices as i
LEFT JOIN customers as c -- keep customers without invoices
ON i.customerid = c.id -- join condition, unchanged
GROUP BY i.customerid

Postgres: Many to many joins creates double output

I've recently added a many to many JOIN to one of my queries to add a "tag" functionality. The many to many works great, however, it's now causing a previously working part of the query to output records twice.
SELECT v.*
FROM "Server" AS s
JOIN "Vote" AS v ON (s.id = v."serverId")
JOIN "_ServerToTag" st ON (s.id = st."A")
OFFSET 0 LIMIT 25;
id | createdAt | authorId | serverId
-----+-------------------------+----------+----------
190 | 2020-12-23 15:47:25.476 | 6667 | 3
190 | 2020-12-23 15:47:25.476 | 6667 | 3
194 | 2020-12-21 15:47:25.476 | 6667 | 3
194 | 2020-12-21 15:47:25.476 | 6667 | 3
In the example above:
Server is my main table which contains a bunch of entries. Think of it as Reddit Posts, they have a title, content and use the Vote table to count "upvotes".
id | title
----+-------------------------------
3 | test server 3
Votes is a really simple table, it contains a timestamp of the "upvote", who created it, and the Server.id it is assigned to.
_ServerToTag is a table that contains two columns A and B. It connects Server to another table which contains Tags.
A | B
---+---
3 | 1
3 | 2
The above is a much-simplified query, in reality, I am suming the outcome of the query to get a number total of Votes.
The desired outcome would be that the results are not duplicated:
id | createdAt | authorId | serverId
-----+-------------------------+----------+----------
190 | 2020-12-23 15:47:25.476 | 6667 | 3
194 | 2020-12-21 15:47:25.476 | 6667 | 3
I'm really unsure why this is even happening so I have absolutely no idea how to fix it.
Any help would be greatly appreciated.
Edit:
DISTINCT works if I want to query the Vote table. But not in more complex queries. In my case it would look something more like this:
SELECT s.id, s.title, sum(case WHEN v."createdAt" >= '2020-12-01' AND v."createdAt" < '2021-01-01'
THEN 1 ELSE 0 END ) AS "voteCount",
FROM "Server" AS s
LEFT JOIN "Vote" AS v ON (s.id = "serverId")
LEFT JOIN "_ServerToTag" st ON (s.id = st."A");
id | title | voteCount
----+-------------------------------+-----------
3 | test server 3 | 4
In the above, I only need the voteCount column to be DISTINCT.
SELECT s.id, s.title, sum(DISTINCT case WHEN v."createdAt" >= '2020-12-01' AND v."createdAt" < '2021-01-01'
THEN 1 ELSE 0 END ) AS "voteCount",
FROM "Server" AS s
LEFT JOIN "Vote" AS v ON (s.id = "serverId")
LEFT JOIN "_ServerToTag" st ON (s.id = st."A");
id | title | voteCount
----+-------------------------------+-----------
3 | test server 3 | 1
The above kind of works, but it seems to only count one vote even if there are multiple.

It appears that the problem is that you added the join to _ServerToTag. Because there are multiple rows in _ServerToTag for each row in Server the query returns multiple rows for each server, one for each matching row in _ServerToTag.
It appears that _ServerToTag was adde to the query so it will only include servers which have tags. If that's your intent you can use:
SELECT v.id, v.authorId, v.serverId, COUNT(DISTINCT v.createdAt) AS TOTAL_VOTES
FROM "Server" AS s
INNER JOIN "Vote" AS v
ON s.id = v."serverId"
INNER JOIN (SELECT DISTINCT "A" FROM "_ServerToTag") st
ON s.id = st."A"
WHERE v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01'
GROUP BY v.id, v.authorId, v.serverId
OFFSET 0 LIMIT 25
or
SELECT v.id, v.authorId, v.serverId, COUNT(DISTINCT v.createdAt) AS TOTAL_VOTES
FROM "Server" AS s
INNER JOIN "Vote" AS v
ON s.id = v."serverId"
WHERE v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01' AND
s.id IN (SELECT "A" FROM "_ServerToTag")
GROUP BY v.id, v.authorId, v.serverId
OFFSET 0 LIMIT 25
which may communicate the intent of the query a bit better.
EDIT
If you want to be able to count entries which have no votes you'll need to use an outer join to pull in the (potentially non-existent) votes and then use a CASE expression to only count votes if they exist:
SELECT s.id, v.id, v.authorId, v.serverId,
CASE
WHEN v.id IS NULL THEN 0
ELSE COUNT(DISTINCT v.createdAt)
END AS TOTAL_VOTES
FROM "Server" AS s
LEFT OUTER JOIN "Vote" AS v
ON s.id = v."serverId"
WHERE v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01' AND
s.id IN (SELECT "A" FROM "_ServerToTag")
GROUP BY s.id, v.id, v.authorId, v.serverId
OFFSET 0 LIMIT 25
You may not actually need that though - you may be able to get away with
SELECT s.id, v.id, v.authorId, v.serverId,
COUNT(DISTINCT v.createdAt) AS TOTAL_VOTES
FROM "Server" AS s
LEFT OUTER JOIN "Vote" AS v
ON s.id = v."serverId"
WHERE v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01' AND
s.id IN (SELECT "A" FROM "_ServerToTag")
GROUP BY s.id, v.id, v.authorId, v.serverId
OFFSET 0 LIMIT 25

Okay so I went and asked a friend for help after not really being able to fix my problem with the answers I received.
I think my query was just too complex and confusing and I was suggested to use subqueries to make it less complicated and easier to manage.
My query now looks like this:
SELECT
s.id
, s.title
, COALESCE(v."VOTES", 0) AS "voteCount"
FROM "Server" AS s
-- Join tags
INNER JOIN
(
SELECT
st."A"
, json_agg(
json_build_object(
'id',
t.id,
'tagName',
t."tagName"
)
) as "tagsArray"
FROM
"_ServerToTag" AS st
INNER JOIN
"Tag" AS t
ON
t.id = st."B"
GROUP BY
st."A"
) AS tag
ON
tag."A" = s.id
-- Count votes
LEFT JOIN
(
SELECT
"serverId"
, COUNT(*) AS "VOTES"
FROM
"Vote" as v
WHERE
v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01'
GROUP BY "serverId"
) as v
ON
s.id = v."serverId"
OFFSET 0 LIMIT 25;
This works exactly the same way but by selecting what I need directly in the joins it's more readable and I have more control over the data I get back.

SQL selecting from multiple tables to show even null values [duplicate]

This question already has answers here:
How to join 2 queries with different number of records and columns in oracle sql?
(2 answers)
Closed 5 years ago.
I have 3 tables which have foreign key ACCOUNT_NO, first table is CREDIT_LIST which holds all the credits taken from bank, second table is CUSTOMERS which holds customer info, and the last is ACCOUNTS which holds all the info about account itself. When I try to select
SELECT
B.CUSTOMER_NO AS CUSTOMER_NO,
B.CREDIT_TYPE AS CREDIT_TYPE,
B.ACCOUNT_NO AS CREDIT_ACCOUNT_NUMBER,
A.BRANCH_CODE AS BRANCH_CODE,
C.EXTERNAL_ACCOUNT_NO AS EXTERNAL_ACCOUNT_NUMBER
FROM
CREDIT_LIST B,
CUSTOMERS A,
ACCOUNTS C
WHERE
B.STATUS = 'A' -- ACTIVE
AND A.CUSTOMER_NO = B.CUSTOMER_NO
AND C.ACCOUNT_NO = B.ACCOUNT_NO
;
I get zero results because there is no EXTERNAL_ACCOUNT_NO in ACCOUNTS which has the c.account_no = b.account_no. The problem is that I want to show the info even if there is no EXTERNAL_ACCOUNT_NO and fill it with null for example:
| CUSTOMER_NO | CREDIT_TYPE | CREDIT_ACCOUNT_NUMBER | BRANCH_CODE | EXTERNAL_ACCOUNT_NUMBER
+-------------+-------------+-----------------------+-------------+------------------------
| 1 | some_type | 123456 | 01 |
| 2 | some_type | 654321 | 01 | 111111111111
I feel like this is extremely stupid but can't figure out what exactly I am doing wrong

I think, this is what you need
SELECT
B.CUSTOMER_NO AS CUSTOMER_NO,
B.CREDIT_TYPE AS CREDIT_TYPE,
B.ACCOUNT_NO AS CREDIT_ACCOUNT_NUMBER,
A.BRANCH_CODE AS BRANCH_CODE,
C.EXTERNAL_ACCOUNT_NO AS EXTERNAL_ACCOUNT_NUMBER
FROM
CREDIT_LIST B
JOIN CUSTOMERS A on A.CUSTOMER_NO = B.CUSTOMER_NO
LEFT JOIN ACCOUNTS C on C.ACCOUNT_NO = B.ACCOUNT_NO
WHERE
B.STATUS = 'A' -- ACTIVE
;
The left join will make sure, you get details from A and B, even when there is no data available in C.
Also the AS is redundant, not really required, you can just write B.ACCOUNT_NO CREDIT_ACCOUNT_NUMBER in line 4 of the query.
Left Outer Join - Reference
Not Suggested, however, if there is a reason, you need to use the old syntax, below is what you need
SELECT
B.CUSTOMER_NO AS CUSTOMER_NO,
B.CREDIT_TYPE AS CREDIT_TYPE,
B.ACCOUNT_NO AS CREDIT_ACCOUNT_NUMBER,
A.BRANCH_CODE AS BRANCH_CODE,
C.EXTERNAL_ACCOUNT_NO AS EXTERNAL_ACCOUNT_NUMBER
FROM
CREDIT_LIST B,
CUSTOMERS A,
ACCOUNTS C
WHERE
B.STATUS = 'A' -- ACTIVE
AND A.CUSTOMER_NO = B.CUSTOMER_NO
AND C.ACCOUNT_NO(+) = B.ACCOUNT_NO -- (+) will do a outer join for you
;

Selecting Same Column Twice Using Alias Table

Below are examples of the tables I am working with. These only represent the columns relevant to my query
_Requirements
RequirementID fkOwningWsID
-------------------------------------------
REQ-RPT-01 1
REQ-RPT-02 2
_Workstream
pk WsNm
-------------------------------------------
1 Workstream1
2 Workstream2
mnWorkstream_Leads
fkWsID fkEeID
-------------------------------------------
1 1
1 2
2 1
2 2
The below table is a result of a union. Employees can be from different companies, the below union lists all the employee IDs, the IDs for the employees who are from Company 1 (0 otherwise) and IDs for employees from company 2 (0 otherwise)
qryTrackerAllEeList
EeID Company1_ID Company2_ID
-------------------------------------------
1 1 0
2 0 2
I am attempting to view the following result
RequirementID WsNm Company1_Lead Company2_Lead
--------------------------------------------------------------------
REQ-RPT-01 Workstream1 1 2
REQ-RPT-02 Workstream2 1 2
I have issued the following SQL
SELECT DISTINCT Req.RequirementID, Ws.Wsnm, company1_id.ee_id, company2_id.ee_id
FROM (((([_Requirements] AS Req
INNER JOIN [_Workstream] AS Ws ON Req.fkOwningWsID = Ws.pkWsID)
INNER JOIN [mnWorkstream_Leads] AS wsLeads ON Ws.pkWsID = wsLeads.fkWsID)
LEFT OUTER JOIN qryTrackerAllEeList AS company1 ON wsLeads.fkEeID = company1.Company1_ID)
LEFT OUTER JOIN qryTrackerAllEeList AS company2 ON wsLeads.fkEeID = company2.Company2_ID)
The issue is, however, that I retrieve the following results
RequirementID WsNm Company1_Lead Company2_Lead
--------------------------------------------------------------------
REQ-RPT-01 Workstream1 2
REQ-RPT-01 Workstream1 1
REQ-RPT-02 Workstream2 2
REQ-RPT-02 Workstream2 1
Any suggestions on how to eliminate these duplicative rows and null values?

Use MAX() and GROUP BY to only select the non null values and group them into one row:
SELECT DISTINCT Req.RequirementID, Ws.Wsnm,
MAX(company1_id.ee_id) as Company1_Lead, MAX(company2_id.ee_id) as Company2_Lead,
FROM (((([_Requirements] AS Req
INNER JOIN [_Workstream] AS Ws ON Req.fkOwningWsID = Ws.pkWsID)
INNER JOIN [mnWorkstream_Leads] AS wsLeads ON Ws.pkWsID = wsLeads.fkWsID)
LEFT OUTER JOIN qryTrackerAllEeList AS company1 ON wsLeads.fkEeID = company1.Company1_ID)
LEFT OUTER JOIN qryTrackerAllEeList AS company2 ON wsLeads.fkEeID = company2.Company2_ID)
GROUP BY req.RequirementID, Ws.Wsnm

SQL Query not working as it should

So I have three tables:
authors:
--------
ID Name
1 John
2 Sue
3 Mike
authors_publications:
---------------------
AuthorID PaperID
1 1
1 2
2 2
3 1
3 2
3 3
publications:
-------------
ID year
1 2004
2 2005
3 2004
I'm trying to join them so that I count the number of publications each author has had on 2004. If they didn't publish anything then it should be zero
ideally the result should look like this:
ID Name Publications_2004
1 John 1
2 Sue 0
3 Mike 2
I tried the following:
select a.ID, Name, count(*) as Publications_2004
from authors_publications as ap left join authors as a on ap.AuthorID=a.ID left join publications as p on p.ID=ap.PaperID
where year=2004
group by ap.AuthorID
I don't understand why it's not working. Its completely removing any authors that haven't published in 2004.

Your WHERE statement is taking the result set returned from the JOIN's and them trimming off records where year<>2004.
To get around this you can do a few different things
You can apply a filter to the publications table in the ON statement when joining. This will filter the results before joining
SELECT a.ID,
NAME,
count(*) AS Publications_2004
FROM authors_publications AS ap
LEFT JOIN authors AS a
ON ap.AuthorID = a.ID
LEFT JOIN publications AS p
ON p.ID = ap.PaperID AND
p.year = 2004
GROUP BY ap.AuthorID
You could use a case statement instead of a WHERE:
SELECT a.ID,
NAME,
SUM(CASE WHEN p.year = 2004 THEN 1 ELSE 0) END AS Publications_2004
FROM authors_publications AS ap
LEFT JOIN authors AS a
ON ap.AuthorID = a.ID
LEFT JOIN publications AS p
ON p.ID = ap.PaperID
GROUP BY ap.AuthorID, NAME
You could use a subquery to pre-filter the publications table to only 2004 records, which is just explicitly doing what was implicit in the first option:
SELECT a.ID,
NAME,
count(*) AS Publications_2004
FROM authors_publications AS ap
LEFT JOIN authors AS a
ON ap.AuthorID = a.ID
LEFT JOIN (SELECT * FROM publications WHERE AND year = 2004) AS p
ON p.ID = ap.PaperID
GROUP BY ap.AuthorID, NAME
Also, because you are not aggregating NAME with a formula, you should add that to your GROUP BY otherwise you may get funky results.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Joining on multiple tables causing incorrect results - sql

Related

Adding an unused table after FROM changes retrieved data

Postgres: Many to many joins creates double output

SQL selecting from multiple tables to show even null values [duplicate]

Selecting Same Column Twice Using Alias Table

SQL Query not working as it should

Categories

Resources