Postgres: Many to many joins creates double output - sql

I've recently added a many to many JOIN to one of my queries to add a "tag" functionality. The many to many works great, however, it's now causing a previously working part of the query to output records twice.
SELECT v.*
FROM "Server" AS s
JOIN "Vote" AS v ON (s.id = v."serverId")
JOIN "_ServerToTag" st ON (s.id = st."A")
OFFSET 0 LIMIT 25;
id | createdAt | authorId | serverId
-----+-------------------------+----------+----------
190 | 2020-12-23 15:47:25.476 | 6667 | 3
190 | 2020-12-23 15:47:25.476 | 6667 | 3
194 | 2020-12-21 15:47:25.476 | 6667 | 3
194 | 2020-12-21 15:47:25.476 | 6667 | 3
In the example above:
Server is my main table which contains a bunch of entries. Think of it as Reddit Posts, they have a title, content and use the Vote table to count "upvotes".
id | title
----+-------------------------------
3 | test server 3
Votes is a really simple table, it contains a timestamp of the "upvote", who created it, and the Server.id it is assigned to.
_ServerToTag is a table that contains two columns A and B. It connects Server to another table which contains Tags.
A | B
---+---
3 | 1
3 | 2
The above is a much-simplified query, in reality, I am suming the outcome of the query to get a number total of Votes.
The desired outcome would be that the results are not duplicated:
id | createdAt | authorId | serverId
-----+-------------------------+----------+----------
190 | 2020-12-23 15:47:25.476 | 6667 | 3
194 | 2020-12-21 15:47:25.476 | 6667 | 3
I'm really unsure why this is even happening so I have absolutely no idea how to fix it.
Any help would be greatly appreciated.
Edit:
DISTINCT works if I want to query the Vote table. But not in more complex queries. In my case it would look something more like this:
SELECT s.id, s.title, sum(case WHEN v."createdAt" >= '2020-12-01' AND v."createdAt" < '2021-01-01'
THEN 1 ELSE 0 END ) AS "voteCount",
FROM "Server" AS s
LEFT JOIN "Vote" AS v ON (s.id = "serverId")
LEFT JOIN "_ServerToTag" st ON (s.id = st."A");
id | title | voteCount
----+-------------------------------+-----------
3 | test server 3 | 4
In the above, I only need the voteCount column to be DISTINCT.
SELECT s.id, s.title, sum(DISTINCT case WHEN v."createdAt" >= '2020-12-01' AND v."createdAt" < '2021-01-01'
THEN 1 ELSE 0 END ) AS "voteCount",
FROM "Server" AS s
LEFT JOIN "Vote" AS v ON (s.id = "serverId")
LEFT JOIN "_ServerToTag" st ON (s.id = st."A");
id | title | voteCount
----+-------------------------------+-----------
3 | test server 3 | 1
The above kind of works, but it seems to only count one vote even if there are multiple.

It appears that the problem is that you added the join to _ServerToTag. Because there are multiple rows in _ServerToTag for each row in Server the query returns multiple rows for each server, one for each matching row in _ServerToTag.
It appears that _ServerToTag was adde to the query so it will only include servers which have tags. If that's your intent you can use:
SELECT v.id, v.authorId, v.serverId, COUNT(DISTINCT v.createdAt) AS TOTAL_VOTES
FROM "Server" AS s
INNER JOIN "Vote" AS v
ON s.id = v."serverId"
INNER JOIN (SELECT DISTINCT "A" FROM "_ServerToTag") st
ON s.id = st."A"
WHERE v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01'
GROUP BY v.id, v.authorId, v.serverId
OFFSET 0 LIMIT 25
or
SELECT v.id, v.authorId, v.serverId, COUNT(DISTINCT v.createdAt) AS TOTAL_VOTES
FROM "Server" AS s
INNER JOIN "Vote" AS v
ON s.id = v."serverId"
WHERE v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01' AND
s.id IN (SELECT "A" FROM "_ServerToTag")
GROUP BY v.id, v.authorId, v.serverId
OFFSET 0 LIMIT 25
which may communicate the intent of the query a bit better.
EDIT
If you want to be able to count entries which have no votes you'll need to use an outer join to pull in the (potentially non-existent) votes and then use a CASE expression to only count votes if they exist:
SELECT s.id, v.id, v.authorId, v.serverId,
CASE
WHEN v.id IS NULL THEN 0
ELSE COUNT(DISTINCT v.createdAt)
END AS TOTAL_VOTES
FROM "Server" AS s
LEFT OUTER JOIN "Vote" AS v
ON s.id = v."serverId"
WHERE v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01' AND
s.id IN (SELECT "A" FROM "_ServerToTag")
GROUP BY s.id, v.id, v.authorId, v.serverId
OFFSET 0 LIMIT 25
You may not actually need that though - you may be able to get away with
SELECT s.id, v.id, v.authorId, v.serverId,
COUNT(DISTINCT v.createdAt) AS TOTAL_VOTES
FROM "Server" AS s
LEFT OUTER JOIN "Vote" AS v
ON s.id = v."serverId"
WHERE v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01' AND
s.id IN (SELECT "A" FROM "_ServerToTag")
GROUP BY s.id, v.id, v.authorId, v.serverId
OFFSET 0 LIMIT 25

Okay so I went and asked a friend for help after not really being able to fix my problem with the answers I received.
I think my query was just too complex and confusing and I was suggested to use subqueries to make it less complicated and easier to manage.
My query now looks like this:
SELECT
s.id
, s.title
, COALESCE(v."VOTES", 0) AS "voteCount"
FROM "Server" AS s
-- Join tags
INNER JOIN
(
SELECT
st."A"
, json_agg(
json_build_object(
'id',
t.id,
'tagName',
t."tagName"
)
) as "tagsArray"
FROM
"_ServerToTag" AS st
INNER JOIN
"Tag" AS t
ON
t.id = st."B"
GROUP BY
st."A"
) AS tag
ON
tag."A" = s.id
-- Count votes
LEFT JOIN
(
SELECT
"serverId"
, COUNT(*) AS "VOTES"
FROM
"Vote" as v
WHERE
v."createdAt" >= '2020-12-01' AND
v."createdAt" < '2021-01-01'
GROUP BY "serverId"
) as v
ON
s.id = v."serverId"
OFFSET 0 LIMIT 25;
This works exactly the same way but by selecting what I need directly in the joins it's more readable and I have more control over the data I get back.

Related

How to get frequency of each value in two columns in a mapping table

We have a mapping table with two columns, SITE_FROM & SITE_TO, the data is like:
We want to run a query to summarize for each site, get count in the SITE_FROM column and count in the SITE_TO column, such as:
<- desired
The explanation is, S1 appeared 2 times in SITE_FROM, 2 times in SITE_TO. S2 appeared 3 times in SITE_FROM, 4 times in SITE_TO. S3 appears 0 times in SITE_FROM, 3 times in SITE_TO...
I came up with some query:
SELECT SITE_FROM AS SITE,
count(*) CNT,
count(decode(SITE_TO,'S1',1)) S1,
count(decode(SITE_TO,'S2',1)) S2,
count(decode(SITE_TO,'S3',1)) S3,
count(decode(SITE_TO,'S4',1)) S4,
count(decode(SITE_TO,'S5',1)) S5
FROM site_changes
GROUP BY SITE_FROM
ORDER BY 1
;
But it returns detailed site to site mapping, also it has to have hard coded values in it, if later more sites are added, we need to remember to update the query as well:
<- undesired
Thank you for your time.
First get a defined list of all sites by unironing the data from site for each column. (CTE BELOW) this allows for any site to be added (no hard coding)
then join the CTE to CITE for from and to and count with group by.
since the left join will result in NULLS when no matches are found, and nulls do not get counted we should get the desired counts.
DEMO: https://dbfiddle.uk/k3ehGbIt
CTE AS (SELECT SITE_FROM as SITE FROM SITE_CHANGES
UNION
SELECT SITE_TO FROM SITE_CHANGES)
SELECT A.SITE,
coalesce(SUM(B.SITE_FROM_CNT),0) as SITE_FROM,
coalesce(SUM(C.SITE_TO_CNT) ,0) as SITE_TO
FROM CTE A
LEFT JOIN (SELECT SITE_FROM, count(SITE_FROM) SITE_FROM_CNT
FROM SITE_CHANGES
GROUP BY SITE_FROM) B
on A.SITE= B.SITE_FROM
LEFT JOIN (SELECT SITE_TO, count(SITE_TO) SITE_TO_CNT
FROM SITE_CHANGES
GROUP BY SITE_TO) C
on A.SITE = C.SITE_TO
GROUP BY A.SITE
ORDER BY A.SITE
Giving us:
+------+-----------+---------+
| SITE | SITE_FROM | SITE_TO |
+------+-----------+---------+
| S1 | 2 | 2 |
| S2 | 3 | 4 |
| S3 | 0 | 3 |
| S4 | 4 | 1 |
| S5 | 1 | 0 |
+------+-----------+---------+
I union'd both sets to get the full permutation of to and from site id's. Then joined into that set twice with left outer joins. To avoid nulls I used NVL replacing the null with 0.
SELECT
SITE
, NVL(Z.SITE_FROM,0) SITE_FROM
, NVL(A.SITE_TO,0) SITE_TO
FROM
(
SELECT sitefrom SITE
FROM
site_changes
UNION
SELECT siteto SITE
from site_changes
) X
LEFT OUTER JOIN
(
SELECT COUNT(1) SITE_FROM, SITEFROM FROM site_changes GROUP BY SITEFROM
) Z ON X.SITE = Z.SITEFROM
LEFT OUTER JOIN
(
SELECT COUNT(1) SITE_TO, SITETO FROM site_changes GROUP BY SITETO
) A ON X.SITE = A.SITETO
The challenge is not having the same sites on both columns so we unpivot them to get a list of all the info and then pivot them back with count to get the info we need.
select *
from t
unpivot (site for to_from in ("SITE FROM","SITE TO"))
pivot (count(*) for to_from in('SITE FROM' as "SITE FROM", 'SITE TO' as "SITE TO")) p
order by site
SITE
SITE FROM
SITE TO
S1
2
2
S2
3
4
S3
0
3
S4
4
1
S5
1
0
Fiddle
Here's the same concept, but using a full outer join instead of selecting out the distinct elements.
with sf as (
select
site_from as site,
count(*) as site_from
from site_changes
group by
site_from
),
st as (
select
site_to as site,
count(*) as site_to
from site_changes
group by
site_to
)
select
site,
nvl(site_from, 0) as site_from,
nvl(site_to, 0) as site_to
from sf
full outer join st
using (site)
#vlookup indicated that using FULL OUTER JOIN might be less performant. If you have a chance to compare the queries, please share the results.

SQL query that joins two tables and sum from second

I have this tables with example data:
Calendar
ID | Date
---+-----------
1 | 2020-01-01
2 | 2020-01-02
3 | 2020-01-03
EmployeeTimeWorked
ID | Date | HoursWorked | UserID
---+------------+--------------+-------
1 | 2020-01-01 | 2 | 2
2 | 2020-01-01 | 4 | 2
I want to make a MS-SQL query that shows days the user have not worked, and how many hours they have left to work (they should work 8 hours per day). All within within a time period, say a week.
The result should look like this:
EmployeeHaveNotWorked
Date | HoursLeftToWork
-----------+----------------
2020-01-01 | 2
Any idea how to make such a MS-SQL Query?
First get all users with all dates. This is done with a cross join. Seeing that you are using a UserID I suppose there is a users table. Otherwise get the users from the EmployeeTimeWorked table.
Then outer join the working times per user and date. This is a simple aggregation query.
Then subtract the worked hours from the required 8 hours.
select
u.userid,
c.date,
8 - coalesce(w.hours_worked, 0) as hours_left_to_work
from users u
cross join calendar c
left outer join
(
select userid, date, sum(hoursworked) as hours_worked
from employeetimeworked
group by userid, date
) w on w.userid = u.userid and w.date = c.date
order by u.userid, c.date;
Use a cross join to generate all possible rows and then filter out the ones that exist:
select u.userid, c.date,
8 - coalesce(sum(HoursWorked), 0) as remaining_time
from calendar c cross join
(select distinct userid from EmployeeTimeWorked) u left join
EmployeeTimeWorked etw
on etw.userid = u.userid and etw.date = c.date
where etw.userid is null
group by u.userid, c.date
having sum(HoursWorked) < 8
This query seem to have done it for me:
select * from (select c.Date, 8 - coalesce(sum(t.durationHours),0) hours_left_to_work
from Calendar c
left join TimeLog t on t.Date = c.Date
where c.date >= '2020-08-01' and c.date <= '2020-08-31'
group by c.Date) as q1
where q1.hours_left_to_work IS NOT NULL
AND q1.hours_left_to_work > 0;
TimeLog = EmployeeTimeWorked

Joining on multiple tables causing incorrect results

I am trying to extract some data grouped by the markets we operate in. The table structure looks like this:
bks:
opportunity_id
bks_opps:
opportunity_id | trip_start | state
bts:
boat_id | package_id
pckgs:
package_id | boat_id
addresses:
addressable_id | district_id
districts:
district_id
What I wanted to do is to count the number of won, lost and total and percentage won for each district.
SELECT d.name AS "District",
SUM(CASE WHEN bo.state IN ('won') THEN 1 ELSE 0 END) AS "Won",
SUM(CASE WHEN bo.state IN ('lost') THEN 1 ELSE 0 END) AS "Lost",
Count(bo.state) AS "Total",
Round(100 * SUM(CASE WHEN bo.state IN ('won') THEN 1 ELSE 0 END) / Count(bo.state)) AS "% Won"
FROM bks b
INNER JOIN bks_opps bo ON bo.id = b.opportunity_id
INNER JOIN pckgs p ON p.id = b.package_id
INNER JOIN bts bt ON bt.id = p.boat_id
INNER JOIN addresses a ON a.addressable_type = 'Boat' AND a.addressable_id = bt.id
INNER JOIN districts d ON d.id = a.district_id
WHERE bo.trip_start BETWEEN '2016-05-12' AND '2016-06-12'
GROUP BY d.name;
This returns incorrect data (The values are way higher than expected). However, when I get rid of all the joins and stop grouping by district - the numbers are correct (Counting the toal # of opportunities). Anybody that can spot what I am doing wrong? The most related question on here is this one.
Example data:
District | won | lost | total
----+---------+---------+------
1 | 42 | 212 | 254
Expected data:
District | won | lost | total |
----+---------+---------+--
1 | 22 | 155 | 177
Formatted comment here:
I would venture a guess that one of your join conditions is at fault here, but with the provided structure it is impossible to say.
For instance, you have this join INNER JOIN pckgs p ON p.id = b.package_id, but package_id is not listed as a column in bks.
And these joins look especially suspect:
INNER JOIN pckgs p ON p.id = b.package_id
INNER JOIN bts bt ON bt.id = p.boat_id
If a boat can exist in multiple packages, it will be an issue.
To troubleshoot, start with the simplest query you can:
SELECT b.opportunity_id
FROM bks b
Then leave the select alone, and proceed to add in each join:
SELECT b.opportunity_id
FROM bks b
INNER JOIN pckgs p ON p.id = b.package_id
At some point you'll likely see a jump in the number of rows returned. Whichever JOIN you added last is your issue.

group twice in one query

I use below code but doesn't return what I expect,
the table relationship,
each gallery is include multiple media and each media is include multiple media_user_action.
I want to count each gallery how many media_user_action and order by this count
rows: [
{
"id": 1
},
{
"id": 2
}
]
and this query will return duplicate gallery rows something like
rows: [
{
"id": 1
},
{
"id": 1
},
{
"id": 2
}
...
]
I think because in the LEFT JOIN subquery select media_user_action rows only group by media_id,
need to group by gallery_id also ?
SELECT
g.*
FROM gallery g
LEFT JOIN gallery_media gm ON gm.gallery_id = g.id
LEFT JOIN (
SELECT
media_id,
COUNT(*) as mua_count
FROM media_user_action
WHERE type = 0
GROUP BY media_id
) mua ON mua.media_id = gm.media_id
ORDER BY g.id desc NULLS LAST OFFSET $1 LIMIT $2
table
gallery
id |
1 |
2 |
gallery_media
id | gallery_id fk gallery.id | media_id fk media.id
1 | 1 | 1
2 | 1 | 2
3 | 2 | 3
....
media_user_action
id | media_id fk media.id | user_id | type
1 | 1 | 1 | 0
2 | 1 | 2 | 0
3 | 3 | 1 | 0
...
media
id |
1 |
2 |
3 |
UPDATE
There's more other table I need to select, this is a part in a function like this https://jsfiddle.net/g8wtqqqa/1/ when user input option then build query.
So I correct my question I need to find a way if user want to count media_user_action order by it, I wanna know how to put these in a subquery possible not change any other code
Base on below #trincot answer I update code, only add media_count on top change a little bit and put those in sub query. is what I want,
now they are group by gallery.id, but sort media_count desc and asc are same result not working I can't find why?
SELECT
g.*,
row_to_json(gi.*) as gallery_information,
row_to_json(gl.*) as gallery_limit,
media_count
FROM gallery g
LEFT JOIN gallery_information gi ON gi.gallery_id = g.id
LEFT JOIN gallery_limit gl ON gl.gallery_id = g.id
LEFT JOIN "user" u ON u.id = g.create_by_user_id
LEFT JOIN category_gallery cg ON cg.gallery_id = g.id
LEFT JOIN category c ON c.id = cg.category_id
LEFT JOIN (
SELECT
gm.gallery_id,
COUNT(DISTINCT mua.media_id) media_count
FROM gallery_media gm
INNER JOIN media_user_action mua
ON mua.media_id = gm.media_id AND mua.type = 0
GROUP BY gm.gallery_id
) gm ON gm.gallery_id = g.id
ORDER BY gm.media_count asc NULLS LAST OFFSET $1 LIMIT $2
The join with gallery_media table is multiplying your results. The count and grouping should happen after you have made that join.
You could achieve that like this:
SELECT g.id,
COUNT(DISTINCT mua.media_id)
FROM gallery g
LEFT JOIN gallery_media gm
ON gm.gallery_id = g.id
LEFT JOIN media_user_action mua
ON mua.media_id = gm.id AND type = 0
GROUP BY g.id
ORDER BY 2 DESC
If you need the other informations as well, you could use the above (in simplified form) as a sub-query, which you join with anything else that you need, but will not multiply the number of rows:
SELECT g.*
row_to_json(gi.*) as gallery_information,
row_to_json(gl.*) as gallery_limit,
media_count
FROM gallery g
LEFT JOIN (
SELECT gm.gallery_id,
COUNT(DISTINCT mua.media_id) media_count
FROM gallery_media gm
INNER JOIN media_user_action mua
ON mua.media_id = gm.id AND type = 0
GROUP BY gm.gallery_id
) gm
ON gm.gallery_id = g.id
LEFT JOIN gallery_information gi ON gi.gallery_id = g.id
LEFT JOIN gallery_limit gl ON gl.gallery_id = g.id
ORDER BY media_count DESC NULLS LAST
OFFSET $1
LIMIT $2
The above assumes that gallery_id is unique in the tables gallery_information and gallery_limit.
You're grouping by media_id to get a count, but since one gallery can have many gallery_media, you still end up with multiple rows for one gallery. You can either sum the mua_count from your subselect:
SELECT g.*, sum(mua_count)
FROM gallery g
LEFT JOIN gallery_media gm ON gm.gallery_id = g.id
LEFT JOIN (
SELECT media_id,
COUNT(*) as mua_count
FROM media_user_action
WHERE type = 0
GROUP BY media_id
) mua ON mua.media_id = gm.media_id
GROUP BY g.id
ORDER BY g.id desc NULLS LAST;
id | sum
----+-----
2 | 1
1 | 2
Or you can just JOIN all the way through and group once on g.id:
SELECT g.id, count(*)
FROM gallery g
JOIN gallery_media gm ON gm.gallery_id = g.id
JOIN media_user_action mua ON mua.media_id = gm.id
GROUP BY g.id
ORDER BY count DESC;
id | count
----+-------
1 | 2
2 | 1
If you only want to show data from table gallery (with select g.*) then why do you join the other tables? Outer joins either join one ore more records to each main record (depending on how many matches are found in the outer-joined table), so no surprise you get duplicates (in your case because gallery ID 1 has two matches in gallery_media).

Joining and Grouping data from 3 tables

I have two tables
Category
CategorySerno | CategoryName
1 One
2 Two
3 Three
Status
StatusSerno | Status
1 Active
2 Pending
Data
CatId |Status | Date
1 1 2014-07-26 11:30:09.693
2 2 2014-07-25 17:30:09.693
1 1 2014-07-25 17:30:09.693
1 2 2014-07-25 17:30:09.693
When I join them I get I need the Joining of the latest Date/
Like
One Active 2014-07-26 11:30:09.693
Two Inactive 2014-07-25 17:30:09.693
Three Null Null
When I am doing a Join and group them It gives me
One Active 2014-07-26 11:30:09.693
One Active 2014-07-26 11:30:09.693
One Active 2014-07-26 11:30:09.693
Two Inactive 2014-07-25 17:30:09.693
Three Null Null
You could use ROW_NUMBER in a CTE:
WITH CTE AS
(
SELECT c.CategoryName,
s.Status,
d.Date,
dateNum = ROW_NUMBER() OVER (PARTITION BY CatId, d.Status
ORDER BY Date DESC)
FROM Category c
LEFT OUTER JOIN Data d
ON c.CategorySerno = d.CatId
LEFT OUTER JOIN Status s
ON d.Status = s.StatusSerno
)
SELECT CategoryName, Status, Date
FROM CTE
WHERE dateNum = 1
Demo-Fiddle
SELECT CategoryName, Status.Status, MAX(Data.Date) FROM Category
LEFT OUTER JOIN Data ON CategorySerno = CatId
LEFT OUTER JOIN Status ON Data.Status = Status.StatusSerno
GROUP BY CategoryName, Status.Status
You prabobly have mismatch between SELECT and GROUP BY columns withch couse duplications
Try this:
SELECT Category.CategoryName, Status.Status, MAX(Data.Date) Data
FROM Data
LEFT JOIN Category ON Category.CategorySerno = Data.CatId
LEFT JOIN Status ON Status.StatusSerno = Data.Status
GROUP BY Category.CategoryName, Status.Status
You don't mention the RDBMS you're working in but you might try starting with something like:
SELECT
c.CategoryName
, s.Status
, d.Date
FROM
Data d
LEFT OUTER JOIN Category c ON d.CatId = c.CategorySerno
LEFT OUTER JOIN Status s ON d.Status = s.StatusSerno
WHERE
d.date=(
SELECT
max(dd.date)
FROM
Data dd
WHERE
d.CatId = dd.CatId
AND
d.Status = dd.Status
) z
To make this more maintainable in the long run, consider using a convention to identify the primary keys in any table, e.g., table_name_id, and use this same convention for foreign keys. Employing this convention address questions like: is a "CategorySerno" a "CatId"?