Remove duplicates when joining tables

Remove duplicates when joining tables - sql

I have a news table as follows
News:
| id | title | description
| 1 | Breaking news | bla bla bla
| 2 | Heavy snowfall in london | bla bla bla
a Type table as follows:
| id | type_name | type_code
| 1 | weather | 0567
| 2 | city | 0653
and a NewsType table as follows
|id | news_id | type_id | created_by |
| 1 | 2 | 1 | "John" |
| 2 | 2 | 2 | "Alex" |
As you can see from the NewsType table that a single news can fall into two or more types.
I need to display news corresponding to types. A user might say give me all the news about cities and weather. To display this I am doing something like:
select distinct n.* , nt.created_at
from news n, newstype nt, type t where
n.id = nt.news_id and
t.id = nt.type_id
order by nt.created_at
limit 25
The problem is this query returns the same news twice (I think it's because of the inner join I am doing). What should I change in the query so that if a news is classified as two types, and the user has requested to view the same two types of news, I get only single news item? instead of two!

simple solution:
select * from news where news_id in (
select news_id
from NewsType
where type_id in (the types you want)
)
most people would say that you should add a DISTINCT on the news_id on the inner query. You can try that, but Im quite sure it will decrese performance.
Over all, if you think this solution doesnt perform well, you can make the inner query a CTE, which usually behaves better:
with my_CTE as(
select news_id
from NewsType
where type_id in (the types you want)
)
select *
from news
where news_id in (select news_id from my_CTE)

A group by is another approach to this:
select n.id, n.title, n.description, max(nt.created_at)
from news n, newstype nt, type t where
n.id = nt.news_id and
t.id = nt.type_id
group by n.id, n.title, n.description
order by nt.created_at
limit 25

Try
select distinct n.id, n.title, n.description
but, as #Jan Dvorak stated,
select distinct n.*
shouldn't select the same news twice

You want to select all of the stories that have an entry in the NewsType table for a praticular type. Therefore you want to select the news items where a relationship to the type exists:
SELECT
News.ID,
News.Title,
News.Description
FROM
News
WHERE
EXISTS
(SELECT
NULL
FROM
NewsType
INNER JOIN Type ON NewsType.Type_ID = Type.ID
WHERE
News.ID = NewsType.News_ID
AND Type.Type_Code = #typeCode)
The last line of the where clause may need to be changed to Type.Type_Name = #typeName if you are using the type name as the parameter

You need to decide what to do with the "duplicate" types: Do you want to display just one type for a news item associated with multiple types, or do you want to list them all?
If the latter, you could investigate using the string_agg function, see http://www.postgresql.org/docs/9.2/static/functions-aggregate.html
select distinct n.id, n.title, n.description, string_agg(t.type_name, ',')
from news n, newstype nt, type t where
n.id = nt.news_id and
t.id = nt.type_id
group by n.id, n.title, n.description
limit 25

Related

Selecting with multiple conditions from one dataset

I have a Table that has 5 columns
First you can see that im german. But second you see that much of the data only differs in the category and value.
I now want to find all the Datasets that have category 1 and value 1
It should give me this table
I now whant to find in the initial TableA all the entrys that match Name, Date and City BUT only if all 3 of them for every dataset match AND the category is now 2 instead of 1 AND the Value is 0.
So for the Table A it should come out as:
I hope i didnt do any mistakes. In the example and it is clear what i try.
I know for the WHERE Statement there is an IN clause that basically checks if the value is inside a list of values. But i dont know how to use this to check for 3 Values. Because when i just do 3 Lists checks it would also give me every entry that is a combination of my 3 lists regardles of which row the actual value comes from.
So instead of checking if value Name[0] And City[0] And Date[0] can be found i need to avoid that a value is found that is like Name[0] City[4] and Date[12] (Number in brackets stands for the row number).
The code i would have thought of:
Select*
FROM tablea
WHERE
(SELECT name, date, city
FROM tablea
WHERE tablea.Category=1 AND tablea.Value=0) as tableafiltered
WHERE tablea in tableafiltered
Thats what i thought would maybe work. But im pretty sure it wouldnt work. Because im trying to match 3 Columns. And the in in the where statement is only valid for one column right?

The first dataset that you describe can be a subquery and you can join it to the table:
select t.*
from tablea t inner join (
select distinct name, date, city
from tablea
where category = 1 and value = 1
) d on d.name = t.name and d.date = t.date and d.city = t.city
where t.category = 2 and t.value = 0
Another way of doing it is with EXISTS:
select t.*
from tablea t
where t.category = 2 and t.value = 0
and exists (
select 1
from tablea
where name = t.name and date = t.date and city = t.city and category = 1 and value = 1
)
See the demo.
Results:
> name | date | city | category | value
> :----- | :--------- | :------ | -------: | ----:
> Albert | 01.01.2000 | Berlin | 2 | 0
> Albert | 01.01.2000 | Hamburg | 2 | 0

One way to do this would be to create two selects, one for category=1, value=1, and one for the 2,0 combination. Then you can inner join the two tables in one row, then ensure the other two columns matches by where table1.column1=table2.column1 and table2.column2=table1.column2. You can choose the columns any way you like, that's why I give this generic form.

Getting results from two different tables using a join

Let's say I have the following tables:
+-------------------------------------------+
| t_classroom |
+-------------------------------------------+
| PK | id |
| | admin_user_id |
| | name |
| | students |
+-------------------------------------------+
+-------------------------------------------+
| t_shared |
+-------------------------------------------+
| | admin_user_id |
| | classroom_id |
| | expiry |
+-------------------------------------------+
I want to write a query that will pull all classrooms that an admin_user_id has access to. In essence, I want a union of classroom rows when I search by admin_user_id in the t_classroom table as well as classroom rows when I search by admin_user_id in the t_shared table. I made the following attempt:
SELECT
id,
admin_user_id,
name,
students
FROM
t_classroom
WHERE
admin_user_id = 1
UNION ALL
SELECT
c.id,
c.admin_user_id,
c.name,
students
FROM
t_classroom c
INNER JOIN t_shared s
ON c.id = s.classroom_id
WHERE
admin_user_id = 1
Does the above look correct? Is there anything more efficient/cleaner?

Depending on how much data you have you could probably get away with just using an IN clause to look at the other table.
SELECT
c.id,
c.admin_user_id,
c.name,
c.students
FROM
t_classroom c
WHERE
c.admin_user_id = 1
OR c.id IN ( select s.classroom_id from t_shared s where s.admin_user_id = 1 )
Your union wont work because you're left-joining to the t_shared table and checking only the classroom admin user.
If you join the shared room you would also end up with duplicates and would need to distinct the result too.
Edit:
Because of the large number of rows it might be better to use an exists check on the 2nd table.
SELECT
c.id,
c.admin_user_id,
c.name,
c.students
FROM
t_classroom c
WHERE
c.admin_user_id = 1
OR EXISTS ( select 1 from t_shared s where s.classroom_id = c.id AND s.admin_user_id = 1 )

Your solution is fundamentally fine, the only two problems I can detect when eyeballing your query are:
You need to write s.admin_user_id instead of admin_user_id in the last line to avoid an error message, because there is a column of that name in both tables. Best practice is to always qualify column names with the table names.
You might want to use UNION instead of UNION ALL if you want to avoid a duplicate result row in the case that both tables have admin_user_id = 1 for the same classroom.

Select distinct of rows and show count of each value

I'm trying to select a distinct selection of [AssetManager].[AssetType] with a count of how many times the Id of an Asset Type is being referenced from table [AssetManager].[Asset]. Please see below for an example:
+-----------+-------------+
| Type Name | Asset Count |
+-----------+-------------+
| Phone | 5 |
| Desktop | 12 |
| Laptop | 22 |
+-----------+-------------+
However, the query I'm trying isn't working at all, the furthest I've got is selecting Asset titles with an inner join of their Type name (I'm not great at SQL...). Please see below for my current Query:
SELECT
[Asset].[Title] AssetTitle,
[AssetType].[Title] TypeTitle
FROM
[AssetManager].[Asset]
INNER JOIN
[AssetManager].[AssetType]
ON
[Asset].[AssetType_Id] = [AssetType].[Id]

As the comments said, all you needed to do is add a GROUP BY correctly:
SELECT
[AssetType].[Title] TypeTitle
, COUNT(*) [Asset Count]
FROM [AssetManager].[Asset]
INNER JOIN [AssetManager].[AssetType]
ON [Asset].[AssetType_Id] = [AssetType].[Id]
GROUP BY [AssetType].[Title]

Use OUTER APPLY to get reference count from [AssetManager].[Asset] table
SELECT
DISTINCT [AssetType].[Title] TypeTitle,
M.TypeCount
FROM
[AssetManager].[AssetType]
OUTER APPLY(
SELECT COUNT([Asset].[AssetType_Id]) AS TypeCount
FROM [AssetManager].[Asset]
WHERE [Asset].[AssetType_Id] = [AssetType].[Id]
)M

Query to count the frequence of many-to-many associations

I have two tables with a many-to-many association in postgresql. The first table contains activities, which may count zero or more reasons:
CREATE TABLE activity (
id integer NOT NULL,
-- other fields removed for readability
);
CREATE TABLE reason (
id varchar(1) NOT NULL,
-- other fields here
);
For performing the association, a join table exists between those two tables:
CREATE TABLE activity_reason (
activity_id integer NOT NULL, -- refers to activity.id
reason_id varchar(1) NOT NULL, -- refers to reason.id
CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);
I would like to count the possible association between activities and reasons. Supposing I have those records in the table activity_reason:
+--------------+------------+
| activity_id | reason_id |
+--------------+------------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | A |
| 4 | C |
| 4 | D |
| 4 | E |
+--------------+------------+
I should have something like:
+-------+---+------+-------+
| count | | | |
+-------+---+------+-------+
| 2 | A | B | NULL |
| 1 | A | NULL | NULL |
| 1 | C | D | E |
+-------+---+------+-------+
Or, eventually, something like :
+-------+-------+
| count | |
+-------+-------+
| 2 | A,B |
| 1 | A |
| 1 | C,D,E |
+-------+-------+
I can't find the SQL query to do this.

I think you can get what you want using this query:
SELECT count(*) as count, reasons
FROM (
SELECT activity_id, array_agg(reason_id) AS reasons
FROM (
SELECT A.activity_id, AR.reason_id
FROM activity A
LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
ORDER BY activity_id, reason_id
) AS ordered_reasons
GROUP BY activity_id
) reason_arrays
GROUP BY reasons
First you aggregate all the reasons for an activity into an array for each activity. You have to order the associations first, otherwise ['a','b'] and ['b','a'] will be considered different sets and will have individual counts. You also need to include the join or any activity that doesn't have any reasons won't show up in the result set. I'm not sure if that is desirable or not, I can take it back out if you want activities that don't have a reason to not be included. Then you count the number of activities that have the same sets of reasons.
Here is a sqlfiddle to demonstrate
As mentioned by Gordon Linoff you could also use a string instead of an array. I'm not sure which would be better for performance.

We need to compare sorted lists of reasons to identify equal sets.
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id) AS reason_list
FROM (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
ORDER BY reason_id in the innermost subquery would work, too, but adding activity_id is typically faster.
And we don't strictly need the innermost subquery at all. This works as well:
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
FROM activity_reason
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
But it's typically slower for processing all or most of the table. Quoting the manual:
Alternatively, supplying the input values from a sorted subquery will usually work.
We could use string_agg() instead of array_agg(), and that would work for your example with varchar(1) (which might be more efficient with data type "char", btw). It can fail for longer strings, though. The aggregated value can be ambiguous.
If reason_id would be an integer (like it typically is), there is another, faster solution with sort() from the additional module intarray:
SELECT count(*) AS ct, reason_list
FROM (
SELECT sort(array_agg(reason_id)) AS reason_list
FROM activity_reason2
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
Related, with more explanation:
Compare arrays for equality, ignoring order of elements
Storing and comparing unique combinations

You can do this using string_agg():
select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
from activity_reason
group by activity_id
) a
group by reasons
order by count(*) desc;

Recursive SQL Server query

In a table reviewers with a structure like this:
reviewer | reviewee
===================
2 | 1
3 | 2
4 | 3
5 | 4
In a function call, I know both a reviewer-id and a reviewee-id (the owner of the item the reviewee is looking to retrieve).
I'm now trying to send a query that iterates all the entries in the reviewers table, starting with the reviewer, and ends at the reviewee's id (and matches that to the reviewee id I know). So I'm trying to find out if there is a connection between reviewee and reviewer at all.
Is it possible to do this in a single query?

You can do this:
WITH CTE
AS
(
SELECT reviewer, reviewee
FROM TableName
WHERE reviewee = #revieweeID
UNION ALL
SELECT p.reviewer, p.reviewee
FROM CTE c
INNER JOIN TableName p ON c.reviewee = p.reviewer
)
SELECT *
FROM CTE;
--- WHERE reviewer = #reviewerID;
Demo

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicates when joining tables - sql

A group by is another approach to this: select n.id, n.title, n.description, max(nt.created_at) from news n, newstype nt, type t where n.id = nt.news_id and t.id = nt.type_id group by n.id, n.title, n.description order by nt.created_at limit 25

Try select distinct n.id, n.title, n.description but, as #Jan Dvorak stated, select distinct n.* shouldn't select the same news twice

Related

Selecting with multiple conditions from one dataset

Getting results from two different tables using a join

Select distinct of rows and show count of each value

Query to count the frequence of many-to-many associations

Recursive SQL Server query

Categories

Resources