Avoid repeated information when having multiple joins? - sql

I have the following query that uses joins to join multiple tables
select DISTINCT
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblTypes.Article_Type_Name,
tblimages.image_path as "Extra images"
from tblArticles inner join tblWriters
on tblArticles.Writer_ID_Fkey = tblWriters.Writer_ID inner join
tblArticleType on tblArticles.Article_ID = tblArticleType.Article_ID_Fkey inner join
tblTypes on tblArticleType.Article_Type_ID_Fkey = tblTypes.Article_Type_ID left outer join tblExtraImages
on tblArticles.Article_ID = tblExtraImages.Article_ID_Fkey left outer join tblimages
on tblExtraImages.image_id_fkey = tblimages.image_id
order by tblArticles.Article_Sequence, tblArticles.Article_Date_Created;
And I get the following results:
If an article has more than one type_name then I will get repeated columns for the rest of the records. Is there another way of joining these tables that would prevent that from happening?

The simplest method is to just remove column Article_Type_Name from the select clause. This allows SELECT DISTINCT to identify the rows as duplicates, and eliminate them.
Another option is to use an aggregation function on the column. In recent SQL Server versions, STRING_AGG() comes handy (you can also use MIN() or MAX()):
select
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
string_agg(tblTypes.Article_Type_Name, ',')
within group(order by tblTypes.Article_Type_Name) Article_Type_Name_List,
tblimages.image_path as Extra_Images
from ..
group by
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblimages.image_path

What you're seeing here is a Cartesian product; you've joined Tables in such a way that multiple rows from one side match with rows from the other
If you don't care about the article_type, then group the other columns and take the max(article_type), or omit it in a subquery that selects distinct records, not including the article type column, from the table that contains article type). If your SQLS is recent enough and you want to know all the article types you could STRING_AGG them into a csv list
Ultimately what you choose to do depends on what you want them for; filter the rows out, or group them down

Related

Distinct Households Comparison Using Join

I am trying to compare two lists of unique Household IDs using the Distinct clause. The problem comes when I try to pull in a third column consisting of timestamps into the results.
When I include only the two Household ID columns in the Select statement, the results seem to make sense. I get back two lists of unique IDs.
Here is that query:
select distinct e.household_id, a.hhid
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
However, when I just add the "e.imp_ts" column to the Select statement, it looks like SQL completely disregards the Distinct part of the query and pulls in all the duplicate households in the files.
select distinct e.household_id, a.hhid, e.imp_ts
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
Can someone please explain why the query doesn't work when I simply add a third column to the Select statement?
Thank you!
It is not that the second query "doesn't work", but rather that it is being asked to provide different results than the first query. As others in the comments have pointed out, because the imp_ts column contains more granular data, the distinct can no longer return a unique list of household IDs. For example, household ID 12345 may contain 5 records, each with unique timestamps on them.
In order to resolve this, you have some choices:
Remove imp_ts from the query.
Return the minimum (most likely first) timestamp
Return the maximum (most likely last) timestamp
For #2 and #3 above, you can use MIN() or MAX() with a GROUP BY to achieve those results. Here is an example of using MIN():
select e.household_id, a.hhid, MIN(e.imp_ts) AS min_imp_ts
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
group by e.household_id, a.hhid
I would suggest looking up GROUP BY examples online to get a better idea of what is happening.

SQL subselect statement very slow on certain machines

I've got an sql statement where I get a list of all Ids from a table (Machines).
Then need the latest instance of another row in (Events) where the the id's match so have been doing a subselect.
I need to latest instance of quite a few fields that match the id so have these subselects after one another within this single statement so end up with results similar to this...
This works and the results are spot on, it's just becoming very slow as the Events Table has millions of records. The Machine table would have on average 100 records.
Is there a better solution that subselects? Maybe doing inner joins or a stored procedure?
Help appreciated :)
You can use apply. You don't specify how "latest instance" is defined. Let me assume it is based on the time column:
Select a.id, b.*
from TableA a outer apply
(select top(1) b.Name, b.time, b.weight
from b
where b.id = a.id
order by b.time desc
) b;
Both APPLY and the correlated subquery need an ORDER BY to do what you intend.
APPLY is a lot like a correlated query in the FROM clause -- with two convenient enhances. A lateral join -- technically what APPLY does -- can return multiple rows and multiple columns.

Semi-join vs Subqueries

What is the difference between semi-joins and a subquery? I am currently taking a course on this on DataCamp and i'm having a hard time making a distinction between the two.
Thanks in advance.
A join or a semi join is required whenever you want to combine two or more entities records based on some common conditional attributes.
Unlike, Subquery is required whenever you want to have a lookup or a reference on same table or other tables
In short, when your requirement is to get additional reference columns added to existing tables attributes then go for join else when you want to have a lookup on records from the same table or other tables but keeping the same existing columns as o/p go for subquery
Also, In case of semi join it can act/used as a subquery because most of the times we dont actually join the right table instead we mantain a check via subquery to limit records in the existing hence semijoin but just that it isnt a subquery by itself
I don't really think of a subquery and a semi-join as anything similar. A subquery is nothing more interesting than a query that is used inside another query:
select * -- this is often called the "outer" query
from (
select columnA -- this is the subquery inside the parentheses
from mytable
where columnB = 'Y'
)
A semi-join is a concept based on join. Of course, joining tables will combine both tables and return the combined rows based on the join criteria. From there you select the columns you want from either table based on further where criteria (and of course whatever else you want to do). The concept of a semi-join is when you want to return rows from the first table only, but you need the 2nd table to decide which rows to return. Example: you want to return the people in a class:
select p.FirstName, p.LastName, p.DOB
from people p
inner join classes c on c.pID = p.pID
where c.ClassName = 'SQL 101'
group by p.pID
This accomplishes the concept of a semi-join. We are only returning columns from the first table (people). The use of the group by is necessary for the concept of a semi-join because a true join can return duplicate rows from the first table (depending on the join criteria). The above example is not often referred to as a semi-join, and is not the most typical way to accomplish it. The following query is a more common method of accomplishing a semi-join:
select FirstName, LastName, DOB
from people
where pID in (select pID
from class
where ClassName = 'SQL 101'
)
There is no formal join here. But we're using the 2nd table to determine which rows from the first table to return. It's a lot like saying if we did join the 2nd table to the first table, what rows from the first table would match?
For performance, exists is typically preferred:
select FirstName, LastName, DOB
from people p
where exists (select pID
from class c
where c.pID = p.pID
and c.ClassName = 'SQL 101'
)
In my opinion, this is the most direct way to understand the semi-join. There is still no formal join, but you can see the idea of a join hinted at by the usage of directly matching the first table's pID column to the 2nd table's pID column.
Final note. The last 2 queries above each use a subquery to accomplish the concept of a semi-join.

How to select all attributes in sql Join query

The following sql query below produces the specified result.
select product.product_no,product_type,salesteam.rep_name,salesteam.SUPERVISOR_NAME
from product
inner join salesteam
on product.product_rep=salesteam.rep_id
ORDER BY product.Product_No;
However my intensions are to further produce a more detailed result which will include all the attributes in the PRODUCT table. my approach is to list all the attributes in the first line of the query.
select product.product_no,product.product_date,product.product_colour,product.product_style,
product.product_age product_type,salesteam.rep_name,salesteam.SUPERVISOR_NAME
from product
inner join salesteam
on product.product_rep=salesteam.rep_id
ORDER BY product.Product_No;
Is there another way it can be done instead of listing all the attributes of PRoduct table one by one?
You can use * to select all columns from all tables, or you can use [table/alias].* to select all columns from the specified table. In your case, you can use product.*:
select product.*,salesteam.rep_name,salesteam.SUPERVISOR_NAME
from product
inner join salesteam
on product.product_rep=salesteam.rep_id
ORDER BY product.Product_No;
It is important to note that you should only do this if you are 100% sure you need every single column, and always will. There are performance implications associated with this; if you're selecting 100 columns from a table when you really only need 4 or 5 of them, you're adding a lot of overhead to the query. The DBMS has to work harder, and you're also sending more data across the wire (if your database is not on the same machine as your executing code).
If any columns are later added to the product table, those columns will also be returned by this query in the future.
select
product.*,
salesteam.rep_name,
salesteam.SUPERVISOR_NAME
from product inner join salesteam on
product.product_rep=salesteam.rep_id
ORDER BY
product.Product_No;
This should do.
You can write like this
select P.* --- all Product columns
,S.* --- all salesteam columns
from product P
inner join salesteam S
on P.product_rep=S.rep_id
ORDER BY P.Product_No;

Order by in Inner Join

I am putting inner join in my query.I have got the result but didn't know that how the data is coming in output.Can anyone tell me that how the Inner join matching the data.Below I am showing a image.There are two table(One or Two Table).
According to me that first row it should be Mohit but output is different. Please tell me.
In SQL, the order of the output is not defined unless you specify it in the ORDER BY clause.
Try this:
SELECT *
FROM one
JOIN two
ON one.one_name = two.one_name
ORDER BY
one.id
You have to sort it if you want the data to come back a certain way. When you say you are expecting "Mohit" to be the first row, I am assuming you say that because "Mohit" is the first row in the [One] table. However, when SQL Server joins tables, it doesn't necessarily join in the order you think.
If you want the first row from [One] to be returned, then try sorting by [One].[ID]. Alternatively, you can order by any other column.
Avoid SELECT * in your main query.
Avoid duplicate columns: the JOIN condition ensures One.One_Name and two.One_Name will be equal therefore you don't need to return both in the SELECT clause.
Avoid duplicate column names: rename One.ID and Two.ID using 'aliases'.
Add an ORDER BY clause using the column names ('alises' where applicable) from the SELECT clause.
Suggested re-write:
SELECT T1.ID AS One_ID, T1.One_Name,
T2.ID AS Two_ID, T2.Two_name
FROM One AS T1
INNER JOIN two AS T2
ON T1.One_Name = T2.One_Name
ORDER
BY One_ID;
Add an ORDER BY ONE.ID ASC at the end of your first query.
By default there is no ordering.
SQL doesn't return any ordering by default because it's faster this way. It doesn't have to go through your data first and then decide what to do.
You need to add an order by clause, and probably order by which ever ID you expect. (There's a duplicate of names, thus I'd assume you want One.ID)
select * From one
inner join two
ON one.one_name = two.one_name
ORDER BY one.ID
I found this to be an issue when joining but you might find this blog useful in understanding how Joins work in the back.
How Joins Work..
[Edited]
#Shree Thank you for pointing that out.
On the paragraph of Merge Join. It mentions on how joins work...
Like
hash join, merge join consists of two steps. First, both tables of the
join are sorted on the join attribute. This can be done with just two
passes through each table via an external merge sort. Finally, the
result tuples are generated as the next ordered element is pulled from
each table and the join attributes are compared
.