Fetching Two Joined Tables in SQL With Group By and Multiple Columns

Fetching Two Joined Tables in SQL With Group By and Multiple Columns - sql

I've read through every relevant question in here but couldn't figure out or modify the accepted answers to accomplish what I want.
I have two tables:
News: Id, Title, CategoryId
NewsCategory: Id, Title
I want to list all NewsCategories and include 10 rows of News belong to this category in the same Sql query.
I got this query working at the moment:
Select C.Id As CategoryId, C.Title As CategoryTitle, N.Id, N.Title
From NewsCategories C, News N
Where N.CategoryId In (C.Id)
Order By C.Id Desc
But couldn't figure out how to limit the amount of "News" returned without limiting NewsCategories.

Just use OUTER APPLY:
SELECT C.Id AS CategoryId, C.Title AS CategoryTitle, N.Id, N.Title
From NewsCategories C
OUTER APPLY (
SELECT TOP 10 *
FROM News
WHERE CategoryId = C.Id
) AS N
ORDER BY C.Id Desc
Depending on your requirements and the News table schema, you can perform an additional ORDER BY on the OUTER APPLY sub-query to, e.g., get the 10 latest news for each category or the 10 top news in alphabetical order, etc

Related

Why does the optimizer decide to self-join a table?

I'm analyzing my query that looks like this:
WITH Project_UnfinishedCount AS (
SELECT P.Title AS title, COUNT(T.ID) AS cnt
FROM PROJECT P LEFT JOIN TASK T on P.ID = T.ID_PROJECT AND T.ACTUAL_FINISH_DATE IS NULL
GROUP BY P.ID, P.TITLE
)
SELECT Title
FROM Project_UnfinishedCount
WHERE cnt = (
SELECT MAX(cnt)
FROM Project_UnfinishedCount
);
It returns a title of a project that has the biggest number of unfinished tasks in it.
Here is its execution plan:
I wonder why it has steps 6-8 that look like self-join of project table? And than it stores the result of the join as a view, but the view, according to rows and bytes columns is the same as project table. Why does he do it?
I'd also like to know what 2 and 1 steps stand for. I guess, 2 stores the result of my CTE to use it in steps 10-14 and 1 removes the rows from the view that don't have the 'cnt' value that was returned by the subquery, is this a correct guess?

In addition to the comments above, when you reference a CTE more than once, there is a heuristic that tells the optimizer to materialize the CTE, which is why you see the temp table transformation.
A few other comments/questions regarding this query. I'm assuming that the relationship is that a PROJECT can have 0 or more tasks, and each TASK is for one and only one PROJECT. In that case, I wonder why you have an outer join? Moreover, you are joining on the ACTUAL_FINISH_DATE column. This would mean that if you have a project, where all the task were complete, then the outer join would materialize the non-matching row, which would make your query results appear to indicate that there was 1 unfinished task. So I think your CTE should look more like:
SELECT P.Title AS title, COUNT(T.ID) AS cnt
FROM PROJECT P
JOIN TASK T on P.ID = T.ID_PROJECT
WHERE T.ACTUAL_FINISH_DATE IS NULL
GROUP BY P.ID, P.TITLE
With all that being said, these "find the match (count, max etc) within a group" type of queries are often more efficient when written as a window function. That way you can eliminate the self join. This can make a big performance difference when you have millions or billions of rows. So for example, your query could be re-written as:
SELECT TITLE, CNT
FROM (
SELECT P.Title AS title, COUNT(T.ID) AS cnt
, RANK() OVER( ORDER BY COUNT(*) DESC ) RNK
FROM PROJECT P
JOIN TASK T on P.ID = T.ID_PROJECT
WHERE T.ACTUAL_FINISH_DATE IS NULL
GROUP BY P.ID, P.TITLE
)
WHERE RNK=1

SQL for a query with several input IDs, how to get the first 5 results for each ID

I have a query that accepts several IDs as filters in a WHERE clause.
it's formatted something like this:
SELECT a.ID, a.VOLUMETRY, b.ANNOY_DISTANCE
FROM PRODUCT a
JOIN RECOMMENDATIONS b on a.ID = b.ID
WHERE a.ID in ('0001','0002', ...., '0099')
ORDER BY b.ANNOY_DISTANCE
Now this query can return several thousand results for each ID, but I only need the first 5 for each ID after ordering them by the ANNOY_DISTANCE column. The rest aren't needed and would only slow post-processing of the data.
How can I change this so that the query result only gives the first 5 rows for each ID?

Use window functions, which you can filter using a QUALIFY clause:
SELECT p.ID, p.VOLUMETRY, r.ANNOY_DISTANCE
FROM PRODUCT p JOIN
RECOMMENDATIONS r
ON p.ID = r.ID
WHERE a.ID in ('0001','0002', ...., '0099')
QUALIFY ROW_NUMBER() OVER (PARTITION BY p.ID ORDER BY r.ANNOY_DISTANCE) <= 5
ORDER BY r.ANNOY_DISTANCE;
Notice that I changed your table aliases to be meaningful abbreviations for the table names. That is a best practice.

how to count two values from three dataset

I have 3 datasets: company, post, postedited,
I want to count the numbers of companies' post and postedited. some companies post but did not edited.
here is my query :
SELECT company.name, company.id, count(*),
( select count(*)
from post, postedited
where post.id=postedited.post_id)
from company, post as p
where company.id=p.company_id
group by company_id
the outcome of post is right, but the column of postedited is the same. what's wrong with my query?

Your subquery is completely unrelated to the main query. It selects post and postedited and counts. You are showing this result for every row of the main query.
You want the subquery relate to the main query's post. So remove the post table from the subquery's from clause:
(select count(*) from postedited where postedited.post_id = p.id)
Now this subquery selects a count for the post_id of the main query's records. At last you must get the sum of the counts:
select
c.name, c.id, count(*) as posts,
sum(select count(*) from postedited pe where pe.post_id = p.id) as edits
from company c
join post p on p.company_id = c.id
group by c.id;
You can achieve the same thus:
select
c.name, c.id, count(distinct p.id) as posts, count(pe.post_id) as edits
from company c
join post p on p.company_id = c.id
left join postedited pe on pe.post_id = p.id
group by c.id;

SELECT c.name AS companyName
, c.id AS companyID
, COUNT(DISTINCT p.id) AS postCount
, COUNT(DISTINCT pe.post_id) AS postEditCount
FROM company c
LEFT OUTER JOIN post p ON p.Company_ID = c.ID
LEFT OUTER JOIN postEdited pe ON pe.Company_ID = c.ID
GROUP BY c.id, c.name
That will give you a list of all companies in your company table with a count of each of their posts and edited posts. If you need to further query against that dataset, you can. Or you can add a WHERE clause to the above query to filter it.
And I agree, please don't use comma syntax. It's very easy to produce unintended results, and it doesn't give a good representation of what you're actually querying against. Plus, it's no longer standard and being deprecated in many flavors of SQL. Good JOIN syntax will make your life much easier.

Distinct on multi-columns in sql

I have this query in sql
select cartlines.id,cartlines.pageId,cartlines.quantity,cartlines.price
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
I want to get rows distinct by pageid ,so in the end I will not have rows with same pageid more then once(duplicate)
any Ideas
Thanks
Baaroz

Going by what you're expecting in the output and your comment that says "...if there rows in output that contain same pageid only one will be shown...," it sounds like you're trying to get the top record for each page ID. This can be achieved with ROW_NUMBER() and PARTITION BY:
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER(PARTITION BY c.pageId ORDER BY c.pageID) rowNumber,
c.id,
c.pageId,
c.quantity,
c.price
FROM orders o
INNER JOIN cartlines c ON c.orderId = o.id
WHERE userId = 5
) a
WHERE a.rowNumber = 1
You can also use ROW_NUMBER() OVER(PARTITION BY ... along with TOP 1 WITH TIES, but it runs a little slower (despite being WAY cleaner):
SELECT TOP 1 WITH TIES c.id, c.pageId, c.quantity, c.price
FROM orders o
INNER JOIN cartlines c ON c.orderId = o.id
WHERE userId = 5
ORDER BY ROW_NUMBER() OVER(PARTITION BY c.pageId ORDER BY c.pageID)

If you wish to remove rows with all columns duplicated this is solved by simply adding a distinct in your query.
select distinct cartlines.id,cartlines.pageId,cartlines.quantity,cartlines.price
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
If however, this makes no difference, it means the other columns have different values, so the combinations of column values creates distinct (unique) rows.
As Michael Berkowski stated in comments:
DISTINCT - does operate over all columns in the SELECT list, so we
need to understand your special case better.
In the case that simply adding distinct does not cover you, you need to also remove the columns that are different from row to row, or use aggregate functions to get aggregate values per cartlines.
Example - total quantity per distinct pageId:
select distinct cartlines.id,cartlines.pageId, sum(cartlines.quantity)
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
If this is still not what you wish, you need to give us data and specify better what it is you want.

Quick SQL question! Sort by most occurences of an attribute

I have two tables as such:
Categories:
ID - Name - Desc
Items
ID - Name - CategoryID - Desc - Price
I want a query that returns a list of categories ranked by the most occurences in the items table.

This should do the trick:
SELECT c.ID, c.Name, count(i.ID)
FROM Categories c
LEFT JOIN Items i on (c.ID=i.CategoryID)
GROUP BY c.ID
ORDER BY count(i.ID)

SELECT
CategoryID, count(*)
FROM
items
GROUP BY
CategoryID
ORDER BY
2 DESC
You can then join to categories to get their names.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Fetching Two Joined Tables in SQL With Group By and Multiple Columns - sql

Related

Why does the optimizer decide to self-join a table?

SQL for a query with several input IDs, how to get the first 5 results for each ID

how to count two values from three dataset

Distinct on multi-columns in sql

Quick SQL question! Sort by most occurences of an attribute

Categories

Resources