Problem With DISTINCT! - sql

Here is my query:
SELECT
DISTINCT `c`.`user_id`,
`c`.`created_at`,
`c`.`body`,
(SELECT COUNT(*) FROM profiles_comments c2 WHERE c2.user_id = c.user_id AND c2.profile_id = 1) AS `comments_count`,
`u`.`username`,
`u`.`avatar_path`
FROM `profiles_comments` AS `c` INNER JOIN `users` AS `u` ON u.id = c.user_id
WHERE (c.profile_id = 1) ORDER BY `u`.`id` DESC;
It works. The problem though is with the DISTINCT word. As I understand it, it should select only one row per c.user_id.
But what I get is even 4-5 rows with the same c.user_id column. Where is the problem?

actually, DISTINCT does not limit itself to 1 column, basically when you say:
SELECT DISTINCT a, b
What you're saying is, "give me the distinct value of a and b combined" .. just like a multi-column UNIQUE index

distinct will ensure that ALL values in your select clause are unique, not just user_id. If you want to limit the results to individual user_ids, you should group by user_id.
Perhaps what you want is:
SELECT
`c`.`user_id`,
`u`.`username`,
`u`.`avatar_path`,
(SELECT COUNT(*) FROM profiles_comments c2 WHERE c2.user_id = c.user_id AND c2.profile_id = 1) AS `comments_count`
FROM `profiles_comments` AS `c` INNER JOIN `users` AS `u` ON u.id = c.user_id
WHERE (c.profile_id = 1)
GROUP BY `c`.`user_id`,
`u`.`username`,
`u`.`avatar_path`
ORDER BY `u`.`id` DESC;

DISTINCT works at a row level, not just a column level
If you want the DISTiNCT of only one column then you will have to aggregate the rest of the columns returned (MIN, MAX, SUM, AVG, etc)
SELECT DISTINCT (Name), Min (ID)
From MyTable

Distinct will try to return only unique rows, it will not return only 1 row per user id in your example.
http://dev.mysql.com/doc/refman/5.0/en/distinct-optimization.html

You misunderstand. The DISTINCT modifier applies to the entire row — it states that no two identical ROWS will be returned in the result set.
Looking at your SQL, what value of the several available do you expect to see returned in the created_at column (for instance)? It would be impossible to predict the results of the query as written.
Also, you're using profile_comments twice in your SELECT. It appears that you're trying to obtain a count of how many times each user has commented. If so, what you want to do is use an AGGREGATE query, grouped on user_id and including only those columns that uniquely identify a user along with a COUNT of the comments:
SELECT user_id, COUNT(*) FROM profile_comments WHERE profile_id = 1 GROUP BY user_id
You can add the join to users to get the user name if you want but, logically, your result set cannot include other columns from profile_comments and still produce only a single row per user_id unless those columns are also aggregated in some way:
SELECT user_id, MIN(created_at) AS Earliest, MAX(created_at) AS Latest, COUNT(*) FROM profile_comments WHERE profile_id = 1 GROUP BY user_id

Related

SQL - Guarantee at least n unique users with 2 appearances each in query

I'm working with AWS Personalize and one of the service Quotas is to have "At least 1000 records containing a min of 25 unique users with at least 2 records each", I know my raw data has those numbers but I'm trying to find a way to guarantee that those numbers will always be met, even if the query is run by someone else in the future.
The easy way out would be to just use the full dataset, but right now we are working towards a POC, so that is not really my first option. I have covered the "two records each" section by just counting the appearances, but I don't know how to guarantee the min of 25 users.
It is important to say that my data is not shuffled in any way at the time of saving.
My query
SELECT C.productid AS ITEM_ID,
A.userid AS USER_ID,
A.createdon AS "TIMESTAMP",
B.fromaddress_countryname AS "LOCATION"
FROM A AS orders
JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
JOIN C AS order_items ON orders.order_id = order_items.order_id
WHERE orders.userid IN (
SELECT orders.userid
FROM A AS ORDERS
GROUP BY orders.userid
HAVING count(*) > 2
)
LIMIT 10
I use the LIMIT to just query a subset since I'm in AWS Athena.
The IN query is not very efficient since it needs to compare each row with all (worst case) the elements of the subquery to find a match.
It would be easier to start by storing all users with at least 2 records in a common table expression (CTE) and do a join to select them.
To ensure at least 25 distinct users you will need a window function to count the unique users since the first row and add a condition on that count. Since you can't use a window function in the where clause, you will need a second CTE and a final query that queries it.
For example:
with users as (
select userid as good_users
from orders
group by 1
having count(*) > 1 -- this condition ensures at least 2 records
),
cte as (
SELECT C.productid AS ITEM_ID,
A.userid AS USER_ID,
A.createdon AS "TIMESTAMP",
B.fromaddress_countryname AS "LOCATION",
count(distinct A.userid) over (rows between unbounded preceding and current row) as n_distinct_users
FROM A AS orders
JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
JOIN C AS order_items ON orders.order_id = order_items.order_id
JOIN users on A.userid = users.userid --> ensure only users with 2 records
order by A.userid -- needed for the window function
)
select * from cte where n_distinct_users < 26
sorting over userid in cte will ensure that at least 2 records per userid will appear in the results.

right way to alias count * in a subquery

I have query below as
select t.comment_count, count(*) as frequency
from
(select u.id, count(c.user_id) as comment_count
from users u
left join comments c
on u.id = c.user_id
and c.created_at between '2020-01-01' and '2020-01-31'
group by 1) t
group by 1
order by 1
when I also try to alias the count(*) as count(t.*) it gives error, can I not alias that with the t from the table? Not sure what I am missing
Thank you
Count(*) stands for the count of all rows returned by a query (with respect to GROUP BY columns). So it makes no sence to specify one of the involved tables. Consider counting rows produced by a join for example. If you need a count of rows of the specific table t you can use count(distinct t.<unique column>)

Get Max from a joined table

I write this script in SQL server And I want get the food name with the Max of order count From this Joined Table . I can get Max value correct but when I add FoodName is select It give me an error.
SELECT S.FoodName, MAX(S.OrderCount) FROM
(SELECT FoodName,
SUM(Number) AS OrderCount
FROM tblFactor
INNER JOIN tblDetail
ON tblFactor.Factor_ID = tblDetail.Factor_ID
WHERE FactorDate = '2020-10-30'
GROUP BY FoodName)S
Here is The Error Message
Column 'S.FoodName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
also I know I can use order by and top to achieve the food Name and Max order number but I want use the way I use in this script . Thank you for your answers
If I follow you correctly, you can use ORDER BY and TOP (1) directly on the result of the join query:
SELECT TOP (1) f.FoodName, SUM(d.Number) AS OrderCount
FROM tblFactor f
INNER JOIN tblDetail d ON f.Factor_ID = d.Factor_ID
WHERE f.FactorDate = '2020-10-30'
GROUP BY f.FoodName
ORDER BY OrderCount DESC
Notes:
I added table aliases to the query, and prefixed each column with the table it (presumably !) comes from; you might need to review that, as I had to make assumptions
If you want to allow top ties, use TOP (1) WITH TIES instead
You have an aggregation function in the outer query MAX() and an unaggregated column. Hence, the database expects a GROUP BY.
Instead, use ORDER BY and LIMIT:
SELECT FoodName, SUM(Number) AS OrderCount
FROM tblFactor f INNER JOIN
tblDetail d
ON fd.Factor_ID = d.Factor_ID
WHERE FactorDate = '2020-10-30'
GROUP BY FoodName
ORDER BY OrderCount DESC
LIMIT 1;
Note: In a query that references multiple tables, you should qualify all column references. It is not clear where the columns come from, so I cannot do that for this query.

order results from table A by total number of related rows form table B

i have two tables, "topics" (table A) and "topic votes" (table B). each "topic" has multiple "topic votes". i need to select rows from "topics" and order them by the total number of related "topic votes".
how is it possible in sqlite? do i need to create a view for this, or is it possible to solve with JOIN?
If you don't need the votes you can just put a correlated query in the order by:
select t.*
from topics t
order by (select count(*) from topic_votes tv where t.topic = v.topic) desc;
Normally, you would want the number of votes in the result set as well. A simple method is to move the subquery to the select clause:
select t.*,
(select count(*) from topic_votes tv where t.topic = v.topic
) as votes
from topics t
order by votes desc;
Assuming you have a PK id in topics table and FK topic_id in topic votes table:
select
t.*
from topics t
left join topic_votes v on t.id = v.topic_id
group by t.id
order by count(v.topic_id) desc;

SQL Query for getting all elements matching another element where there are at least X elements

I want a query that will give me all the items with an ID matching another item, where the group of matching items is larger than X.
Say I have two tables, submissions and submission_items. Each submission has a submission_id and each submission_item has a foreign key that is the parent submission's submission_id. Submissions can have multiple submission_items.
I want to get all the submissions that have more than X submission_items in them.
I tried this:
select submissions.*, submission_item.*
from submission
join submission_item ON submissions.submission_id = submission_item.submission_id
group by submission.submission_id
having count(*) > 1
It errored out saying it wanted other fields in the GROUP BY function. How can I do this better?
When you specify a GROUP BY, fields in the SELECT statement must be either part of the group clause or part of an aggregate.
You need to find the IDs that you are interested in and then rejoin it back to the table.
SELECT * FROM
submission si INNER JOIN
(select submission_id
from submission
join submission_item ON submissions.submission_id = submission_item.submission_id
group by submission.submission_id
having count(*) > 1) c
ON si.submission_id = c.submission_id
or you can group all the fields individually
select submissions.submission_id, submissions.x, submissions.y
from submission
join submission_item ON submissions.submission_id = submission_item.submission_id
group by submission.submission_id, submissions.x, submissions.y
having count(*) > 1