Query to ORDER BY the number of rows returned from another SELECT - sql

I'm trying to wrap my head around SQL and I need some help figuring out how to do the following query in PostgreSQL 9.3.
I have a users table, and a friends table that lists user IDs and the user IDs of friends in multiple rows.
I would like to query the user table, and ORDER BY the number of mutual friends in common to a user ID.
So, the friends table would look like:
user_id | friend_user_id
1 | 4
1 | 5
2 | 10
3 | 7
And so on, so user 1 lists 4 and 5 as friends, and user 2 lists 10 as a friend, so I want to sort by the highest count of user 1 in friend_user_id for the result of user_id in the select.

The Postgres way to do this:
SELECT *
FROM users u
LEFT JOIN (
SELECT user_id, count(*) AS friends
FROM friends
) f USING (user_id)
ORDER BY f.friends DESC NULLS LAST, user_id -- as tiebreaker
The keyword AS is just noise for table aliases. But don't omit it from column aliases. The manual on "Omitting the AS Key Word":
In FROM items, both the standard and PostgreSQL allow AS to be omitted
before an alias that is an unreserved keyword. But this is impractical
for output column names, because of syntactic ambiguities.
Bold emphasis mine.
ISNULL() is a custom extension of MySQL or SQL Server. Postgres uses the SQL-standard function COALESCE(). But you don't need either here. Use the NULLS LAST clause instead, which is faster and cleaner. See:
PostgreSQL sort by datetime asc, null first?
Multiple users will have the same number of friends. These peers would be sorted arbitrarily. Repeated execution might yield different sort order, which is typically not desirable. Add more expressions to ORDER BY as tiebreaker. Ultimately, the primary key resolves any remaining ambiguity.
If the two tables share the same column name user_id (like they should) you can use the syntax shortcut USING in the join clause. Another standard SQL feature. Welcome side effect: user_id is only listed once in the output for SELECT *, as opposed to when joining with ON. Many clients wouldn't even accept duplicate column names in the output.

Something like this?
SELECT * FORM [users] u
LEFT JOIN (SELECT user_id, COUNT(*) friends FROM fields) f
ON u.user_id = f.user_id
ORDER BY ISNULL(f.friends,0) DESC

Related

Output identical fields names of two LEFT JOIN tables Sql

I have two tables, with about 20 columns each
users:
id_user user ..... status token
----------------------------------
2 A 0 XdAQ
posts:
id_user post ..... status token
-------------------------------------
3 hi 1 sDyTMl
Query:
SELECT u.*,p.*
FROM posts as p
LEFT JOIN users as u ON u.id_user = p.id_user
WHERE p.id_post = 3
LIMIT 1
So in Php, it could be retrieved any value
....
$status=$a['status'];
$token=$a['token'];
I want to return all the fields of each table to make the post content, the problem is that there is conflict among those identical column names in each table. there are more than 20 columns in each in my real tables, so writing the column names with aliases I think is not the way to go. Is there a way to alias only those identical columns in conflict?
You really should list the specific columns that you want. This is the safest way to retrieve values from the table.
If the only column that is in common is the one used for the join, you can use the USING clause:
SELECT *
FROM posts p LEFT JOIN
users as u
USING (id_user)
WHERE p.id_post = 3
LIMIT 1;
The USING clause is ANSI standard, but not all databases support it. When you use it, only one version of id_post is in the columns returned by the SELECT *. In a LEFT JOIN, it is the version with a value.
If you have other columns with the same name, you need to use column aliases. One short-cut is to take all columns from one table and name the columns in the other:
SELECT u.*, p.col1 as p_col1, . . .
FROM posts p LEFT JOIN
users as u
USING (id_user)
WHERE p.id_post = 3
LIMIT 1;

How can I order by a specific order?

It would be something like:
SELECT * FROM users ORDER BY id ORDER("abc","ghk","pqr"...);
In my order clause there might be 1000 records and all are dynamic.
A quick google search gave me below result:
SELECT * FROM users ORDER BY case id
when "abc" then 1
when "ghk" then 2
when "pqr" then 3 end;
As I said all my order clause values are dynamic. So is there any suggestion for me?
Your example isn't entirely clear, as it appears that a simple ORDER BY would suffice to order your id's alphabetically. However, it appears you are trying to create a dynamic ordering scheme that may not be alphabetical. In that case, my recommendation would be to use a lookup table for the values that you will be ordering by. This serves two purposes: first, it allows you to easily reorder the items without altering each entry in the users table, and second, it avoids (or at lest reduces) problems with typos and other issues that can occur with "magic strings."
This would look something like:
Lookup Table:
CREATE TABLE LookupValues (
Id CHAR(3) PRIMARY KEY,
Order INT
);
Query:
SELECT
u.*
FROM
users u
INNER JOIN
LookupTable l
ON
u.Id = l.Id
ORDER BY
l.Order

Select entries from a list that do not occur in the query result

I use Postgres 9.1. I am running a query which is like
select count(distinct(user_id)) from users where user_id in (11,32,763,324,45,76,37,98,587);
Here the list of user_ids contains 600 entries. The result I am getting is 597. Thus there are 3 user_ids from the list, which are not present in the users. How do I get to know these 3 user_ids?
Please note that user_id is not the Primary Key of users
DISTINCT in your count-query only makes sense if user_id is not defined UNIQUE.
We don't need it either way for the query you ask for:
SELECT t.user_id
FROM unnest('{11,32,763,324,45,76,37,98,587}'::int[]) t(user_id)
LEFT JOIN users u USING (user_id)
WHERE u.user_id IS NULL;
Beware of NOT IN if NULL values can be involved on either side of the expression! IN / NOT IN with a long value list also scales poorly.
Details:
Optimizing a Postgres query with a large IN
Select rows which are not present in other table

SQL Combining Counts with Joins

I have three tables:
Messages
messageid | userid | text
Ex: 1 | 1303 | hey guys
Users
userid | username
Ex:
1303 | trantor
1301 | tranro1
1302 | trantor2
Favorites
messageid | userid
Ex:
1 | 1302
1 | 1301
What I want to do, is display a table that has usernames, and counts the number of times they're messages were favorited a certain number of times. In the example above, I want to query saying "how many messages does each user have that has been liked exactly twice?"
and it would show a table that has a row saying
trantor | 1
A natural extension is to replace exactly twice with "at least 2", "more than 6", etc. Im trying to combine count with joins and find myself confused. And since the tables are large, Im getting counts but not confident that my query is working correctly. I have read this article but am still confused :L
What I have so far:
SELECT USERS.username, COUNT(FAVORITES.id) FROM USERS INNER JOIN FAVORITES ON FAVORITES.userID=USERS.id WHERE COUNT(FAVORITES.id) > 2;
But I dont think it works.
On S.O. I've found these questions on "correlated subqueries" but am thoroughly confused.
Would it be something like this?
SELECT USERS.username,
, ( SELECT COUNT(FAVORTIES.userid)
FROM FAVORITES INNER JOIN ON MESSAGES
WHERE FAVORITES.messageid = MESSAGES.messageid
)
FROM USERS
There's a couple things you should know with aggregate functions in SQL. First off, you need to do a GROUP BY if you're selecting an aggregate function. Second, any conditions involving aggregate functions are to be used with a HAVING clause rather than a WHERE.
The GROUP BY is to be applied to the column(s) you're selecting alongside any aggregate functions.
Here's a basic structure:
SELECT attribute1, COUNT(attribute2)
FROM someTable
GROUP BY attribute1
HAVING COUNT(attribute2) > 2;
Apply anything else you're using such as JOINS and ORDER BY and what not.
note: There's a certain order these clauses have to be in. Such as ORDER BY goes after HAVING, which comes after GROUP BY and so forth.
If I'm remembering correctly, the order of operations go:
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
When you use aggregate function such as COUNT() you will need to use GROUP BY together with HAVING rather than WHERE
SELECT USERS.username, COUNT(FAVORITES.id)
FROM USERS
INNER JOIN FAVORITES
ON FAVORITES.userID=USERS.id
GROUP BY USERS.username
HAVING COUNT(FAVORITES.id) > 2;
From documentation
If you use a group function in a statement containing no GROUP BY clause, it is equivalent to grouping on all rows.

Is a GROUP BY on UNIQUE key calculates all the groups before applying LIMIT clause?

If I GROUP BY on a unique key, and apply a LIMIT clause to the query, will all the groups be calculated before the limit is applied?
If I have hundred records in the table (each has a unique key), Will I have 100 records in the temporary table created (for the GROUP BY) before a LIMIT is applied?
A case study why I need this:
Take Stack Overflow for example.
Each query you run to show a list of questions, also shows the user who asked this question, and the number of badges he has.
So, while a user<->question is one to one, user<->badges is one has many.
The only way to do it in one query (and not one on questions and another one on users and then combine results), is to group the query by the primary key (question_id) and join+group_concat to the user_badges table.
The same goes for the questions TAGS.
Code example:
Table Questions:
question_id (int)(pk)| question_body(varchar)
Table tag-question:
question-id (int) | tag_id (int)
SELECT:
SELECT quesuestions.question_id,
questions.question_body,
GROUP-CONCAT(tag_id,' ') AS 'tags-ids'
FROM
questions
JOIN
tag_question
ON
questions.question_id=tag-question.question-id
GROUP BY
questions.question-id
LIMIT 15
Yes, the order the query executes is:
FROM
WHERE
GROUP
HAVING
SORT
SELECT
LIMIT
LIMIT is the last thing calculated, so your grouping will be just fine.
Now, looking at your rephrased question, then you're not having just one row per group, but many: in the case of stackoverflow, you'll have just one user per row, but many badges - i.e.
(uid, badge_id, etc.)
(1, 2, ...)
(1, 3, ...)
(1, 12, ...)
all those would be grouped together.
To avoid full table scan all you need are indexes. Besides that, if you need to SUM, for example, you cannot avoid a full scan.
EDIT:
You'll need something like this (look at the WHERE clause):
SELECT
quesuestions.question_id,
questions.question_body,
GROUP_CONCAT(tag_id,' ') AS 'tags_ids'
FROM
questions q1
JOIN tag_question tq
ON q1.question_id = tq.question-id
WHERE
q1.question_id IN (
SELECT
tq2.question_id
FROM
tag_question tq2
ON q2.question_id = tq2.question_id
JOIN tag t
tq2.tag_id = t.tag_id
WHERE
t.name = 'the-misterious-tag'
)
GROUP BY
q1.question_id
LIMIT 15
LIMIT does get applied after GROUP BY.
Will the temporary table be created or not, depends on how your indexes are built.
If you have an index on the grouping field and don't order by the aggregate results, then an INDEX SCAN FOR GROUP BY is applied, and each aggregate is counted on the fly.
That means that if you don't select an aggregate due to the LIMIT, it won't ever be calculated.
But if you order by an aggregate, then, of course, all of them need to be calculated before they can be sorted.
That's why they are calculated first and then the filesort is applied.
Update:
As for your query, see what EXPLAIN EXTENDED says for it.
Most probably, question_id is a PRIMARY KEY for your table, and most probably, it will be used in a scan.
That means no filesort will be applies and the join itself will not ever happen after the 15'th row.
To make sure, rewrite your query as following:
SELECT question_id,
question_body,
(
SELECT GROUP_CONCAT(tag_id, ' ')
FROM tag_question t
WHERE t.question_id = q.question_id
)
FROM questions q
ORDER BY
question_id
LIMIT 15
First, it is more readable,
Second, it is more efficient, and
Third, it will return even untagged questions (which your current query doesn't).
If the field you're grouping on is indexed, it shouldn't do a full table scan.