EDIT:
As requested, our table schema is,
posts:
postid (primary key),
post_text
comments:
commentid (primary key) ,
postid (foreign key referencing posts.postid),
comment_text
replies
replyid (primary key)
commentid (foreign key referencing comments.commentid)
reply_text
I have the tables posts, comments, and replies in a SQL database. (Obviously, a post can have comments, and a comment can have replies)
I want to return a post based on its id, postid.
So I would like a database function has the inputs and outputs,
input:
postid
output:
post = {
postid
post_text
comments: [comment, ...]
}
Where the comment and reply are nested in the post,
comment = {
commentid,
text
replies: [reply, ...]
}
reply = {
replyid
reply_text
}
I have tried using joins, but the returned data is highly redundant, and it seems stupid. For instance, fetching the data from two different replies will give,
postid
post_text
commentid
comment_text
replyid
reply_text
1
POST_TEXT
78
COMMENT_TEXT
14
REPLY1_TEXT
1
POST_TEXT
78
COMMENT_TEXT
15
REPLY2_TEXT
It seems instead I want to make 3 separate queries, in sequence (first to the table posts, then to comments, then to replies)
How do I do this?
The “highly redundant” join result is normally the best way, because it is the natural thing in a relational database. Relational databases aim at avoiding redundancy in data storage, but not in query output. Avoiding that redundancy comes at an extra cost: you have to aggregate the data on the server side, and the client probably has to unpack the nested JSON data again.
Here is some sample code that demonstrates how you could aggregate the results:
SELECT postid, post_text,
jsonb_agg(
jsonb_build_object(
'commentid', commentid,
'comment_text', comment_text,
'replies', replies
)
) AS comments
FROM (SELECT postid, post_text, commentid, comment_text,
jsonb_agg(
jsonb_build_object(
'replyid', replyid,
'reply_text', reply_text
)
) AS replies
FROM /* your join */
GROUP BY postid, post_text, commentid, comment_text) AS q
GROUP BY postid, post_text;
The redundant data stems from a cross join of a post's comments and replies. I.e. for each post you join each comment with each reply. Comment 78 does neither relate to reply 14 nor to reply 15, but merely to the same post.
The typical approach to select the data would hence be three queries:
select * from posts;
select * from comments;
select * from replies;
You can also reduce this to two queries and join the posts table to the comments query, the replies query, or both. This again, will lead to selecting redundant data, but may ease data handling in your app.
If you want to avoid joins, but must avoid database round trips, you can glue query results together:
select *
from
(
select postid as id, postid, 'post' as type, post_text as text from posts
union all
select commentid as id, postid, 'comment' as type, comment_text as text from comments
union all
select replyid as id, postid, 'reply' as type, reply_text as text from replies
) glued
order by postid, type, id;
At last you can create JSON in your DBMS. Again, don't cross join comments with replies, but join the aggregated comments object and the aggregated replies object to the post.
select p.postid, p.post_text, c.comments, r.replies
from posts p
left join
(
select
postid,
jsonb_object_agg(jsonb_build_object('commentid', commentid,
'comment_text', comment_text)
) as comments
from comments
group by postid
) c on c.postid = p.postid
left join
(
select
postid,
jsonb_object_agg(jsonb_build_object('replyid', replyid,
'reply_text', reply_text)
) as replies
from replies
group by postid
) r on r.postid = p.postid;
Your idea to store things in JSON is a good one if you have something to parse it down the line.
As an alternative to the previous answers that involve JSON, you can also get a normal SQL result set (table definition and sample data are below the query):
WITH MyFilter(postid) AS (
VALUES (1),(2) /* rest of your filter */
)
SELECT 'Post' AS PublicationType, postid, NULL AS CommentID, NULL As ReplyToID, post_text
FROM Posts
WHERE postID IN (SELECT postid from MyFilter)
UNION ALL
SELECT CASE ReplyToID WHEN NULL THEN 'Comment' ELSE 'Reply' END, postid, commentid, replyToID, comment_text
FROM Comments
WHERE postid IN (SELECT postid from MyFilter)
ORDER BY postid, CommentID NULLS FIRST, ReplyToID NULLS FIRST
Note: the PublicationType column was added for the sake of clarity. You can alternatively inspect CommentID and ReplyToId and see what is null to determine the type of publication.
This should leave you with very little, if any, redundant data to transfer back to the SQL client.
This approach with UNION ALL will work with 3 tables too (you only have to add 1 UNION ALL) but in your case, I would rather go with a 2-table schema:
CREATE TABLE posts (
postid SERIAL primary key,
post_text text NOT NULL
);
CREATE TABLE comments (
commentid SERIAL primary key,
ReplyToID INTEGER NULL REFERENCES Comments(CommentID) /* ON DELETE CASCADE? */,
postid INTEGER NOT NULL references posts(postid) /* ON DELETE CASCADE? */,
comment_text Text NOT NULL
);
INSERT INTO posts(post_text) VALUES ('Post 1'),('Post 2'),('Post 3');
INSERT INTO Comments(postid, comment_text) VALUES (1, 'Comment 1.1'), (1, 'Comment 1.2'), (2, 'Comment 2.1');
INSERT INTO Comments(replytoId, postid, comment_text) VALUES (1, 1, 'Reply to comment 1.1'), (3, 2, 'Reply to comment 2.1');
This makes 1 fewer table and allows to have level 2 replies (replies to replies), or more, rather than just replies to comments. A recursive query (there are plenty of samples of that on SO) can make it so a reply can always be linked back to the original comment if you want.
Edit: I noticed your comment just a bit late. Of course, no matter what solution you take, there is no need to execute a request to get the replies to each and every comment.
Even with 3 tables, even without JSON, the query to get all the replies for all the comments at once is:
SELECT *
FROM replies
WHERE commentid IN (
SELECT commentid
FROM comments
WHERE postid IN (
/* List your post ids here or nest another SELECT postid FROM posts WHERE ... */
)
)
Related
I have a table that contains all of the posts. I also have a table where a row is added when a user likes a post with foreign keys user_id and post_id.
I want to retrieve a list of ALL of the posts and whether or not a specific user has liked that post. Using an outer join I end up getting some posts twice. Once for user 1 and once for user 2. If I use a WHERE to filter for likes.user_id = 1 AND likes.user_id is NULL I don't get the posts that are only liked by other users.
Ideally I would do this with a single query. SQL isn't my strength, so I'm not even really sure if a sub query is needed or if a join is sufficient.
Apologies for being this vague but I think this is a common enough query that it should make some sense.
EDIT: I have created a DB Fiddle with the two queries that I mentioned. https://www.db-fiddle.com/f/oFM2zWsR9WFKTPJA16U1Tz/4
UPDATE: Figured it out last night. This is what I ended up with:
SELECT
posts.id AS post_id,
posts.title AS post_title,
CASE
WHEN EXISTS (
SELECT *
FROM likes
WHERE posts.id = likes.post_id
AND likes.user_id = 1
) THEN TRUE
ELSE FALSE END
AS liked
FROM posts;
Although I was able to resolve it, thanks to #wildplasser for his answer as well.
Data (I needed to change it a bit, because one should not assign to serials):
CREATE TABLE posts (
id serial,
title varchar
);
CREATE TABLE users (
id serial,
name varchar
);
CREATE TABLE likes (
id serial,
user_id int,
post_id int
);
INSERT INTO posts (title) VALUES ('First Post');
INSERT INTO posts (title) VALUES ('Second Post');
INSERT INTO posts (title) VALUES ('Third Post');
INSERT INTO users (name) VALUES ('Obama');
INSERT INTO users (name) VALUES ('Trump');
INSERT INTO likes (user_id, post_id) VALUES (1, 1);
INSERT INTO likes (user_id, post_id) VALUES (2, 1);
INSERT INTO likes (user_id, post_id) VALUES (2, 2);
-- I want to retrieve a list of ALL of the posts and whether or not a specific user has liked that post
SELECT id, title
, EXISTS(
--EXISTS() yields a boolean value
SELECT *
FROM likes lk
JOIN users u ON u.id = lk.user_id AND lk.post_id=p.id
WHERE u.name ='Obama'
) AS liked_by_Obama
FROM posts p
;
Results:
id | title | liked_by_obama
----+-------------+----------------
1 | First Post | t
2 | Second Post | f
3 | Third Post | f
(3 rows)
As far as I understand, you have two tables such as post table which includes all post from different users and a like table with user.id and post id. if you want to retreive only posts then
select * from posts
if you need user information as well, which is present in user table then you can do below.
select user.user_name, post.postdata from user,post where post.userid=user.userid
in above query, user_name is a column name in user table and postdata is a column in post table.
This is a follow-up to this excellent Q&A: 13227142.
I almost have to do the same thing (with the constraint of PostgreSQL 9.2) but I'm using only one table. Therefore the query uses a self-join (in order to produce the correct JSON format) which results in a duplicate id field. How can I avoid this?
Example:
CREATE TABLE books
(
id serial primary key,
isbn text,
author text,
title text,
edition text,
teaser text
);
SELECT row_to_json(row)
FROM
(
SELECT id AS bookid,
author,
cover
FROM books
INNER JOIN
(
SELECT id, title, edition, teaser
FROM books
) cover(id, title, edition, teaser)
USING (id)
) row;
Result:
{
"bookid": 1,
"author": "Bjarne Stroustrup",
"cover": {
"id": 1,
"title": "Design and Evolution of C++",
"edition": "1st edition",
"teaser": "This book focuses on the principles, processes and decisions made during the development of the C++ programming language"
}
}
I want to get rid of "id" in "cover".
This turned out to be a tricky task. As far as I can see it's impossible to achieve with a simple query.
One solution is to use a predefined data type:
CREATE TYPE bookcovertype AS (title text, edition text, teaser text);
SELECT row_to_json(row)
FROM
(
SELECT books.id AS bookid, books.author,
row_to_json(row(books.title, books.edition, books.teaser)::bookcovertype) as cover
FROM books
) row;
you need id to join, so without id you can't make such short query. You need to struct it. Smth like:
select row_to_json(row,true)
FROM
(
with a as (select id,isbn,author,row_to_json((title,edition,teaser)) r from books
)
select a.id AS bookid,a.author, concat('{"title":',r->'f1',',"edition":',r->'f2',',"teaser":',r->'f3','}')::json as cover
from a
) row;
row_to_json
--------------------------------------------------------
{"bookid":1, +
"author":"\"b\"", +
"cover":{"title":"c","edition":"d","teaser":"\"b\""}}
(1 row)
Also without join you use twice as less resources
For the sake of completeness I've stumbled upon another answer myself: The additional fields can be eliminated by string functions. However, I prefer AlexM's anwer because it will be faster and is still compatible with PostgreSQL 9.2.
SELECT regexp_replace(
(
SELECT row_to_json(row)
FROM
(
SELECT id AS bookid,
author,
cover
FROM books
INNER JOIN
(
SELECT id, title, edition, teaser
FROM books
) cover(id, title, edition, teaser)
USING (id)
) row
)::text,
'"id":\d+,',
'')
I have two tables A and B.
A table contain
postid,postname,CategoryURl
and
B table contain
postid,CategoryImageURL
For one postid there are multiple CategoryImageURL assigned.I want to display that CategoryImageURL in Table A but for one postid there should be CategoryImageURL1,CategoryImageURL2 should be like that one.
I want to achieve one to many relationship for one postid then what logic should be return in sql function??
In my eyes it seems that you want to display all related CategoryImageURLs of the second table in one line with a separator in this case the comma?
Then you will need a recursive operation there. Maybe a CTE (Common Table Expression) does the trick. See below. I have added another key to the second table, to be able to check, if all rows of the second table have been processed for the corresponding row in the first table.
Maybe this helps:
with a_cte (post_id, url_id, name, list, rrank) as
(
select
a.post_id
, b.url_id
, a.name
, cast(b.urln + ', ' as nvarchar(100)) as list
, 0 as rrank
from
dbo.a
join dbo.b
on a.post_id = b.post_id
union all
select
c.post_id
, a1.url_id
, c.name
, cast(c.list + case when rrank = 0 then '' else ', ' end + a1.urln as nvarchar(100))
, c.rrank + 1
from a_cte c
join ( select
b.post_id
, b.url_id
, a.name
, b.urln
from dbo.a
join dbo.b
on a.post_id = b.post_id
) a1
on c.post_id = a1.post_id
and c.url_id < a1.url_id -- ==> take care, that there is no endless loop
)
select d.name, d.list
from
(
select name, list, rank() over (partition by post_id order by rrank desc)
from a_cte
) d (name, list, rank)
where rank = 1
You are asking the wrong sort of question. This is about normalization.
As it stands, you have a redundancy? Where each postname and categoryURL is represented by an ID field.
For whatever reason, the tables separated CategoryImageUrl into its own table and linked this to each set of postname and categoryURL.
If the relation is actually one id to each postname, then you can denormalize the table by adding the column CategoryImageUrl to your first table.
Postid, postname, CategoryURL, CategoryImageUrl
Or if you wish to keep the normalization, combine like fields into their own table like so:
--TableA:
Postid, postname, <any other field dependent on postname >
--TableA
Postid, CategoryURL, CategoryImageUrl
Now this groups CategoryURL together but uses a redundancy of having multiple CategoryURL to exist. However, Postid has only one CategoryUrl.
To remove this redundancy in our table, we could use a Star Schema strategy like this:
-- Post Table
Postid, postname
-- Category table
CategoryID, CategoryURL, <any other info dependent only on CategoryURL>
-- Fact Table
Postid, CategoryID, CategoryImageURL
DISCLAIMER: Naturally I assumed aspects of your data and might be off. However, the strategy of normalization is still the same.
Also, remember that SQL is relational and deals with sets of data. Inheritance is incompatible to the relational set theory. Every table can be queried forwards and backwards much the way every page and chapter in a book is treated as part of the book. At no point would we see a chapter independent of a book.
Let's say I create two tables using the following SQL,
such that post has many comment:
CREATE TABLE IF NOT EXISTS post (
id SERIAL PRIMARY KEY,
title VARCHAR NOT NULL,
text VARCHAR NOT NULL
)
CREATE TABLE IF NOT EXISTS comment (
id SERIAL PRIMARY KEY,
text VARCHAR NOT NULL,
post_id SERIAL REFERENCES post (id)
)
I would like to be able to query these tables so as to serve a response that
looks like this:
{
"post" : [
{ id: 100,
title: "foo",
text: "foo foo",
comment: [1000,1001,1002] },
{ id: 101,
title: "bar",
text: "bar bar",
comment: [1003] }
],
"comment": [
{ id: 1000,
text: "bla blah foo",
post: 100 },
{ id: 1001,
text: "bla foo foo",
post: 100 },
{ id: 1002,
text: "foo foo foo",
post: 100 },
{ id: 1003,
text: "bla blah bar",
post: 101 },
]
}
Doing this naively would involve to SELECT statements,
the first along the lines of
SELECT DISTINCT ON(post.id), post.title, post.text, comment.id
FROM post, comment
WHERE post.id = comment.post_id
... and the second something along the lines of
SELECT DISTINCT ON(comment.id), comment.text, post.id
FROM post, comment
WHERE post.id = comment.post_id
However, I cannot help but think that there is a way to do this involving
only one SELECT statement - is this possible?
Notes:
I am using Postgres, but I do not require a Postgres-specific solution. Any standard SQL solution should do.
The queries above are illustrative only, they do not give we exactly what is necessary at the moment.
It looks like what the naive solution here does is perform the same join on the same two tables, just doing a distinct on a different table each time. This definitely leaves room for improvement.
It appears that ActiveModel Serializers in Rails already do this - if someone familair with them would like to chime in how they work under the hood, that would be great.
You need two queries to get the form you laid out:
SELECT p.id, p.title, p.text, array_agg(c.id) AS comments
FROM post p
JOIN comment c ON c.post_id = p.id
WHERE p.id = ???
GROUP BY p.id;
Or faster, if you really want to retrieve all or most of your posts:
SELECT p.id, p.title, p.text, c.comments
FROM post p
JOIN (
SELECT post_id, array_agg(c.id) AS comments
FROM comment
GROUP BY 1
) c ON c.post_id = p.id
GROUP BY 1;
Plus:
SELECT id, text, post_id
FROM comment
WHERE post_id = ??;
Single query
SQL can only send one result type per query. For a single query, you would have to combine both tables, listing columns for post redundantly. That conflicts with the desired response in your question. You have to give up one of the two conflicting requirements.
SELECT p.id, p.title, p.text AS p_text, c.id, c.text AS c_text
FROM post p
JOIN comment c ON c.post_id = p.id
WHERE p.id = ???
Aside: The column comment.post_id should be integer, not serial! Also, column names are probably just for a quick show case. You wouldn't use the non-descriptive text as column name, which also conflicts with a basic data type.
Compare this related case:
Foreign key of serial type - ensure always populated manually
However, I cannot help but think that there is a way to do this involving only one SELECT statement - is this possible?
Technically: yes. If you really want your data in json anyway, you could use PostgreSQL (9.2+) to generate it with the json functions, like:
SELECT row_to_json(sq)
FROM (
SELECT array_to_json(ARRAY(
SELECT row_to_json(p)
FROM (
SELECT *, ARRAY(SELECT id FROM comment WHERE post_id = post.id) AS comment
FROM post
) AS p
)) AS post,
array_to_json(ARRAY(
SELECT row_to_json(comment)
FROM comment
)) AS comment
) sq;
But I'm not sure it's worth it -- usually not a good idea to dump all your data without limit / pagination.
SQLFiddle
I saw this question on meta: https://meta.stackexchange.com/questions/33101/how-does-so-query-comments
I wanted to set the record straight and ask the question in a proper technical way.
Say I have 2 tables:
Posts
id
content
parent_id (null for questions, question_id for answer)
Comments
id
body
is_deleted
post_id
upvotes
date
Note: I think this is how the schema for SO is setup, answers have a parent_id which is the question, questions have null there. Questions and answers are stored in the same table.
How do I pull out comments stackoverflow style in a very efficient way with minimal round trips?
The rules:
A single query should pull out all the comments needed for a page with multiple posts to render
Needs to only pull out 5 comments per answer, with pref for upvotes
Needs to provide enough information to inform the user there are more comments beyond the 5 that are there. (and the actual count - eg. 2 more comments)
Sorting is really hairy for comments, as you can see on the comments in this question. The rules are, display comments by date, HOWEVER if a comment has an upvote it is to get preferential treatment and be displayed as well at the bottom of the list. (this is nasty hard to express in sql)
If any denormalizations make stuff better what are they? What indexes are critical?
I wouldn't bother to filter the comments using SQL (which may surprise you because I'm an SQL advocate). Just fetch them all sorted by CommentId, and filter them in application code.
It's actually pretty infrequent that there are more than five comments for a given post, so that they need to be filtered. In StackOverflow's October data dump, 78% of posts have zero or one comment, and 97% have five or fewer comments. Only 20 posts have >= 50 comments, and only two posts have over 100 comments.
So writing complex SQL to do that kind of filtering would increase complexity when querying all posts. I'm all for using clever SQL when appropriate, but this would be penny-wise and pound-foolish.
You could do it this way:
SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;
But this gives you redundant copies of q and a columns, which is significant because those columns include text blobs. The extra cost of copying redundant text from the RDBMS to the app becomes significant.
So it's probably better to not do this in two queries. Instead, given that the client is viewing a Question with PostId = 1234, do the following:
SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL
SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
ON (c.PostId = p.PostId);
And then sort through them in application code, collecting them by the referenced post and filtering out extra comments beyond the five most interesting ones per post.
I tested these two queries against a MySQL 5.1 database loaded with StackOverflow's data dump from October. The first query takes about 50 seconds. The second query is pretty much instantaneous (after I pre-cached indexes for the Posts and Comments tables).
The bottom line is that insisting on fetching all the data you need using a single SQL query is an artificial requirement (probably based on a misconception that the round-trip of issuing a query against an RDBMS is overhead that must be minimized at any cost). Often a single query is a less efficient solution. Do you try to write all your application code in a single function? :-)
the real question is not the query, but the schema, specially the clustered indexes. The comment ordering requirements are ambuigous as you defined them (is it only 5 per answer or not?). I interpreted the requirements as 'pull 5 comments per post (answer or question) and give preference to upvoted ones, then to newer ones. I know this is not how SO comments are showen, but you gotta define your requirements more precisesly.
Here is my query:
declare #postId int;
set #postId = ?;
with cteQuestionAndReponses as (
select post_id
from Posts
where post_id = #postId
union all
select post_id
from Posts
where parent_id = #postId)
select * from
cteQuestionAndReponses p
outer apply (
select count(*) as CommentsCount
from Comments c
where is_deleted = 0
and c.post_id = p.post_id) as cc
outer apply (
select top(5) *
from Comments c
where is_deleted = 0
and p.post_id = c.post_id
order by upvotes desc, date desc
) as c
I have some 14k posts and 67k comments in my test tables, the query gets the posts in 7ms:
Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 7 ms.
Here is the schema I tested with:
create table Posts (
post_id int identity (1,1) not null
, content varchar(max) not null
, parent_id int null -- (null for questions, question_id for answer)
, constraint fkPostsParent_id
foreign key (parent_id)
references Posts(post_id)
, constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on
Posts(parent_id, post_id);
go
create table Comments (
comment_id int identity(1,1) not null
, body varchar(max) not null
, is_deleted bit not null default 0
, post_id int not null
, upvotes int not null default 0
, date datetime not null default getutcdate()
, constraint pkComments primary key nonclustered (comment_id)
, constraint fkCommentsPostId
foreign key (post_id)
references Posts(post_id)
);
create clustered index cdxComments on
Comments (is_deleted, post_id, upvotes, date, comment_id);
go
and here is my test data generation:
insert into Posts (content)
select 'Lorem Ipsum'
from master..spt_values;
insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
select top(checksum(newid(), p.post_id) % 10) Number
from master..spt_values) as r
where parent_id is NULL
insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
-- 5% deleted comments
, case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
, p.post_id
-- up to 10 upvotes
, abs(checksum(newid(), p.post_id, r.Number)) % 10
-- up to 1 year old posts
, dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate())
from Posts p
cross apply (
select top(abs(checksum(newid(), p.post_id)) % 10) Number
from master..spt_values) as r
Use:
WITH post_hierarchy AS (
SELECT p.id,
p.content,
p.parent_id,
1 AS post_level
FROM POSTS p
WHERE p.parent_id IS NULL
UNION ALL
SELECT p.id,
p.content,
p.parent_id,
ph.post_level + 1 AS post_level
FROM POSTS p
JOIN post_hierarchy ph ON ph.id = p.parent_id)
SELECT ph.id,
ph.post_level,
c.upvotes,
c.body
FROM COMMENTS c
JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date
Couple of things to be aware of:
StackOverflow displays the first 5 comments, doesn't matter if they were upvoted or not. Subsequent comments that were upvoted are immediately displayed
You can't accommodate a limit of 5 comments per post without devoting a SELECT to each post. Adding TOP 5 to what I posted will only return the first five rows based on the ORDER BY statement