how to load 2 related datasets together? (i.e posts and comments) - sql

I'm fairly new to pg and trying to figure out what the best approach is to loading a set of posts and their associated comments together.
For example:
I'm trying to fetch a 10 posts and comments associated with all those posts, like facebooks wall where you see a feed of posts and comments loaded on the same page. My Schema looks something like this:
Posts
--------
id - author - description - date - commentCount
Comments
-------
id - post_id - author - description - date
I tried to load both posts and comments on the same postgres function doing the follow:
select *
from posts
LEFT join comments on posts.id = comments.post_id
unfortunately it duplicated the posts N times where comment exists, where N is the number of comments a post has. However, the first solution is that I can always filter it out in Node after fetching the data
Also when I try to use group by posts.id (to make it easier to traverse in node) I get the following error:
column "comments.id" must appear in the GROUP BY clause or be used in an aggregate function
The second thing I can try is to send an array of post_ids I want to load and have pg_function load and send them back, but I can't quite the query right:
CREATE OR REPLACE FUNCTION "getPosts"(postIds int[])
RETURNS text AS
$BODY$
BEGIN
RETURN (
SELECT *
FROM Comments
WHERE Comments.id = postIds[0]
);
END;$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
to call it:
SELECT n FROM "public"."getPosts"(array[38]) As n;
However, even when trying to get value from one index I get the following error:
ERROR: subquery must return only one column
LINE 1: SELECT (
^
QUERY: SELECT (
SELECT *
FROM Comments
WHERE Comments.id = 38
)
Finally, the last solution is to simple make N seperate calls of postgres, where N is the number of posts with comments, so if I have 5 posts with comments I make 5 calls to postgres with post_id and select from Comments table.
I'm really not sure what to do here, any help would be appreciated.
Thanks

To have all comments as an array of records for each post:
select
p.id, p.title, p.content, p.author,
array_agg(c) as comments
from
posts p
left join
comments c on p.id = c.post_id
group by 1, 2, 3, 4
Or one array for each comment column:
select
p.id, p.title, p.content, p.author,
array_agg(c.author) as comment_author,
array_agg(c.content) as comment_content
from
posts p
left join
comments c on p.id = c.post_id
group by 1, 2, 3, 4

Related

Query with a sub query that requires multiple values

I can't really think of a title so let me explain the problem:
Problem: I want to return an array of Posts with each Post containing a Like Count. The Like Count is for a specific post but for all users who have liked it
For example:
const posts = [
{
post_id: 1,
like_count: 100
},
{
post_id: 2,
like_count: 50
}
]
Now with my current solution, I don't think it's possible but here is what I have so far.
My query currently looks like this (produced by TypeORM):
SELECT
"p"."uid" AS "p_uid",
"p"."created_at" AS "post_created_at",
"l"."uid" AS "like_uid",
"l"."post_liked" AS "post_liked",
"ph"."path" AS "path",
"ph"."title" AS "photo_title",
"u"."name" AS "post_author",
(
SELECT
COUNT(like_id) AS "like_count"
FROM
"likes" "l"
INNER JOIN
"posts" "p"
ON "p"."post_id" = "l"."post_id"
WHERE
"l"."post_liked" = true
AND l.post_id = $1
)
AS "like_count"
FROM
"posts" "p"
LEFT JOIN
"likes" "l"
ON "l"."post_id" = "p"."post_id"
INNER JOIN
"photos" "ph"
ON "ph"."photo_id" = "p"."photo_id"
INNER JOIN
"users" "u"
ON "u"."user_id" = "p"."user_id"
At $1 is where the post.post_id should go (but for the sake of testing I stuck the first post's id in there), assuming I have an array of post_ids ready to put in there.
My TypeORM query looks like this
async findAll(): Promise<Post[]> {
return await getRepository(Post)
.createQueryBuilder('p')
.select(['p.uid'])
.addSelect(subQuery =>
subQuery
.select('COUNT(like_id)', 'like_count')
.from(Like, 'l')
.innerJoin('l.post', 'p')
.where('l.post_liked = true AND l.post_id = :post_id', {post_id: 'a16f0c3e-5aa0-4cf8-82da-dfe27d3f991a'}), 'like_count'
)
.addSelect('p.created_at', 'post_created_at')
.addSelect('u.name', 'post_author')
.addSelect('l.uid', 'like_uid')
.addSelect('l.post_liked', 'post_liked')
.addSelect('ph.title', 'photo_title')
.addSelect('ph.path', 'path')
.leftJoin('p.likes', 'l')
.innerJoin('p.photo', 'ph')
.innerJoin('p.user', 'u')
.getRawMany()
}
Why am I doing this? What I am trying to avoid is calling count for every single post on my page to return the number of likes for each post. I thought I could somehow do this in a subquery but now I am not sure if it's possible.
Can someone suggest a more efficient way of doing something like this? Or is this approach completely wrong?
I find working with ORMs terrible and cannot help you with this. But the query itself has flaws:
You want one row per post, but you are joining likes, thus getting one row per post and like.
Your subquery is not related to your main query. It should instead relate to the main query's post.
The corrected query:
SELECT
p.uid,
p.created_at,
ph.path AS photo_path,
ph.title AS photo_title,
u.name AS post_author,
(
SELECT COUNT(*)
FROM likes l
WHERE l.post_id = p.post_id
AND l.post_liked = true
) AS like_count
FROM posts p
JOIN photos ph ON ph.photo_id = p.photo_id
JOIN users u ON u.user_id = p.user_id
ORDER BY p.uid;
I suppose it's quite easy for you to convert this to TypeORM. There is nothing wrong with counting for every single post, by the way. It is even necessary to get the result you are after.
The subquery could also be moved to the FROM clause using GROUP BY l.post_id within. As is, you are getting all posts, regardless of them having likes or not. By moving the subquery to the FROM clause, you could instead decide between INNER JOIN and LEFT OUTER JOIN.
The query would benefit from the following index:
CREATE INDEX idx ON likes (post_id, post_liked);
Provide this index, if the query seems too slow.

BigQuery filtering selftext that contains a word in all possible posts in a subreddit

I'm trying to get posts and their comments in the AskDocs subreddit about Asperger, this sql works fine to get the posts
#standardSQL
SELECT
TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH, 'America/New_York') AS date_agg, title,selftext
FROM
`fh-bigquery.reddit_posts.*`
WHERE
(_TABLE_SUFFIX BETWEEN "2016_01" AND "2019_03" OR _TABLE_SUFFIX = 'full_corpus_201512')
AND subreddit = 'AskDocs'
AND REGEXP_CONTAINS(selftext, r'Asperger')
ORDER BY
date_agg
But I'm not sure if with this I get all the posts that are available, I got 169 rows but I'm trying to get as much as possible in AskDocs about this subject.
And the second question is because I'm trying to link each post with its comments, I found this here in SO
#standardSQL
SELECT posts.title, comments.body
FROM `fh-bigquery.reddit_comments.2016_01` AS comments
JOIN `fh-bigquery.reddit_posts.2016_01` AS posts
ON posts.id = SUBSTR(comments.link_id, 4)
WHERE posts.id = '43go1r'
But when I try to merge my code in here I have a real mess
For the first query, you are obtaining 169 rows since you are using a capital A within the regex and selftext that contains the word Asperger will be listed only, e.g.: Asperger, Asperger's, Aspergers, etc. However, the titles with the words asperger, asperger's, aspergers, will not be listed since you are using a capital A within the regex only. To list the words that includes the lowercase, you need to use [aA] within the regex, which will display 241 rows:
AND REGEXP_CONTAINS(posts.selftext, r'[aA]sperger')
For join the tables, you can use the following query:
WITH
comments AS (
SELECT
link_id,
body
FROM
`fh-bigquery.reddit_comments.201*`
WHERE
_TABLE_SUFFIX BETWEEN "6_01"
AND "9_03"
AND subreddit = 'AskDocs' ),
posts AS (
SELECT
TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH, 'America/New_York') AS date_agg,
id,
selftext,
title
FROM
`fh-bigquery.reddit_posts.*`
WHERE
(_TABLE_SUFFIX BETWEEN "2016_01"
AND "2019_03"
OR _TABLE_SUFFIX = 'full_corpus_201512')
AND subreddit = 'AskDocs'
AND REGEXP_CONTAINS(selftext, r'[aA]sperger') )
SELECT
posts.date_agg AS Date,
posts.title AS Post,
posts.selftext AS Text,
comments.body AS Comment
FROM
comments
JOIN
posts
ON
posts.id = SUBSTR(comments.link_id, 4)
ORDER BY
Date,
Post
Note: I used different wildcards since tables are not the same on both datasets to filter partitions and optimize query computation.

Using LATERAL joins in Ecto v2.0

I'm trying to join the latest comment on a post record, like so:
comment = from c in Comment, order_by: [desc: c.inserted_at], limit: 1
post = Repo.all(
from p in Post,
where: p.id == 123,
join: c in subquery(comment), on: c.post_id == p.id,
select: [p.title, c.body],
limit: 1
)
Which generates this SQL:
SELECT p0."title",
c1."body"
FROM "posts" AS p0
INNER JOIN (SELECT p0."id",
p0."body",
p0."inserted_at",
p0."updated_at"
FROM "comments" AS p0
ORDER BY p0."inserted_at" DESC
LIMIT 1) AS c1
ON c1."post_id" = p0."id"
WHERE ( p0."id" = 123 )
LIMIT 1
It just returns nil. If I remove the on: c.post_id == p.id it'll return data, but obviously it'll return the lastest comment for all posts, not the post in question.
What am I doing wrong? A fix could be to use a LATERAL join subquery, but I can't figure out whether it's possible to pass the p reference into a subquery.
Thanks!
The issue was caused by the limit: 1 here:
comment = from c in Comment, order_by: [desc: c.inserted_at], limit: 1
Since the resulting query was SELECT * FROM "comments" AS p0 ORDER BY p0."inserted_at" DESC LIMIT 1, it was only returning the most recent comment on ANY post, not the post I was querying against.
FYI the query was >150ms with ~200,000 comment rows, but that was brought down to ~12ms with a simple index:
create index(:comments, ["(inserted_at::date) DESC"])
It's worth noting that while this query works in returning the post in question and only the most recent comment, it'll actually return $number_of_comments rows if you remove the limit: 1. So say if you wanted to retrieve all 100 posts in your database with the most recent comment of each, and you had 200,000 comments in the database, this query would return 200,000 rows. Instead you should use a LATERAL join as discussed below.
.
Update
Unfortunately ecto doesn't support LATERAL joins right now.
An ecto fragment would work great here, however the join query wraps the fragment in additional parentheses (i.e. INNER JOIN (LATERAL (SELECT …))), which isn't valid SQL, so you'd have to use raw SQL for now:
sql = """
SELECT p."title",
c."body"
FROM "posts" AS p
INNER JOIN LATERAL (SELECT c."id",
c."body",
c."inserted_at"
FROM "comments" AS c
WHERE ( c."post_id" = p."id" )
ORDER BY c."inserted_at" DESC
LIMIT 1) AS c
ON true
WHERE ( p."id" = 123 )
LIMIT 1
"""
res = Ecto.Adapters.SQL.query!(Repo, sql, [])
This query returns in <1ms on the same database.
Note this doesn't return your Ecto model struct, just the raw response from Postgrex.

How to use count correctly in sql?

I have tow tables 'matches' and 'forum' I need to get match information from the matches table which has comments in the forum table so I use the following query:
SELECT distinct forum.match_static_id, matches.*
from forum
INNER JOIN matches
ON forum.match_static_id = matches.static_id
WHERE forum.comments_yes_or_no = 1
I use distinct to avoid getting the same match twice if it has more than one comment in the forum table.
The problem is I want to get the count of each match comments with the same query is it possible? I use :
SELECT distinct forum.match_static_id, count(forum.comments), matches.*
from forum
INNER JOIN matches
ON forum.match_static_id = matches.static_i
WHERE forum.comments_yes_or_no = 1
but it give me just one record (which is wrong). What is the problem ?? does I need to use group by ? and if yes where to but it in this crowded query?
Please try this:
SELECT forum.match_static_id, count(matches.id), matches.*
from forum
INNER JOIN matches
ON forum.match_static_id = matches.static_i
WHERE forum.comments_yes_or_no = 1
GROUP BY forum.id

Calculate number of fields

I have three tables:
Article(idArticle,NameArt)
Tag(idTag, NameTag)
ArtiTag(idArticle,idTag)
I want to have a result like this: NameTag,Count(Articles that belongs to that tag)
I tried the following:
SELECT Tag.NameTag , COUNT(DISTINCT(idArticle))
FROM ArtiTag, ArtiTag
but it returns always only one row, even if I have many tags and many articles related
SELECT t.NameTag, COUNT(*)
FROM ArtiTag at
INNER JOIN Tag t
ON at.idTag = t.idTag
GROUP BY t.NameTag;
Select T.idTag, Max(nametag), count(artitag.idArticle) from Tag t
left join ArtiTag on t.idTag=ArtiTag.idTag
Group by t.idTag
This query outputs all tags including also tags with 0 articles.