How can I optimize Postgresql ARRAY_AGG queries for large tables?

How can I optimize Postgresql ARRAY_AGG queries for large tables? - sql

I am using PostgreSQL for its array functionality. Here's my schema:
CREATE TABLE questions (
id INTEGER PRIMARY KEY,
product_id INTEGER UNIQUE NOT NULL,
body VARCHAR(1000) NOT NULL,
date_written DATE NOT NULL DEFAULT current_date,
asker_name VARCHAR(60) NOT NULL,
asker_email VARCHAR(60) NOT NULL,
reported BOOLEAN DEFAULT FALSE,
helpful INTEGER NOT NULL DEFAULT 0
);
CREATE TABLE answers (
id PRIMARY KEY NOT NULL,
question_id INTEGER NOT NULL,
body VARCHAR(1000) NOT NULL,
date_written DATE NOT NULL DEFAULT current_date,
answerer_name VARCHAR(60) NOT NULL,
answerer_email VARCHAR(60) NOT NULL,
reported BOOLEAN DEFAULT FALSE,
helpful INTEGER NOT NULL DEFAULT 0
);
CREATE TABLE photos (
id INTEGER UNIQUE,
answer_id INTEGER NOT NULL,
photo VARCHAR(200)
);
I am trying to query my answers table to get a list of all the answers for a given question id and include an array of all photos that exist for that given answer_id. The results should be sorted in descending order of helpfulness. So far, I have a massive query that displays the results I'm looking for, but the execution time is 729.595 ms. I am trying to optimize to get the query's time down to 200 ms. I have the following indexes to try and optimize my query times:
indexname | indexdef
-----------------+---------------------------------------------------------------------------
answer_id | CREATE UNIQUE INDEX answer_id ON public.answers USING btree (id)
question_id | CREATE INDEX question_id ON public.answers USING btree (question_id)
idx_reported_id | CREATE INDEX idx_reported_id ON public.answers USING btree (reported, id)
answers_pkey | CREATE UNIQUE INDEX answers_pkey ON public.answers USING btree (id)
indexname | indexdef
----------------+----------------------------------------------------------------------------
id | CREATE UNIQUE INDEX id ON public.questions USING btree (id)
idx_q_reported | CREATE INDEX idx_q_reported ON public.questions USING btree (id, reported)
questions_pkey | CREATE UNIQUE INDEX questions_pkey ON public.questions USING btree (id)
indexname | indexdef
---------------+---------------------------------------------------------------------
photos_id_key | CREATE UNIQUE INDEX photos_id_key ON public.photos USING btree (id)
p_links | CREATE INDEX p_links ON public.photos USING btree (photo)
In my analysis, I noticed that the GroupAggregate is time-consuming: GroupAggregate (cost=126222.21..126222.71 rows=25 width=129) (actual time=729.497..729.506 rows=5 loops=1)
Group Key: answers.id
Is there a way I can avoid the time-consuming GROUP BY? Am I missing something with the indexes? Here is the query itself:
SELECT answers.id,
question_id,
body,
date_written,
answerer_name,
answerer_email,
reported,
helpful,
ARRAY_AGG(photo) as photos
FROM answers
LEFT JOIN photos ON answers.id = photos.answer_id
WHERE reported IS
false AND answers.id IN (SELECT id
FROM answers
WHERE question_id = 20012)
GROUP BY answers.id
ORDER BY helpful DESC;
Thanks!

I think you can skip the subquery:
SELECT answers.id, question_id, body, date_written, answerer_name, answerer_email, reported, helpful, ARRAY_AGG(photo) as photos
FROM answers
LEFT JOIN photos ON answers.id = photos.answer_id
WHERE reported IS false AND question_id = 20012
GROUP BY answers.id, question_id, body, date_written, answerer_name, answerer_email, reported, helpful
ORDER BY helpful DESC;
You can add a btree index on photos.answer_id because this field is use in the join clause.
You losed same fields on the GROUP BY clause;

One way that often works, is to aggregate first, then join on the result (rather than aggregating the full result). And you don't really need the IN condition either
SELECT a.id,
a.question_id,
a.body,
a.date_written,
a.answerer_name,
a.answerer_email,
a.reported,
a.helpful,
p.photos
FROM answers a
LEFT JOIN (
select answer_id, array_agg(photo) as photos
from photos
group by answer_id
) p ON a.id = p.answer_id
WHERE reported IS false
AND a.question_id = 20012
ORDER BY a.helpful DESC;

Related

Get data from one table with nested relations

I am new in DB and I have a table topics and in this table, I have a foreign key master_topic_id and this foreign key is related to the same table topics column id.
Schema:
CREATE TABLE public.topics (
id bigserial NOT NULL,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
published_at timestamp NULL,
master_topic_id int8 NULL,
CONSTRAINT t_pkey PRIMARY KEY (id),
CONSTRAINT t_master_topic_id_fkey FOREIGN KEY (master_topic_id) REFERENCES topics(id
);
I write a query - SELECT * FROM topics WHERE id = 10. But if this record has master_topic_id I need to get data by master_topic_id too.
I tried to do it by using JOIN, but join just concat records, but I need to have data from master_topic_id as new row.
Any help?

I think you are describing:
select t.*
from topics t
where t.id = 10 or
exists (select 1
from topics t2
where t2.master_topic_id = t.id and t2.id = 10
);
However, you might just want:
where 10 in (id, master_topic_id)

Use or in your where condition
SELECT *
FROM topics
WHERE id = 10
or master_topic_id = 10
you can use union all as well
SELECT *
FROM topics
WHERE id = 10
union all
SELECT *
FROM topics
WHERE master_topic_id = 10

Eliminate subquery to improve query performance

I'd like to rewrite the following subquery as it's used over and over again in a larger query. The DBMS used is Postgres and the table has the following structure table (id uuid, seq int, value int).
Given a value for id (id_value), the query finds all records in "table" where seq < seq of id_value
My naive (slow) solution so far is the following:
select * from table
where seq < (select seq from table where id = id_value)
table
id, seq, value
a, 1, 12
b, 2, 22
c, 3, 32
x, 4, 43
d, 5, 54
s, 6, 32
a, 7, 54
e.g. a query
select * from table where seq < (select seq from table where id = 'x')
returns
a, 1, 12
b, 2, 22
c, 3, 32
For testing purposes, I've tried to hardcode the relevant seq field and it improves the whole query significantly, but I really don't like to query for seq as a two-stage process. Ideally this could happen as part of the query. Any ideas or inspiration would be appreciated.
CREATE TABLE foo
(
seq integer NOT NULL,
id uuid NOT NULL,
CONSTRAINT foo_pkey PRIMARY KEY (id),
CONSTRAINT foo_id_key UNIQUE (id),
CONSTRAINT foo_seq_key UNIQUE (seq)
);
CREATE UNIQUE INDEX idx_foo_id
ON public.foo USING btree
(id)
TABLESPACE pg_default;
CREATE UNIQUE INDEX idx_foo_seq
ON public.foo USING btree
(seq)
TABLESPACE pg_default;

You may have so many redundant indexes that you are confusing Postgres. Simply defining a column as primary key or unique is sufficient. You don't need multiple index declarations.
For what you want to do, this should be optimal:
select f.*
from foo f
where f.seq < (select f2.seq from foo f2 where f2.id = :id_value)
This should use the index to fetch the seq value in the subquery. Then it should return the appropriate rows.
You could also try:
select f.*
from (select f.*, min(seq) filter (where id = :id_value) over () as min_seq
from foo f
) f
where seq < min_seq;
However, my suspicion is simply that the query is returning a large number of rows and that is affecting performance.

How to query two link tables in the same query?

I have two tables. device that can have on blacklisted_device. I would like to get the number of device that include specific user_ids and in the same request number of blacklisted_devices linked.
Here the full sql to try it :
CREATE TABLE device (
device_id serial PRIMARY KEY,
user_id integer,
updated_at timestamp default current_timestamp
);
CREATE TABLE blacklisted_device (
blacklisted_id serial PRIMARY KEY,
device_id integer,
updated_at timestamp default current_timestamp,
CONSTRAINT blacklisted_device_device_id_fkey FOREIGN KEY (device_id)
REFERENCES device (device_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
INSERT INTO device (user_id)
VALUES (1),(2),(2),(7),(88),(99),(102),(106);
INSERT INTO blacklisted_device (device_id)
VALUES (1),(2),(3),(4);
SELECT COUNT(*) AS total_device
FROM device
WHERE user_id IN (7,88,99);
SELECT COUNT(*) AS blacklisted
FROM blacklisted_device
WHERE device_id IN (SELECT device_id FROM device WHERE user_id IN (7,88,99));
As you can see at the end I get the result I want but in two requests. How to get it in one request?
total_device: 3, blacklisted: 1
Feel free to make any comment on all the SQL, I probably made few mistakes.
Thanks

You need a LEFT JOIN:
SELECT COUNT(*) AS total_device,
COUNT(DISTINCT bd.device_id) AS blacklisted
FROM device d
LEFT JOIN blacklisted_device bd
ON d.device_id = bd.device_id
WHERE d.user_id IN (7,88,99);

returning array of rows from psql in window function

I am trying to return an array of names as a row in PSQL so that i don't return duplicate entries of data. This is my current query:
SELECT DISTINCT
thread_account.*,
thread.*,
MAX(message.Created) OVER (PARTITION BY thread.id) as Last_Message_Date,
MAX(message.content) OVER (PARTITION BY thread.id) as Last_Message_Sent,
ARRAY_AGG((account.first_name, account.last_name)) OVER (PARTITION BY thread.id) as user
FROM thread_account
JOIN thread on thread.id = thread_account.thread
JOIN message on message.thread = thread_account.thread
JOIN account on account.id = message.account
WHERE thread_account.account = 299
ORDER BY MAX(message.Created) OVER (PARTITION BY thread.id) desc;
any thoughts?
I would like to be able to do something like:
ARRAY_AGG(distinct (account.first_name, account.last_name))
OVER (PARTITION BY thread.id) as user
but it doesn't let you do distinct inside a window function
Here are the table definitions:
create table thread (
id bigserial primary key,
subject text not null,
created timestamp with time zone not null default current_timestamp
);
create table thread_account (
account bigint not null references account(id) on delete cascade,
thread bigint not null references thread(id) on delete cascade
);
create index thread_account_account on thread_account(account);
create index thread_account_thread on thread_account(thread);
create table message (
id bigserial primary key,
thread bigint not null references thread(id) on delete cascade,
content text not null,
account bigint not null references account(id) on delete cascade,
created timestamp with time zone not null default current_timestamp
);
create index message_account on message(account);
create index message_thread on message(thread);
create table account (
id bigint primary key,
first_name text,
last_name text,
email text
);

I don' know why you need the relation thread_account because involved accounts are referenced through messages already.
A possible Query could be:
SELECT DISTINCT
Thread_id,
Thread_Subject,
Thread_Created,
ARRAY_AGG(Message_Account) OVER (PARTITION BY Thread_Id) AS Involed_Accounts,
Last_Message_Date,
Last_Message_Sent
FROM (
SELECT DISTINCT ON (thread.id, message.account)
thread.id AS Thread_Id,
thread.subject AS Thread_Subject,
thread.created AS Thread_Created,
message.account AS Message_Account,
MAX(message.Created)
OVER (PARTITION BY thread.id) AS Last_Message_Date,
MAX(message.content)
OVER (PARTITION BY thread.id) AS Last_Message_Sent
FROM
thread
INNER JOIN message ON (message.thread = thread.id)
INNER JOIN account ON (message.account = account.id)
) as threads
ORDER BY Last_Message_Date desc;
Result:
thread_id | thread_subject | thread_created | Involed_Accounts | last_message_date | last_message_sent
-----------+----------------+-------------------------------+---------------+-------------------------------+-------------------
1 | Thread 1 | 2016-02-17 19:42:58.630795+01 | {1,2,3,4,5,6} | 2016-02-17 19:56:35.749875+01 | R
3 | Thread 3 | 2016-02-17 19:42:58.630795+01 | {1,4,5,8} | 2016-02-17 19:47:27.952065+01 | N
2 | Thread 2 | 2016-02-17 19:42:58.630795+01 | {7,8,9,10} | 2016-02-17 19:47:27.952065+01 | J
You should check the query plan to ensure it performs good on your database.

many-to-many query

I have following database structure,
CREATE TABLE IF NOT EXISTS `analyze` (
`disease_id` int(11) NOT NULL,
`symptom_id` int(11) NOT NULL
) ;
CREATE TABLE IF NOT EXISTS `disease` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(10) NOT NULL,
PRIMARY KEY (`id`)
) ;
CREATE TABLE IF NOT EXISTS `symptom` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(4) NOT NULL,
PRIMARY KEY (`id`)
) ;
EDIT:
Sorry, I mean how do I identify the disease from inputted symptoms.
Example:
If I have symptom: fever and cough then I would have influenza.
If I have symptom: sore throat and fever then I would have throat infection.
The input are $symptom1, $symptom2, $symptom3, and so on.
Thank you.

SELECT disease_id
FROM analyze
GROUP BY disease_id
HAVING COUNT(symptom_id) > 1
Edit: to reply to the edited question
SELECT disease_id, COUNT(DISTINCT symptom_id)
FROM analyze
WHERE symptom_id IN ($symptom1, $symptom2, $symptom3)
GROUP BY disease_id
ORDER BY COUNT(DISTINCT symptom_id) DESC
Of course you'll have to replace $symptomX by their respective ID's.
This query lists the diseases which match at least one symptom - the diseases which match the most symptoms are on top.
If you added an unique constraint on symptom_id and disease_id in analyze, you could lose the DISTINCT:
SELECT disease_id, COUNT(symptom_id)
FROM analyze
WHERE symptom_id IN ($symptom1, $symptom2, $symptom3)
GROUP BY disease_id
ORDER BY COUNT(symptom_id) DESC

select d.id from disease d inner join analyze a
on d.id = a.disease_id
group by d.id having count(a.disease_id) > 1

select disease_id, count(*)
from analyze
where symptom_id in ($symptom1, $symptom2, $symptom3)
group by disease_id
order by 2 descending;
will return the matching disease ids in descending order of matching symptoms.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I optimize Postgresql ARRAY_AGG queries for large tables? - sql

Related

Get data from one table with nested relations

Eliminate subquery to improve query performance

How to query two link tables in the same query?

returning array of rows from psql in window function

many-to-many query

Categories

Resources