Fetch multi-participant conversations with last message for each - sql

I am trying to create a simple chat application database schema, and query the conversations. My current table setup is the following:
CREATE TABLE chat_user (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
display_name VARCHAR(140),
... other user stuff ...
);
CREATE TABLE conversation (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
title VARCHAR(140),
created timestamp with time zone NOT NULL
);
CREATE TABLE conversation_message (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
conversation_id bigint NOT NULL,
sender_id bigint NOT NULL,
body TEXT NOT NULL,
created timestamp with time zone NOT NULL
);
CREATE TABLE conversation_participant (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
conversation_id bigint NOT NULL,
user_id bigint NOT NULL
);
So basically each conversation has its own title, and multiple participants. What I would like to fetch my conversations sorted by the date of the latest message in the conversation (so the conversations with the newest messages are shown first). The result set should contain the id, title of the conversation and list of participants + the id, sender_id and body of the latest message.
It would also be required to fetch the conversations paginated based on the creation date of the conversation (20 per page)
Is my table setup efficient enough to satisfy the above constraint? Seems to me that this could result in a rather large query with multiple subqueries?.

This answers the original version of the question.
You seem to want a join and agregation:
select cm.conversation_id, max(created)
from conversation_message cm join
conversation_participant cp
on cm.conversation_id = cp.conversation_id
where cp.user_id = ?
group by cm.conversation_id
order by max(created) desc;

You can try using lateral join.
So your query would look something like this.
You can fetch all of your needed data, apply limits and offsets, and retrieve the last message for each conversation. Hope it would help.
select * from conversation c
left join lateral (
select * from conversation_message cm
where cm.conversation_id=c.id
order by created desc
limit 1
) cm on true
left join conversation_participant cp on cp.id = cm.sender_id;
Left joins here are for the chat rooms without any messages.

In short: I think you have a reasonable design for a normalized (3NF) OLTP database. That's what you should aim for, not the number of JOINs for a specific use case. The design you have will satisfy the use case you defined and many other use cases, which I'm sure involved in this application of yours.
In details:
You're designing for an OLTP system, where the data is kept normalized to ensure the data consistency and improve the efficiency of OLTP transactions.
This however means you will have to do much more JOINs than a de-normalized database( which is suitable more for OLAP, reporting, analytics systems). This is just the nature of OLTP relational databases.
Trying to reduce the number of JOINs in a normalized database (i.e 3NF - Third normal form) means you will be combining data from different granularity into the same table and cause duplication, thus making updates harder and slower, and eventually data inconsistency.
So, you really shouldn't design aiming to reduce the number of JOINs. Instead make sure, you have a normalized design and avoid over-normalizing. In cases, where you may want to avoid writing long queries, you can add VIEWS and use view to write queries to simplify your queries (but that can cause sub-optimal query performance sometimes by bringing unnecessary joins).

To get latest message for your conversations there are ways to achieve it like self joins or window functions (row_number(), rank() etc). Using window function you can write your query as
with cm as (
select *,
rank() over (partition by conversation_id order by created desc) as r
from conversation_message
)
select c.id,
c.title,
cm.body,
cm.created,
cm.r,
cu.display_name
from conversation as c
left join cm on c.id = cm.conversation_id and cm.r <= 1
left join chat_user cu on cu.id = cm.sender_id
DEMO
In above query I have used left joins to include conversions with no messages, If you need only conversations which has messages then use inner joins. If you need more than 1 latest messages for each conversation change cm.r <= #no
To get participants list for each conversation you can add new CTE like
with cm as (
select *,
rank() over (partition by conversation_id order by created desc) as r
from conversation_message
),
message_participants as (
select
m.conversation_id,
array_agg(u.display_name order by m.created desc) as participants
from chat_user as u
join conversation_message as m on u.id = m.sender_id
group by m.conversation_id
)
select c.id,
c.title,
cm.body,
cm.created,
cm.r,
cu.display_name,
cmp.participants
from conversation c
left join cm on c.id = cm.conversation_id and cm.r <= 1
left join chat_user cu on cu.id = cm.sender_id
left join message_participants cmp on c.id = cmp.conversation_id
DEMO
Improvements
Add user_id in conversation table to identify who has created
this conversation.
Table conversation_participant is redundant while you can extract
list of participants from conversation_message

Related

What will be faster for GROUP BY statement

Imagine that I have the next two SQL Server tables:
CREATE TABLE Users (
id INT IDENTITY(1, 1) PRIMARY KEY,
name VARCHAR(100) NOT NULL
)
CREATE TABLE UserLogins (
id INT IDENTITY(1, 1) PRIMARY KEY,
user_id INT REFERENCES Users(id) NOT NULL,
login VARCHAR(100) NOT NULL
)
And I need to get a count of user logins for each user. And the query result should contain user name, for example.
Which query will work faster:
SELECT MAX(name), count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY u.id
or the next one:
SELECT name, count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY u.name
So, I'm not sure, if it will be better to group by the column with an index and then use MAX or MIN aggregate function. Or just group by Users.name, which doesn't have any indexes.
Thank you in advance!
The answer is: neither is really correct.
The second version is completely wrong as name is not unique. The first version is correct, although it may not be efficient.
Since name has a functional dependency on id, every unique value of id also defines a value of name. Grouping by name is wrong, because name is not necessarily unique. Grouping only by id means you need to aggregate name, which makes no sense if there is a functional dependency. So you actually want to group by both columns:
SELECT
u.name,
count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY
u.id,
u.name;
Note that id does not actually need to be selected.
This query is almost certainly going to be faster than grouping by name alone, because the server cannot deduce that name is unique and needs to sort and aggregate it.
It may also be faster than grouping by id, although that may depend on whether the optimizer is clever enough to deduce the functional dependency (and therefore no aggregation would be necessary). Even if it isn't clever, this probably won't be slow, as id is already unique, so a scan of an index over id would not require a sort, only aggregation.

"Data warehouse"-like SQLite store design

I am interested in designing a SQL-based (SQLite, actually) storage for an application processing a large number of similar data entries. For this example, let it be a chat messages storage.
The application has to provide the capabilities of filtering and analyzing the data by message participants, tags, etc., all of those implying N-to-N relationships.
So, the schema (kind of star) will look something like:
create table messages (
message_id INTEGER PRIMARY KEY,
time_stamp INTEGER NOT NULL
-- other fact fields
);
create table users (
user_id INTEGER PRIMARY KEY,
-- user dimension data
);
create table message_participants (
user_id INTEGER references users(user_id),
message_id INTEGER references messages(message_id)
);
create table tags (
tag_id INTEGER PRIMARY KEY,
tag_name TEXT NOT NULL,
-- tag dimension data
);
create table message_tags (
tag_id INTEGER references tags(tag_id),
message_id INTEGER references messages(message_id)
);
-- etc.
So, all good and well, until I have to perform analytic operations and filtering based on the N-to-N dimensions. Given millions of rows in the messages table and thousands in the dimensions (there are more than shown in the example), all the joins are simply too much a performance hit.
For example, I would like to analyze the number of messages each user participated in, given the data is filtered based on selected tags, selected users and other aspects:
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
and message_id in
(select message_id
from message_tags
where tag_id in ( /* some tag ID's set */ ))
and
-- more N-to-N filtering
group by U.user_id
I am constrained to SQL and, specifically, SQLite. And I do use indices on the tables.
I there some way I don't see to improve the schema, maybe a clever way to de-normalize it?
Or maybe there is a way to somehow index the dimension keys inside the message row (I thought about using FTS capabilities but not sure if searching the textual index and joining on the results will provide any performance leverage)?
Too long to put in a comment, and might help with performance but isn't exactly a direct answer to your question (your schema seems fine): have you tried messing with your query itself?
I often see that kind of subselect filter for many-to-many, and I have found that on large queries like this I frequently see improvements in performance from running a CTE/join rather than a where blag in (subselect):
;with tagMesages as (
select distinct message_id
from message_tags
where tag_id in ( /* some tag ID's set */ )
) -- more N-to-N filtering
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
join tagMesages on M.message_id = tagMesages.message_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
group by U.user_id
We can tell they're the same, but the query planner can sometimes find this more helpful
Disclaimer: I don't do SQLite, I do SQL Server, so sorry if I've made some obvious (or otherwise) error.

How to add multiple columns in a table with HSQLDB?

I have been trying to use the ALTER TABLE command with HSQLDB to add 2 columns to a table but with no luck. I know MySQL and other systems support it but why doesn't HSQLDB support it? Perhaps I'm using the wrong syntax, I don't know. I also know that I could add it one-by-one but my application requires the addition of 1000 columns and it is too slow to do it one-by-one.
The reason that I'm using HSQLDB is that I need to use it in file-mode. I have also tried SmallSQL, but it is much more slow than HSQLDB.
You don't need thousands of columns for this. This is a standard one-to-many relationship between three tables: questionaire, question and answer:
create table questionaire
(
id integer not null primary key,
customer_name varchar(100) not null
);
create table question
(
id integer not null primary key,
questionaire_id integer not null references questionaire,
question_text varchar(20000),
sort_order integer
);
create table answer
(
question_id integer not null references question,
answer_text varchar(20000),
user_name varchar(50) not null,
primary key (question_id, user_name)
);
In reality you wouldn't actually store the users's name in the answer table. If you have named users that log in, you probably also need a user_account table and the question table would reference the user_account table.
You can easily query this model without the need to revert to a key/value store or JSON
To get all questionaires from a customer
select *
from questionaire qu
where customer_name = 'Some company';
Get all questionaires and the number of questions per customer
select qu.customer_name,
count(distinct qu.id) as num_questionaires,
count(q.id) as total_questions
from questionaire qu
join question q on qu.id = q.questionaire_id
group by qu.customer_name;
Get all answers for a questionaire from a specific user
select q.question_text, a.answer_text
from answer a
join question q on q.id = a.question_id
join questionaire qu on qu.id = q.questionaire_id
where qu.id = 1
and a.user_name = 'Marvin'
order by q.sort_order;
A bit more complicated, but probably still fast enough even with thousands of questions and answers: find users that haven't answered all questions
select aq.user_name, aq.questionaire_id, aq.answered_questions, tq.num_questions
from (
select a.user_name, q.questionaire_id, count(*) as answered_questions
from answer a
join question q on q.id = a.question_id
group by a.user_name, q.questionaire_id
) aq join (
select questionaire_id, count(*) as num_questions
from question
group by questionaire_id
) tq on tq.questionaire_id = aq.questionaire_id
where aq.answered_questions < tq.num_questions;
SQLFiddle example: http://sqlfiddle.com/#!15/0a4e5/1
You also shouldn't try to transpose the rows for each question (or answer) into column in SQL - you will eventually hit some limits of the number of columns the database can manage. Relational databases were designed to handle rows, lots of rows - not "thousands of column". Transposing rows to columns is typically done in the presentation layer of your application (or e.g. using a Pivot function in a spreadsheet)

Design : multiple visits per patient

Above is my schema. What you can't see in tblPatientVisits is the foreign key from tblPatient, which is patientid.
tblPatient contains a distinct copies of each patient in the dataset as well as their gender. tblPatientVists contains their demographic information, where they lived at time of admission and which hospital they went to. I chose to put that information into a separate table because it changes throughout the data (a person can move from one visit to the next and go to a different hospital).
I don't get any strange numbers with my queries until I add tblPatientVisits. There are just under one millions claims in tblClaims, but when I add tblPatientVisits so I can check out where that person was from, it returns over million. I thinkthis is due to the fact that in tblPatientVisits the same patientID shows up more than once (due to the fact that they had different admission/dischargedates).
For the life of me I can't see where this is incorrect design, nor do I know how to rectify it beyond doing one query with count(tblPatientVisits.PatientID=1 and then union with count(tblPatientVisits.patientid)>1.
Any insight into this type of design, or how I might more elegantly find a way to get the claimType from tblClaims to give me the correct number of rows with I associate a claim ID with a patientID?
EDIT: The biggest problem I'm having is the fact that if I include the admissionDate,dischargeDate or the patientStatein the tblPatient table I can't use the patientID as a primary key.
It should be noted that tblClaims are NOT necessarily related to tblPatientVisits.admissionDate, tblPatientVisits.dischargeDate.
EDIT: sample queries to show that when tblPatientVisits is added, more rows are returned than claims
SELECT tblclaims.id, tblClaims.claimType
FROM tblClaims INNER JOIN
tblPatientClaims ON tblClaims.id = tblPatientClaims.id INNER JOIN
tblPatient ON tblPatientClaims.patientid = tblPatient.patientID INNER JOIN
tblPatientVisits ON tblPatient.patientID = tblPatientVisits.patientID
more than one million query rows returned
SELECT tblClaims.id, tblPatient.patientID
FROM tblClaims INNER JOIN
tblPatientClaims ON tblClaims.id = tblPatientClaims.id INNER JOIN
tblPatient ON tblPatientClaims.patientid = tblPatient.patientID
less than one million query rows returned
I think this is crying for a better design. I really think that a visit should be associated with a claim, and that a claim can only be associated with a single patient, so I think the design should be (and eliminating the needless tbl prefix, which is just clutter):
CREATE TABLE dbo.Patients
(
PatientID INT PRIMARY KEY
-- , ... other columns ...
);
CREATE TABLE dbo.Claims
(
ClaimID INT PRIMARY KEY,
PatientID INT NOT NULL FOREIGN KEY
REFERENCES dbo.Patients(PatientID)
-- , ... other columns ...
);
CREATE TABLE dbo.PatientVisits
(
PatientID INT NOT NULL FOREIGN KEY
REFERENCES dbo.Patients(PatientID),
ClaimID INT NULL FOREIGN KEY
REFERENCES dbo.Claims(ClaimID),
VisitDate DATE
, -- ... other columns ...
, PRIMARY KEY (PatientID, ClaimID, VisitDate) -- not convinced on this one
);
There is some redundant information here, but it's not clear from your model whether a patient can have a visit that is not associated with a specific claim, or even whether you know that a visit belongs to a specific claim (this seems like crucial information given the type of query you're after).
In any case, given your current model, one query you might try is:
SELECT c.id, c.claimType
FROM dbo.tblClaims AS c
INNER JOIN dbo.tblPatientClaims AS pc
ON c.id = pc.id
INNER JOIN dbo.tblPatient AS p
ON pc.patientid = p.patientID
-- where exists tells SQL server you don't care how many
-- visits took place, as long as there was at least one:
WHERE EXISTS (SELECT 1 FROM dbo.tblPatientVisits AS pv
WHERE pv.patientID = p.patientID);
This will still return one row for every patient / claim combination, but it should only return one row per patient / visit combination. Again, it really feels like the design isn't right here. You should also get in the habit of using table aliases - they make your query much easier to read, especially if you insist on the messy tbl prefix. You should also always use the dbo (or whatever schema you use) prefix when creating and referencing objects.
I'm not sure I understand the concept of a claim but I suspect you want to remove the link table between claims and patient and instead make the association between patient visit and a claim.
Would that work out better for you?

Unexpected results after joining another table

I use three tables to get to the final result. They are called project_board_members, users and project_team.
This is the query:
SELECT `project_board_members`.`member_id`,
`users`.`name`,
`users`.`surname`,
`users`.`country`,
`project_team`.`tasks_completed`
FROM `project_board_members`
JOIN `users`
ON (`users`.`id` = `project_board_members`.`member_id`)
JOIN `project_team`
ON (`project_team`.`user_id` = `project_board_members`.`member_id`)
WHERE `project_board_members`.`project_id` = '5'
You can ignore last line because it just points to the project I'm using.
Table project_board_members holds three entries and have structure like:
id,
member_id,
project_id,
created_at;
I need to get member_id from that table. Then I join to users table to get name, surname and country. No problems. All works! :)
After that, I needed to get tasks_completed for each user. That is stored in project_team table. The big unexpected thing is that I got four entries returned and the big what-the-f*ck is that in the project_board_members table are only three entries.
Why is that so? Thanks in advice!
A SQL join creates a result set that contains one row for each combination of the left and right tables that matches the join conditions. Without seeing the data or a little more information it's hard to say what exactly is wrong from what you expect, but I'm guessing it's one of the following:
1) You have two entries in project_team with the same user_id.
2) Your entries in project_team store both user_id and project_id and you need to be joining on both of them rather than just user_id.
The table project_board_members represent what is called in the Entity-Relationship modelling world an "associative entity". It exists to implement a many-to-many relationship (in this case, between the project and user entities. As such it is a dependent entity, which is to say that the existence of an instance of it is predicated on the existence of an instance of each of the entities to which it refers (a user and a project).
As a result, the columnns comprising the foreign keys relating to those entities (member_id and project_id) must be form part or all of the primary key.
Normally, instances of an associative entity are unique WRT the entities to which it relates. In your case the relationship definitions would be:
Each user is seated on the board of 0-to-many projects;
Each project's board is comprise of 0-to-many users
which is to say that a particular user may not be on the board of a particular project more than once. The only reason for adding other columns (such as your id column) to the primary key would be if the user:project relationship is non-unique.
To enforce this rule -- a user may sit on the board a particular project just once -- the table schema should look like this:
create table project_board_member
(
member_id int not null foreign key references user ( user_id ) ,
project_Id int not null foreign key references project ( project_id ) ,
created_at ...
...
primary key ( member_id , project_id ) ,
)
}
The id column is superfluous.
For debugging purposes do
SELECT GROUP_CONCAT(pbm.member_id) AS member_ids,
GROUP_CONCAT(u.name) as names,
GROUP_CONCAT(u.surname) as surnames,
GROUP_CONCAT(u.country) as countries,
GROUP_CONCAT(pt.tasks_completed) as tasks
FROM project_board_members pbm
JOIN users u
ON (u.id = pbm.member_id)
JOIN project_team pt
ON (pt.user_id = pbm.member_id)
WHERE pbm.project_id = '5'
GROUP BY pbm.member_id
All the fields that list multiple entries in the result are messing up the rowcount in your resultset.
To Fix that you can do:
SELECT pbm.member_id
u.name,
u.surname,
u.country,
pt.tasks_completed
FROM (SELECT
p.project_id, p.member_id
FROM project_board_members p
WHERE p.project_id = '5'
LIMIT 1
) AS pbm
JOIN users u
ON (u.id = pbm.member_id)
JOIN project_team pt
ON (pt.user_id = pbm.member_id)