"Data warehouse"-like SQLite store design - sql

I am interested in designing a SQL-based (SQLite, actually) storage for an application processing a large number of similar data entries. For this example, let it be a chat messages storage.
The application has to provide the capabilities of filtering and analyzing the data by message participants, tags, etc., all of those implying N-to-N relationships.
So, the schema (kind of star) will look something like:
create table messages (
message_id INTEGER PRIMARY KEY,
time_stamp INTEGER NOT NULL
-- other fact fields
);
create table users (
user_id INTEGER PRIMARY KEY,
-- user dimension data
);
create table message_participants (
user_id INTEGER references users(user_id),
message_id INTEGER references messages(message_id)
);
create table tags (
tag_id INTEGER PRIMARY KEY,
tag_name TEXT NOT NULL,
-- tag dimension data
);
create table message_tags (
tag_id INTEGER references tags(tag_id),
message_id INTEGER references messages(message_id)
);
-- etc.
So, all good and well, until I have to perform analytic operations and filtering based on the N-to-N dimensions. Given millions of rows in the messages table and thousands in the dimensions (there are more than shown in the example), all the joins are simply too much a performance hit.
For example, I would like to analyze the number of messages each user participated in, given the data is filtered based on selected tags, selected users and other aspects:
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
and message_id in
(select message_id
from message_tags
where tag_id in ( /* some tag ID's set */ ))
and
-- more N-to-N filtering
group by U.user_id
I am constrained to SQL and, specifically, SQLite. And I do use indices on the tables.
I there some way I don't see to improve the schema, maybe a clever way to de-normalize it?
Or maybe there is a way to somehow index the dimension keys inside the message row (I thought about using FTS capabilities but not sure if searching the textual index and joining on the results will provide any performance leverage)?

Too long to put in a comment, and might help with performance but isn't exactly a direct answer to your question (your schema seems fine): have you tried messing with your query itself?
I often see that kind of subselect filter for many-to-many, and I have found that on large queries like this I frequently see improvements in performance from running a CTE/join rather than a where blag in (subselect):
;with tagMesages as (
select distinct message_id
from message_tags
where tag_id in ( /* some tag ID's set */ )
) -- more N-to-N filtering
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
join tagMesages on M.message_id = tagMesages.message_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
group by U.user_id
We can tell they're the same, but the query planner can sometimes find this more helpful
Disclaimer: I don't do SQLite, I do SQL Server, so sorry if I've made some obvious (or otherwise) error.

Related

Preventing SQLite query from doing USE TEMP B-TREE FOR GROUP BY

I have a table
CREATE TABLE user_records (
pos smallint PRIMARY KEY,
username MEDIUMINT unsigned not null,
anime_id smallint UNSIGNED NOT NULL,
score tinyint not null,
scaled_score DECIMAL(1,5) not null
)
with indexes
(anime_id,username,scaled_score)
(username,anime_id,scaled_score)
(username)
(anime_id)
I know those last two were redundant I was just testing
And lastly here is my query:
select aggregate_func(score2) scores,anime_id from
(select anime_id as anime_id2,username as username2,scaled_score as score2 from user_records where anime_id in(666))
inner join
(select anime_id,username,scaled_score from user_records)
where username = username2 group by anime_id order by scores desc limit 1000;
The goal of this query is to run an aggregation function on every combination of scores a user has given to a specified show (666 in this case) and all other shows in the table. I have tried every type of join SQLite supports, which isn't many, and reordering the select statements, but the outcome is always the same except with a cross join with the unconstrained select first, which takes a very long time for obvious reasons. I am confident after executing each part separately that the part taking the most time is the USE TEMP B-TREE FOR GROUP BY. My goal is for the query planner to somehow use an index for the GROUP BY, but no matter what I try it chooses B-TREE, and the grouping process takes exponentially longer depending on the size of the result set from the join.
For reference, this table has 70,000,000 rows of user show ratings and the GROUP by often has to work on millions of joined rows. Thanks in advance.

Postgresql "column must appear in the GROUP BY clause or be used in an aggregate function" and unique field

CREATE TABLE posts (
id bigint NOT NULL,
user_id bigint NOT NULL,
content text
);
CREATE TABLE users (
id bigint NOT NULL,
email character varying DEFAULT ''::character varying NOT NULL
)
CREATE UNIQUE INDEX index_users_on_email ON users USING btree (email);
The following SQL request:
SELECT posts.content, users.email /*, other aggregate fields not relevant for the question */
FROM posts
INNER JOIN users ON posts.user_id = users.id
GROUP BY posts.id;
gives the error column "users.email" must appear in the GROUP BY clause or be used in an aggregate function.
But the email field is unique (if it changes anything) and a post can only have one user (so one email).
Why is this request not valid, since it's not possible to have multiple values of email per post?
You need to add the primary key of the user table to the group by clause to make the query a valid aggregation query:
SELECT p.content, u.email /*, other aggregate fields not relevant for the question */
FROM posts p
INNER JOIN users u ON p.user_id = u.id
/* Other `inner join`s but not relevant for the question */
GROUP BY posts.id, u.id;
Postgres is quite smart about functional dependencies, but not that smart. It understands the concept of functionally-dependent columns, but not across tables so it cannot foresee that a post uniquely refers to a user, even if you have a proper foreign key set up. I don't think that such thing is defined in standard ANSI SQL either.

Fetch multi-participant conversations with last message for each

I am trying to create a simple chat application database schema, and query the conversations. My current table setup is the following:
CREATE TABLE chat_user (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
display_name VARCHAR(140),
... other user stuff ...
);
CREATE TABLE conversation (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
title VARCHAR(140),
created timestamp with time zone NOT NULL
);
CREATE TABLE conversation_message (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
conversation_id bigint NOT NULL,
sender_id bigint NOT NULL,
body TEXT NOT NULL,
created timestamp with time zone NOT NULL
);
CREATE TABLE conversation_participant (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
conversation_id bigint NOT NULL,
user_id bigint NOT NULL
);
So basically each conversation has its own title, and multiple participants. What I would like to fetch my conversations sorted by the date of the latest message in the conversation (so the conversations with the newest messages are shown first). The result set should contain the id, title of the conversation and list of participants + the id, sender_id and body of the latest message.
It would also be required to fetch the conversations paginated based on the creation date of the conversation (20 per page)
Is my table setup efficient enough to satisfy the above constraint? Seems to me that this could result in a rather large query with multiple subqueries?.
This answers the original version of the question.
You seem to want a join and agregation:
select cm.conversation_id, max(created)
from conversation_message cm join
conversation_participant cp
on cm.conversation_id = cp.conversation_id
where cp.user_id = ?
group by cm.conversation_id
order by max(created) desc;
You can try using lateral join.
So your query would look something like this.
You can fetch all of your needed data, apply limits and offsets, and retrieve the last message for each conversation. Hope it would help.
select * from conversation c
left join lateral (
select * from conversation_message cm
where cm.conversation_id=c.id
order by created desc
limit 1
) cm on true
left join conversation_participant cp on cp.id = cm.sender_id;
Left joins here are for the chat rooms without any messages.
In short: I think you have a reasonable design for a normalized (3NF) OLTP database. That's what you should aim for, not the number of JOINs for a specific use case. The design you have will satisfy the use case you defined and many other use cases, which I'm sure involved in this application of yours.
In details:
You're designing for an OLTP system, where the data is kept normalized to ensure the data consistency and improve the efficiency of OLTP transactions.
This however means you will have to do much more JOINs than a de-normalized database( which is suitable more for OLAP, reporting, analytics systems). This is just the nature of OLTP relational databases.
Trying to reduce the number of JOINs in a normalized database (i.e 3NF - Third normal form) means you will be combining data from different granularity into the same table and cause duplication, thus making updates harder and slower, and eventually data inconsistency.
So, you really shouldn't design aiming to reduce the number of JOINs. Instead make sure, you have a normalized design and avoid over-normalizing. In cases, where you may want to avoid writing long queries, you can add VIEWS and use view to write queries to simplify your queries (but that can cause sub-optimal query performance sometimes by bringing unnecessary joins).
To get latest message for your conversations there are ways to achieve it like self joins or window functions (row_number(), rank() etc). Using window function you can write your query as
with cm as (
select *,
rank() over (partition by conversation_id order by created desc) as r
from conversation_message
)
select c.id,
c.title,
cm.body,
cm.created,
cm.r,
cu.display_name
from conversation as c
left join cm on c.id = cm.conversation_id and cm.r <= 1
left join chat_user cu on cu.id = cm.sender_id
DEMO
In above query I have used left joins to include conversions with no messages, If you need only conversations which has messages then use inner joins. If you need more than 1 latest messages for each conversation change cm.r <= #no
To get participants list for each conversation you can add new CTE like
with cm as (
select *,
rank() over (partition by conversation_id order by created desc) as r
from conversation_message
),
message_participants as (
select
m.conversation_id,
array_agg(u.display_name order by m.created desc) as participants
from chat_user as u
join conversation_message as m on u.id = m.sender_id
group by m.conversation_id
)
select c.id,
c.title,
cm.body,
cm.created,
cm.r,
cu.display_name,
cmp.participants
from conversation c
left join cm on c.id = cm.conversation_id and cm.r <= 1
left join chat_user cu on cu.id = cm.sender_id
left join message_participants cmp on c.id = cmp.conversation_id
DEMO
Improvements
Add user_id in conversation table to identify who has created
this conversation.
Table conversation_participant is redundant while you can extract
list of participants from conversation_message

How to add multiple columns in a table with HSQLDB?

I have been trying to use the ALTER TABLE command with HSQLDB to add 2 columns to a table but with no luck. I know MySQL and other systems support it but why doesn't HSQLDB support it? Perhaps I'm using the wrong syntax, I don't know. I also know that I could add it one-by-one but my application requires the addition of 1000 columns and it is too slow to do it one-by-one.
The reason that I'm using HSQLDB is that I need to use it in file-mode. I have also tried SmallSQL, but it is much more slow than HSQLDB.
You don't need thousands of columns for this. This is a standard one-to-many relationship between three tables: questionaire, question and answer:
create table questionaire
(
id integer not null primary key,
customer_name varchar(100) not null
);
create table question
(
id integer not null primary key,
questionaire_id integer not null references questionaire,
question_text varchar(20000),
sort_order integer
);
create table answer
(
question_id integer not null references question,
answer_text varchar(20000),
user_name varchar(50) not null,
primary key (question_id, user_name)
);
In reality you wouldn't actually store the users's name in the answer table. If you have named users that log in, you probably also need a user_account table and the question table would reference the user_account table.
You can easily query this model without the need to revert to a key/value store or JSON
To get all questionaires from a customer
select *
from questionaire qu
where customer_name = 'Some company';
Get all questionaires and the number of questions per customer
select qu.customer_name,
count(distinct qu.id) as num_questionaires,
count(q.id) as total_questions
from questionaire qu
join question q on qu.id = q.questionaire_id
group by qu.customer_name;
Get all answers for a questionaire from a specific user
select q.question_text, a.answer_text
from answer a
join question q on q.id = a.question_id
join questionaire qu on qu.id = q.questionaire_id
where qu.id = 1
and a.user_name = 'Marvin'
order by q.sort_order;
A bit more complicated, but probably still fast enough even with thousands of questions and answers: find users that haven't answered all questions
select aq.user_name, aq.questionaire_id, aq.answered_questions, tq.num_questions
from (
select a.user_name, q.questionaire_id, count(*) as answered_questions
from answer a
join question q on q.id = a.question_id
group by a.user_name, q.questionaire_id
) aq join (
select questionaire_id, count(*) as num_questions
from question
group by questionaire_id
) tq on tq.questionaire_id = aq.questionaire_id
where aq.answered_questions < tq.num_questions;
SQLFiddle example: http://sqlfiddle.com/#!15/0a4e5/1
You also shouldn't try to transpose the rows for each question (or answer) into column in SQL - you will eventually hit some limits of the number of columns the database can manage. Relational databases were designed to handle rows, lots of rows - not "thousands of column". Transposing rows to columns is typically done in the presentation layer of your application (or e.g. using a Pivot function in a spreadsheet)

Unexpected results after joining another table

I use three tables to get to the final result. They are called project_board_members, users and project_team.
This is the query:
SELECT `project_board_members`.`member_id`,
`users`.`name`,
`users`.`surname`,
`users`.`country`,
`project_team`.`tasks_completed`
FROM `project_board_members`
JOIN `users`
ON (`users`.`id` = `project_board_members`.`member_id`)
JOIN `project_team`
ON (`project_team`.`user_id` = `project_board_members`.`member_id`)
WHERE `project_board_members`.`project_id` = '5'
You can ignore last line because it just points to the project I'm using.
Table project_board_members holds three entries and have structure like:
id,
member_id,
project_id,
created_at;
I need to get member_id from that table. Then I join to users table to get name, surname and country. No problems. All works! :)
After that, I needed to get tasks_completed for each user. That is stored in project_team table. The big unexpected thing is that I got four entries returned and the big what-the-f*ck is that in the project_board_members table are only three entries.
Why is that so? Thanks in advice!
A SQL join creates a result set that contains one row for each combination of the left and right tables that matches the join conditions. Without seeing the data or a little more information it's hard to say what exactly is wrong from what you expect, but I'm guessing it's one of the following:
1) You have two entries in project_team with the same user_id.
2) Your entries in project_team store both user_id and project_id and you need to be joining on both of them rather than just user_id.
The table project_board_members represent what is called in the Entity-Relationship modelling world an "associative entity". It exists to implement a many-to-many relationship (in this case, between the project and user entities. As such it is a dependent entity, which is to say that the existence of an instance of it is predicated on the existence of an instance of each of the entities to which it refers (a user and a project).
As a result, the columnns comprising the foreign keys relating to those entities (member_id and project_id) must be form part or all of the primary key.
Normally, instances of an associative entity are unique WRT the entities to which it relates. In your case the relationship definitions would be:
Each user is seated on the board of 0-to-many projects;
Each project's board is comprise of 0-to-many users
which is to say that a particular user may not be on the board of a particular project more than once. The only reason for adding other columns (such as your id column) to the primary key would be if the user:project relationship is non-unique.
To enforce this rule -- a user may sit on the board a particular project just once -- the table schema should look like this:
create table project_board_member
(
member_id int not null foreign key references user ( user_id ) ,
project_Id int not null foreign key references project ( project_id ) ,
created_at ...
...
primary key ( member_id , project_id ) ,
)
}
The id column is superfluous.
For debugging purposes do
SELECT GROUP_CONCAT(pbm.member_id) AS member_ids,
GROUP_CONCAT(u.name) as names,
GROUP_CONCAT(u.surname) as surnames,
GROUP_CONCAT(u.country) as countries,
GROUP_CONCAT(pt.tasks_completed) as tasks
FROM project_board_members pbm
JOIN users u
ON (u.id = pbm.member_id)
JOIN project_team pt
ON (pt.user_id = pbm.member_id)
WHERE pbm.project_id = '5'
GROUP BY pbm.member_id
All the fields that list multiple entries in the result are messing up the rowcount in your resultset.
To Fix that you can do:
SELECT pbm.member_id
u.name,
u.surname,
u.country,
pt.tasks_completed
FROM (SELECT
p.project_id, p.member_id
FROM project_board_members p
WHERE p.project_id = '5'
LIMIT 1
) AS pbm
JOIN users u
ON (u.id = pbm.member_id)
JOIN project_team pt
ON (pt.user_id = pbm.member_id)