I am trying to build a chat application similar to slack chat, I want to understand how they have designed their database that it returns so much information at once when someone loads a chat, which database is good for this problem, I am adding screenshot of the same for reference.
Initially when I started thinking about this I wanted to go ahead with PostgreSQL and always keeping tables normalized to keep it clean but as I went ahead normalization started to feel like a problem.
Users Table
id
name
email
1
John
john#gmail.com
2
same
sam#gmail.com
Channels Table
id
channel_name
1
Channel name 1
2
Channel name 2
Participants table
id
user_id
channel_id
1
1
1
2
1
2
3
2
1
Chat table
id
user_id
channel_id
parent_id
message_text
total_replies
timestamp
1
1
1
null
first message
0
-
2
1
2
1
second message
10
-
3
1
3
null
third message
0
-
Chat table has column name parent_id which tells if it is parent message or child message I don't want to go with recursive child messages so this is fine
Emojis table
id
user_id
message_id
emoji_uni-code
1
1
12
U123
2
1
12
U234
3
2
14
U456
4
2
14
U7878
5
3
14
U678
A person can react with many emojis on the same message
when someone loads I want to fetch last 10 messages inserted into tables with
all the emojis which have been reacted with each messages and replies like you can see in the image where it says 1 reply with person's profile picture(this can be more than 1)
Now to fetch this data I have to join all the tables and then fetch the data which could be very heavy job on the back-end side, considering this is going to be very frequent.
What I thought is I would add two more columns in Chat table which are profile_replies and emoji_reactions_count and both will be of bson data types to store data something like this
This for emoji_reactions_count column
This is also with two ways one which is count only way
{
"U123": "123",// count of reactions on an emoji
"U234": "12"
}
When someone reacts i would update the count and insert or delete the row from Emojis table, Here I have a question, too frequent emoji updates on any message could become slow? because i need to update the count in above table everytime someone reacts with a emoji
OR
storing user id along with the count like this , this looks better I can get rid off Emojis table completely
{
"U123": {
"count": 123, // count of reactions on an emoji
"userIds": [1,2,3,4], // list of users ids who all have reacted
},
"U234": {
"count": 12,
"userIds": [1,2,3,4],
},
}
This for profile_replies column
[
{
"name": 'john',
"profile_image": 'image url',
"replied_on": timestamp
},
... with similar other objects
]
Does this look fine solution or is there anything I can do to imporve or should I switch to some noSQL Database like mongodb or cassandra? I have considered about mongodb but this does not also look very good because joins are slow when data exponentially grows but this does not happen in sql comparatively.
Even though this is honestly more like a discussion and there is no perfect answer to such a question, I will try to point out things you might want to consider if rebuilding Slack:
Emoji table:
As #Alex Blex already commmetend can be neglected for the very beginning of a chat software. Later on they could either be injected by your some cache in your application, somewhere in middleware or view or whereever, or stored directly with your message. There is no need to JOIN anything on the database side.
Workspaces:
Slack is organized in Workspaces, where you can participate with the very same user. Every workspace can have multiple channels, every channel can have multiple guests. Every user can join multiple workspaces (as admin, full member, single-channel or multi-channel guest). Try to start with that idea.
Channels:
I would refactor the channel wording to e.g. conversation because basically (personal opinion here) I think there is not much of a difference between e.g. a channel with 10 members and a direction conversation involving 5 people, except for the fact that: users can join (open) channels later on and see previous messages, which is not possible for closed channels and direct messages.
Now for your actual database layout question:
Adding columns like reply_count or profile_replies can be very handy later on when you are developing an admin dashboard with all kinds of statistics but is absolutely not required for the client.
Assuming your client does a small call to "get workspace members" upon joining / launching the client (and then obviously frequently renewing the cache on the clients side) there is no need to store user data with the messages, even if there are 1000 members on the same workspace it should be only a few MiB of information.
Assuming your client does the same with a call to "get recent workspace conversations" (of course you can filter by if public and joined) you are going to have a nice list of channels you are already in, and the last people you have talked to.
create table message
(
id bigserial primary key,
workspace_id bigint not null,
conversation_id bigint not null,
parent_id bigint,
created_dt timestamp with time zone not null,
modified_at timestamp with time zone,
is_deleted bool not null default false,
content jsonb
)
partition by hash (workspace_id);
create table message_p0 partition of message for values with (modulus 32, remainder 0);
create table message_p1 partition of message for values with (modulus 32, remainder 1);
create table message_p2 partition of message for values with (modulus 32, remainder 2);
...
So basically your query against the database whenever a user joins a new conversation is going to be:
SELECT * FROM message WHERE workspace_id = 1234 ORDER BY created_dt DESC LIMIT 25;
And when you start scrolling up its going to be:
SELECT * FROM message WHERE workspace_id = 1234 AND conversation_id = 1234 and id < 123456789 ORDER BY created_dt DESC LIMIT 25;
and so on... As you can already see you can now select messages very efficiently by workspace and conversation if you additionally add an INDEX like (might differ if you use partitioning):
create index idx_message_by_workspace_conversation_date
on message (workspace_id, conversation_id, created_dt)
where (is_deleted = false);
For message format I would use something similar than Twitter does, for more details please check their official documentation:
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet
Of course e.g. your Client v14 should know how to 'render' all objects from v1 to v14, but thats the great thing about message format versioning: It is backwards compatible and you can launch a new format supporting more features whenever you feel like, a primitive example of content could be:
{
"format": "1.0",
"message":"Hello World",
"can_reply":true,
"can_share":false,
"image": {
"highres": { "url": "https://www.google.com", "width": 1280, "height": 720 },
"thumbnail": { "url": "https://www.google.com", "width": 320, "height": 240 }
},
"video": null,
"from_user": {
"id": 12325512,
"avatar": "https://www.google.com"
}
}
The much complicated question imo is efficiently determining which messages have been read by each and every user. I will not go into detail of how to send push notifications as that should be done by your backend application and not by polling the database.
Using the previously gathered data from "get recent workspace conversations" (something like SELECT * FROM user_conversations ORDER BY last_read_dt DESC LIMIT 25 should do, in your case the Participants table where you would have to add both last_read_message_id and last_read_dt) you can then do a query to get which messages have not been read yet:
a small stored function returning messages
a JOIN statement returning those messages
a UNNEST / LATERAL statement returning those messages
maybe something else that doesn't come to my mind at the moment. :)
And last but not least I would highly recommend not trying to rebuild Slack as there are so many more topics to cover, like security & encryption, API & integrations, and so on...
Related
I am currently building a messaging feature and my goal is to get the last message between two users (a message inbox). And then when a user clicks on the last message, a full conversation between the two users will show. I currently have a messages table with the following data:
create table messages(
message_id serial primary key not null,
user_id_sender integer,
user_id_receiver integer,
subject varchar(100),
message text,
seen boolean default false,
date timestamptz
)
My question is, is it better to create a inbox table to log the last conversation between two users, or is there a query to find the last message between the two users and display that message with just the messages table. I have looked into and tried using CTE's, triggers and playing with SELECT DISTINCT that other have posted to no avail. I'm using React for my front end and get duplicate key issues with left joins with null values. Please I would like to know A.The best way to approach and normalize the data and B. Find the best query that would produce the desired result
Any help would be appreciated I'm just trying to find the right path and the best things to look into and learn to solve this kind of problem.
I have tried using CTEs, triggers, SELECT DISTINCT
If you are looking for a better option as per your question is it better to create a xxx table, then the answer is yes for retrieval of your latest message assuming you are storing only one record with the last message in that table xxx and the same record would be updated with the recent message always. You can update this table in your main table trigger (i.e. messages table) and the downside would be little performance hit during the insertion you can take a call on that after you check with a load test.
Coming to your second question is there a query to find the last message using the messages table; you can have a subquery to select the MAX of date from the messages table and then use that date in your main query like date = ( select max(date) to get the latest message. Make sure to create an index on the date column and also, for better performance add an additional where clause on the date column (for example date should be greater than the current date minus 30 days) in the main query.
I am working on my cron system which gathers informaiton via an API call. For most, it has been fairly straight forward, but now I am faced with multiple difficulties, as the API call is dependant on who is making the API request. It runs through each users API Key and certain information will be visible/hidden to them and visaversa to the public.
There are teams, and users are part of teams. A user can stealth their move, however all information will be showed to them and their team, however this will not be visible to their oponent, however both teams share the same id and have access tothe same informaiton, just one can see more of it than the other.
Defendants Point Of View
"attacks": {
"12345`": {
"timestamp": 1645345234,
"attacker_id": "",
"attacker_team_id": "",
"defender_id": 321,
"defender_team_id": 1,
"stealthed": 1
}
}
Attackers Point Of View
"attacks": {
"12345`": {
"timestamp": 1645345234,
"attacker_id": 123,
"attacker_team_id": 2
"defender_id": 321,
"defender_team_id": 1,
"stealthed": 1,
"boosters": {
"fair_fight": 3,
"retaliation": 1,
"group_attack": 1
}
}
}
So, if the defendant's API key is first used, id 12345 will already be in the team_attacks table but will not include the attacker_id and attacker_team_id. For each insert there after, I need to check to see whether the new insert's ID already exist and has any additional information to add to the row.
Here is the part of my code that loops through the API and obtains the data, it loops through all the attacks per API Key;
else if ($category === "attacks") {
$database = new Database();
foreach($data as $attack_id => $info) {
$database->query('INSERT INTO team_attacks (attack_id, attacker_id, attacker_team_id, defender_id, defender_team_id) VALUES (:attack_id, :attacker_id, :attacker_team_id, :defender_id, :defender_team_id)');
$database->bind(':attack_id', $attack_id);
$database->bind(':attacker_id', $info["attacker_id"]);
$database->bind(':attacker_team_id', $info["attacker_team_id"]);
$database->bind(':defender_id', $info["defender_id"]);
$database->bind(':defender_team_id', $info["defender_team_id"]);
$database->execute();
}
}
I have also been submitting to the news table, and typically I have simply been submitting X new entries have been added or whatnot, however I haven't a clue if there is a way to check during the above if any new entries and any updated entries to produce two news feeds:
2 attacks have bee updated.
49 new attack information added.
For this part, I was simply counting how many is in the array, but this only works for the first ever upload, I know I cannot simply count the array length on future inserts which require additional checks.
If The attack_id Does NOT Already Exist I also need to submit the boosters into another table, for this I was adding them to an array during the above loop and then looping through them to submit those, but this also depends on the above, not simply attempting to upload for each one without any checks. Boosters will share the attack_id.
With over 1,000 teams who will potentially have at least one members join my site, I need to be as efficient as this as possible. The API will give the last 100 attacks per call and I want this to be within my cron which collects any new data every 30 seconds, so I need to sort through potentially 100,000.
In SQL, you can check conditions when inserting new data using merge:
https://en.wikipedia.org/wiki/Merge_(SQL)
Depending on the database you are using, the name and syntax of the command might be different. Common names for the command are also upsert and replace.
But: If you are seeking for high performance and almost-realtimeness, consider using a cache holding critical aggregated data instead of doing the aggregation 100'000 times per minute.
This may or may not be the "answer" you're looking for. The question(s) imply use of a single table for both teams. It's worth considering one table per team for writes to avoid write contention altogether. The two data sets could be combined at query time in order to return "team" results via the API. At scale, you could have another process calculating and storing combined team results in an API-specific cache table that serves the API request.
Problem Statement : One of our module(Ticket module - store procedure ) is taking 4- 5 sec to return the data from DB. This store procedure supports 7 -8 filters + it has joins on 4- 5 tables to get the text of the IDs stored in Ticket tables eg ( Client name, Ticket Status, TicketType ... ) and this has hampered the performance of the SP.
Current Tech stack : ASP.Net 4.0 Web API , MS SQL 2008
We are planning to introduce Redis as a caching server and Node js with the aim of improving the performance and scalability.
Use case : We have Service ticket module and its has following attributes
TicketId
ClientId
TicketDate
Ticket Status
Ticket Type
Each user of this module has access to fix no of clients ie
User1 have access to tickets of Client 1, 2, 3, 4......
User2 have access to tickets of Client 1, 2, 5, 7......
So basically when User1 accesses the Ticket module he should be able to filter service tickets on TicketId, Client, Ticket Date ( from and To) , Ticket Status (Open, Hold, In Process ...) and Ticket Type ( Request, Complaint, Service .....) + Since User1 has access of only Client 1 ,2 3, 4......, caching should only give the list of tickets of the clients he has access of.
Would appreciate if you guys can share your views how should we structure the Redis ie what should we use for each of the above items ie hashset, set, sorted set ... + how should we filter the tickets depending on the access of client resp user has.
Redis is a key/value storage. I would use hashset with a structure like:
Key: ticketId
Subkeys: clientId, ticketDate, ticketStatus, ticketType
Search, sorting etc - handle programmatically from application or/and in LUA.
I have a table that looks like the following:
game_stats table:
id | game_id | player_id | stats | (many other cols...)
----------------------
1 | 'game_abc' | 8 | 'R R A B S' | ...
2 | 'game_abc' | 9 | 'S B A S' | ...
A user uploads data for a given game in bulk, submitting both players' data at once. For example:
"game": {
id: 'game_abc',
player_stats: {
8: {
stats: 'R R A B S'
},
9: {
stats: 'S B A S'
}
}
}
Submitting this to my server should result in the first table.
Instead of updating the existing rows when the same data is submitted again (with revisions, for example) what I do in my controller is first delete all existing rows in the game_stats table that have the given game_id:
class GameStatController
def update
GameStat.where("game_id = ?", game_id).destroy_all
params[:game][:player_stats].each do |stats|
game_stat.save
end
end
end
This works fine with a single threaded or single process server. The problem is that I'm running Unicorn, which is a multi-process server. If two requests come in at the same time, I get a race condition:
Request 1: GameStat.where(...).destroy_all
Request 2: GameStat.where(...).destroy_all
Request 1: Save new game_stats
Request 2: Save new game_stats
Result: Multiple game_stat rows with the same data.
I believe somehow locking the rows or table is the way to go to prevent multiple updates at the same time - but I can't figure out how to do it. Combining with a transaction seems the right thing to do, but I don't really understand why.
EDIT
To clarify why I can't figure out how to use locking: I can't lock a single row at a time, since the row is simply deleted and not modified.
AR doesn't support table-level locking by default. You'll have to either execute db specific SQL or use a gem like Monogamy
Wrapping up the save statements in a transaction will speed things up if nothing else.
Another alternative is to implement the lock with Redis. Gems like redis-lock are also available. This will probably be less risky as it doesn't touch the DB, and you can set Redis keys to expire.
I'm implementing a php page that display data in pagination format. My problem is that these data change in real time, so when user request next page, I submit last id and query is executed for retrieve more 10 rows from last id order by a column that value change in real time. For example I have 20 rows:
Id 1 col_real_time 5
Id 2 col_real_time 3
Id 3 col_real_time 11
Etc
I get data sorted by col_real_time in ascending order, so result is
id 2, id 1, id 3
Now in realtime id 2 change in col_real_time 29 before user send request for next page, user now send request for next results and because id 2 now is 29 he already see it.
How can I do?
Now in realtime id 2 change
You basically have to take a snapshot of the data if you don't want the data to appear to change to the user. This isn't something that you can do very efficiently in SQL, so I'd recommend downloading the entire result set into a PHP session variable that can be persisted across pages. That way, you can just get rows on demand. There are Javascript widgets that will effectively do the same thing, but will send the entire result set to the client which is a bad idea if you have a lot of data.
This is not as easy to do as pure SQL pagination, as you will have to take responsibility for cleaning the stored var out when it's no longer needed. Otherwise, you'll rather quickly run out of memory.
If you have just a few pages, you could:
Save it to session and page over it, instead of going back to the
database server.
Save it to a JSON object list and use Jquery to read it and page
over it.
Save it to a temp table indicating generation timestamp, user_id and
session_id, and page over it.