Model simple social network in Azure table service - azure-storage

What is the best table design for a simple social networking website using Azure Table Service?
The website could have millions of users.
Users need to be able to view a list of all other users in the system sorted by the number of mutual connections.
Users must be able to view a list of their connections
User must be able to view content posted by themselves and their connections.
One major design constraint is that Azure table service queries are generally limited to the partition key and row key when there are a large number of records or else they get really slow. Another constraint is that query results are only sorted by the partition key and then the row key.

Try this Design:
UserTable
PK: GUID ( GUID for PK will maximize scalability, only one partition with single row in each server)
RK: GUID
... Rest of properties
UserFriendsTable
PK: UserTable.RK ( Every User with his friends in a separate server)
RK: GUID
FriendWith: UserTable.Pk - UserTable.RK (Concatenate PK and RK from user table separated with "-", this will help you to execute point query fast when you try to access friend profile )
PostsTable
PK: UserTable.RK + "-" +YYYYMM+ Random number (This will allow azure to put all monthly posts of any user in a separate server. Random number to prevent azure from auto grouping partitions in sequence. You can query posts with filtering PK partly ex: pk start with XCtghi94ktY-201411.
RK use following code to generate row key in descending order. means latest post comes first.
long ticks = DateTimeOffset.MaxValue.UtcDateTime.Ticks - DateTimeOffset.Now.UtcDateTime.Ticks;
string guid = Guid.NewGuid().ToString("N");
string suffix = "-";
string.Format("{0:d21}{1}{2}", ticks, suffix, guid);
Post : String

Related

Exchangeable fields in SQL index

I m designing a table for a list of chats. Every chat references two users and must be unique per given user pair. Evidently, a user pair is symmetric for permutations: it does not matter, which user in a pair comes first and which one comes second.
Suppose a user table has an integer UserID as PK. Then chat table will have a pair of FK fields: UserID1, UserID2. I want that every pair of users have a unique record in chats table. Then I can create unique INDEX(UserID1, UserID2). However, this index will not be unique in terms of a user pair because it may include also permutations when UserID2 comes first: (UserID1, UserID2) and (UserID2, UserID1) will be two distinct different pairs, while by required logic, they should be treated one distinct record.
Is there a way to implement this construct in pure SQL without external coding, such as DB triggers or scripting? I m using MS SQL Server for prototyping. But I want to have a design universal and neutral as possible to be compatible with most SQL compliant databases. It is mere a question on optimal architecture than on specific SQL implementation code as such.
Possible ideas:
Make and index on a hash of ordered user pairs
Check user order and put a user with smaller ID first in all queries
But all these put restrictions on external queries outside of SQL.

Table design in Azure Table Storage

I should organize REST-service for messaging using azure. Now i have problem with DB. I have 3 tables: users, chats, messages of chats.
Users contains user data like login, password hash, salt.
Chats contains partitionkey - userlogin, rowkey - chatId, nowInChat - the user came from a chat.
Messages of chat contains partitionkey, wich consists of
userlogin_chatId_datetimeticks
(zevis_8a70ff8d-c363-4eb4-8a51-f853fa113fa8 _634292263478068039),
rowkey - messageId, message, sender - userLogin.
I saw disadvantages in the design, such as, if you imagine that users are actively communicated a year ago, and now do not talk, and one of them wants to look at the history, then I'll have to send a large number of requests to the server with the time intervals, such as a week, request data. Sending the request with a time less than today will be ineffective, because We get the whole story.
How should we change the design of the table?
Because Azure Storage Tables do not support secondary indexes, and storage is very inexpensive, your best option is to store the data twice, using different partition and/or row keys. From the Azure Storage Table Design Guide:
To work around the lack of secondary indexes, you can store multiple
copies of each entity with each copy using different PartitionKey and
RowKey values
https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#table-design-patterns
Thank you for your post, you have two options here. The easiest answer with the least amount of design change would be to include a StartTime and EndTime in the Chat table. Although these properties would not be indexed I'm guessing there will not be many rows to scan once you filter on the UserID.
The second option requires a bit more work, but cleaner, would be to create an additional table with Partition Key = UserID, Row Key = DateTimeTicks and your entity properties would contain the ChatID. This would enable you to quickly filter by user on a given date/date range. (This is the denormalization answer provided above).
Hopefully this helps your design progress.
I would create a separate table with these PK and RK values:
Partition Key = UserID, Row Key = DateTime.Max - DateTimeTicks
Optionally you can also append ChatId to the end of the Row Key above.
This way the most recent communication made by the user will always be on top. So you can later on simply query the table with passing in only the UserId and a take count (ie. Take Count = 1 if you want the latest chat entry from the user). The query will also be very fast because since you use inverted ticks for your row keys, azure table storage service will sort the entries for the same user id in increasing lexicographical order of Row Keys, always keeping the latest chat on top of the partition as it will have the minimum inverted tick value.
Even if you add Chat Id at the end of the RowKey (ie. InvertedTicks_ChatId) the sort order will not change and latest conversation will be on top regardless of chat id.
Once you read the entity back, you subtract the inverted ticks from DateTime.Max to find the actual date.

Best way to maintain data integrity between local and remote sql databases

So I have, what would seem like a common question that I can't seem to find an answer to. I'm trying to find what is the "best practice" for how to architect a database that maintains data locally, then syncs that data to a remote database that is shared between many clients. To make things more clear, this remote database would have many clients that use it.
For example, if I had a desktop application that stored to-do lists (in SQL) that had individual items. Then I want to be able to send that data to a web-service that had a "master" copy of all the different clients information. I'm not worried about syncing problems as much as I am just trying to think through actual architecture of the client's tables and the web-services tables
Here's an example of how I was thinking about it:
Client Database
list
--list_client_id (primary key, auto-increment)
--list_name
list_item
--list_item_client_id (primary key, auto-increment)
--list_id
--list_item_text
Web Based Master Database (Shared between many clients)
list
--list_master_id
--list_client_id (primary key, auto-increment)
--list_name
--user_id
list_item
--list_item_master_id (primary key, auto-increment)
--list_item_remote_id
--list_id
--list_item_text
--user_id
The idea would be that the client can create todo lists with items, and sync this with the web service at any given time (i.e. if they lose data connectivity, and aren't able to send the information until later, nothing will get out of order). The web service would record the records with the clients id's as just extra fields.
That way, the client can say "update list number 4 with a new name" and the server takes this to mean "update user 12's list number 4 with a new name".
I think they general concept you're working with is the right direction, but you may need to pay careful attention to the use of auto-increment columns. For example, auto-increment on the server is useless if the client is the owner of this ID. Instead, you probably want list.list_master_id to be an auto-increment. Everything else you've mentioned is entirely plausible, though the complexity may increase if there may be multiple clients per user. Then, the use of an auto-increment alone probably isn't sufficient. Instead, you may need a guid or a datatype that also includes a client identifier to prevent id collision.
Without having more details it would be difficult to speculate on what other situations you may need to consider.
SERVER:
list
--id
--name
--user_id
--updated_at
--created_from_device_id
Those 2 tables link all records, might be grouped in one table also.
list_ids
--list_id
--device_id
--device_record_id
user_ids
--user_id
--device_id
--device_record_id
CLIENT (device_id=5)
list
--id
--name
--user_id
--updated_at
That will allow you to save records as(only showing relevant fields):
server
list: id=1, name=shopping, user_id=1234
user: id=27, name=John Doe
list_ids: list_id=1, device_id=5, device_record_id=999
user_ids: user_id=27, device_id=5, device_record_id=567
client
id=999, name=shopping, user_id=567
This way they are totally unaware of any ID's, translations can be done quite fast and you can supply the clients only with information and ID's they know of.
I have the same issue with a project i am working on, the solution in my case was to create an extra nullable field in the local tables named remote_id. When synchronizing records from local to remote database if remote_id is null, it means that this row has never been synchronized and needs to return a unique id matching the remote row id.
Local Table Remote Table
_id (used locally)
remote_id ------------- id
name ------------- name
In the client application i link tables by the _id field, remotely i use the remote id field to fetch data, do joins, etc..
example locally:
Local Client Table Local ClientType Table Local ClientType
_id
remote_id
_id -------------------- client_id
remote_id client_type_id -------------- _id
remote_id
name name name
example remotely:
Remote Client Table Remote ClientType Table Remote ClientType
id -------------------- client_id
client_type_id -------------- id
name name name
This scenario, and without any logical in the code, would cause data integrity failures, as the client_type table may not match the real id either in the local or remote tables, therefor whenever a remote_id is generated, it returns a signal to the client application asking to update the local _id field, this fires a previously created trigger in sqlite updating the affected tables.
http://www.sqlite.org/lang_createtrigger.html
1- remote_id is generated in the server
2- returns a signal to client
3- client updates its _id field and fires a trigger that updates local tables that join local _id
Of course i use also a last_updated field to help synchronizations and to avoid duplicated syncs.

How to properly index my database to increase query performance

I'm working on simple log in page using OpenID: if the user has just registered for an OpenID, then I need to create a new entry in the database for the user, otherwise I just display their alias with a greeting. Every time somebody gets authenticated with their Open ID, I must find their alias by looking up which user has the given OpenID and it seems that it might be fairly slow if the primary key is the UserID (and there are millions of users).
I'm using SQL Server 2008 and I have two tables in my database (Users and OpenIDs): I plan the check if the Open ID exists in the OpenIDs table, then use the corresponding UserID to get the rest of the user information from the Users table.
The Users table is indexed by UserID and has the following columns:
UserID (pk)
EMail
Alias
OpenID (fk)
The OpenIDs table is indexed by OpenID and has the following columns:
OpenID (pk)
UserID (fk)
Alternately, I can index the Users table by UserID and OpenID (i.e have 2 indexes) and completely drop the OpenIDs table.
What would be the recommended way to improve the query for a user with the matching OpenID in this case: index the Users table with two keys or use the OpenIDs table to find the matching UserID?
May be the answers to What are some best practises and “rules of thumb” for creating database indexes? can help you.
Without knowing what kind of queries you'll be running in detail, I would recommend indexing the two foreign key columns - Users.OpenID and OpenIDs.UserID.
Indexing the foreign keys is typically a good idea to help with JOIN conditions and other queries.
But quite honestly, if you use the OpenIDs table only to check the existance of an OpenID, you'd be much better off just indexing (possibly a unique index?) that column in the Users table and be done with it. That OpenIDs table as you have it now serves no real purpose at all - just takes up space for redundant information.
Other than that: you need to observe how your application behaves, samples some usage data, and then see what kind of queries are running the most often, and the longest, and then start doing performance tweaking. Don't over-do the ahead-of-time performance optimizations - too many indices can be worse than having none at all !
Every time somebody gets authenticated
with their Open ID, I must find their
alias by looking up which user has the
given OpenID and it seems that it
might be fairly slow if the primary
key is the UserID (and there are
millions of users).
Actually, quite the contrary! If you have a value that's unique amongst millions of rows, finding that single value is actually quite quick - even with millions of users. It will take only a handful (max. 5-6) comparisons, and bang! you have your one user out of a million. If you have an index on that OpenID column, that should be pretty fast indeed. Such a highly selective index (one value picks out 1 in a million) work very very efficiently.

Optimizing SQL to determine unique page views per user

I need to determine if user has already visited a page, for tracking unique page views.
I have already implemented some HTTP header cache, but now I need to optimize the SQL queries.
The visit is unique, when:
pair: page_id + user_id is found in the visit table
or pair: page_id + session_id is found
or: page_id + [ip + useragent] - (this is a topic for another discussion, whether it should be only ip or ip+useragent)
So I have a table tracking user visits:
visit:
page_id
user_id
session_id
useragent
ip
created_at
updated_at
Now on each user visit (which does not hit cache) I will update a row if it exists. If there are any affected rows, I will insert new visit to the table.
This are one or two queries (assuming the cache will work, mostly two queries), but the number of rows is limited somehow. Maybe it would be better to store all the visits and then clean up the database within e.g. a month?
The questions are:
how should be the visit table constructed (keys, indexes, relations to user and page_views table). Some of the important fields may be null (e.g. user_id), what about indexes then? Do I need a multi column primary key?
which would be the fastest sql query to find the unique user?
is this sane approach?
I use PostgreSQL and PDO (Doctrine ORM).
All my sessions are stored in the same DB.
Personally I would not put this in the request-response path. I would log the the raw data in a table (or push it on a queue) and let a background task/thread/cron job deal with that.
The queue (or the message passing table) should then just contain pageid, userip, sessionid, useragen,ip.
Absolute timings are less important now as long as the background task can keep up. since a single thread will now do the heavy lifting it will not create conflicting locks when updating the unique pageviews tables.
Just some random thoughts:
Can I verify that the thinking behind the unique visit types is:
pageid + userid = user has logged in
pageid + sessionid = user not identified but has cookies enabled
pageid + ip / useragent = user not identified and no cookies enabled
For raw performance, you might consider #2 to be redundant since #3 will probably cover #2 i most conditions (or is #2 important e.g. if the user then registers and then #2 can be mapped to a #1)? (meaning that session id might still be logged, but not used in any visit determination)
IMHO IP will always be present (even if spoofed) and will be a good candidate for an Index. User agent can be hidden and will only have a limited range (not very selectable).
I would use a surrogate primary key in this instance due to the nullable fields and since none of the fields is unique by themselves.
IMHO your idea about storing ALL the visits and then trimming the duplicates via batch out is a good one to weigh up (rather than checking if exists to update vs insert new)
So PK = Surrogate
Clustering = Not sure - another query / requirement might drive this better.
NonClustered Index = IP Address, Page Id (assuming more distinct IP addresses than page id's)