Table design in Azure Table Storage - azure-storage

I should organize REST-service for messaging using azure. Now i have problem with DB. I have 3 tables: users, chats, messages of chats.
Users contains user data like login, password hash, salt.
Chats contains partitionkey - userlogin, rowkey - chatId, nowInChat - the user came from a chat.
Messages of chat contains partitionkey, wich consists of
userlogin_chatId_datetimeticks
(zevis_8a70ff8d-c363-4eb4-8a51-f853fa113fa8 _634292263478068039),
rowkey - messageId, message, sender - userLogin.
I saw disadvantages in the design, such as, if you imagine that users are actively communicated a year ago, and now do not talk, and one of them wants to look at the history, then I'll have to send a large number of requests to the server with the time intervals, such as a week, request data. Sending the request with a time less than today will be ineffective, because We get the whole story.
How should we change the design of the table?

Because Azure Storage Tables do not support secondary indexes, and storage is very inexpensive, your best option is to store the data twice, using different partition and/or row keys. From the Azure Storage Table Design Guide:
To work around the lack of secondary indexes, you can store multiple
copies of each entity with each copy using different PartitionKey and
RowKey values
https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#table-design-patterns

Thank you for your post, you have two options here. The easiest answer with the least amount of design change would be to include a StartTime and EndTime in the Chat table. Although these properties would not be indexed I'm guessing there will not be many rows to scan once you filter on the UserID.
The second option requires a bit more work, but cleaner, would be to create an additional table with Partition Key = UserID, Row Key = DateTimeTicks and your entity properties would contain the ChatID. This would enable you to quickly filter by user on a given date/date range. (This is the denormalization answer provided above).
Hopefully this helps your design progress.

I would create a separate table with these PK and RK values:
Partition Key = UserID, Row Key = DateTime.Max - DateTimeTicks
Optionally you can also append ChatId to the end of the Row Key above.
This way the most recent communication made by the user will always be on top. So you can later on simply query the table with passing in only the UserId and a take count (ie. Take Count = 1 if you want the latest chat entry from the user). The query will also be very fast because since you use inverted ticks for your row keys, azure table storage service will sort the entries for the same user id in increasing lexicographical order of Row Keys, always keeping the latest chat on top of the partition as it will have the minimum inverted tick value.
Even if you add Chat Id at the end of the RowKey (ie. InvertedTicks_ChatId) the sort order will not change and latest conversation will be on top regardless of chat id.
Once you read the entity back, you subtract the inverted ticks from DateTime.Max to find the actual date.

Related

designing a database schema for aws mobile backend

I am new to databases and sql and would like to design a database for a fitness app that will keep track of workouts at the gym.
In my app, I have designed a custom workout object that has a name (e.g. 'Chest day'), an ID (some number) and a date (string). Each workout object contains an array of exercises, another custom object, that has a property for called 'set'. The set is also a custom object with only two numeric properties: number of reps and weight (e.g. 10 reps at 50 lbs)
What I thought of is to have one table for the workouts, another for the exercises and another for the sets. The problem is I do not know how to connect the tables (i.e. link multiple exercises to a unique workout and link multiple sets to a unique exercise) and am not sure if this is even the correct approach.
Also, I planned to set up the backend for this app using the amazon web services mobile hub which provides a noSQL database.
In NoSQL, you should keep all the attributes in single table. You shouldn't normalize the data like RDBMS. Also, please try to come away from Join. The main advantage of NoSQL is that keep everything as one item, so that you don't need to Join to get the result.
Advantages of this approach are:-
1) Fast response as all the data is present as one item in a table
2) Schema less database i.e. you can add any attributes at any time (i.e. no need to alter table and add the new columns)
DynamoDB design for above use case:-
The combination of partition and sort key should be unique
name -String (Partition Key)
id -Number (Sort Key)
date - String
exercise : [array of values] - List data type
custom_set : {rep : 1, weight : 2} - Map data type
Important Note:-
The important thing while designing the data model for DynamoDB is all the data retrieval use cases (i.e. Query Access Patterns) should be available to design the appropriate model.

Implementing a cache layer (including some sql db tables) in couchbase

Suppose I have many SQL tables with 10 columns (at least) per each.
Let's take for example:
HR Table: ID, FirstName, LastName, PhoneNumber, Gender, City, Street, Height, Weight, IQ
I need to build a cache layer for all of my SQL tables.
What would be the best way to store the data in Couchbase ?
Should I store the whole document for each row ?
Here is a potential key, For example - A key that brings me a JSON document that contains all columns where its row ID=4:
HR_4
Or should I implement it like key-value store ?
For instance - A key that brings me a specific value (not the entire columns):
HR_4_FirstName
Please put in mind that I DO need to get an entire row for key in my application, but sometimes I need to get just one specific column.
The question is: Should I go for the second way, and if I need a few values - just send a few requests from my application and aggregate them ?
On the other hand, going the second way is much more keys to handle (That actually means having a key for each db field).
I would look at how your application uses and accesses the data. It may be worthwhile to have several objects for the data you are trying to store depending on access patterns and what you want to optimize for. May I recommend this article on data modeling for a user profile store in Couchbase. Let me know if this does not help.

Saving Statistics to sqlite database

I have created a IRC bot for twitch that comes with a couple of features (song request,a queue to join games. Currently I have a table in my database looking like this
CREATE TABLE users(id,points,timespent,follower,followed,wMessage);
a very simple table(id is the name of user, points are a virtual currency you get for watching the stream that you use for some of the features I mentioned above)timespent is timespent in the channel, follower is if you are a follower, followed is if you have followed once before and wMessage is a welcome message)
I would like it so I can see some statistics on the bot so, statisticslike how many people joined the channel on x year/month/day/hour, how many used the queue feature, how many used y feature on x time. I can only come up with one way to do this but I am not sure if it is the best way to do it:
CREATE TABLE queueStats(usedDate DATETIME,timeUsed int);
I guess you could even remove the timeUsed and just make a new row each time the feature is used and then count the rows with a "SELECT - WHERE" query. Is this a smart way to do this? The reason I ask is, I am very new with sql databases so I am not really sure of the standard way to do things(if there is such a thing)
I'd recommend creating a table to record events of interest. You could have a foreign key referencing the user table. Getting summary statistics could then be done using an aggregation query (example).
BTW, I'd recommend explicitly specifying your user id column to be an "integer primary key". See here for why/how. Basically if you don't you could end up with duplicate rows for user IDs, and also if you don't explicitly specify a primary key field, sqlite creates an extra "row id" column for you.

Model simple social network in Azure table service

What is the best table design for a simple social networking website using Azure Table Service?
The website could have millions of users.
Users need to be able to view a list of all other users in the system sorted by the number of mutual connections.
Users must be able to view a list of their connections
User must be able to view content posted by themselves and their connections.
One major design constraint is that Azure table service queries are generally limited to the partition key and row key when there are a large number of records or else they get really slow. Another constraint is that query results are only sorted by the partition key and then the row key.
Try this Design:
UserTable
PK: GUID ( GUID for PK will maximize scalability, only one partition with single row in each server)
RK: GUID
... Rest of properties
UserFriendsTable
PK: UserTable.RK ( Every User with his friends in a separate server)
RK: GUID
FriendWith: UserTable.Pk - UserTable.RK (Concatenate PK and RK from user table separated with "-", this will help you to execute point query fast when you try to access friend profile )
PostsTable
PK: UserTable.RK + "-" +YYYYMM+ Random number (This will allow azure to put all monthly posts of any user in a separate server. Random number to prevent azure from auto grouping partitions in sequence. You can query posts with filtering PK partly ex: pk start with XCtghi94ktY-201411.
RK use following code to generate row key in descending order. means latest post comes first.
long ticks = DateTimeOffset.MaxValue.UtcDateTime.Ticks - DateTimeOffset.Now.UtcDateTime.Ticks;
string guid = Guid.NewGuid().ToString("N");
string suffix = "-";
string.Format("{0:d21}{1}{2}", ticks, suffix, guid);
Post : String

Optimizing SQL to determine unique page views per user

I need to determine if user has already visited a page, for tracking unique page views.
I have already implemented some HTTP header cache, but now I need to optimize the SQL queries.
The visit is unique, when:
pair: page_id + user_id is found in the visit table
or pair: page_id + session_id is found
or: page_id + [ip + useragent] - (this is a topic for another discussion, whether it should be only ip or ip+useragent)
So I have a table tracking user visits:
visit:
page_id
user_id
session_id
useragent
ip
created_at
updated_at
Now on each user visit (which does not hit cache) I will update a row if it exists. If there are any affected rows, I will insert new visit to the table.
This are one or two queries (assuming the cache will work, mostly two queries), but the number of rows is limited somehow. Maybe it would be better to store all the visits and then clean up the database within e.g. a month?
The questions are:
how should be the visit table constructed (keys, indexes, relations to user and page_views table). Some of the important fields may be null (e.g. user_id), what about indexes then? Do I need a multi column primary key?
which would be the fastest sql query to find the unique user?
is this sane approach?
I use PostgreSQL and PDO (Doctrine ORM).
All my sessions are stored in the same DB.
Personally I would not put this in the request-response path. I would log the the raw data in a table (or push it on a queue) and let a background task/thread/cron job deal with that.
The queue (or the message passing table) should then just contain pageid, userip, sessionid, useragen,ip.
Absolute timings are less important now as long as the background task can keep up. since a single thread will now do the heavy lifting it will not create conflicting locks when updating the unique pageviews tables.
Just some random thoughts:
Can I verify that the thinking behind the unique visit types is:
pageid + userid = user has logged in
pageid + sessionid = user not identified but has cookies enabled
pageid + ip / useragent = user not identified and no cookies enabled
For raw performance, you might consider #2 to be redundant since #3 will probably cover #2 i most conditions (or is #2 important e.g. if the user then registers and then #2 can be mapped to a #1)? (meaning that session id might still be logged, but not used in any visit determination)
IMHO IP will always be present (even if spoofed) and will be a good candidate for an Index. User agent can be hidden and will only have a limited range (not very selectable).
I would use a surrogate primary key in this instance due to the nullable fields and since none of the fields is unique by themselves.
IMHO your idea about storing ALL the visits and then trimming the duplicates via batch out is a good one to weigh up (rather than checking if exists to update vs insert new)
So PK = Surrogate
Clustering = Not sure - another query / requirement might drive this better.
NonClustered Index = IP Address, Page Id (assuming more distinct IP addresses than page id's)