Creating a blog service or a persistent chat with Table Storage - azure-storage

I'm trying azure storage and can't come up with real life scenarios when I would use it. As far as I understand the only index Table Storage has is Partition Key and Row Key. I can't sort or query on other columns without doing a full partition scan, right?
If I would migrate my blog service from a traditional sql server or a richer nosql database like Mongo i would probably be alright, considering users don't blog that much in one year (I would partition all blog posts per user per year for example). Even if someone would hit around a thousand blog posts a year i would be OK to load them all metadata in memory. I could do smarter partitioning if this won't work well.
If I would migrate my persistent chat service to table storage how would I do that? Users post thousands of messages a day and query history pretty often from desktop clients, mobile devices, web site etc. I don't want to lose on this and only return 1 day history with paging (which can be slow as well).
Any ideas or patterns or what am I missing here?
btw I can always use different database, however considering Table Storage is so cheap I don't want to.

PartitionKey and RowKey values are the only two indexed properties. To work around the lack of secondary indexes, you can store multiple copies of each entity with each copy using a different RowKey value. For instance, one entity will have PartitionKey=DepartmentName and RowKey=EmployeeID, while the other entity will have PartitionKey=DepartmentName and RowKey=EmailAddress. That will allow you to look up either by EmployeeID or emailAddress. Azure Storage Table Design Guide ( http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/) has more detailed example and has all the information that you need to design a scalable and performant Tables.
We will need more information to answer your second question about how you would migrate contents of your chat service to table storage. We need to understand the format and structure of the data that you currently store in your chat service.

Related

Database design & 3rd party integrations

We're building an application where eCommerce owners can connect their store from different platforms (e.g. Shopify, Magento, Woocommerce). We do this in order to import data from these various platforms.
So we have a Stores table. In there we have data that are common to all platforms and some data that are specific to the platforms.
I'm not sure what to do here. Should we create specific tables that contain platform-specific information or we create columns to store certain information but that will be empty for the stores from the other platforms?
What would be the pros and cons? Knowing that we would then need to create tables for all new platforms that we integrate with if we go for option 2.
You haven't said which specific RDBMS you're using, but with PostgreSQL you have the option of foreign data wrappers. These let you federate data from other sources and APIs into your application database and read and write foreign tables just like you do the internal tables (assuming the external APIs allow you to modify data). With this approach, you just need to make sure that your stores are properly associated with their respective entries in the foreign tables. Developing FDWs is relatively easy with Multicorn.
If that's not an option: using columns is efficient to query since the information is right there in your store record. However, it could get unwieldy depending on how much of it there is, and if you could have a tenant with multiple presences on one of those external platforms -- weirder things have happened -- you're in for some trouble. And the relational form makes adding and changing support for the external platforms easier since you don't have to lock the entire tenants table to add or remove columns.
The simpler approach may be all you need to start out with, but it'd probably be smart to plan for tables in the end.

PET technology Fluent Nhibernate

For a web application (with some real private data) we want to use privacy enhancing technology to prevent big risks when someone gets permission to our database.
The application is build with different layers, and we use (as said in the topic title) Fluent NHibernate to connect to our database and we've created our own wrapper class to create query's.
Security is a big issue for the kind of application we're building. I'll try to explain the setting by a simple example:
Our customers got some clients in their application (each installation of the application uses its own database), for which some sensitive data is added, there is a client table, and a person table, that are linked.
The base table, which links to the other tables (there will be hundreds of them soon), probably containing sensitive data, is the client table
At this moment, the client has a cleint_id, and a table_id in the database, our customer only knows the client_id, the system links the data by the table_id, which is unknown to the user.
What we want to ensure:
A possible hacker who would have gained access to our database, should not be able to see the link between the customer and the other tables by just opening the database. So actually there should be some kind of "hidden link" between the customer and other tables. The personal data and all sensitive other tables should not be obviously linked together.
Because of the data sensitivity we're looking for a more robust solution then "statically hash the table_id and use this in other tables", because when one of the persons is linked to the corresponding client, not all other clients data is compromised too.
Ultimately, the customer table cannot be linked to the other tables at all, just by working inside the database, the application-code is needed to link the tables.
To accomplish this we've been looking into different methods, but because of the multiple linked tables to this client, and further development (thus probably even more tables) we're looking for a centralised solution. That's why we concluded this should be handled in the database connector. Searching on the internet and here on Stack Overflow, did not point us in the right direction, perhaps we couldn't find this because of wrong search terms (PET, Privacy enhancing technology, combined with NHibernate did not give us any directions.
How can we accomplish our goals in this specific situation, or where to search to help us fix this.
We have a similar requirement for our application and what we ended up with using database schema's.
We have one database and each customer has a separate schema, where all the data for that customer is stored. It is possible to link from the schema to the rest of the database, but not to different schema's.
Security can be set for each schema separately so you can make the life of a hacker harder.
That being said I can also imagine a solution where you let NHibernate encrypt every peace of data it will send to the database and decrypt everything it gets back. The data will be store savely, but it will be very difficult to query over data.
So there is probably not a single answer to this question, and you have to decide what is better: Not being able to query, or just making it more difficult for a hacker to get to the data.

Proper use of Azure storage. (When to use SQL, Tables and Blobs)

I am relatively new to Azure storage and have been implementing a solution for some time now.
And I keep hitting obstacles, making me feel that I'm not applying the right storage type for the data I'm storing.
So this is more of an overall question:
When should I use Azure SQL?
When should I use Azure Table storage?
When should I use Azure Blobs?
So far I have been using table storage a lot, and I'm now paying for it.
As requirements for the solution grow I find myself unable to access the data as needed.
For instance I need to fetch the 50 latest entries in a table, but I can not user OrderBy in the query.
I need to fetch the total amount of entries, but can not use Count.
I keep getting the impression that any data I plan to access regularly without knowing the exact RowKey and PartitionKey should be indexed in Azure SQL aswell as being stored in a table. Is this correct?
I also find myself recreating objects as Entity objects, but with the very severe limitations on datatypes I often end up just serializing the object into a byte array. And though a table row may hold up to 1MB a byte array on that row may only hold 64KB, at which point I end up using Blob storage instead.
So in the end I feel like I would have been better off just putting all my data in Azure SQL and indexing larger data but saving it as blobs.
Of course this does not feel quite right, since that would leave Table storage with no real purpose.
So I'm wondering if there are any guidelines for when to use which kind of storage.
In my case I have very large amount of data in some areas, some of it consumes a fair amount of space (often above 64KB), but I also need to access the data very frequently and will need to be able to filter and sort it by certain values.
Do I really need to index all data I plan to access in SQL?
And would I be better off avoiding Table for any data that could potentially exceed 64KB?
I feel like there's something I'm not doing right. Something I didn't understand. What am I missing here?
The best recommendation I can make is basically, "Try really hard not to use Azure Table Storage". As other folks have pointed out, it's not just a "No-SQL" data-store, it's a particularly stunted, handicapped, and very-low-featured instance of a No-SQL store. About the only thing good about it is that you can put lots and lots of data into it very quickly, and with minimal storage fees. However, you basically can't hope to get that data back out again unless you're lucky enough to have a use-case that magically matches its Partition-Key/Row-Key storage model. If you don't - and I suspect very few people do - you're going to be doing a lot of partition scans, and processing the data yourself.
Beyond that, Azure Table Storage seems to be at a dead-end in terms of development. If you look at the "Support Secondary Indexes" request on the Azure feedback forums (https://feedback.azure.com/forums/217298-storage/suggestions/396314-support-secondary-indexes), you can see that support for Secondary Indexes was promised as far back as 2011, but no progress has been made. Nor has any progress been made on any of the other top requests for Table Storage.
Now, I know that Scott Guthrie is a quality guy, so my hope is that all this stagnation on the Table Storage front is a preface to Azure fixing it and coming up with something really cool. That's my hope (though I have zero evidence that's the case). But for right now, unless you don't have a choice, I'd strongly recommend against Azure Table Storage. Use Azure SQL; use your own instance of MongoDB or some other No-SQL DB; or use Amazon DynamoDB. But don't use Azure Table Storage.
EDIT: 2014-10-09 - Having been forced into a scenario where I needed to use it, I've modified my opinion on Azure Table Storage slightly. It does in fact have all the regrettable limitations I ascribe to it above, but it also has its (limited) uses. I go into them somewhat on a blog post here.
EDIT: 2017-02-09 - Nah, ATS is still awful. Steer clear of it. It hasn't improved significantly in 7+ years, and MS obviously wishes it would just go away. And it probably should - they're presumably only keeping it around for folks who made the mistake of betting on it originally.
have a look at this: Windows Azure Table Storage and Windows Azure SQL Database - Compared and Contrasted
doesn't include blobs, but a good read anyway...
I keep getting the impression that any data I plan to access regularly without knowing the exact RowKey and PartitionKey should be indexed in Azure SQL aswell as being stored in a table. Is this correct?
Table storage does not support secondary indexes and so any efficient queries should contain the RowKey and the PartitionKey. There can be workarounds such as saving the same data twice in the same table with different RowKeys. However this can quickly become a pain. If eventual consistency is ok then you could do this. You need to take care of transactions and rollbacks.
In my case I have very large amount of data in some areas, some of it consumes a fair amount of space (often above 64KB), but I also need to access the data very frequently and will need to be able to filter and sort it by certain values.
Use table storage for basic NoSQL functionality and the ability to scale quickly. However, if you want secondary indexes and other such features you might have to take a look at something like DynamoDB on AWS which afaik seems to have better support for secondary indexes etc. If you have data that has complex relationships in other words data that requires an RDBMS go with SQL Azure.
Now, as far as your options on Azure go I'd think you would need to store everything on SQL Azure and large objects as blobs or on table storage.
Do I really need to index all data I plan to access in SQL?
Tough to say. If each partition is going to contain say just 100 rows then you can query by partition key and any of the columns. At this point the partition scan is going to be pretty fast. However, if you have a million rows then it could be a problem.
I feel like there's something I'm not doing right. Something I didn't understand. What am I missing here?
A bunch of early Azure users started using Table Storage without understanding what NoSQL (and in this case a particularly stunted version of NoSQL) entails.

How to isolate SQL Data from different customers?

I'm currently developing a service for an App with WCF. I want to host this data on windows-azure and it should host data from differed users. I'm searching for the right design of my database. In my opinion there are only two differed possibilities:
Create a new database for every customer
Store a customer-id to every table (or the main table when every table is connected via entities)
The first approach has very good speed and isolating, but it's very expansive on windows azure (or am I understanding something of the azure pricing wrong?). Also I don't know how to configure a WCF- Service that way, that it always use another database.
The second approach is low on speed and the isolating is poor. But it's easy to implement and cheaper.
Now to my question:
Is there any other way to get high isolation of data and also easy integration in a WCF- service using azure?
What design should I use and why?
You have two additional options: build multiple schema containers within a database (see my blog post about this technique), or even better use SQL Database Federations (you can use my open-source project called Enzo SQL Shard to access federations). The links I am providing give you access to other options as well.
In the end it's a rather complex decision that involves a tradeoff of performance, security and manageability. I usually recommend Federations, even if it has its own set of limitations, because it is a flexible multitenant option for the cloud with the option to filter data automatically. Check out the open source project - you will see how to implement good separation of customer of data independently of the physical storage.

Local SQL database interface to cloud database

Excuse me if the question is simple. We have multiple medical clinics running each running their own SQL database EHR.
Is there anyway I can interface each local SQL database with a cloud system?
I essentially want to use the current patient data that one is consulting with at that moment to generate a pathology request that links to a cloud ?google app engine database.
As a medical student / software developer this project of yours interests me greatly!
If you don't mind me asking, where are you based? I'm from the UK and unfortunately there's just no way a system like this would get off the ground as most data is locked in proprietary databases.
What you're talking about is fairly complex anyway, whatever country you're in I assume there would have to be a lot of checks / security around any cloud system that dealt with patient data. Theoretically though, what you would want to do ideally is create an online database (cloud, hosted, intranet etc), and scrap the local databases entirely.
You then have one 'pool' of data each clinic can pull information from (i.e. ALL records for patient #3563). They could then edit that data and/or insert new records and SAVE them, exporting them back to the main database.
If there is a need to keep certain information private to one clinic only this could still be achieved on one database in a number of ways, or you could retain parts of the local database and have them merge with the cloud data as they're requested by the clinic
This might be a bit outdated, but you guys should checkout https://www.firebase.com/. It would let you do what you want fairly easily. We just did this for a client in the exact same business your are.
Basically, Firebase lets you work with a Central Database on the Cloud, that is automatically synchronised with all its front-ends. It even handles losing the connection to the server automagically. It's the best solution I've found so far to keep several systems running against one only cloud database.
We used to have our own backend that would try its best to sync changes, but you need to be really careful with inter-system unique IDs for your tables (i.e. going to one of the branches and making a new user won't yield the same id that one that already exists in any other branch or the central database). It becomes cumbersome very quickly.
CakePHP can automatically generate this kind of Unique IDs pretty easily and automatically, but you still have to work on sync'ing all the local databases with the central repository.