Proper use of Azure storage. (When to use SQL, Tables and Blobs) - azure-sql-database

I am relatively new to Azure storage and have been implementing a solution for some time now.
And I keep hitting obstacles, making me feel that I'm not applying the right storage type for the data I'm storing.
So this is more of an overall question:
When should I use Azure SQL?
When should I use Azure Table storage?
When should I use Azure Blobs?
So far I have been using table storage a lot, and I'm now paying for it.
As requirements for the solution grow I find myself unable to access the data as needed.
For instance I need to fetch the 50 latest entries in a table, but I can not user OrderBy in the query.
I need to fetch the total amount of entries, but can not use Count.
I keep getting the impression that any data I plan to access regularly without knowing the exact RowKey and PartitionKey should be indexed in Azure SQL aswell as being stored in a table. Is this correct?
I also find myself recreating objects as Entity objects, but with the very severe limitations on datatypes I often end up just serializing the object into a byte array. And though a table row may hold up to 1MB a byte array on that row may only hold 64KB, at which point I end up using Blob storage instead.
So in the end I feel like I would have been better off just putting all my data in Azure SQL and indexing larger data but saving it as blobs.
Of course this does not feel quite right, since that would leave Table storage with no real purpose.
So I'm wondering if there are any guidelines for when to use which kind of storage.
In my case I have very large amount of data in some areas, some of it consumes a fair amount of space (often above 64KB), but I also need to access the data very frequently and will need to be able to filter and sort it by certain values.
Do I really need to index all data I plan to access in SQL?
And would I be better off avoiding Table for any data that could potentially exceed 64KB?
I feel like there's something I'm not doing right. Something I didn't understand. What am I missing here?

The best recommendation I can make is basically, "Try really hard not to use Azure Table Storage". As other folks have pointed out, it's not just a "No-SQL" data-store, it's a particularly stunted, handicapped, and very-low-featured instance of a No-SQL store. About the only thing good about it is that you can put lots and lots of data into it very quickly, and with minimal storage fees. However, you basically can't hope to get that data back out again unless you're lucky enough to have a use-case that magically matches its Partition-Key/Row-Key storage model. If you don't - and I suspect very few people do - you're going to be doing a lot of partition scans, and processing the data yourself.
Beyond that, Azure Table Storage seems to be at a dead-end in terms of development. If you look at the "Support Secondary Indexes" request on the Azure feedback forums (https://feedback.azure.com/forums/217298-storage/suggestions/396314-support-secondary-indexes), you can see that support for Secondary Indexes was promised as far back as 2011, but no progress has been made. Nor has any progress been made on any of the other top requests for Table Storage.
Now, I know that Scott Guthrie is a quality guy, so my hope is that all this stagnation on the Table Storage front is a preface to Azure fixing it and coming up with something really cool. That's my hope (though I have zero evidence that's the case). But for right now, unless you don't have a choice, I'd strongly recommend against Azure Table Storage. Use Azure SQL; use your own instance of MongoDB or some other No-SQL DB; or use Amazon DynamoDB. But don't use Azure Table Storage.
EDIT: 2014-10-09 - Having been forced into a scenario where I needed to use it, I've modified my opinion on Azure Table Storage slightly. It does in fact have all the regrettable limitations I ascribe to it above, but it also has its (limited) uses. I go into them somewhat on a blog post here.
EDIT: 2017-02-09 - Nah, ATS is still awful. Steer clear of it. It hasn't improved significantly in 7+ years, and MS obviously wishes it would just go away. And it probably should - they're presumably only keeping it around for folks who made the mistake of betting on it originally.

have a look at this: Windows Azure Table Storage and Windows Azure SQL Database - Compared and Contrasted
doesn't include blobs, but a good read anyway...

I keep getting the impression that any data I plan to access regularly without knowing the exact RowKey and PartitionKey should be indexed in Azure SQL aswell as being stored in a table. Is this correct?
Table storage does not support secondary indexes and so any efficient queries should contain the RowKey and the PartitionKey. There can be workarounds such as saving the same data twice in the same table with different RowKeys. However this can quickly become a pain. If eventual consistency is ok then you could do this. You need to take care of transactions and rollbacks.
In my case I have very large amount of data in some areas, some of it consumes a fair amount of space (often above 64KB), but I also need to access the data very frequently and will need to be able to filter and sort it by certain values.
Use table storage for basic NoSQL functionality and the ability to scale quickly. However, if you want secondary indexes and other such features you might have to take a look at something like DynamoDB on AWS which afaik seems to have better support for secondary indexes etc. If you have data that has complex relationships in other words data that requires an RDBMS go with SQL Azure.
Now, as far as your options on Azure go I'd think you would need to store everything on SQL Azure and large objects as blobs or on table storage.
Do I really need to index all data I plan to access in SQL?
Tough to say. If each partition is going to contain say just 100 rows then you can query by partition key and any of the columns. At this point the partition scan is going to be pretty fast. However, if you have a million rows then it could be a problem.
I feel like there's something I'm not doing right. Something I didn't understand. What am I missing here?
A bunch of early Azure users started using Table Storage without understanding what NoSQL (and in this case a particularly stunted version of NoSQL) entails.

Related

Solution to host 200GB of data and provide JSON API with aggregates?

I am looking for a solution that will host a nearly-static 200GB, structured, clean dataset, and provide a JSON API onto the data, for querying in a web app.
Each row of my data looks like this, and I have about 700 million rows:
parent_org,org,spend,count,product_code,product_name,date
A31,A81001,1003223.2,14,QX0081,Rosiflora,2014-01-01
The data is almost completely static - it updates once a month. I would like to support straightforward aggregate queries like:
get total spending on product codes starting QX, by organisation, by month
get total spending by parent org A31, by month
And I would like these queries to be available over a RESTful JSON API, so that I can use the data in a web application.
I don't need to do joins, I only have one table.
Solutions I have investigated:
To date I have been using Postgres (with a web app to provide the API), but am starting to reach the limits of what I can do with indexing and materialized views, without dedicated hardware + more skills than I have
Google Cloud Datastore: is suitable for structured data of about this size, and has a baked-in JSON API, but doesn't do aggregates (so I couldn't support my "total spending" queries above)
Google BigTable: can definitely do data of this size, can do aggregates, could build my own API using App Engine? Might need to convert data to hbase to import.
Google BigQuery: fast at aggregating, would need to roll my own API as with BigTable, easy to import data
I'm wondering if there's a generic solution for my needs above. If not, I'd also be grateful for any advice on the best setup for hosting this data and providing a JSON API.
Update: Seems that BigQuery and Cloud SQL support SQL-like queries, but Cloud SQL may not be big enough (see comments) and BigQuery gets expensive very quickly, because you're paying by the query, so isn't ideal for a public web app. Datastore is good value, but doesn't do aggregates, so I'd have to pre-aggregate and have multiple tables.
Cloud SQL is likely sufficient for your needs. It certainly is capable of handling 200GB, especially if you use Cloud SQL Second Generation.
They only reason why a conventional database like MySQL (the database Cloud SQL uses) might not be sufficient is if your queries are very complex and not indexed. I recommend you try Cloud SQL, and if the performance isn't sufficient, try ensuring you have sufficient indexes (hint: use the EXPLAIN statement to see how the queries are being executed).
If your queries cannot be indexed in a useful way, or your queries are so cpu intensive that they are slow regardless of indexing, you might want to graduate up to BigQuery. BigQuery is parallelised so that it can handle pretty much as much data as you throw at it, however it isn't optimized for real-time use and isn't as conveneint as Cloud SQL's "MySQL in a box".
Take a look at ElasticSearch. It's JSON, REST, cloud, distributed, quick on aggregate queries and so on. It may or may not be what you're looking for.

Creating a blog service or a persistent chat with Table Storage

I'm trying azure storage and can't come up with real life scenarios when I would use it. As far as I understand the only index Table Storage has is Partition Key and Row Key. I can't sort or query on other columns without doing a full partition scan, right?
If I would migrate my blog service from a traditional sql server or a richer nosql database like Mongo i would probably be alright, considering users don't blog that much in one year (I would partition all blog posts per user per year for example). Even if someone would hit around a thousand blog posts a year i would be OK to load them all metadata in memory. I could do smarter partitioning if this won't work well.
If I would migrate my persistent chat service to table storage how would I do that? Users post thousands of messages a day and query history pretty often from desktop clients, mobile devices, web site etc. I don't want to lose on this and only return 1 day history with paging (which can be slow as well).
Any ideas or patterns or what am I missing here?
btw I can always use different database, however considering Table Storage is so cheap I don't want to.
PartitionKey and RowKey values are the only two indexed properties. To work around the lack of secondary indexes, you can store multiple copies of each entity with each copy using a different RowKey value. For instance, one entity will have PartitionKey=DepartmentName and RowKey=EmployeeID, while the other entity will have PartitionKey=DepartmentName and RowKey=EmailAddress. That will allow you to look up either by EmployeeID or emailAddress. Azure Storage Table Design Guide ( http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/) has more detailed example and has all the information that you need to design a scalable and performant Tables.
We will need more information to answer your second question about how you would migrate contents of your chat service to table storage. We need to understand the format and structure of the data that you currently store in your chat service.

Use SQL or NoSQL?

I'm designing a system that checks a given website for any security vulnerabilities. The system includes a client (firefox plugin) and a server. The server does all the scanning while the client just relays that info to the user. If a website is dangerous, it is blacklisted; otherwise whitelisted.
The system must hypothetically be able to handle several thousands of requests and updates to the database simultaneously.
Although the database is expected to have a very simple structure, I am still considering using NoSQL because my understanding is that it can handle a greater amount of queries. Is this true? Which db technology is better suited for my system?
I suggest a NoSQL database.
In fact I've been working with two databases in the last weeks, and searching on internet I found the differences between a NoSQL an a SQL database.
Pratically, you should use a NoSQL db if you have a lot of data to query. Remind that it's not sure the data recovery in case of a db disaster.
Instead, use a SQL database if your data MUST be permanent, and you can't lose it. But query times will be longer, so it's not suggested if you have tons of data.
I understood, from what you wrote, that you need lot of queries and you "can lose" the data (if you lose a website of the list, you'll just need to re-check it, right?).
So I suggest you to go for a NoSQL db (I worked with MongoDb, it is the most famous worl-wide).
If you consider NoSQL Databases you have to analyze your data to get the right Database.
For your use case I think you should look at document databases (like MongoDB) or, if you want really high performance, a key-value Database like Redis or Riak.
With Key-Value databases you can only use the key to find the data you want.
With document databases you still have some kind of querys to find the data.
For further information look at: http://nosql-database.org/

SQL-Azure Performance, Add Database or Add Server?

This is not a traditional scale-up or scale-out question.
Please bear with me, here first allow me give an example:
I created a Sql Azure server and create a 1GB database inside, cost $9.99 a month.
(It has a master database as well, 1G, but Microsoft not charge us for that)
Ok, here is my question comes, when I need another 1G database for my application. Why I need another 1GB database? You may ask me this because the azure can support database up to 50GB. My answer is distribution, I know the data will reach 50G eventually, so I create the data model distribute and spread the data in different database.
For all the sake of performance, which option I should use:
Create another database in same server
Create another server and create a new database inside
Both option cost same.
I guess option 2 will be better, isn't it?
I'm not sure there are strong (or any) performance implications, my understanding is that the consideration is mostly a management one as some entities, mostly around security, are defined at server level and some at database level.
Behind the scenes the model is quite different anyway, and a multi-tenant one, so having separate SQL Azure server does not actually mean you get a dedicated server per-se. theoretically separate servers or separate databases may end up looking exactly the same.

should i advocate migrating from access to (my)sql

We have a windows MFC app that is written against an access database on a company server. The db is not that big: 19 MB. There are at most 2-3 users accessing it at any one time. It is used in a factory environment where access speed (or lack thereof) over the intranet becomes noticeable as it is part of the manufacturing time for our widgets.
The scenario is this: as each widget is completed, it gets a record in the db.. by the end of the year, the db is larger and searching for a record takes longer and longer. The solution so far has been to manually move older records to an archival table about once a year.
We are reworking other portions of this app right now, and it would be a good time to move to another db if we are going to do it.
It is my understanding that if we were using sql, the search time would not go up as the table gets bigger because the entire .mdb does not have to be sent over the network each time. Is this correct? Does anyone have any insight about whether it could be worth it to go to the trouble (time and money) of migrating to a new db, or should I just add more functionality to the application we have now, and maybe automatically purge the older records from time to time, and add additional facilities to the app to get at the older records when needed?
Thanks for any wisdom you can share..
Since your database is small and very few users, I could not make a solid case for migration. I would definetly set up an script to archive old records on a more frequent basis (don't archive into same db, this would somewhat defeat the purpose).
But also make sure two things are correct as well.
INDEXES. If your queries start slowing down, make sure you have proper indexes
http://support.microsoft.com/kb/304272
Your network connection between computers is fast. Maybe upgrade to gigabit cards and router? Possibly put the db on a scsi drive (raid 10 for speed and redundancy)
Throwing advanced technology at simple problems is an expensive way to go and not always the answer!
First of all, the information that the whole table and the whole database is transferred across the network is simply incorrect. If the queries are indexed, then the search times should not go up that much over time.
As others have mentioned spending the time + money to setup and maintain and then have someone maintain and manage and support that database server is certainly a possibility here. However, keep in mind that simply migrating a JET based application to sql server in many cases will run slower, and in fact sql server is slower then JET when no network is involved.
So, I would take some time to ascertain why things slow down so much, and also check into how indexing is setup.
So, just keep in mind that it is pure folklore and myth that the whole tables and whole database is transferred over the network. This concept is ONLY DUE to most people really not having any computer training and not knowing and understanding how the JET data engine works.
I would probably move to either Microsoft SQL Server 2008 R2 Express Edition (free) or MySQL (free) if there is both funding and time to put in a data access layer. Because you will be making requests of a remote server and not operating on data at the local workstation this move is very involved from the development standpoint.
However you should analyze whether or not its more cost effective to perform your archival process quarterly or monthly, and just move the archive database to SQL Server 2008 R2 Express Edition. (You can install the Microsoft SQL Server Management Studio client tools on workstations and query the archival database for faster reports on historical data without rewriting your entire production application; similar solutions exist for using MySQL or other OSS/free RDBMS).
I have cilents with 300 mb databases although they should be upsizing to SQL Server for other reasons. 19 Mb is relatively small. If performance is bad enough that archiving speeds things up then check the indexes to the tables for all your sorting and selection fields. Albert gave you a good URL there to check.
Entire MDB files do not go down the wire. Unless you are missing indexes.
Instead of shipping the DB over the network to the client and then performing queries, you could instead write a small wrapper on the server that handles requests, looks up the result in the Access DB (using SQL + the Access ODBC driver), and returns the result. This avoids the overhead of a large migration you might not need and still gets rid of the basic problem the users are experiencing.
Moving to a "proper" database solution is the best long term solution, but if your needs scale linearly and slowly over the next 30 years, it's hard to justify an expensive migration. That said, if you expect to really ramp up, or want to be more "future-proof", migrating now will likely save money/time.
It is my understanding that if we were
using sql, the search time would not
go up as the table gets bigger because
the entire .mdb does not have to be
sent over the network each time. Is
this correct?
This general idea is true for almost all databases. The idea of a database is to separate your application from the actual data. The data resides in a database server. Your application doesn't.
Does anyone have any insight about
whether it could be worth it to go to
the trouble (time and money) of
migrating to a new db
Yes. Having proposed this many times. It's expensive. It's complicated. Your MS-Access database will never get better or faster.
Other database servers will (and can) get faster and more sophisticated. After all, you're not sending .MDB files through a network anymore. The limitations are reduced. You're working with standard SQL through ODBC. Any database will work at the end of ODBC. You can fire vendors to find better, faster, cheaper products. Once you stop using Access you have choices.
Either stop using Access now or plan to suffer with it forever. And remake this decision every year until the end of time.