Amazon SimpleDB - amazon-s3

Has anyone considered using something along the lines of the Amazon SimpleDB data store as their backend database?
SQL Server hosting (at least in the UK) is expensive so could something like this along with cloud file storage (S3) be used for building apps that could grow with your application.
Great in theory but would anyone consider using it. In fact is anyone actually using it now for real production software as I would love to read your comments.

This is a good analysis of Amazon services from Dare.
S3 handled what I've typically heard described as "blob storage". A typical Web application typically has media files and other resources (images, CSS stylesheets, scripts, video files, etc) that is simply accessed by name/path. However a lot of these resources also have metadata (e.g. a video file on YouTube has metadata about it's rating, who uploaded it, number of views, etc) which need to be stored as well. This need for queryable, schematized storage is where SimpleDB comes in. EC2 provides a virtual server that can be used for computation complete with a local file system instance which isn't persistent if the virtual server goes down for any reason. With SimpleDB and S3 you have the building blocks to build a large class of "Web 2.0" style applications when you throw in the computational capabilities provided by EC2.
However neither S3 nor SimpleDB provides a solution for a developer who simply wants the typical LAMP or WISC developer experience of building a database driven Web application or for applications that may have custom storage needs that don't fit neatly into the buckets of blob storage or schematized storage. Without access to a persistent filesystem, developers on Amazon's cloud computing platform have had to come up with sophisticated solutions involving backing data up manually from EC2 to S3 to get the desired experience.

I just finished writing a library to make porting an app to simpledb in Perl easy, Net::Amazon::SimpleDB::Simple because I found the Amazon client libraries painful. The library isn't on CPAN yet, but it is at http://rjurneyopen.s3.amazonaws.com/SimpleDB/Simple.pm The idea was to make it trivial to stuff hashes in and out of SimpleDB.
I just ported an app to use it. Overall I am impressed with SimpleDB... even inefficient queries take only 2-3 seconds to return. SimpleDB doesn't seem to care about the size of your table, owing to its Erlang/parallel nature. Tablescans are easy for it.
The pain comes from the fact that you can't count, sum or group by. If you plan on doing any of those things... then SimpleDB probably isn't for you. At the moment in terms of functionality it exists somewhere in between memcached and MySQL. You can SELECT ORDER BY LIMIT, which is nice. Its also nice that you don't have to scale it yourself, and its nice that it doesn't care how much you stuff into it. But more advanced operations like analytics are painful at best. You'll have to do your own calculations server side. Its also a big plus that on any computer I can use the simpledb CLI http://code.google.com/p/amazon-simpledb-cli/ to query my data.
There are some confusing 'gotchas.' For instance, attributes can have more than one value, and you have to explicitly set 'replace' when storing items. Also, storing undef or null string results in a library error, instead of deleting that attribute name/value pair or setting it null/empty string.
Learning to think in terms of a largely un-normalized way is a little strange too, which is why I would second the suggestion above that says it is best for new applications. Porting from a SQL app to SimpleDB would be painful because your application logic would have to change. The way you do things is a bit different. The amazon docs are pretty good at explaining this.
All of this is extractable in a library that sits atop SimpleDB, so for your use of SimpleDB you will want to pick a good library... you probably don't want to deal with it directly. There is some work on the PHP side to make things easy, and there is my library. There is a RAILS activesource, but it doesn't seem to do much for you.
All in all its still early in the game, but compared to other APIs (twitter comes to mind), I have to say that the SimpleDB REST API is pretty simple (especially considering that it is XML) and polite to work with. I would recommend it... depending on the requirements of your application and the economics of your use of it. If you're looking to rapidly scale a service that doesn't put a great load on the DB and don't want to bother with a scalable MySQL/memcache combo... then SimpleDB can offer a 'simple' solution for you.
I expect that its features will continue to grow and it will be a good choice for more and more applications that do more complex and interesting things. But right now it is targeted at and appropriate for your typical Web 2.0 service.

We are using SimpleDB almost exclusively for our new projects. The zero maintenance, high availability, no install aspects are just too good. And for your Ruby developers, check out SimpleRecord, an ActiveRecord like interface for SimpleDB which makes it super easy to use.

But do you really need SQL Server? Can't you live with PostgreSQL or MySQL? Both have proven to be ok for most tasks.
Now if you need SQL Server features then you're out of luck.
Another option is to rent a server. How expensive is expensive?
(I've used Amazon S3 to store images for an application, it's ok and works fine, at least for that)

I haven't used SimpleDB, but have been using combination of S3, EC2, and MySQL for our application.
As long as you are willing to use SimpleDB, then you might as well consider using MySQL (which is very scalable, and not that expensive).
On the S3 and EC2 side, it is great in practice as well.

SimpleDB works great for many applications.... if your project will require a lot of analytic reporting, joining, etc, you may consider MySQL or a hybrid-model.
If you go SimpleDB, we've developed Radquery.com for our internal use and opened it up to the public.

Related

Easiest API to learn/methdology to create web applications for running mapreduce on hadoop?

I have hadoop 1.0.4 running on my ubuntu 11.04,configured with eclipse I want to make a web application to run hadoop jobs, or may be Cassandra,Hbase and Hive might be a way but I don't have much time to learn thoroughly all these and I want to do it as quickly as possible.Any advice which one might prove the easiest to get started with ?
I don't know if this question really qualifies to be here on SO in its current form. This is the reason I did not write this initially. But, a lot of SO experts are out there to decide this(they can do it much better than me) :)
Having said that, I would like to share a few things with you based on my personal experience, so that you proceed towards the correct path. First of all, Hadoop jobs(MapReduce) and Hive are actually not a good fit for web services kinda use cases. They are most suitable for offline, batch processing kinda stuff. HBase/Cassandra can be used though, if you have real time needs(like web services).
Coming back to your actual question. Before diving into Hadoop, Hive, HBase etc, I would suggest you to get some hold on web services first(if you are new to web services as well). Reason being, a web service is something which has much wider scope of applicability as compared to tools like Hadoop, Hive, HBase etc. These tools are specific to some particular use cases and cannot be used everywhere. But, web services are used almost everywhere and with n number of different things, like RDBMSs, NoSQL datastores etc etc. So if you know web service concepts you definitely have that extra edge. To begin with you can visit these links :
Web Services Tutorial by W3Schools(Nice n easy. Would serve the quick start guide purpose).
For a detailed tutorial you can visit the oracle web services tutorial.
This link by IBM developerworks has references to some really good web services learning stuff.
You might find this one really helpful to start with(Shows how to create web services using Eclipse).
And you can obviously Google web service tutorials anytime.
One last thing. Although it's not mandatory to be a pro in things like Hadoop, Hive, HBase etc, but having some decent amount of understanding of the concepts would be really helpful in developing your solution in a much better manner. It'll allow you to think accurately in the correct direction.
HTH.

PyAMF backend choices!

I've been using PyAMF to write a backend for a flex app that will request different groups of hundreds of different images depending on what the client needs. I have been using the "simple_server" WSGI server that PyAMF supplies while developing the flex code. Now I'm ready to write a robust backend that will be able to pull images from a mySQL database and send them as fast as possible and as efficiently as possible to many concurrent clients.
The PyAMF documentation is great because they supply many examples to follow, however I am confused about what kind of backend I am trying to create.
Do I want a SocketServer or a WSGI server or something like Twisted or web2py or Tornado? Are these even all different? :) Should I be using Apache modules instead (mod_wsgi or modjy or mod_python)?
I realize that this probably touches on many open debates, so maybe you could just point me to any good summaries of these debates?
Its great to have so many options, but how do I choose?
The short answer is, of course, that it depends on the requirements of your project.
How many concurrent connections is "a lot"?
How much programmer time can you throw at the problem?
How much hardware can you throw at the problem?
...etc...
If you plan to have lots of concurrent clients, it's hard to beat Twisted in the Python world. However, you'll have to deal with your database asynchronously to avoid blocking, and depending on how complex your database interactions are, this can be a bit of a pain. You're basically limited to either using twisted.enterprise.adbapi or coming up with your own twisted-ORM integration.
If you'd rather have "easy" database code (i.e. you want to use an ORM), you're better off going with a (TurboGears/Pylons/plain wsgi) project, probably hosted using Apache and mod_wsgi. This can be a pretty scalable solution, and you get a lot of stuff for free using these frameworks, but it may be more than you need.
I would avoid using one of the many plain python wsgi servers out there (wsgiref, paster, etc.) in production if you really want high performance.
Good Luck!

What Backup Library/Code Do You Use For Your Dataset (Files On Disk)?

I'm implementing Backup functionality into my new (small) app, Oldaer. I've got standalone desktop files (rather than sitting in a SQL db).
Looking around, I decided on using a Clarion 3rd-Party Template that will package them into one file and then compress (huffman's) that one file. Restoring is just the reverse. Uncompress, unpack.
However, I'm not convinced this is ideal.
What Backup functionality do you implement for your dataset?
Of course, there's a lot more in "Backup/Restore" functionality. Location, Tracking/Archiving, Out-of-the-box Information (like better ways of letting the User know what was in the archive file). But that's another question.
SQL Replication, clustering, RAID 5
Just been playing with uploading datasets to Amazon S3 using the NetTalk 3rd party libraries in Clarion. Seems to work a treat. I am working on keeping multiple 'versions' of the dataasets using the MetaTags functionality.
Happy to dig out my code and discuss further if you need.

Experiences and tips for programming with and for Amazon's cloud servers/apps/tools?

We're looking into developing a product that would use Amazon's cloud tools (EC2, SQS, etc), and I'm curious what tips/gotchas/pointers people that have used these technologies have.
One tip/whatever per post, please.
The Elasticfox plug-in for Mozilla makes doing a lot of the EC2 stuff easier. It can be found at: Elasticfox Firefox Extension for Amazon EC2. This page has links specifically to download the Elasticfox plug-in and also the associated Sourceforge project. Well worth using...
Get a developer account at Right Scale. It's free and a god-send for a guy who hates remembering those dumb commands and arguments. If you only resort to Amazon-supplied tools, you're throwing away your human rights.
We're interested in EC2 where i work. We don't care about web-serving or enterprisey stuff, just massive number crunching for physics, using python. This EC2 stuff had me befuddled, with most documentation oriented toward businessy applications and using C# or Java, but this slide show clarified much for me, especially for using python: http://www.datawrangling.com/pycon-2008-elasticwulf-slides
As for SimpleDB, it has a very limited query language and it is very restrictive. If you planning on having lot of complex queries, you must first sit down and think how to organize your data to make those queries possible. One thing missing, but that will probably will be added, is the ability to count the results of a given query, much like SQL's COUNT.
Performance is ok, but I consider the latency maybe a little high.
An important concept to grasp: the file system your EC2 instance lives on while it's running is not persistent. There are tools/services available that let you mount file systems backed by S3 storage, or you can upload to S3 or other storage service from the instance, but when an instance closes the associated file system is no more.
As for tools, I've found Amazon's tools to be great, but you should probably be comfortable with the command line if you're taking this route.
For managing your EC2 instances, etc. Amazon also offers - in beta since a couple of days - the management console which has similar functionality to the Elasticfox Firefox plugin but is a pure web console.
https://console.aws.amazon.com

Website Hardware Scaling

So I was listening to the latest Stackoverflow podcast (episode 19), and Jeff and Joel talked a bit about scaling server hardware as a website grows. From what Joel was saying, the first few steps are pretty standard:
One server running both the webserver and the database (the current Stackoverflow setup)
One webserver and one database server
Two load-balanced webservers and one database server
They didn't talk much about what comes next though. Do you add more webservers? Another database server? Replicate this three-machine cluster in a different datacenter for redundancy? Where does a web startup go from here in the hardware department?
A reasonable setup supporting an "average" web application might evolve as follows:
Single combined application/database server
Separate database on a different machine
Second application server with DNS round-robin (poor man's load balancing) or, e.g. Perlbal
Second, replicated database server (for read loads, requires some application logic changes so eligible database reads go to a slave)
At this point, evaluating the current state of affairs would help to determine a better scaling path. For example, if read load is high and content doesn't change too often, it might be better to emphasise caching and introduce dedicated front-end caches, e.g. Squid to avoid un-needed database reads, although you will need to consider how to maintain cache coherency, typically in the application.
On the other hand, if content changes reasonably often, then you will probably prefer a more spread-out solution; introduce a few more application servers and database slaves to help mitigate the effects, and use object caching, such as memcached to avoid hitting the database for the less volatile content.
For most sites, this is probably enough, although if you do become a global phenomenon, then you'll probably want to start considering having hardware in regional data centres, and using tricks such as geographic load balancing to direct visitors to the closest "cluster". By that point, you'll probably be in a position to hire engineers who can really fine-tune things.
Probably the most valuable scaling advice I can think of would be to avoid worrying about it all far too soon; concentrate on developing a service people are going to want to use, and making the application reasonably robust. Some easy early optimisations are to make sure your database design is fairly solid, and that indexes are set up so you're not doing anything painfully crazy; also, make sure the application emits cache-control headers that direct browsers on how to cache the data. Doing this sort of work early on in the design can yield benefits later, especially when you don't have to rework the entire thing to deal with cache coherency issues.
The second most valuable piece of advice I want to put across is that you shouldn't assume what works for some other web site will work for you; check your logs, run some analysis on your traffic and profile your application - see where your bottlenecks are and resolve them.
plenty of fish Architecture
some interesitng videos:
Youtube scalibility
Inteview with Dan Farino, System Architect at Myspace
Joel mentioned adding a second datacenter, with the same setup, and then assigning your users randomly to each. Changes to the data are logged and sent from one location to the other, so that both locations contain all the data.
The talk Scalable Web Architectures Common Patterns & Approaches from Cal Henderson (Yahoo) on Web 2.0 Expo was quite interesting. I thought there was an video, but I could not find it. But here are the slides:
http://www.slideshare.net/techdude/scalable-web-architectures-common-patterns-and-approaches
A certain next step would be a cluster of webservers (a web farm) and a clustered system of database servers (replication or Oracle RAC etc. etc.)
If your interested in caching and using .Net, look into the application caching block in enterprise library (of course use this along with the other points above).