I've been using PyAMF to write a backend for a flex app that will request different groups of hundreds of different images depending on what the client needs. I have been using the "simple_server" WSGI server that PyAMF supplies while developing the flex code. Now I'm ready to write a robust backend that will be able to pull images from a mySQL database and send them as fast as possible and as efficiently as possible to many concurrent clients.
The PyAMF documentation is great because they supply many examples to follow, however I am confused about what kind of backend I am trying to create.
Do I want a SocketServer or a WSGI server or something like Twisted or web2py or Tornado? Are these even all different? :) Should I be using Apache modules instead (mod_wsgi or modjy or mod_python)?
I realize that this probably touches on many open debates, so maybe you could just point me to any good summaries of these debates?
Its great to have so many options, but how do I choose?
The short answer is, of course, that it depends on the requirements of your project.
How many concurrent connections is "a lot"?
How much programmer time can you throw at the problem?
How much hardware can you throw at the problem?
...etc...
If you plan to have lots of concurrent clients, it's hard to beat Twisted in the Python world. However, you'll have to deal with your database asynchronously to avoid blocking, and depending on how complex your database interactions are, this can be a bit of a pain. You're basically limited to either using twisted.enterprise.adbapi or coming up with your own twisted-ORM integration.
If you'd rather have "easy" database code (i.e. you want to use an ORM), you're better off going with a (TurboGears/Pylons/plain wsgi) project, probably hosted using Apache and mod_wsgi. This can be a pretty scalable solution, and you get a lot of stuff for free using these frameworks, but it may be more than you need.
I would avoid using one of the many plain python wsgi servers out there (wsgiref, paster, etc.) in production if you really want high performance.
Good Luck!
Related
What advantages and disadvantages using nginx+Apache+mod_wsgi vs nginx+uWSGI(vurtualenv) in production
Advantages of first variant using i see in that mod_wsgi developing since 2007 and have more stable version and easy administrated
Advantages of second variant is more high perfomance (see Benchmark of Python WSGI Servers, available to use uWSGI server in virtualenv that is more secure.
Disadvantage of second variant is a still no major version, need to creating something controling scripts for starting uWSGI servers for each virtual host (or use supervisor)
What do you thinking about it?
When you load your typical large Python web application on top of the most popular WSGI servers, the performance difference isn't actually that much and usually nothing to get excited about. Hello world benchmarks like the one you quote are very misleading as they test a very narrow use case and the configurations used are usually never comparable. You should consider watching my PyCon talk which talk about bottlenecks in web servers and web applications.
http://pyvideo.org/video/703/web-server-bottlenecks-and-performance-tuning
Given that the WSGI server is not usually the problem, you should just choose that which you find easiest to manage and has the sorts of features you think you will require. Then use benchmarking and monitoring of that choice to work out how to set it up so as to perform best for your specific web application. Even then, any increase in performance or gains in user satisfaction are not usually going to come from such tuning.
I've heard polling the server is not the best of ideas.
Let's say I make a client-server application.
A simple game for example.
Where each client polls the server every half a minute.
How many clients is it possible to have before it overloads a wamp server?
Basically how robust is Apache for this kind of stuff?
Getting a request, aggregating data from mysql server, and then returning the data in an xml format.
This is a really open ended question. It entirely depends on your configuration, how many apache services do you have running, how many physical servers do you have, how is your mysql server setup (is it on it's own machine)? You want to also keep in mind that by polling the server you have to initiate a connection each time and allocate resources for that communication (in the lower level networking and your program).
If at all possible it might be better for the server to push content to the client (assuming a push happens less frequently than polls happen).
My guess is you will maxout whatever the request is handling and the mysql database before Apache server becomes a problem.
However, going out on a limb, if proper caching is in place, and a sufficiently smart architecture and pragmatic design, and effectively only the 30 sec polling, you should be able to support a couple 1000 users.
But : mock it up, quick and dirty, do roughly what you think you will be doing, and hit it with JMeter (or similar) with 3, 10, 30, 100, 300, 1000, 3000, ... till you find the performance wall.
I am scared about the aggregating data part... if aggregation is really needed, carefully architect it so you do not need to go to the DB, because this will kill you in production, and you wont find it in development.
Can't really give you numbers. It depends on a bunch of factors (including hardware).
What's more important to you: How many concurrent users you can support or how real-time the results are?
If you looking for real-time results, you probably want to investigate something like Comet or long polling.
If you're looking for supporting a lot of users, the long polling approach probably isn't ideal and you'll probably want something more lightweight than Apache. Personally, I'm a fan of nginx.
EDIT: And if you're feeling really hip, your best bet for real-time results here is Web Sockets, but if you're a microsoft guy this isn't going to do you much good as IE doesn't support them.
I want to make an application that executes a remote script. The user can create a script (probabily a LUA script) then stores it in the server. Then he can uses an API for execute the script. I was thinking that API could be a webservice.
So my questions are:
I need high performance to execute the script. So my first choice was LUA script. Someone has another sugestion?
Cause I need high perfomance, I was thinking if the webservice is the best solution. Maybe I could create a TCP/IP Windows Service that hold the users request. It is important to say that I will have many user executing scripts at the same time. So I will have a concurrency problem.
My scripts will query in a database. I will use Tokyo Cabinet or Tokio Tyrant. I think Tokio Tyrant is the only solution cause I will have many requests. For perfomance, Do I need to make a connection pooling? Is there anyway to share variables between webservices requests?
To make the webservice or the Windows service i was thinking to use C++.
Can someone help with these questions?
thanks
Lua is pretty high performance for a scripting language, especially if you use LuaJIT or something similar.
You speak of high performance. How much are we speaking about? Say you have a very simple webservice that executes scripts it receives via POST, then probably the HTTP overhead is comparably small when compared to the Lua compile, environment setup & execution time.
About the database I cannot tell you anything. There's many possibilities to do pooling and this also depends on how you execute the Lua scripts. Are they running in a common environment? One per session? One per request?
C++ surely is a good choice to host Lua, because Lua fits in pretty well. Though there are other good language bindings as well.
But keep in mind that your job is not over by just sandboxing scripts. User submitted scripts can do a lot other Bad Things(TM), intentionally or by mistake, like allocating a lot of memory or hogging the CPU. In Lua (and I think this is true of many, if not all, sandboxed environments) you cannot do much about this, except killing the offending instance or, if you disallowed using coroutines in your sandbox, yield out of the offending coroutine and do something smarter.
as opposed to writing your own library.
We're working on a project here that will be a self-dividing server pool, if one section grows too heavy, the manager would divide it and put it on another machine as a separate process. It would also alert all connected clients this affects to connect to the new server.
I am curious about using ZeroMQ for inter-server and inter-process communication. My partner would prefer to roll his own. I'm looking to the community to answer this question.
I'm a fairly novice programmer myself and just learned about messaging queues. As i've googled and read, it seems everyone is using messaging queues for all sorts of things, but why? What makes them better than writing your own library? Why are they so common and why are there so many?
what makes them better than writing your own library?
When rolling out the first version of your app, probably nothing: your needs are well defined and you will develop a messaging system that will fit your needs: small feature list, small source code etc.
Those tools are very useful after the first release, when you actually have to extend your application and add more features to it.
Let me give you a few use cases:
your app will have to talk to a big endian machine (sparc/powerpc) from a little endian machine (x86, intel/amd). Your messaging system had some endian ordering assumption: go and fix it
you designed your app so it is not a binary protocol/messaging system and now it is very slow because you spend most of your time parsing it (the number of messages increased and parsing became a bottleneck): adapt it so it can transport binary/fixed encoding
at the beginning you had 3 machine inside a lan, no noticeable delays everything gets to every machine. your client/boss/pointy-haired-devil-boss shows up and tell you that you will install the app on WAN you do not manage - and then you start having connection failures, bad latency etc. you need to store message and retry sending them later on: go back to the code and plug this stuff in (and enjoy)
messages sent need to have replies, but not all of them: you send some parameters in and expect a spreadsheet as a result instead of just sending and acknowledges, go back to code and plug this stuff in (and enjoy.)
some messages are critical and there reception/sending needs proper backup/persistence/. Why you ask ? auditing purposes
And many other use cases that I forgot ...
You can implement it yourself, but do not spend much time doing so: you will probably replace it later on anyway.
That's very much like asking: why use a database when you can write your own?
The answer is that using a tool that has been around for a while and is well understood in lots of different use cases, pays off more and more over time and as your requirements evolve. This is especially true if more than one developer is involved in a project. Do you want to become support staff for a queueing system if you change to a new project? Using a tool prevents that from happening. It becomes someone else's problem.
Case in point: persistence. Writing a tool to store one message on disk is easy. Writing a persistor that scales and performs well and stably, in many different use cases, and is manageable, and cheap to support, is hard. If you want to see someone complaining about how hard it is then look at this: http://www.lshift.net/blog/2009/12/07/rabbitmq-at-the-skills-matter-functional-programming-exchange
Anyway, I hope this helps. By all means write your own tool. Many many people have done so. Whatever solves your problem, is good.
I'm considering using ZeroMQ myself - hence I stumbled across this question.
Let's assume for the moment that you have the ability to implement a message queuing system that meets all of your requirements. Why would you adopt ZeroMQ (or other third party library) over the roll-your-own approach? Simple - cost.
Let's assume for a moment that ZeroMQ already meets all of your requirements. All that needs to be done is integrating it into your build, read some doco and then start using it. That's got to be far less effort than rolling your own. Plus, the maintenance burden has been shifted to another company. Since ZeroMQ is free, it's like you've just grown your development team to include (part of) the ZeroMQ team.
If you ran a Software Development business, then I think that you would balance the cost/risk of using third party libraries against rolling your own, and in this case, using ZeroMQ would win hands down.
Perhaps you (or rather, your partner) suffer, as so many developers do, from the "Not Invented Here" syndrome? If so, adjust your attitude and reassess the use of ZeroMQ. Personally, I much prefer the benefits of Proudly Found Elsewhere attitude. I'm hoping I can proud of finding ZeroMQ... time will tell.
EDIT: I came across this video from the ZeroMQ developers that talks about why you should use ZeroMQ.
what makes them better than writing your own library?
Message queuing systems are transactional, which is conceptually easy to use as a client, but hard to get right as an implementor, especially considering persistent queues. You might think you can get away with writing a quick messaging library, but without transactions and persistence, you'd not have the full benefits of a messaging system.
Persistence in this context means that the messaging middleware keeps unhandled messages in permanent storage (on disk) in case the server goes down; after a restart, the messages can be handled and no retransmit is necessary (the sender does not even know there was a problem). Transactional means that you can read messages from different queues and write messages to different queues in a transactional manner, meaning that either all reads and writes succeed or (if one or more fail) none succeeds. This is not really much different from the transactionality known from interfacing with databases and has the same benefits (it simplifies error handling; without transactions, you would have to assure that each individual read/write succeeds, and if one or more fail, you have to roll back those changes that did succeed).
Before writing your own library, read the 0MQ Guide here: http://zguide.zeromq.org/page:all
Chances are that you will either decide to install RabbitMQ, or else you will make your library on top of ZeroMQ since they have already done all the hard parts.
If you have a little time give it a try and roll out your own implemntation! The learnings of this excercise will convince you about the wisdom of using an already tested library.
So I was listening to the latest Stackoverflow podcast (episode 19), and Jeff and Joel talked a bit about scaling server hardware as a website grows. From what Joel was saying, the first few steps are pretty standard:
One server running both the webserver and the database (the current Stackoverflow setup)
One webserver and one database server
Two load-balanced webservers and one database server
They didn't talk much about what comes next though. Do you add more webservers? Another database server? Replicate this three-machine cluster in a different datacenter for redundancy? Where does a web startup go from here in the hardware department?
A reasonable setup supporting an "average" web application might evolve as follows:
Single combined application/database server
Separate database on a different machine
Second application server with DNS round-robin (poor man's load balancing) or, e.g. Perlbal
Second, replicated database server (for read loads, requires some application logic changes so eligible database reads go to a slave)
At this point, evaluating the current state of affairs would help to determine a better scaling path. For example, if read load is high and content doesn't change too often, it might be better to emphasise caching and introduce dedicated front-end caches, e.g. Squid to avoid un-needed database reads, although you will need to consider how to maintain cache coherency, typically in the application.
On the other hand, if content changes reasonably often, then you will probably prefer a more spread-out solution; introduce a few more application servers and database slaves to help mitigate the effects, and use object caching, such as memcached to avoid hitting the database for the less volatile content.
For most sites, this is probably enough, although if you do become a global phenomenon, then you'll probably want to start considering having hardware in regional data centres, and using tricks such as geographic load balancing to direct visitors to the closest "cluster". By that point, you'll probably be in a position to hire engineers who can really fine-tune things.
Probably the most valuable scaling advice I can think of would be to avoid worrying about it all far too soon; concentrate on developing a service people are going to want to use, and making the application reasonably robust. Some easy early optimisations are to make sure your database design is fairly solid, and that indexes are set up so you're not doing anything painfully crazy; also, make sure the application emits cache-control headers that direct browsers on how to cache the data. Doing this sort of work early on in the design can yield benefits later, especially when you don't have to rework the entire thing to deal with cache coherency issues.
The second most valuable piece of advice I want to put across is that you shouldn't assume what works for some other web site will work for you; check your logs, run some analysis on your traffic and profile your application - see where your bottlenecks are and resolve them.
plenty of fish Architecture
some interesitng videos:
Youtube scalibility
Inteview with Dan Farino, System Architect at Myspace
Joel mentioned adding a second datacenter, with the same setup, and then assigning your users randomly to each. Changes to the data are logged and sent from one location to the other, so that both locations contain all the data.
The talk Scalable Web Architectures Common Patterns & Approaches from Cal Henderson (Yahoo) on Web 2.0 Expo was quite interesting. I thought there was an video, but I could not find it. But here are the slides:
http://www.slideshare.net/techdude/scalable-web-architectures-common-patterns-and-approaches
A certain next step would be a cluster of webservers (a web farm) and a clustered system of database servers (replication or Oracle RAC etc. etc.)
If your interested in caching and using .Net, look into the application caching block in enterprise library (of course use this along with the other points above).