I am using Heroku to host a small app. It's running a screenscraper using Mechanize and Nokogiri for every search request, and this takes about 3 seconds to complete.
Does the screenscraper block anyone else who wants to access the app at that moment? In other words, is it in holding mode for only the current user or for everyone?
If you have only one heroku dyno, then yes it is the case that other users would have to wait in line.
Background jobs are for cases like yours where there is some heavy processing to be done. The process running rails doesn't do the hard work up-front, instead it triggers a background job to do it and responds quickly, freeing itself up to respond to other requests.
The data processed by the background job is viewed later - perhaps in a few requests time, or whenever the job is done, loaded in by javascript.
Definitely, because a dyno is single threaded if it's busy scrapping then other requests will be queued until the dyno is free or ultimately timed out if they hang for more than 30 seconds.
Anything that relies on an external service would be best run through a worker via DJ - even sending an email, so your controller puts the message into the queue and returns the user to a screen and then the mail is picked up by DJ and delivered so the user doesn't have to wait for the mailing process to complete.
Install the NewRelic gem to see what your queue length is doing
John.
If you're interested in a full service worker system we're looking for beta testers of our Heroku appstore app SimpleWorker.com. It's integrated tightly into Heroku and offers a number of advantages over DJ.
It's as simple as a few line of code to send your work off to our elastic worker cloud and you can take advantage of massive parallel processing because we do not limit the number of workers.
Shoot me a message if you're interested.
Chad
Related
we are facing a problem.
we have background requests that are downloading files constantly (up to 5MB each file). meanwhile, we have a UI that most navigations require REST calls.
we limited the number of background downloads so it won't suffocate the operationQueue that RESTkit uses.
when several files are downloaded in background, we see the network usage with 1->2 MB (which is understandable).
The problem is: the user navigates through the app, and each navigation calls a quick REST call that should return very little data. but because of the background downloads, the UI call is taking forever (~10 seconds).
Priority did not help, i saw that the UI call i make instantly is handled by the operation queue (because we limited the downloads limit and the NSOperationQueue had more space to fulfill other requests.
when we limited the concurrent REST download calls to 5 - the REST calls from the UI took 10 seconds.
when we limited the concurrent REST download calls to 2 - everything worked fine.
the issue here is that because we let only 2 downloads occur in the background - the whole background operation of downloading files will take forever.
the best scenario would be that every UI call would be considered as most important network-wise and even pause the background operations and let only the UI call to be handled - then resume the background operation - but i'm not sure it's possible.
any other idea to address this issue?
You could use 2 RKObjectManagers so that you have 2 separate queues, then use one for 'UI' and the other for 'background'. On top of that you can set the concurrent limits for each queue differently and you could suspend the background queue. Note that suspending the queue doesn't mean already running operations are paused, it just stops new operations from being started.
By doing this you can gain some control, but better options really are to limit the data flow, particularly when running on a mobile data network, and to inform the user what is happening so they can accept the situation or pause it till later.
The Heroku Dev Center on the page about using worker dynos and background jobs states that you need to use worker's + queues to handle API calls, such as fetching an RSS feed, as the operation may take some time if the server is slow and doing this on a web dyno would result in it being blocked from receiving additional requests.
However, from what I've read, it seems to me that one of the major points of Node.js is that it doesn't suffer from blocking under these conditions due to its asynchronous event-based runtime model.
I'm confused because wouldn't this imply that it would be ok to do API calls (asynchronously) in the web dynos? Perhaps the docs were written more for the Ruby/Python/etc use cases where a synchronous model was more prevalent?
NodeJS is an implementation of the reactor pattern. The default build of of NodeJS uses 5 reactors. Once these 5 reactors are being used for IO bound tasks, the main event loop will block.
A common misconception about NodeJS is that it is a system that allows you to do many things at once. This is not necessarily the case, it allows you to do other things while waiting on IO bound tasks, up to 5 at a time.
Any CPU bound tasks are always executed in the main event loop, meaning they will block.
This means if your "job" is IO bound, like putting things in databases then you can probably get away with not using dynos. This of course is dependent on how many things you plan on having go on at once. Remember, any task you put in your main app will take away resources from other incoming requests.
Generally it is not recommended for things like this, if you have a job that does some processing, it belongs in a queue that is executed in its own process or thread.
I have a service that does image processing on an image supplied by the client.
Each processing takes CPU (3min aprox runtime/image), so I will not allow more than 1 image to be processed at a time.
what I did is that when the service is called, the image is saved on the server and an entry is added into the database, with the status queued.
Now I would like to create a background task or something that takes every entry from the database that has a status Queued, processes that image,updates the entry status to Done, and than takes the new entry with the status Queued and so on.
There may be the case that no image is queued at some time.
How do you suggest me implementing this?
It sounds like what you want is a queued service.
http://msdn.microsoft.com/en-us/library/ms731089.aspx
It allows you to focus on your core algorithm and not worry much about the mechanics of queuing the messages (e.g. making custom DB tables for queues etc.). Queuing sounds easy, but to get it to work reliably is harder than it sounds - better to leave it to the experts at MS :o)
It also provides some good features like durability, poison message handling etc.
You could use Windows Server AppFabric to host a workflow-backed WCF service. Instead of service.svc, the extension is service.xamlx. AppFabric is designed to run long running processes like this and will scale to your needs.
Perhaps you can develop a windows service which polls the database every minute for any images to process.
I have a request that takes more than 30 seconds and it breaks.
What is the solution for this? I am not sure if I add more dynos this will work.
Thanks
You should probably see the Heroku devcenter article regarding this, as the information will be more helpful, here's a small summary:
To answer the timeout question:
Cedar supports long-polling and streaming responses. Your app has an initial 30 second window to respond with a single byte back to the client. After each byte sent (either recieved from the client or sent by your application) you reset a rolling 55 second window. If no data is sent during the 55 second window your connection will be terminated.
(That is, if you had Cedar instead of Aspen or Bamboo you could send a byte every thirty seconds or so just to trick the system. It might work.)
To answer your dynos question:
Additional concurrency is of no help whatsoever if you are encountering request timeouts. You can crank your dynos to the maximum and you'll still get a request timeout, since it is a single request that is failing to serve in the correct amount of time. Extra dynos increase your concurrency, not the speed of your requests.
(That is, don't bother adding more dynos.)
On request timeouts:
Check your code for infinite loops, if you're doing something big:
If so, you should move this heavy lifting into a background job which can run asynchronously from your web request. See Queueing for details.
We have a memory intensive processing for certain functionality and we would like to limit the number of parallel requests to this processing. We are able to configure by using "Work Managers" in WebLogic and putting a limit on the number of threads for that servlet.
For example, if we put maximim thread limit as 3, then if there are 10 parallel requests; 7 requests are in queue. There could be situations where these the requests waiting in queue could take up to 30-40 minutes to be processed. We did simple testing and the received page cannot be displayed due to timeout after 15 mins and received the message after 1 hour.
Does any one know if there is a setting in WebLogic to increase/decrease timeout and avoid page cannot be displayed?
Appreciate if any one has any thoughts around this.
Does any one know if there is a setting in WebLogic to increase/decrease timeout and avoid page cannot be displayed?
There might be something but I actually didn't check as it would be a bad advice anyway. By looking for this, you are trying to solve the wrong problem here. A browser is just not made for long-running process like the one you are describing (>30mn) even if you don't mind the user waiting (not mentioning that he could refresh the page and queue more and more jobs).
So, the right answer here is in my opinion: use asynchronism, this is the perfect use case. When the user clicks on the button, send a JMS message to a queue (or create a Quartz job) and send the user a page with a request ID telling him to come back later. When the processing is done, update the status somewhere and make the status/result available to the user. Really, the user experience will be better doing this and you'll face less problems than with a browser.
1) Use some other tool (not browser) like WGET where you can control timeout parameter (--timeout).
2) Why do you use HTTP? Use message driven beans and send message JMS to that and don't care about time outs.
Perhaps quartz can do what you need? Start a job and check in on it as you need to?