how to speedup amazon EMR bootstrap? - amazon-emr

I'm using amazon EMR for some intensive computation, but, it takes around 7 min to start computing, is there some clever way to have my computation starting immediately ? The computation is a python stream started from a user-faced website, so I can't really afford a long startup.
I might have simply missed an option in the ocean that is amazon AWS. I just want simplicity to launch jobs (that's what I used EMR), scalability, and pay only for what I use (and startup time is not useful).

I know this is an old question but had some insights I would add to the next searcher who finds this thread in hope of speeding up bootstrap times on Amazon EMR.
For a while I have wondered why my clusters took so long to start, usually about 15 minutes. This takes a pretty big chunk of time for a job that usually completes in under 1 hour. Sometimes it pushes the job past 1 hour, but I think thankfully AWS does not charge for the full boot strap time.
The last couple days I noticed my startup times were improved. You see the spot market became very volatile during April and the first week of May. Normally, I start my cluster entirely of spot instances, as failure is an option, and the cost savings justifies the technique in my case. However, after waiting 14 hours for clusters to start, I had to switch to OnDemand, I only have so much patience, over night usually exceeds it. The OnDemand clusters start in about 5 minutes. Now having switched back to spot as the madness seems to have abated, I am back to 15 minutes for a cluster.
So if you are using Spot instances on your Core or Master nodes, expect a longer startup time. I will be experimenting with using a small set of OnDemand in the core and augmenting with a large number of spot instances to see if it helps startup and deals better with Spot Market volatility.

This is pretty normal and there is little you can do about it. I'm starting 100+ node clusters and I've seen them take 15+ minutes before they start processing. Given the amount of work thats going on in the background I'm pretty happy to allow them the 15 minutes or so to get the cluster configured and read in whatever data maybe required. Nature of the beast I'm afraid.

Where's your data source hosted?
If on S3 (probably), if you have many tiny files, it's the latency of each connection (per file) that is taking the time.
If that's the only reason then, your 7 mins of start-up time will translate to ~5 mins of reading from S3 time => ~1GB input files on S3

Related

Using the cloud service to trasform a picture using a neural algorithm?

Yesterday I tried to transform a picture in the artistic style using CNNs based on A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge using a recent Torch implemenation,as it is explained here :
https://github.com/mbartoli/neural-animation
it started the conversion correctly,the problem is that the process is very time consuming,after 1 hour of elaboration a simple picture was not fully transformed. And I have to trasform 1615 pictures. What's the solution here ? Can I use the Google Cloud Platform to make this operation faster ? Or some other kind of Cloud service ? Using my Home PC is not the right solution. If I can use the cloud power,how can I configure everything ? let me know,thanks.
Using (Google Cloud Platform) GCP here would seem to be a good use case. If we were to boil it down to what you have ... you have an application which is CPU intensive that takes a long time to run. Depending on the nature of the application, it may run faster for any single given instance by having more CPUs and/or more RAM. GCP allows YOU to choose the size of the machine on which your application runs. You can choose from VERY small to VERY large. The distinction is how much you are willing to pay. Remember, you only pay for what you use. If an application takes an hour to run on a machine with price X but takes 30 minutes on a different machine with price 2X then the cost will still only be X but you will have a result in 30 minutes rather than an hour. You would switch off the machine after the 30 minutes to prevent charging.
Since you also said that you have MANY images to process, this is where you can take advantage of horizontal scale. Instead of having just one machine where the application takes an hour on each machine and all results are serialized.... you can create an array of machines where each machine is processing one picture. So if you had 50 machines, at the end of one hour you would have 50 images processed instead of one.
As for how to get this all going ... I'm afraid that is a much bigger story and one where a read of the GCP documentation will help tremendously. I suggest you read and have a play and THEN if you have specific questions, the community can try and provide specific answers.

AWS S3 ObjectCreated triggers lambda with delay (Lambda Cold-start)

I've configured the simple trigger for lambda, which process image up on arrive to S3.
In general, the lambda triggered with minimum delay, many times in the same second when S3 received the image.
But, occasionally, around 7% cases, there is a delay between image received and ObjectCreated event, this delay could be up to 19 seconds!! (9-10 seconds in average).
Any idea how to avoid this delay?
This delay makes me impossible to use S3->Lambda triggers for high performance real time apps.
After a while, trying to investigate and googling.
In parallel asking AWS support about the case, I finally got the answer from AWS:
--
.. Lambda invoked the function pretty much immediately after we received
the event, but the specific request id you shared was for an invoke
that had to coldstart, which added nearly 10 seconds of extra latency.
The function is in the VPC, where cold starts tend to take a few
seconds longer. Coldstarts cannot be eliminated but for high volume
functions the incidence of cold start should be lower once you scale
up and more containers are available for reuse.
As you may see from the answer, if you are trying to make a high performance / high traffic real time app, S3->Lambda will not fit in your requirements.
My next question would be, if I trigger the lambda directly from the script that uploads the image, will it help?
Or I should avoid of using lambda at all on this kind of applications and leave it only for background data processing?
Hope this answer will help someone else..
After 28th of November 2019, cold starts for Lambdas that are inside VPC are no longer causing that much delays, due to AWS Lambda service improvements. You can find out more about it here:
https://aws.amazon.com/blogs/compute/announcing-improved-vpc-networking-for-aws-lambda-functions/
There are a lot of other ways to reduce cold starts in Lambda, but it mostly depends on the use case. The most common is to reduce code size of Lambda or use other runtime, Node or Python for example.

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

Do BigQuery queries run faster at night?

I just received this from Google Support, and it surprised me as I didn't know there was a congestion issue - do other people have this experience?
To fasten up your query, I would recommend that you try to run your query in other time like midnight
BigQuery is nocturnal, so it runs better in the dark. There are fewer predators around, so BigQuery can be free to express itself and cavort across the prairies near the Google Datacenters.
Other techniques to "enfasten" the queries involve running them from the ley lines of power, which are described in the Alchemical diaries of Hermes Trismagistus. Unfortunately, I am not permitted to share their location, and may be putting myself at risk of excommunication from a number of secret societies by just mentioning their existence.
Finally, if you name your tables with the suffix __Turbo, BigQuery will run them in turbo mode, which means they run on 486/66 processors instead of the default Z80 datapath.
Edited to add:
In a non-snarky answer, if you do not have reserved BigQuery capacity (i.e. fixed-price reservations), you may experience lower throughput at certain times. BigQuery has a shared pool of resources, so if lots of other customers are using it at the same time, there may not be enough resources to give everyone the resources that their queries would need to run at full speed.
That said, BigQuery uses a very large pool of resources, and we (currently) run at a utilization rate where every user gets nearly all of the resources they need nearly all of the time.
If you are seeing your queries slow down by 20% at certain times of the day, this might not be surprising. If you see queries take 2 or 3 times as long as they usually do, there is probably something else going on.

BigQuery interactive query response times degradation since 2/19/16

For the Google BigQuery infrastructure folks: we've been running a set of short running interactive queries for many months now averaging about 5 seconds to complete. Starting Friday 2/19 these response times have been rising steadily (SQL has not changed and we're dealing with a steady stream of data we're querying using a sliding window)
Is this a global BigQuery issue you are aware of?
edit: more granular response times:
There is good news and bad news; the good news is that the query took only 0.5 seconds to execute. The bad news is that it took 191 seconds to find the files where the data was stored.
We have a couple of performance regressions that cause high tail latency for resolving paths. Tables (like yours) where the data is stored in many paths will see worse performance.
This is performance issue is exacerbated by the fact that you're using time-range decorators, which mean that our efforts to optimize the file layout doesn't work as well.
We are starting the roll-out of a fix to the underlying performance problem this afternoon; it will likely take at least a week for it to take effect everywhere. I'll update this answer once it is complete (if I forget, please remind me)
In the mean time, you may get faster results by removing the time-range decorators from your queries. You are already filtering by time, so the queries should still be correct. Of course, this may mean that the queries cost a bit more to run.