Pig script minimum execution time - apache-pig

I'm currently learning Pig and I'm executing my scripts inside Hortonworks Sandbox. What is bugging me from the very beginning is that the minimum execution time for a Pig script seems to be at least 30-40 seconds. Is that because I'm using the Hortonworks Sandbox or is a normal for Pig scripts? Is there a way to reduce the execution time, because this is really slowing my learning progress? If this execution time is normal can you explain me what is going on and why is that?
PS
I've allocated 2GB RAM for the Hortonworks virtual machine. And just to mention I'm currently executing just simple scripts on small data sets.

If you execute pig in local mode (pig -x local) then it'll run a lot faster but it won't do map-reduce and won't access hdfs - it's good for learning though!

Yes, 30-40 seconds is absolutely normal for Pig, because it has a big overhead for compiling the job, launching JVMs, etc.
As stated in the other answer - you can try to run in local mode. It usually takes me about 15 seconds for a simple job with input containing just a few lines of data. My Cloudera VM is allocated with 4G of RAM, btw.

Related

Usages of cores in Spark SQL Execution

I am new to Spark SQL queries and trying to understand it's working under the hood.
I have come across the term "Core" in the Spark vocabulary but still struggling to get a hold on the same.
I know that - 1 core = 1 task.
My questions -
Can anyone please explain what exactly does a core mean ?
Does Spark UI show the number of cores currently allocated for my job ? If yes,
then where can I see it ?
If I find in the Spark UI that the number of tasks running is less, is
there a way to increase the number of cores allocated for my job, so
that Spark can submit more tasks and make my job run faster ?
Please advise.
Yes, you are right in a way.
In spark task are distributed across executors, on each executor number of task running is equal to the number of cores on that executors. So basically core is something that is going to execute your task. The task here is the most granular work that needs to be carried out.
JOB=>STAGE=>TASK
Yes, spark UI shows you the number of the task currently running on your every executor. You can check them under the Executors tab. This tab shows you a very detailed view of your task allocation against the number of cores available and a lot of other details.
Yes, you can increase the number of cores. You can do that by passing the argument in the spark-submit command.
--executor-cores n
Here n is the number of cores you want. For optimum usage, it should be 5.
It is not necessary that more than the number of cores faster your job will run.
Your task needs to be distributed equally across all the cores available to run faster.
If you provide more cores than required they will remain idle most of the time.

AWS Glue - Spark Job - how to increase Memory Limit or run more efficiently?

While running Spark (Glue) job - during writing of Dataframe to S3 - getting error:
Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead or
disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Is there an easy cure for this?
How writing of Dataframe to S3 can be optimized (to use less memory)?
How memory can be increased for containers so that they we have more room to work with?
As you may already know, AWS Glue jobs doesn't support increasing memory. But you can select G1.X as worker type for the Glue job. AWS recommends to use this for memory intensive work loads.
https://docs.aws.amazon.com/en_us/glue/latest/dg/add-job.html
Apart from that, I don't see any configuration option to increase the memory.
Did you check the memory profile of job runtime metrics?

Apache server cannot allocate memory for new process

I have a apache server with 32 GB of RAM. When I start the server and execute top to see the resources It show me that the CPU is at 95 percent. It doesn't a normal behaviour and after a few minutes it raises:
apache cannot allocate memory fork unable to fork new process
I don't know how to solve the problem. Any tips?
I had same problem to fix it there is 2 options:
1- move from micro instances to small and this was the change that solved the problem (micro instances on amazon tend to have large cpu steal time)
2- tune the mysql database server configuration and my apache configuration to use a lot less memory.
tuning guide for a low memory situation such as this one: http://www.narga.net/optimizing-apachephpmysql-low-memory-server/ (But don't use the suggestion of MyISAM tables - horrible...)
this 2 options will make the problem much much less happening .. I am still looking for better solution to close the process that are done and kill the ones that hang in there .

Can I use MRJob to process big files in local mode?

I have a relatively big file - around 10GB to process. I suspect it won't fit into my laptop's RAM, if MRJob decides to sort it in RAM or something similar.
At the same time, I don't want to setup hadoop or EMR - the job is not urgent and I can simple start worker before going to sleep and get the results the next morning. In other words, I'm quite happy with local mode. I know, the performance won't be perfect but it's ok for now.
So can it process such 'big' files at a single weak machine? If yes - what would you recommend to do (besides setting a custom tmp dir to point to the filesystem, not to the ramdisk which will be exhausted quickly). Let's assume we use version 0.4.1.
I think the RAM size won't be an issue with the python runner of mrjob. The output of each step should be written out to temporary file on disk, so it should not fill up the RAM I believe. Dumping output to disk is the way it should be with Hadoop (and the reason why it is slow due to IO). So I would just run the job and see how it goes.
If the RAM size is an issue, you can create enough swap space on your laptop to make it at least run, thought it will be slow if the partition isn't on SSD.

Slow shrinksafe operation during dojo build process

I use dojo build process on my application during build stage.
But it is very slow, takes several minutes to optimize one big .js file.
I am calling it within ant build script and groovy antBuilder.
Here is the call:
ant.java(classname:"org.mozilla.javascript.tools.shell.Main",fork:"true", failonerror:"true",dir:"${properties.'app.dir'}/WebRoot/release 1.5/util/buildscripts",maxmemory:"256m") {
ant.jvmarg(value:"-Dfile.encoding=UTF8")
ant.classpath() {
ant.pathelement(location:"${properties.'app.dir'}/WebRoot/release-1.5/util/shrinksafe/js.jar")
ant.pathelement(
location:"${properties.'app.dir'}/WebRoot/release-1.5/util/shrinksafe/shrinksafe.jar")
}
ant.arg(file:"${properties.'app.dir'}/WebRoot/release-1.5/util/buildscripts/build.js")
ant.arg(
line:"profileFile=${properties.'app.dir'}/dev-tools/build-scripts/standard.profile.js releaseDir='../../../' releaseName=dojo15 version=0.1.0 action=clean,release")
}
and this is taking about 15 min to optimize and combine all dojo and our own files.
Is there a way to speed it up, maybe run in parallel somehow.
The script is running on a big 8 CPU solaris box so hardware is no problem here.
Any suggestion?
We've had similar problems. Not sure exactly what it is about running in ant that makes it so much slower. You might try increasing the memory. We couldn't even get Shrinksafe to process large layers without increasing our heap beyond the 2g limit (needed a 64-bit JVM) You might also try using closure with the Dojo build tool.