Run Pig scripts in Parallel on Amazon AWS

Run Pig scripts in Parallel on Amazon AWS - apache-pig

We currently run all our Pig jobs sequentially on Amazon EMR, so we launch a cluster and then add all our Pig jobs as a step to the cluster 1 by 1.
While this works, I was wondering if there is something that can allow you to run those Pig jobs in parallel.
Ideally I would like to do the following:
Launch a cluster (let's say c3.xlarge) and then throw 15 pig jobs at it.
Those jobs would then run in parallel as best as they can (eg: 3 at the same time) and when 1 is done, another one gets executed.
Any help would be appreciated if something like this exist and how we could use it. I read something about Oozie but I am not sure if this would suit our needs.

EMR steps cannot be made to run in parallel. However like you mentioned, use oozie to orchestrate your pig script execution using fork and join actions to run in parallel.

Generally, it's possible if you manually reconfigure your EMR cluster to use Fair scheduler and submit tasks via shell. Or probably you could go the qay with Oozie. But generally it's not like from the box.

Oozie can help you run the pig scripts in parallel. For scheduling parallel execution of the pig scripts you can use the Fork-Join control nodes. For example Fork and Join Control Nodes.
The only thing is, it will start the execution of all the forked scripts in parallel, does not provide you to control a fixed parallelism. You need to manage it yourself in oozie workflow application definition. For Pig action check the doc.
The two map-reduce jobs will execute in parallel. You can use a combination of jobs here like Pig, Hive, Map-Reduce etc.
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<fork name="forking">
<path start="firstparalleljob"/>
<path start="secondparalleljob"/>
</fork>
<action name="firstparallejob">
<map-reduce>
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<job-xml>job1.xml</job-xml>
</map-reduce>
<ok to="joining"/>
<error to="kill"/>
</action>
<action name="secondparalleljob">
<map-reduce>
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<job-xml>job2.xml</job-xml>
</map-reduce>
<ok to="joining"/>
<error to="kill"/>
</action>
<join name="joining" to="nextaction"/>
...
</workflow-app>

Related

Jenkins - How to run two Jobs parallelly (1 FT Jobs and 1 Selenium Jobs) on same slave node

I want to run two jobs parallelly on the same Slave.
Job 1 is Functional Testing jobs doesn't require Browser and Job 2 is Selenium Job which requires Browser for testing.

As for running the job on the same slave, you can use the option Restrict where this project can be run, assuming you have the jenkins slave configured in your setup.
For running the jobs in parallel (are you trying to do this via Jenkinsfile or via freestyle jobs?). For jenkinsfile, you can use the parallel stages feature as described here. For freestyle jobs, I would suggest adding one more job (for example setup job) and use this job to trigger your two jobs at the same time. Here are few screenshots showing one of my pipeline triggering jobs in parallel.

aws emr with yarn scheduler

I am creating AWS EMR using cloudformation template. I need to run the steps parallel. For that I am trying to change the YARN Scheduler from FIFO to fair / capacity scheduler.
I have added:
yarn.resourcemanager.scheduler.class : 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler'
Do I need to add FairScheduler.xml file in conf.empty folder? If so, can you please share the xml file.
and if I want to add fairscheduler.xml through cloudformation template, then do I need to use bootstrap for it? if so could you provide me the bootstrap file please.

Looks like even though after changing the scheduler, EMR won't allow to run jobs concurrently.

You can configure your cluster by specifying the configuration in cloud-formation scripts.
This a example to configure
- Classification: fair-scheduler
ConfigurationProperties:
<key1>: <value1>
<key2>: <value2>
- Classification: yarn-site
ConfigurationProperties:
yarn.acl.enable: true
yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
Please follow these -
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-configuration.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
EMR recently allows you to run multiple steps in parallel -
https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/

Use of Enable blocking in PDI - Pig Script Executor

I am exploring Big data plugin in Pentaho 5.2. I was trying to run Pig Script executor. I am unable to understand the usage of
Enabling Blocking. The PDI documentation says that
If checked, the Pig Script Executor job entry will prevent downstream
entries from executing until the script has finished processing.
I am aware that running a pig script will convert the execution to Map reduce jobs. I am running the job with Start job -> Pig Script. If I disable the Enable blocking step I am unable to execute the script. I am getting permission denied errors. As per the documentation " ".
What does downstream mean here. I do not pass any hops from the pig script out. I am unable to understand the Enable blocking step. Any hints can be helpful and will be appreciated.

Enable blocking: the task is deployed to the Hadoop cluster; PDI will follow up on progress and only proceed with the rest of the job tasks AFTER the execution of the Hadoop job finishes;
Enable blocking is disabled: PDI deploys the task to the Hadoop cluster and forgets about it. The rest of the job tasks proceed immediately after the cluster accepts the task, but doesn't wait for it to complete.

How can I reload oozie job configuration file without restart oozie job

I'd like to know if there is a way to reload the configuration file of the oozie job without restart the oozie job ( coordinator ).
Because the coordinator actually runs many our tasks, maybe sometimes we only need change one line of the job configuration file, then make the update , without disturbing other tasks.
Thank you very much.

The properties of oozie coordinator can be updated using below command once the coordinators start running. Update the property file in unix file system and then submit as below.
oozie job -oozie http://namenodeinfo/oozie -config job.properties -update coordinator_job_id
Note that all the created coordinator versions (including the ones in WAITING status) will still use old configuration. New configurations will be applied to new versions of coordinators when they materialize.

the latest oozie 4.1 allows to update coordinator definition. See https://oozie.apache.org/docs/4.1.0/DG_CommandLineTool.html#Updating_coordinator_definition_and_properties

Not really (well you could go into the database table and make the change but that might require a shutdown of OOZIE if your using an embedded Derby DB, and besides probably isn't advisable).
If you need to change the configuration often then consider pushing the value down into the launched workflow.xml file - you can change this file's contents between coordinator instantiations.
You could also (if this is a one time change) kill the running coordinator, make the change and start the coordinator up again amending the start time such that previous instances won't be scheduled to run again.

Not really :-)
Here is what you can do.
Create another config file with properties that you want to be able to change in hdfs.
Read this file in the beginning of your workflow.
Example:
<action name="devices-location">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>hadoop</exec>
<argument>fs</argument>
<argument>-cat</argument>
<argument>/path/to/config/file.properties</argument>
<capture-output/>
</shell>
<ok to="report"/>
<error to="kill"/>
</action>
<action name="report">
<java>
...
<main-class>com.twitter.scalding.Tool</main-class>
<arg>--device-graph</arg>
<arg>${wf:actionData('devices-location')['path']}</arg>
<file>${scalding_jar}</file>
</java>
<ok to="end"/>
<error to="kill"/>
</action>
Where the config file in hdfs at /path/to/config/file.properties looks like this:
path=/some/path/to/data

In what mode is Hive installed?

Does hive installation have any specific mode?
Like for example, Hadoop installation has 3 modes: standalone, pseudo-distributed and fully distributed.
Similarly does Hive has any specific type of distribution?
Can Hive be installed in distributed mode?

Hive actually provides you the option to run queries in 2 modes :
1- Map-Reduce mode
2- Local mode
Normally Hive compiler generates map-reduce jobs for most queries under the hood. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
mapred.job.tracker
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also provided you the ability to run map-reduce jobs locally on the your standalone workstation. In order to run Hive queries in local mode you need to do this :
hive> SET mapred.job.tracker=local;
Details can be found here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Run Pig scripts in Parallel on Amazon AWS - apache-pig

EMR steps cannot be made to run in parallel. However like you mentioned, use oozie to orchestrate your pig script execution using fork and join actions to run in parallel.

Generally, it's possible if you manually reconfigure your EMR cluster to use Fair scheduler and submit tasks via shell. Or probably you could go the qay with Oozie. But generally it's not like from the box.

Related

Jenkins - How to run two Jobs parallelly (1 FT Jobs and 1 Selenium Jobs) on same slave node

aws emr with yarn scheduler

Use of Enable blocking in PDI - Pig Script Executor

How can I reload oozie job configuration file without restart oozie job

In what mode is Hive installed?

Categories

Resources