I'd like to know if there is a way to reload the configuration file of the oozie job without restart the oozie job ( coordinator ).
Because the coordinator actually runs many our tasks, maybe sometimes we only need change one line of the job configuration file, then make the update , without disturbing other tasks.
Thank you very much.
The properties of oozie coordinator can be updated using below command once the coordinators start running. Update the property file in unix file system and then submit as below.
oozie job -oozie http://namenodeinfo/oozie -config job.properties -update coordinator_job_id
Note that all the created coordinator versions (including the ones in WAITING status) will still use old configuration. New configurations will be applied to new versions of coordinators when they materialize.
the latest oozie 4.1 allows to update coordinator definition. See https://oozie.apache.org/docs/4.1.0/DG_CommandLineTool.html#Updating_coordinator_definition_and_properties
Not really (well you could go into the database table and make the change but that might require a shutdown of OOZIE if your using an embedded Derby DB, and besides probably isn't advisable).
If you need to change the configuration often then consider pushing the value down into the launched workflow.xml file - you can change this file's contents between coordinator instantiations.
You could also (if this is a one time change) kill the running coordinator, make the change and start the coordinator up again amending the start time such that previous instances won't be scheduled to run again.
Not really :-)
Here is what you can do.
Create another config file with properties that you want to be able to change in hdfs.
Read this file in the beginning of your workflow.
Example:
<action name="devices-location">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>hadoop</exec>
<argument>fs</argument>
<argument>-cat</argument>
<argument>/path/to/config/file.properties</argument>
<capture-output/>
</shell>
<ok to="report"/>
<error to="kill"/>
</action>
<action name="report">
<java>
...
<main-class>com.twitter.scalding.Tool</main-class>
<arg>--device-graph</arg>
<arg>${wf:actionData('devices-location')['path']}</arg>
<file>${scalding_jar}</file>
</java>
<ok to="end"/>
<error to="kill"/>
</action>
Where the config file in hdfs at /path/to/config/file.properties looks like this:
path=/some/path/to/data
Related
When we run multiple flink jobs in one yarn session. We found that the logs
of all jobs are written into the same file, "taskmanager.log", It is difficult for us to check logs of a specific job. Is there any approach to separate them?
Besides this, if our flink jobs are running for a long period, how to separate log files according to date?
As far as I know, there isn't anyway to separate the logs for one job, other than to run a separate cluster per job. Moreover, a lot of what is being logged isn't really job-specific.
To set up log rotation, you could put something like this in the log4j.properties file in the flink/conf directory:
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=${log.file} log4j.appender.file.MaxFileSize=1000MB
log4j.appender.file.MaxBackupIndex=0 log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n
We currently run all our Pig jobs sequentially on Amazon EMR, so we launch a cluster and then add all our Pig jobs as a step to the cluster 1 by 1.
While this works, I was wondering if there is something that can allow you to run those Pig jobs in parallel.
Ideally I would like to do the following:
Launch a cluster (let's say c3.xlarge) and then throw 15 pig jobs at it.
Those jobs would then run in parallel as best as they can (eg: 3 at the same time) and when 1 is done, another one gets executed.
Any help would be appreciated if something like this exist and how we could use it. I read something about Oozie but I am not sure if this would suit our needs.
EMR steps cannot be made to run in parallel. However like you mentioned, use oozie to orchestrate your pig script execution using fork and join actions to run in parallel.
Generally, it's possible if you manually reconfigure your EMR cluster to use Fair scheduler and submit tasks via shell. Or probably you could go the qay with Oozie. But generally it's not like from the box.
Oozie can help you run the pig scripts in parallel. For scheduling parallel execution of the pig scripts you can use the Fork-Join control nodes. For example Fork and Join Control Nodes.
The only thing is, it will start the execution of all the forked scripts in parallel, does not provide you to control a fixed parallelism. You need to manage it yourself in oozie workflow application definition. For Pig action check the doc.
The two map-reduce jobs will execute in parallel. You can use a combination of jobs here like Pig, Hive, Map-Reduce etc.
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<fork name="forking">
<path start="firstparalleljob"/>
<path start="secondparalleljob"/>
</fork>
<action name="firstparallejob">
<map-reduce>
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<job-xml>job1.xml</job-xml>
</map-reduce>
<ok to="joining"/>
<error to="kill"/>
</action>
<action name="secondparalleljob">
<map-reduce>
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<job-xml>job2.xml</job-xml>
</map-reduce>
<ok to="joining"/>
<error to="kill"/>
</action>
<join name="joining" to="nextaction"/>
...
</workflow-app>
I've got a problem with running Sqoop job on YARN in Oozie using Hue. I want to download table from Oracle database and upload that table to HDFS. I've got multinode cluster consists of 4 nodes.
I want to run simple Sqoop statement:
import --options-file /tmp/oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
Options file is located on local system on node number 1. Other nodes have no options file in /tmp/ dir. I created Oozie workflow with Sqoop job and tried to run it, but I got error:
3432 [main] ERROR org.apache.sqoop.Sqoop - Error while expanding arguments
java.lang.Exception: Unable to read options file: /tmp/oracle_dos.txt
The weirdest thing is that the job is sometimes ok, but sometimes fails. The log file gave me answer why - Oozie runs Sqoop jobs on YARN.
Resource Manager (which is component of YARN) decides which node will execute Sqoop job. When Resource Manager decided that Node 1 (which has options file on local file system) should execute job, everything is ok. But when RM decided that one of other 3 nodes should execute Sqoop job, it failed.
This is big problem for me, because I don't want to upload options file on every node (because what if I will have 1000 nodes?). So my question is - is there any way to tell Resource Manager which node it should use?
You can make a custom file available for you oozie action on a node, it can be done by using <file> tag in your sqoop action, look at this syntax:
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
...
<action name="[NODE-NAME]">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<delete path="[PATH]"/>
...
<mkdir path="[PATH]"/>
...
</prepare>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<command>[SQOOP-COMMAND]</command>
<arg>[SQOOP-ARGUMENT]</arg>
...
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
...
</sqoop>
<ok to="[NODE-NAME]"/>
<error to="[NODE-NAME]"/>
</action>
...
</workflow-app>
Also read this:
The file , archive elements make available, to map-reduce jobs, files
and archives. If the specified path is relative, it is assumed the
file or archiver are within the application directory, in the
corresponding sub-path. If the path is absolute, the file or archive
it is expected in the given absolute path.
Files specified with the file element, will be symbolic links in the
home directory of the task.
...
So in simplest case you put your file oracle_dos.txt in your workflow directory, add element oracle_dos.txt in workflow.xml and change you command to something like this:
import --options-file ./oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
In this case nevertheless your sqoop action is running on some randomly picked node in a cluster, oozie will copy oracle_dos.txt to this node and you can refer to it as to local file.
Perhaps this is about file permissions. Try to put this file in /home/{user}.
I'm trying to get Jenkins figured out. I have a suite of selenium tests that I can build and run via Eclipse or build and run via the command line with ant, but whenever I try Jenkins, they fail.
The console out from Jenkins reports that the value for ws.jars, defined in my build.xml file, doesn't exist; however that directory does exist! Again, no problem building from the command line.
Any suggestions would be greatly appreciated as I've been trying to get this solved now for a couple of days.
Thanks.
My build.xml file:
<property environment="env"/>
<property name="ws.home" value="${basedir}"/>
<property name="ws.jars" value="/Users/username/Documents/All JAR Files/All in one place"/>
<property name="test.dest" value="${ws.home}/build"/>
<property name="test.src" value="${ws.home}/src"/>
<property name="ng.result" value="test-output"/>
I created a new target in my build.xml file called path. Here is the output when I run with Jenkins.
Started by user anonymous
Building in workspace /Users/<user>/local_repo/qa-automation/selenium-java/my-projects
[my-projects] $ /Users/Shared/Jenkins/Home/tools/hudson.tasks.Ant_AntInstallation/Install_automatically/bin/ant path
Buildfile: /Users/<user>/local_repo/qa-automation/selenium-java/my-projects/build.xml
path:
[echo]
[echo] My path - /Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/bin:/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home/bin:/usr/bin:/bin:/usr/sbin:/sbin
[echo]
BUILD SUCCESSFUL
Total time: 0 seconds
Finished: SUCCESS
I suspect this is caused by the spaces within your path of ws.jars. You should escape these spaces by adding \, try change the path :
<property name="ws.jars" value="/Users/username/Documents/All JAR Files/All in one place"/>
with this one:
<property name="ws.jars" value="/Users/username/Documents/All\ JAR\ Files/All\ in\ one\ place/"/>
Thanks everybody for your suggestions. I hadn't thought about this in awhile but last night I had some time and was able to work on this issue again. Turns out it was related to the Jenkins job being executed in the non-interactive shell, just as 'rgulia' pointed out. So finally, I tried copying my ws.jars directory to /Users/Shared/Jenkins, changed the owner to 'jenkins', and boom, my build was able to proceed.
The original error message was just so misleading. It wasn't that the directory didn't exist, but that 'jenkins' didn't have access and/or permissions to it.
I hope this information can help others.
I am creating an scheduled task during installation of an application. The installer itself is running with administrator permissons:
SchTasks /F /create /tn "MyApp Start" /XML "D:\MyApps\start.xml" /ru "System"
This task is intended to start during system startup, which is working fine as long as the user who is logging in is the one who created the task.
In my special case the task should also run if another non-admin-user is logging in.
Currently the task is not running, if the non-admin-user is logging in. Even more, the task ist not visible to him at all.
The question is: How can I create a scheduled task as administrator
using DOS or PowerShell-comnmands
that runs with System priviliges
that starts even if a normal non-admin-user logs into Windows 7/8
Here is my xml-description of the task.
<?xml version="1.0"?>
<Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task">
<RegistrationInfo>
<Date>2015-03-02T22:54:11</Date>
<Author>foobar</Author>
</RegistrationInfo>
<Triggers>
<BootTrigger>
<StartBoundary>2015-03-02T22:54:11</StartBoundary>
<Enabled>true</Enabled>
</BootTrigger>
</Triggers>
<Principals>
<Principal>
<UserId>S-1-5-18</UserId>
<RunLevel>LeastPrivilege</RunLevel>
</Principal>
</Principals>
<Settings>
<MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>
<DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries>
<StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>
<AllowHardTerminate>true</AllowHardTerminate>
<StartWhenAvailable>false</StartWhenAvailable>
<RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>
<IdleSettings>
<Duration>PT10M</Duration>
<WaitTimeout>PT1H</WaitTimeout>
<StopOnIdleEnd>true</StopOnIdleEnd>
<RestartOnIdle>false</RestartOnIdle>
</IdleSettings>
<AllowStartOnDemand>true</AllowStartOnDemand>
<Enabled>true</Enabled>
<Hidden>false</Hidden>
<RunOnlyIfIdle>false</RunOnlyIfIdle>
<WakeToRun>false</WakeToRun>
<ExecutionTimeLimit>PT72H</ExecutionTimeLimit>
<Priority>7</Priority>
</Settings>
<Actions>
<Exec>
<Command>D:\MyApps\start.bat</Command>
</Exec>
</Actions>
</Task>
Do you have any suggestions?
Best Regards
Tobias
Tobias,
I actually use the built in Windows Task Scheduler to set up these types of operations. I find it alot easier than using CMD and it has all the options, features, triggers, etc that you may be looking for. I use it to draft tasks and eventually push them onto our network. Not to mention it can be accessed under normal and admin user rights by default.
Hope this points you in the right direction.
Mike.