Deploy to YARN with Spring Cloud Dataflow - hadoop-yarn

When deploying a stream to a remote YARN cluster, I get the following error from the YARN UI:
Diagnostics: File file:///dataflow/apps/stream/app/application.properties does not exist
This file exists on the Dataflow server' side, and contains the following data:
#Thu Dec 01 10:32:39 CET 2016
spring.yarn.applicationVersion=app
spring.cloud.deployer.yarn.version=1.0.2.RELEASE
spring.hadoop.resourceManagerHost=hmaprb.my-domain.com
From what I understand, this error comes from a deployed container that tries to access the configuration file as well. What I cant' understand is when this configuration file should have been copied into YARN ?
That may be obvious, but this is very hard to debug, not knowing that. Also, here are the YARN logs, if that helps:
2016-12-06 12:20:44,439 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 148106
2016-12-06 12:20:44,539 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 148106 submitted by user tcozien
2016-12-06 12:20:44,539 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1478697416091_148106
2016-12-06 12:20:44,539 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1478697416091_148106 State change from NEW to NEW_SAVING
2016-12-06 12:20:44,539 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1478697416091_148106
2016-12-06 12:20:44,539 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=tcozien IP=10.191.40.250 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1478697416091_148106
2016-12-06 12:20:44,593 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Storing info for app: application_1478697416091_148106 at: /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot/application_1478697416091_148106/application_1478697416091_148106
2016-12-06 12:20:44,683 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1478697416091_148106 State change from NEW_SAVING to SUBMITTED
2016-12-06 12:20:44,716 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1478697416091_148106 from user: tcozien, in queue: default, currently num of applications: 5
2016-12-06 12:20:44,717 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1478697416091_148106 State change from SUBMITTED to ACCEPTED
2016-12-06 12:20:44,717 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1478697416091_148106_000001
2016-12-06 12:20:44,717 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1478697416091_148106_000001 State change from NEW to SUBMITTED
2016-12-06 12:20:44,717 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1478697416091_148106_000001 to scheduler from user: tcozien
2016-12-06 12:20:44,717 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1478697416091_148106_000001 State change from SUBMITTED to SCHEDULED
2016-12-06 12:20:45,349 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e49_1478697416091_148106_01_000001 Container Transitioned from NEW to ALLOCATED
2016-12-06 12:20:45,349 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=tcozien OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1478697416091_148106 CONTAINERID=container_e49_1478697416091_148106_01_000001
2016-12-06 12:20:45,349 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_e49_1478697416091_148106_01_000001 of capacity <memory:2048, vCores:1, disks:0.0> on host hmaprb.my-domain.com:41610, which has 25 containers, <memory:51200, vCores:25, disks:12.0> used and <memory:71680, vCores:5, disks:3.0> available after allocation
2016-12-06 12:20:45,349 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : hmaprb.my-domain.com:41610 for container : container_e49_1478697416091_148106_01_000001
2016-12-06 12:20:45,349 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e49_1478697416091_148106_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
2016-12-06 12:20:45,349 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Clear node set for appattempt_1478697416091_148106_000001
2016-12-06 12:20:45,349 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1478697416091_148106 AttemptId: appattempt_1478697416091_148106_000001 MasterContainer: Container: [ContainerId: container_e49_1478697416091_148106_01_000001, NodeId: hmaprb.my-domain.com:41610, NodeHttpAddress: hmaprb.my-domain.com:8042, Resource: <memory:2048, vCores:1, disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.11.129.57:41610 }, ]
2016-12-06 12:20:45,349 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1478697416091_148106_000001 State change from SCHEDULED to ALLOCATED_SAVING
2016-12-06 12:20:45,350 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Storing info for attempt: appattempt_1478697416091_148106_000001 at: /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot/application_1478697416091_148106/appattempt_1478697416091_148106_000001
2016-12-06 12:20:45,464 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1478697416091_148106_000001 State change from ALLOCATED_SAVING to ALLOCATED
2016-12-06 12:20:45,464 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1478697416091_148106_000001
2016-12-06 12:20:45,465 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_e49_1478697416091_148106_01_000001, NodeId: hmaprb.my-domain.com:41610, NodeHttpAddress: hmaprb.my-domain.com:8042, Resource: <memory:2048, vCores:1, disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.11.129.57:41610 }, ] for AM appattempt_1478697416091_148106_000001
2016-12-06 12:20:45,465 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_e49_1478697416091_148106_01_000001 : $JAVA_HOME/bin/java,,-Dspring.config.location=servers.yml,-jar,spring-cloud-deployer-yarn-appdeployerappmaster-#spring-cloud-deployer-yarn.version#.jar,--spring.cloud.deployer.yarn.appmaster.artifact=/dataflow//artifacts/cache/,1><LOG_DIR>/Appmaster.stdout,2><LOG_DIR>/Appmaster.stderr
2016-12-06 12:20:45,465 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1478697416091_148106_000001
2016-12-06 12:20:45,465 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1478697416091_148106_000001
2016-12-06 12:20:45,484 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_e49_1478697416091_148106_01_000001, NodeId: hmaprb.my-domain.com:41610, NodeHttpAddress: hmaprb.my-domain.com:8042, Resource: <memory:2048, vCores:1, disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.11.129.57:41610 }, ] for AM appattempt_1478697416091_148106_000001
2016-12-06 12:20:45,484 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1478697416091_148106_000001 State change from ALLOCATED to LAUNCHED
2016-12-06 12:20:46,347 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e49_1478697416091_148106_01_000001 Container Transitioned from ACQUIRED to RUNNING
2016-12-06 12:20:51,547 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e49_1478697416091_148106_01_000001 Container Transitioned from RUNNING to COMPLETED
2016-12-06 12:20:51,547 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_e49_1478697416091_148106_01_000001 in state: COMPLETED event:FINISHED
2016-12-06 12:20:51,547 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=tcozien OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1478697416091_148106 CONTAINERID=container_e49_1478697416091_148106_01_000001
2016-12-06 12:20:51,547 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_e49_1478697416091_148106_01_000001 of capacity <memory:2048, vCores:1, disks:0.0> on host hmaprb.my-domain.com:41610, which currently has 29 containers, <memory:59392, vCores:29, disks:14.5> used and <memory:63488, vCores:1, disks:0.5> available, release resources=true
2016-12-06 12:20:51,547 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1478697416091_148106_000001 released container container_e49_1478697416091_148106_01_000001 on node: host: hmaprb.my-domain.com:41610 #containers=29 available=<memory:63488, vCores:1, disks:0.5> used=<memory:59392, vCores:29, disks:14.5> with event: FINISHED
2016-12-06 12:20:51,547 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1478697416091_148106_000001 with final state: FAILED, and exit status: -1000
2016-12-06 12:20:51,547 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1478697416091_148106_000001 State change from LAUNCHED to FINAL_SAVING
2016-12-06 12:20:51,547 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for attempt: appattempt_1478697416091_148106_000001 at: /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot/application_1478697416091_148106/appattempt_1478697416091_148106_000001
2016-12-06 12:20:51,741 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1478697416091_148106_000001
2016-12-06 12:20:51,742 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1478697416091_148106_000001
2016-12-06 12:20:51,742 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1478697416091_148106_000001 State change from FINAL_SAVING to FAILED
2016-12-06 12:20:51,742 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1478697416091_148106 with final state: FAILED
2016-12-06 12:20:51,742 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1478697416091_148106 State change from ACCEPTED to FINAL_SAVING
2016-12-06 12:20:51,742 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating info for app: application_1478697416091_148106
2016-12-06 12:20:51,742 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application appattempt_1478697416091_148106_000001 is done. finalState=FAILED
2016-12-06 12:20:51,742 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1478697416091_148106 requests cleared
2016-12-06 12:20:51,742 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Updating info for app: application_1478697416091_148106 at: /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot/application_1478697416091_148106/application_1478697416091_148106
2016-12-06 12:20:51,907 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1478697416091_148106 failed 1 times due to AM Container for appattempt_1478697416091_148106_000001 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hmaprb.my-domain.com:8088/cluster/app/application_1478697416091_148106Then, click on links to logs of each attempt.
2016-12-06 12:20:51,907 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1478697416091_148106 State change from FINAL_SAVING to FAILED
2016-12-06 12:20:51,907 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=tcozien OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1478697416091_148106 failed 1 times due to AM Container for appattempt_1478697416091_148106_000001 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hmaprb.my-domain.com:8088/cluster/app/application_1478697416091_148106Then, click on links to logs of each attempt.
Failing this attempt. Failing the application. APPID=application_1478697416091_148106
2016-12-06 12:20:51,907 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1478697416091_148106,name=scdstream:app:offer,user=tcozien,queue=root.tcozien,state=FAILED,trackingUrl=http://hmaprb.my-domain.com:8088/cluster/app/application_1478697416091_148106,appMasterHost=N/A,startTime=1481026844538,finishTime=1481026851742,finalStatus=FAILED,memorySeconds=12701,vcoreSeconds=6,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0\, disks:0.0>,applicationType=DATAFLOW

I'd check what's hdfs fsUri in servers.yml as file:///dataflow/apps/stream/app/application.properties is wrong because it should look it up from hdfs. On hadoop fs defaults to local fs so I think there error is coming from there.

Related

Filebeat does not complete on close_eof + --once

Using filebeat 7.5.2:
I'm using a filebeat configuration with close_eof enabled and I run filebeat with the flag --once. I can see the harvester reaching eof but the filebeat keeps going.
Flebeat conf:
filebeat.inputs:
- type: log
close_eof: true
enabled: true
paths:
- "${LOGS_PATH}"
scan_frequency: 1s
fields: {
machine: "${HOST}"
}
output.logstash:
hosts: ["192.168.41.6:5044"]
bulk_max_size: 1024
timeout: 30s
pipelining: 1
workers: 1
And I run it using:
filebeat run --once -v -c "PATH TO CONF..."
And some logs from the filebeat instance:
...
2020-02-04T18:30:16.950Z INFO instance/beat.go:297 Setup Beat: filebeat; Version: 7.5.2
2020-02-04T18:30:17.059Z INFO [publisher] pipeline/module.go:97 Beat name: logstash
2020-02-04T18:30:17.167Z WARN beater/filebeat.go:152 Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch out
put is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-02-04T18:30:17.168Z INFO instance/beat.go:429 filebeat start running.
2020-02-04T18:30:17.168Z INFO [monitoring] log/log.go:118 Starting metrics logging every 30s
2020-02-04T18:30:17.168Z INFO registrar/migrate.go:104 No registry home found. Create: /tmp/tmp.BXJtfiaEzb/data/registry/filebeat
2020-02-04T18:30:17.179Z INFO registrar/migrate.go:112 Initialize registry meta file
2020-02-04T18:30:17.192Z INFO registrar/registrar.go:108 No registry file found under: /tmp/tmp.BXJtfiaEzb/data/registry/filebeat/data.json. Creating a new re
gistry file.
2020-02-04T18:30:17.193Z INFO registrar/registrar.go:145 Loading registrar data from /tmp/tmp.BXJtfiaEzb/data/registry/filebeat/data.json
2020-02-04T18:30:17.193Z INFO registrar/registrar.go:152 States Loaded from registrar: 0
2020-02-04T18:30:17.193Z WARN beater/filebeat.go:368 Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch out
put is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-02-04T18:30:17.193Z INFO crawler/crawler.go:72 Loading Inputs: 1
2020-02-04T18:30:17.194Z INFO log/input.go:152 Configured paths: [/tmp/tmp.BXJtfiaEzb/*.log]
2020-02-04T18:30:17.206Z INFO input/input.go:114 Starting input of type: log; ID: 13918413832820009056
2020-02-04T18:30:17.225Z INFO input/input.go:167 Stopping Input: 13918413832820009056
2020-02-04T18:30:17.225Z INFO crawler/crawler.go:106 Loading and starting Inputs completed. Enabled inputs: 1
2020-02-04T18:30:17.225Z INFO log/harvester.go:251 Harvester started for file: /tmp/tmp.BXJtfiaEzb/dcbgw-20200124080032_darkblue.log
2020-02-04T18:30:17.231Z INFO beater/filebeat.go:384 Running filebeat once. Waiting for completion ...
2020-02-04T18:30:17.231Z INFO beater/filebeat.go:386 All data collection completed. Shutting down.
2020-02-04T18:30:17.231Z INFO crawler/crawler.go:139 Stopping Crawler
2020-02-04T18:30:17.231Z INFO crawler/crawler.go:149 Stopping 1 inputs
2020-02-04T18:30:17.258Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://192.168.41.6:5044))
2020-02-04T18:30:17.296Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://192.168.41.6:5044)) established
... Only metrics here ...
2020-02-04T18:35:55.686Z INFO log/harvester.go:274 End of file reached: /tmp/tmp.BXJtfiaEzb/dcbgw-20200124080032_darkblue.log. Closing because close_eof is enabled.
2020-02-04T18:35:55.686Z INFO crawler/crawler.go:165 Crawler stopped
... MORE METRICS ...
2020-02-04T18:36:26.609Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 192.168.41.6:49662->192.168.41.6:5044: i/o timeout
2020-02-04T18:36:26.621Z ERROR logstash/async.go:256 Failed to publish events caused by: client is not connected
2020-02-04T18:36:28.520Z ERROR pipeline/output.go:121 Failed to publish events: client is not connected
2020-02-04T18:36:28.520Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://192.168.41.6:5044))
2020-02-04T18:36:28.521Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://192.168.41.6:5044)) established
... MORE METRICS ...
From this I'm outputing this to Logstash 7.5.2 running in the same Ubuntu 18 VM. Running Logstash with log level trace does not output any error.

Apache flink - Timeout after submitting job on hadoop / yarn cluster

I am trying to upgrade our job from flink 1.4.2 to 1.7.1 but I keep running into timeouts after submitting the job. The flink job runs on our hadoop cluster (version 2.7) with Yarn.
I've seen the following behavior:
Using the same flink-conf.yaml as we used in 1.4.2: 1.5.6 / 1.6.3 / 1.7.1 all versions timeout while 1.4.2 works.
Using 1.5.6 with "mode: legacy" (to switch off flip-6) works
Using 1.7.1 with "mode: legacy" gives timeout (I assume this option was removed but the documentation is outdated? https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#legacy)
When the timeout happens I get the following stacktrace:
INFO class java.time.Instant does not contain a getter for field seconds
INFO class com.bol.fin_hdp.cm1.domain.Cm1Transportable does not contain a getter for field globalId
INFO Submitting job 5af931bcef395a78b5af2b97e92dcffe (detached: false).
INFO ------------------------------------------------------------
INFO The program finished with the following exception:
INFO org.apache.flink.client.program.ProgramInvocationException: The main method caused an error.
INFO at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:545)
INFO at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
INFO at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
INFO at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:798)
INFO at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:289)
INFO at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
INFO at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1035)
INFO at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1111)
INFO at java.security.AccessController.doPrivileged(Native Method)
INFO at javax.security.auth.Subject.doAs(Subject.java:422)
INFO at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
INFO at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
INFO at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1111)
INFO Caused by: java.lang.RuntimeException: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:43)
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJobWithConfig(IntervalJobStarter.java:32)
INFO at com.bol.fin_hdp.Main.main(Main.java:8)
INFO at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
INFO at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO at java.lang.reflect.Method.invoke(Method.java:498)
INFO at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
INFO ... 12 more
INFO Caused by: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
INFO at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:258)
INFO at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
INFO at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
INFO at com.bol.fin_hdp.cm1.job.Job.execute(Job.java:54)
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:41)
INFO ... 19 more
INFO Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
INFO at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:371)
INFO at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
INFO at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
INFO at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
INFO at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
INFO at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:216)
INFO at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
INFO at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
INFO at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
INFO at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
INFO at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:301)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
INFO at java.lang.Thread.run(Thread.java:748)
INFO Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
INFO at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
INFO ... 17 more
INFO Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-01...
INFO at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
INFO at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
INFO at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
INFO at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
INFO ... 15 more
INFO Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-017.example.com/some.ip.address:46500
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212)
INFO ... 7 more
What changed in flip-6 that might cause this behavior and how can I fix this?
For our jobs on YARN w/Flink 1.6, we had to bump up the web.timeout setting via -yD web.timeout=100000.
In our case, there was a firewall between the machine submitting the job and our Hadoop cluster.
In newer Flink versions (1.7 and up) Flink uses REST to submit jobs. The port number for this REST service is random on yarn setups and could not be set.
Flink 1.8.0 introduced a config option to set this to a port or port range using:
rest.bind-port: 55520-55530

Flink 1.4.0 ClassDefNotFoundError ... S3ErrorResponseHandler

Working on setting up a local test of Flink 1.4.0 that writes to s3 and I'm getting the following error:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.internal.S3ErrorResponseHandler
at org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:363)
at org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:542)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem.createAmazonS3Client(PrestoS3FileSystem.java:639)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem.initialize(PrestoS3FileSystem.java:212)
at org.apache.flink.fs.s3presto.S3FileSystemFactory.create(S3FileSystemFactory.java:132)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:397)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:320)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory.<init>(FsCheckpointStreamFactory.java:99)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createStreamFactory(FsStateBackend.java:277)
at org.apache.flink.streaming.runtime.tasks.StreamTask.createCheckpointStreamFactory(StreamTask.java:787)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:247)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Following the documentation here I added the flink-s3-fs-presto-1.4.0.jar from opt/ to lib/ so I'm not exactly sure why I'm getting this error. Any help would be appreciated let me know if I can add additional information.
Here is some more information about my system and process:
I start the local job manager:
[flink-1.4.0] ./bin/start-local.sh
Warning: this file is deprecated and will be removed in 1.5.
Starting cluster.
Starting jobmanager daemon on host MBP0535.local.
Starting taskmanager daemon on host MBP0535.local.
OS information:
[flink-1.4.0] system_profiler SPSoftwareDataType
Software:
System Software Overview:
System Version: macOS 10.13.2 (17C205)
Kernel Version: Darwin 17.3.0
Boot Volume: Macintosh HD
Try to run jar:
[flink-1.4.0] ./bin/flink run streaming.jar
I'm actually having trouble reproducing the error. Here is the task manager log:
2018-01-18 10:17:07,668 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 # 11:08:40 UTC)
2018-01-18 10:17:07,668 INFO org.apache.flink.runtime.taskmanager.TaskManager - OS current user: k
2018-01-18 10:17:08,002 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-01-18 10:17:08,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - Current Hadoop/Kerberos user: k
2018-01-18 10:17:08,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.152-b16
2018-01-18 10:17:08,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum heap size: 1024 MiBytes
2018-01-18 10:17:08,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - JAVA_HOME: /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - Hadoop version: 2.8.1
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM Options:
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+UseG1GC
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xms1024M
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xmx1024M
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:MaxDirectMemorySize=8388607T
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog.file=/Users/k/flink-1.4.0/log/flink-k-taskmanager-0-MBP0535.local.log
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog4j.configuration=file:/Users/k/flink-1.4.0/conf/log4j.properties
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlogback.configurationFile=file:/Users/k/flink-1.4.0/conf/logback.xml
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - Program Arguments:
2018-01-18 10:17:08,088 INFO org.apache.flink.runtime.taskmanager.TaskManager - --configDir
2018-01-18 10:17:08,088 INFO org.apache.flink.runtime.taskmanager.TaskManager - /Users/k/flink-1.4.0/conf
2018-01-18 10:17:08,088 INFO org.apache.flink.runtime.taskmanager.TaskManager - Classpath: /Users/k/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/Users/k/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/Users/k/flink-1.4.0/lib/flink-shaded-hadoop2-uber-1.4.0.jar:/Users/k/flink-1.4.0/lib/log4j-1.2.17.jar:/Users/k/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/Users/k/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar:::
2018-01-18 10:17:08,089 INFO org.apache.flink.runtime.taskmanager.TaskManager - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-01-18 10:17:08,094 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum number of open file descriptors is 10240
2018-01-18 10:17:08,117 INFO org.apache.flink.runtime.taskmanager.TaskManager - Loading configuration from /Users/k/flink-1.4.0/conf
2018-01-18 10:17:08,119 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: classloader.resolve-order, parent-first
2018-01-18 10:17:08,119 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: classloader.parent-first-patterns, java.;org.apache.flink.;javax.annotation;org.slf4j;org.apache.log4j;org.apache.logging.log4j;ch.qos.logback;com.mapr.;org.apache.
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: s3.access-key, XXXXXXXXXXXXXXXXXXXX
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: s3.secret-key, YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-01-18 10:17:08,121 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2018-01-18 10:17:08,121 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-01-18 10:17:08,121 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.port, 8082
2018-01-18 10:17:08,199 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to k (auth:SIMPLE)
2018-01-18 10:17:08,289 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2018-01-18 10:17:08,289 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2018-01-18 10:17:08,291 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address localhost/127.0.0.1:6123.
2018-01-18 10:17:08,472 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address 'MBP0535.local' (10.1.11.139) for communication.
2018-01-18 10:17:08,482 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager
2018-01-18 10:17:08,482 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at MBP0535.local:54024.
2018-01-18 10:17:08,484 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to start actor system at mbp0535.local:54024
2018-01-18 10:17:08,898 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2018-01-18 10:17:08,960 INFO akka.remote.Remoting - Starting remoting
2018-01-18 10:17:09,087 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#mbp0535.local:54024]
2018-01-18 10:17:09,097 INFO org.apache.flink.runtime.taskmanager.TaskManager - Actor system started at akka.tcp://flink#mbp0535.local:54024
2018-01-18 10:17:09,105 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2018-01-18 10:17:09,111 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor
2018-01-18 10:17:09,115 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: MBP0535.local/10.1.11.139, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 1 (manual), number of client threads: 1 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2018-01-18 10:17:09,118 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2018-01-18 10:17:09,122 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T': total 465 GB, usable 333 GB (71.61% usable)
2018-01-18 10:17:09,236 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 101 MB for network buffer pool (number of memory segments: 3255, bytes per segment: 32768).
2018-01-18 10:17:09,323 WARN org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime is not in the classpath. Please put the corresponding jar from the opt to the lib folder.
2018-01-18 10:17:09,324 WARN org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Server. Probable reason: flink-queryable-state-runtime is not in the classpath. Please put the corresponding jar from the opt to the lib folder.
2018-01-18 10:17:09,324 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2018-01-18 10:17:09,353 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 23 ms).
2018-01-18 10:17:09,378 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 25 ms). Listening on SocketAddress /10.1.11.139:54026.
2018-01-18 10:17:09,381 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation - No hostname could be resolved for the IP address 10.1.11.139, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2018-01-18 10:17:09,431 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2018-01-18 10:17:09,437 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/flink-io-186cf8c8-5a0d-44cc-9d78-e81c943b0b9f for spill files.
2018-01-18 10:17:09,439 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/flink-dist-cache-a9a568cd-c7cd-45c6-abbe-08912d051583
2018-01-18 10:17:09,509 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/flink-dist-cache-bd3cc98c-cebb-4569-98d3-5357393d8c5b
2018-01-18 10:17:09,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1044592356.
2018-01-18 10:17:09,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: 97b3a934f84ba25e20aae8a91a40e336 # 10.1.11.139 (dataPort=54026)
2018-01-18 10:17:09,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 1 task slot(s).
2018-01-18 10:17:09,518 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 112/1024/1024 MB, NON HEAP: 35/36/-1 MB (used/committed/max)]
2018-01-18 10:17:09,522 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#localhost:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2018-01-18 10:17:09,692 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink#localhost:6123/user/jobmanager), starting network stack and library cache.
2018-01-18 10:17:09,696 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be localhost/127.0.0.1:54025. Starting BLOB cache.
2018-01-18 10:17:09,699 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/blobStore-77287aab-5128-4363-842c-1a124114fd91
2018-01-18 10:17:09,702 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/blobStore-c9f62e97-bf53-4fc4-9e4a-1958706e78ec
2018-01-18 10:26:25,993 INFO org.apache.flink.runtime.taskmanager.TaskManager - Received task Source: Kafka -> Sink: S3 (1/1)
2018-01-18 10:26:25,993 INFO org.apache.flink.runtime.taskmanager.Task - Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) switched from CREATED to DEPLOYING.
2018-01-18 10:26:25,994 INFO org.apache.flink.runtime.taskmanager.Task - Creating FileSystem stream leak safety net for task Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) [DEPLOYING]
2018-01-18 10:26:25,996 INFO org.apache.flink.runtime.taskmanager.Task - Loading JAR files for task Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) [DEPLOYING].
2018-01-18 10:26:25,998 INFO org.apache.flink.runtime.blob.BlobClient - Downloading 34e7c81bd4a0050e7809a1343af0c7cb/p-4eaec529eb247f30ef2d3ddc2308e029e625de33-93fe90509266a50ffadce2131cedc514 from localhost/127.0.0.1:54025
2018-01-18 10:26:26,238 INFO org.apache.flink.runtime.taskmanager.Task - Registering task at network: Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) [DEPLOYING].
2018-01-18 10:26:26,240 INFO org.apache.flink.runtime.taskmanager.Task - Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) switched from DEPLOYING to RUNNING.
2018-01-18 10:26:26,249 INFO org.apache.flink.streaming.runtime.tasks.StreamTask - Using user-defined state backend: File State Backend # s3://stream-data/checkpoints.
2018-01-18 10:26:26,522 INFO org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.util.NativeCodeLoader - Skipping native-hadoop library for flink-s3-fs-hadoop's relocated Hadoop... using builtin-java classes where applicable
2018-01-18 10:26:29,041 ERROR org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink - Error while creating FileSystem when initializing the state of the BucketingSink.
java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.createHadoopFileSystem(BucketingSink.java:1196)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initFileSystem(BucketingSink.java:411)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:355)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
2018-01-18 10:26:29,048 INFO org.apache.flink.runtime.taskmanager.Task - Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) switched from RUNNING to FAILED.
java.lang.RuntimeException: Error while creating FileSystem when initializing the state of the BucketingSink.
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:358)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.createHadoopFileSystem(BucketingSink.java:1196)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initFileSystem(BucketingSink.java:411)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:355)
... 9 more

OpsCenter not getting data after restart of server

we are using Datastax Enterprize edition. We are running a 2 node cluster. We get the message: After restarting of OpsCentre node getting below error.
2017-03-20 14:49:45,819 [opscenterd] ERROR: Unhandled error in
Deferred: There are no clusters with name or ID 'tracking'
File "/usr/share/opscenter/lib/py/twisted/internet/defer.py", line 1124, in _inlineCallbacks
result = g.send(result)
File "/usr/share/opscenter/jython/Lib/site-packages/opscenterd/WebServer.py",
line 523, in ClusterController
File "/usr/share/opscenter/jython/Lib/site-packages/opscenterd/ClusterServices.py",
line 181, in __getitem__
(MainThread)
Agents Log
WARN [async-dispatch-23] 2017-03-20 17:13:45,230 Attempted to ping opscenterd on stomp but did not receive a reply in time, will retry again later.
ERROR [StompConnection receiver] 2017-03-20 17:13:45,230 Mar 20, 2017 5:13:45 PM org.jgroups.client.StompConnection run
SEVERE: JGRP000112: Connection closed unexpectedly:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:223)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.jgroups.util.Util.readLine(Util.java:2825)
at org.jgroups.protocols.STOMP.readFrame(STOMP.java:240)
at org.jgroups.client.StompConnection.run(StompConnection.java:274)
at java.lang.Thread.run(Thread.java:745)
INFO [async-dispatch-23] 2017-03-20 17:13:45,236 Starting DynamicEnvironmentComponent
INFO [async-dispatch-23] 2017-03-20 17:13:45,512 Dynamic environment script output: paths:
cassandra-conf: /etc/dse//cassandra
cassandra-log: /var/log/cassandra
hadoop-log: /var/log/hadoop/userlogs
spark-log: /var/log/spark
dse-env: /etc/dse
dse-conf: /etc/dse/
hadoop-conf: /etc/dse/hadoop2-client
spark-conf: /etc/dse//spark
INFO [async-dispatch-23] 2017-03-20 17:13:45,522 Starting storage database connection.
ERROR [async-dispatch-23] 2017-03-20 17:13:47,737 Can't connect to Cassandra (All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1:9042] Cannot connect))), retrying soon.
INFO [async-dispatch-23] 2017-03-20 17:13:47,738 Starting monitored database connection.
ERROR [async-dispatch-23] 2017-03-20 17:13:49,965 Can't connect to Cassandra, authentication error, please carefully check your Auth settings, retrying soon.
INFO [async-dispatch-23] 2017-03-20 17:13:49,967 Starting RepairComponent
INFO [async-dispatch-23] 2017-03-20 17:13:49,970 Finished starting system.
INFO [async-dispatch-26] 2017-03-20 17:13:59,971 Starting system.
INFO [async-dispatch-26] 2017-03-20 17:13:59,973 Configuration change for component class opsagent.nodedetails.repair.RepairComponent: before: {:send-repair-fn #object[opsagent.nodedetails.repair.jmx$send_repair 0x76028b5c "opsagent.nodedetails.repair.jmx$send_repair#76028b5c"], :parse-notification-fn #object[opsagent.nodedetails.repair.jmx$parse_notification 0x5e84cf80 "opsagent.nodedetails.repair.jmx$parse_notification#5e84cf80"]}, after: {:send-repair-fn nil, :parse-notification-fn nil}
INFO [async-dispatch-26] 2017-03-20 17:13:59,974 The following components have had a config change and will be rebuilt and restarted: (:repair-component)
INFO [async-dispatch-26] 2017-03-20 17:13:59,975 The component restart for (:repair-component) when accounting for dependencies requires these components to be restarted #{:repair-component :http-server}
INFO [async-dispatch-26] 2017-03-20 17:13:59,976 Stopping RepairComponent.
INFO [async-dispatch-26] 2017-03-20 17:13:59,977 Starting StompComponent
INFO [async-dispatch-26] 2017-03-20 17:13:59,978 SSL communication is disabled
INFO [async-dispatch-26] 2017-03-20 17:13:59,978 Creating stomp connection to 192.168.136.250:61620
ERROR [async-dispatch-26] 2017-03-20 17:13:59,980 Mar 20, 2017 5:13:59 PM org.jgroups.client.StompConnection connect
INFO: Connected to 192.168.136.250:61620
I am not able to understand whats wrong with Agent and OpsCentre?

Oozie Sqoop Issue

I am trying to run a oozie sqoop job to import from teradata to Hive.
Sqoop runs fine in CLI. But I am facing the issues in scheduling it with oozie.
Note: I am able to do shell actions in oozie and it works fine.
Find below the error logs and workflow
Error logs:
Log Type: stderr
Log Upload Time: Wed Feb 01 04:19:00 -0500 2017
Log Length: 513
log4j:ERROR Could not find value for key log4j.appender.CLA
log4j:ERROR Could not instantiate appender named "CLA".
log4j:WARN No appenders could be found for logger (org.apache.hadoop.yarn.client.RMProxy).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
No such sqoop tool: sqoop. See 'sqoop help'.
Intercepting System.exit(1)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
Log Type: stdout
Log Upload Time: Wed Feb 01 04:19:00 -0500 2017
Log Length: 158473
Showing 4096 bytes of 158473 total. Click here for the full log.
curity.ShellBasedUnixGroupsMapping
dfs.client.domain.socket.data.traffic=false
dfs.client.read.shortcircuit.streams.cache.size=256
fs.s3a.connection.timeout=200000
dfs.datanode.block-pinning.enabled=false
mapreduce.job.end-notification.max.retry.interval=5000
yarn.acl.enable=true
yarn.nm.liveness-monitor.expiry-interval-ms=600000
mapreduce.application.classpath=$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH
mapreduce.input.fileinputformat.list-status.num-threads=1
dfs.client.mmap.cache.size=256
mapreduce.tasktracker.map.tasks.maximum=2
yarn.scheduler.fair.user-as-default-queue=true
yarn.timeline-service.ttl-enable=true
yarn.nodemanager.linux-container-executor.resources-handler.class=org.apache.hadoop.yarn.server.nodemanager.util.DefaultLCEResourcesHandler
dfs.namenode.max.objects=0
dfs.namenode.service.handler.count=10
dfs.namenode.kerberos.principal.pattern=*
yarn.resourcemanager.state-store.max-completed-applications=${yarn.resourcemanager.max-completed-applications}
dfs.namenode.delegation.token.max-lifetime=604800000
mapreduce.job.classloader=false
yarn.timeline-service.leveldb-timeline-store.start-time-write-cache-size=10000
mapreduce.job.hdfs-servers=${fs.defaultFS}
yarn.application.classpath=$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
dfs.datanode.hdfs-blocks-metadata.enabled=true
mapreduce.tasktracker.dns.nameserver=default
dfs.datanode.readahead.bytes=4193404
mapreduce.job.ubertask.maxreduces=1
dfs.image.compress=false
mapreduce.shuffle.ssl.enabled=false
yarn.log-aggregation-enable=false
mapreduce.tasktracker.report.address=127.0.0.1:0
mapreduce.tasktracker.http.threads=40
dfs.stream-buffer-size=4096
tfile.fs.output.buffer.size=262144
fs.permissions.umask-mode=022
dfs.client.datanode-restart.timeout=30
dfs.namenode.resource.du.reserved=104857600
yarn.resourcemanager.am.max-attempts=2
yarn.nodemanager.resource.percentage-physical-cpu-limit=100
ha.failover-controller.graceful-fence.connection.retries=1
mapreduce.job.speculative.speculative-cap-running-tasks=0.1
hadoop.proxyuser.hdfs.groups=*
dfs.datanode.drop.cache.behind.writes=false
hadoop.proxyuser.HTTP.hosts=*
hadoop.common.configuration.version=0.23.0
mapreduce.job.ubertask.enable=false
yarn.app.mapreduce.am.resource.cpu-vcores=1
dfs.namenode.replication.work.multiplier.per.iteration=2
mapreduce.job.acl-modify-job=
io.seqfile.local.dir=${hadoop.tmp.dir}/io/local
yarn.resourcemanager.system-metrics-publisher.enabled=false
fs.s3.sleepTimeSeconds=10
mapreduce.client.output.filter=FAILED
------------------------
Sqoop command arguments :
sqoop
import
--connect
"jdbc:teradata://xx.xxx.xx:xxxx/DATABASE=Database_name"
--verbose
--username
xxx
-password
'xxx'
--table
BILL_DETL_EXTRC
--split-by
EXTRC_RUN_ID
--m
1
--fields-terminated-by
,
--hive-import
--hive-table
OPS_TEST.bill_detl_extr213
--target-dir
/hadoop/dev/TD_archive/bill_detl_extrc
Fetching child yarn jobs
tag id : oozie-56ea2084fcb1d55591f8919b405f0be0
Child yarn jobs are found -
=================================================================
Invoking Sqoop command line now >>>
3324 [uber-SubtaskRunner] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://namenode:8020/user/hadoopadm/oozie-oozi/0000039-170123205203054-oozie-oozi-W/sqoop-action--sqoop/action-data.seq
Oozie Launcher ends
Log Type: syslog
Log Upload Time: Wed Feb 01 04:19:00 -0500 2017
Log Length: 16065
Showing 4096 bytes of 16065 total. Click here for the full log.
adoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Job jar is not present. Not adding any jar to the list of resources.
2017-02-01 04:18:51,990 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: The job-conf file on the remote FS is /user/hadoopadm/.staging/job_1485220715968_0219/job.xml
2017-02-01 04:18:52,074 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Adding #5 tokens and #1 secret keys for NM use for launching container
2017-02-01 04:18:52,074 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Size of containertokens_dob is 6
2017-02-01 04:18:52,074 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Putting shuffle token in serviceData
2017-02-01 04:18:52,174 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://svacld001.bcbsnc.com:8020]
2017-02-01 04:18:52,240 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapred.JobConf: Task java-opts do not specify heap size. Setting task attempt jvm max heap size to -Xmx820m
2017-02-01 04:18:52,243 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1485220715968_0219_m_000000_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED
2017-02-01 04:18:52,243 INFO [uber-EventHandler] org.apache.hadoop.mapred.LocalContainerLauncher: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1485220715968_0219_01_000001 taskAttempt attempt_1485220715968_0219_m_000000_0
2017-02-01 04:18:52,245 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1485220715968_0219_m_000000_0] using containerId: [container_1485220715968_0219_01_000001 on NM: [svacld005.bcbsnc.com:8041]
2017-02-01 04:18:52,246 INFO [uber-SubtaskRunner] org.apache.hadoop.mapred.LocalContainerLauncher: mapreduce.cluster.local.dir for uber task: /disk1/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk10/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk11/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk12/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk2/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk3/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk4/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk5/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk6/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk7/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk8/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219,/disk9/yarn/nm/usercache/hadoopadm/appcache/application_1485220715968_0219
2017-02-01 04:18:52,247 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1485220715968_0219_m_000000_0 TaskAttempt Transitioned from ASSIGNED to RUNNING
2017-02-01 04:18:52,247 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1485220715968_0219_m_000000 Task Transitioned from SCHEDULED to RUNNING
2017-02-01 04:18:52,249 INFO [uber-SubtaskRunner] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
2017-02-01 04:18:52,258 INFO [uber-SubtaskRunner] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2017-02-01 04:18:52,324 INFO [uber-SubtaskRunner] org.apache.hadoop.mapred.MapTask: Processing split: org.apache.oozie.action.hadoop.OozieLauncherInputFormat$EmptySplit#9c73765
2017-02-01 04:18:52,329 INFO [uber-SubtaskRunner] org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2017-02-01 04:18:52,340 INFO [uber-SubtaskRunner] org.apache.hadoop.conf.Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
WORKFLOW
<workflow-app xmlns="uri:oozie:workflow:0.5" name="oozie-wf">
<start to="sqoop-wf"/>
<action name="sqoop-wf">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>xx.xx.xx:8032</job-tracker>
<name-node>hdfs://xx.xxx.xx:8020</name-node>
<command>import --connect "jdbc:teradata://ip/DATABASE=EDW_EXTRC_TAB_HST" --connection-manager "com.cloudera.connector.teradata.TeradataManager" --verbose --username HADOOP -password 'xxxxx' --table BILL_DETL_EXTRC --split-by EXTRC_RUN_ID --m 1 --fields-terminated-by , --hive-import --hive-table OPS_TEST.bill_detl_extrc1 --target-dir /hadoop/dev/TD_archive/data/PDCRDATA_TEST/bill_detl_extrc </command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Failed, Error Message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
JOB PROPERTIES
oozie.wf.application.path=hdfs:///hadoop/dev/TD_archive/workflow1.xml
oozie.use.system.libpath=true
security_enabled=True
dryrun=False
jobtracker=xxx.xxx:8032
nameNode=hdfs://xx.xx:8020
NOTE:
We are using cloudera CDH5.5
All the necessary JARS (sqoop-connector-teradata-1.5c5.jar tdgssconfig.jar terajdbc4.jar) are placed in /var/lib/sqoop and as well as placed in HDFS too.