Changing Gridgain Cluster state - ignite

I'm trying to change the state of a 3 node Gridgain cluster running on Kubernetes using the control.sh script as documented.
./control.sh --set-state INACTIVE
this usually should return success in a short time, but now it takes forever and the only way to break away is to CTRL+C. But after that cluster moves into an unexpected state where,
./control.sh --set-state ACTIVE
would fail. Below is the exception extracted from the gridgain log.
[1]
[SEVERE][rest-#70%dev%][GridJobWorker] Failed to execute job [jobId=a1fbeb63c71-ab1511ea-8668-442e-8aea-0d51e23026d6, ses=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=o.a.i.i.v.misc.VisorChangeGridActiveStateTask, dep=LocalDeployment [super=GridDeployment [ts=1633005984012, depMode=SHARED, clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader#2c13da15, clsLdrId=c09ddb63c71-ab1511ea-8668-442e-8aea-0d51e23026d6, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=o.a.i.i.v.misc.VisorChangeGridActiveStateTask, sesId=91fbeb63c71-ab1511ea-8668-442e-8aea-0d51e23026d6, startTime=1633008816019, endTime=9223372036854775807, taskNodeId=ab1511ea-8668-442e-8aea-0d51e23026d6, clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader#2c13da15, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, internal=true, topPred=ContainsNodeIdsPredicate [], subjId=ab1511ea-8668-442e-8aea-0d51e23026d6, mapFut=IgniteFuture [orig=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=939562377]], execName=null], jobId=a1fbeb63c71-ab1511ea-8668-442e-8aea-0d51e23026d6]]
class org.apache.ignite.IgniteException: Failed to activate cluster, because another state change operation is currently in progress: deactivate cluster
Following attempts at ./control.sh would immediately throw the below exception.
[2]
Command [SET-STATE] finished with code: 4
Error stack trace:
class org.apache.ignite.internal.client.GridClientException: null
suppressed:
at org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection.handleClientResponse(GridClientNioTcpConnection.java:628)
at org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection.handleResponse(GridClientNioTcpConnection.java:559)
at org.apache.ignite.internal.client.impl.connection.GridClientConnectionManagerAdapter$NioListener.onMessage(GridClientConnectionManagerAdapter.java:694)
at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:278)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:108)
at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:115)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:108)
at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:3714)
at org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:174)
at org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1193)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2504)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2269)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1891)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
at java.base/java.lang.Thread.run(Thread.java:829)
On the server logs following was then observed.
[3]
class org.apache.ignite.IgniteCheckedException: Failed to send response to node. Unsupported direct type [message=GridDhtAffinityAssignmentRequest [flags=1, futId=23, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=2], super=GridCacheGroupIdMessage [grpId=-149688677]]]
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processFailedMessage(GridCacheIoManager.java:1139)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:382)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1726)
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1333)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4800(GridIoManager.java:157)
at org.apache.ignite.internal.managers.communication.GridIoManager$8.execute(GridIoManager.java:1218)
at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:54)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to find message handler for message: GridDhtAffinityAssignmentRequest [flags=1, futId=23, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=2], super=GridCacheGroupIdMessage [grpId=-149688677]]
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
... 11 more
Thanks in advance for your help to resolve this issue.
Gridgain version: 8.8.8

Deactivation might take a while because of different reasons and it's better to check the logs.
As for the commands, you can use the force mode:
control.(sh|bat) --set-state INACTIVE|ACTIVE|ACTIVE_READ_ONLY [--force] [--yes]
./control.sh --set-state INACTIVE --force

Related

Issues with Ignite benchmark

I am getting below error while running benchmark test using yardstick with default settings.
It is a standalone setup.
[root#db3 ~]# ./bin/benchmark-run-all.sh config/benchmark-sample.properties
<17:26:41> Failed to set up benchmark drivers (will shutdown and exit).
class org.apache.ignite.IgniteCheckedException: Failed to start manager: GridManagerAdapter [enabled=true, name=org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1922)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1235)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:1787)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1711)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1141)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:639)
at org.apache.ignite.IgniteSpring.start(IgniteSpring.java:65)
at org.apache.ignite.yardstick.IgniteNode.start(IgniteNode.java:220)
at org.apache.ignite.yardstick.IgniteAbstractBenchmark.setUp(IgniteAbstractBenchmark.java:64)
at org.apache.ignite.yardstick.cache.IgniteCacheAbstractBenchmark.setUp(IgniteCacheAbstractBenchmark.java:107)
at org.yardstickframework.BenchmarkDriverStartUp.main(BenchmarkDriverStartUp.java:130)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000, ackTimeout=5000, marsh=JdkMarshaller [clsFilter=org.apache.ignite.marshaller.MarshallerUtils$1#1eff3cfb], reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=5, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false]
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:280)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:985)
at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1917)
... 10 more
Caused by: class org.apache.ignite.spi.IgniteSpiException: Failed to join node (Incompatible data region configuration [region=DEFAULT, locNodeId=8c2b6d02-01b3-4c22-8ad2-67c0c5f9ec4e, isPersistenceEnabled=true, rmtNodeId=4011d970-ae2d-4116-bb85-4311195c88a8, isPersistenceEnabled=false])
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2047)
at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1174)
at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:445)
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2149)
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:277
(Incompatible data region configuration [region=DEFAULT,
locNodeId=8c2b6d02-01b3-4c22-8ad2-67c0c5f9ec4e,
isPersistenceEnabled=true,
rmtNodeId=4011d970-ae2d-4116-bb85-4311195c88a8,
isPersistenceEnabled=false]) at
You have different persistence configurations for your nodes therefore they won't join each other according to the above error.

Wildfly - migrate Wildfly from 8.2.1 to 21.0.0

I am using wildfly migration server to migrate my old wildFly(8.2.1) to the latest (21.0.0)
but I face that issue
this happens in migration standalone.xml
UPDATE
I get the error from wildfly-server-migration-master/dist/standalone/target/jboss-server-migration/logs/migration.log
its more readable now
seems to have an error in this part jgroups
but any hint on how to fix it
ERROR [management-operation] WFLYCTL0013: Operation ("add") failed - address: ([
("subsystem" => "jgroups"),
("stack" => "tcp"),
("transport" => "TCP")
]) - failure description: "WFLYCTL0155: 'socket-binding' may not be null"
2020-11-03 02:32:11,219 FATAL [server] WFLYSRV0056: Server boot has failed in an unrecoverable manner; exiting. See previous messages for details.
2020-11-03 02:32:11,222 INFO [as] WFLYSRV0050: WildFly Full 21.0.0.Final (WildFly Core 13.0.1.Final) stopped in 2ms
2020-11-03 02:32:11,225 ERROR [logger] Migration failed: org.jboss.migration.core.ServerMigrationFailureException: java.lang.IllegalStateException: WFLYEMB0022: Cannot invoke 'start' on embedded process
at org.jboss.migration.core.task.TaskExecutionImpl.run(TaskExecutionImpl.java:174) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskExecutionImpl.execute(TaskExecutionImpl.java:159) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskContextImpl.execute(TaskContextImpl.java:68) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskContextImpl.execute(TaskContextImpl.java:32) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.wfly10.config.task.ServerConfigurationsMigration$Task.migrateConfig(ServerConfigurationsMigration.java:151) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.wfly10.config.task.ServerConfigurationsMigration$Task.migrateAllConfigs(ServerConfigurationsMigration.java:120) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.wfly10.config.task.ServerConfigurationsMigration$Task.run(ServerConfigurationsMigration.java:105) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskExecutionImpl.run(TaskExecutionImpl.java:169) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskExecutionImpl.execute(TaskExecutionImpl.java:159) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskContextImpl.execute(TaskContextImpl.java:68) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskContextImpl.execute(TaskContextImpl.java:32) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.wfly10.config.task.StandaloneServerMigration$1.run(StandaloneServerMigration.java:61) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.console.UserConfirmationServerMigrationTask.runTask(UserConfirmationServerMigrationTask.java:58) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.console.UserConfirmationServerMigrationTask.confirmTaskRun(UserConfirmationServerMigrationTask.java:50) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.console.UserConfirmationServerMigrationTask.run(UserConfirmationServerMigrationTask.java:63) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.env.SkippableByEnvServerMigrationTask.run(SkippableByEnvServerMigrationTask.java:47) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskExecutionImpl.run(TaskExecutionImpl.java:169) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskExecutionImpl.execute(TaskExecutionImpl.java:159) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskContextImpl.execute(TaskContextImpl.java:68) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskContextImpl.execute(TaskContextImpl.java:32) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.wfly10.config.task.ServerMigration.run(ServerMigration.java:45) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.wfly10.WildFlyServer10.migrate(WildFlyServer10.java:40) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.ServerMigration$1.run(ServerMigration.java:153) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskExecutionImpl.run(TaskExecutionImpl.java:169) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.ServerMigration.run(ServerMigration.java:160) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.cli.CommandLineServerMigration.main(CommandLineServerMigration.java:131) [jboss-server-migration-cli-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
Caused by: java.lang.IllegalStateException: WFLYEMB0022: Cannot invoke 'start' on embedded process
at org.wildfly.core.embedded.EmbeddedManagedProcessImpl.invokeOnServer(EmbeddedManagedProcessImpl.java:100) [wildfly-embedded-11.1.1.Final.jar:11.1.1.Final]
at org.wildfly.core.embedded.EmbeddedManagedProcessImpl.start(EmbeddedManagedProcessImpl.java:58) [wildfly-embedded-11.1.1.Final.jar:11.1.1.Final]
at org.jboss.migration.wfly10.config.management.impl.EmbeddedStandaloneServerConfiguration.startConfiguration(EmbeddedStandaloneServerConfiguration.java:89) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.wfly10.config.management.impl.AbstractManageableServerConfiguration.start(AbstractManageableServerConfiguration.java:70) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.wfly10.config.task.ServerConfigurationMigration$1.run(ServerConfigurationMigration.java:96) [jboss-server-migration-wildfly10.0-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
at org.jboss.migration.core.task.TaskExecutionImpl.run(TaskExecutionImpl.java:169) [jboss-server-migration-core-1.10.0.Final-SNAPSHOT.jar:1.10.0.Final-SNAPSHOT]
... 25 more
Caused by: org.wildfly.core.embedded.EmbeddedProcessStartException: WFLYEMB0021: Cannot start embedded process
at org.wildfly.core.embedded.EmbeddedStandaloneServerFactory$StandaloneServerImpl.start(EmbeddedStandaloneServerFactory.java:324) [wildfly-embedded-11.1.1.Final.jar:11.1.1.Final]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_221]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_221]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_221]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_221]
at org.wildfly.core.embedded.EmbeddedManagedProcessImpl.invokeOnServer(EmbeddedManagedProcessImpl.java:88) [wildfly-embedded-11.1.1.Final.jar:11.1.1.Final]
... 30 more
Caused by: java.util.concurrent.ExecutionException: JBTHR00005: Operation failed
at org.jboss.threads.AsyncFutureTask.get(AsyncFutureTask.java:253) [jboss-threads-2.3.3.Final.jar:2.3.3.Final]
at org.wildfly.core.embedded.EmbeddedStandaloneServerFactory$StandaloneServerImpl.start(EmbeddedStandaloneServerFactory.java:305) [wildfly-embedded-11.1.1.Final.jar:11.1.1.Final]
... 35 more
Caused by: java.lang.Exception: WFLYSRV0056: Server boot has failed in an unrecoverable manner; exiting. See previous messages for details.
at org.jboss.as.server.BootstrapListener.bootFailure(BootstrapListener.java:87)
at org.jboss.as.server.ServerService.boot(ServerService.java:426)
at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:416) [wildfly-controller-11.1.1.Final.jar:11.1.1.Final]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_221]
Your jgroups configuration is expecting a socket-binding element to define the address/port for connection.
According to https://docs.wildfly.org/21/wildscribe/subsystem/jgroups/stack/transport/TCP/index.html Sorry I don't have the model for WildFly 8.2.1 to 'see' which configuration you should change.

Flink submit task failed

I am using Flink1.6.1 and Hadoop2.7.5. on first I start a flink
bin/yarn-session.sh -n 2 -jm 1024 -tm 1024 -d
then submit a task
./bin/flink run ./examples/batch/WordCount.jar -input hdfs://CS-201:9000/LICENSE -output hdfs://CS-201:9000/wordcount-result.txt
I got a error:
[root#CS-201 flink-1.6.1]# ./bin/flink run
./examples/batch/WordCount.jar -input hdfs://CS-201:9000/LICENSE
-output hdfs://CS-201:9000/wordcount-result.txt 2019-05-19 15:31:11,357 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli
- Found Yarn properties file under /tmp/.yarn-properties-root. 2019-05-19 15:31:11,357 INFO
org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found
Yarn properties file under /tmp/.yarn-properties-root. 2019-05-19
15:31:11,737 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli
- YARN properties set default parallelism to 2 2019-05-19 15:31:11,737 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli -
YARN properties set default parallelism to 2 YARN properties set
default parallelism to 2 2019-05-19 15:31:11,777 INFO
org.apache.hadoop.yarn.client.RMProxy -
Connecting to ResourceManager at CS-201/192.168.1.201:8032 2019-05-19
15:31:11,887 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli
- No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2019-05-19 15:31:11,887 INFO
org.apache.flink.yarn.cli.FlinkYarnSessionCli - No
path for the flink jar passed. Using the location of class
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2019-05-19 15:31:11,891 WARN
org.apache.flink.yarn.AbstractYarnClusterDescriptor -
Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable
is set.The Flink YARN Client needs one of these to be set to properly
load the Hadoop configuration for accessing YARN. 2019-05-19
15:31:11,979 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor
- Found application JobManager host name 'cs-202' and port '52389' from supplied application id 'application_1558248666499_0003' Starting
execution of program
------------------------------------------------------------ The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: Could not
retrieve the execution result. (JobID:
471f0c2d047aba74ea621c5bfe782cbf) at
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:260)
at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:474)
at
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
at
org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:85)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:426)
at
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
at
org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException:
Failed to submit JobGraph. at
org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
java.util.concurrent.CompletionException:
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could
not complete the operation. Exception is not retryable. at
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at
java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
... 12 more Caused by:
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could
not complete the operation. Exception is not retryable. ... 10 more
Caused by: java.util.concurrent.CompletionException:
org.apache.flink.runtime.rest.util.RestClientException: [Job
submission failed.] at
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at
java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:953)
at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 4 more Caused by:
org.apache.flink.runtime.rest.util.RestClientException: [Job
submission failed.] at
org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:310)
at
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:294)
at
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
... 5 more
why it happen? and How to fix that..

Apache flink - Timeout after submitting job on hadoop / yarn cluster

I am trying to upgrade our job from flink 1.4.2 to 1.7.1 but I keep running into timeouts after submitting the job. The flink job runs on our hadoop cluster (version 2.7) with Yarn.
I've seen the following behavior:
Using the same flink-conf.yaml as we used in 1.4.2: 1.5.6 / 1.6.3 / 1.7.1 all versions timeout while 1.4.2 works.
Using 1.5.6 with "mode: legacy" (to switch off flip-6) works
Using 1.7.1 with "mode: legacy" gives timeout (I assume this option was removed but the documentation is outdated? https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#legacy)
When the timeout happens I get the following stacktrace:
INFO class java.time.Instant does not contain a getter for field seconds
INFO class com.bol.fin_hdp.cm1.domain.Cm1Transportable does not contain a getter for field globalId
INFO Submitting job 5af931bcef395a78b5af2b97e92dcffe (detached: false).
INFO ------------------------------------------------------------
INFO The program finished with the following exception:
INFO org.apache.flink.client.program.ProgramInvocationException: The main method caused an error.
INFO at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:545)
INFO at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
INFO at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
INFO at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:798)
INFO at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:289)
INFO at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
INFO at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1035)
INFO at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1111)
INFO at java.security.AccessController.doPrivileged(Native Method)
INFO at javax.security.auth.Subject.doAs(Subject.java:422)
INFO at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
INFO at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
INFO at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1111)
INFO Caused by: java.lang.RuntimeException: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:43)
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJobWithConfig(IntervalJobStarter.java:32)
INFO at com.bol.fin_hdp.Main.main(Main.java:8)
INFO at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
INFO at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO at java.lang.reflect.Method.invoke(Method.java:498)
INFO at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
INFO ... 12 more
INFO Caused by: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
INFO at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:258)
INFO at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
INFO at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
INFO at com.bol.fin_hdp.cm1.job.Job.execute(Job.java:54)
INFO at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:41)
INFO ... 19 more
INFO Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
INFO at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:371)
INFO at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
INFO at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
INFO at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
INFO at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
INFO at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:216)
INFO at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
INFO at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
INFO at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
INFO at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
INFO at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:301)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
INFO at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
INFO at java.lang.Thread.run(Thread.java:748)
INFO Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
INFO at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
INFO ... 17 more
INFO Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-01...
INFO at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
INFO at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
INFO at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
INFO at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
INFO ... 15 more
INFO Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-017.example.com/some.ip.address:46500
INFO at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212)
INFO ... 7 more
What changed in flip-6 that might cause this behavior and how can I fix this?
For our jobs on YARN w/Flink 1.6, we had to bump up the web.timeout setting via -yD web.timeout=100000.
In our case, there was a firewall between the machine submitting the job and our Hadoop cluster.
In newer Flink versions (1.7 and up) Flink uses REST to submit jobs. The port number for this REST service is random on yarn setups and could not be set.
Flink 1.8.0 introduced a config option to set this to a port or port range using:
rest.bind-port: 55520-55530

Apache Ignite nodes cannot communicate

I've configured Apache Ignite 1.8.0 programmatically and can start a server with a single node, but when another node joins, they cannot communicate and I receive many of the following two messages in the logs. These continue until the other node is stopped.
ERROR 12:52:39,187-0800 [*Initialization*] util.nio.GridDirectParser: Failed to read message [msg=null, buf=java.nio.DirectByteBuffer[pos=5 lim=420 cap=32768], reader=null, ses=GridSelectorNioSessionImpl [selectorIdx=0, queueSize=1, writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=5 lim=420 cap=32768], recovery=null, super=GridNioSessionImpl [locAddr=/10.97.184.106:5702, rmtAddr=/10.97.189.92:58788, createTime=1484945559174, closeTime=0, bytesSent=0, bytesRcvd=420, sndSchedTime=1484945559174, lastSndTime=1484945559174, lastRcvTime=1484945559185, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser#21e93eaf, directMode=true], GridConnectionBytesVerifyFilter], accepted=true]]]
class org.apache.ignite.IgniteException: Invalid message type: -84
at org.apache.ignite.internal.managers.communication.GridIoMessageFactory.create(GridIoMessageFactory.java:805)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$5.create(TcpCommunicationSpi.java:1631)
at org.apache.ignite.internal.util.nio.GridDirectParser.decode(GridDirectParser.java:76)
at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:104)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
at org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onMessageReceived(GridConnectionBytesVerifyFilter.java:113)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:2332)
at org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:173)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:918)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:1583)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1516)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1289)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:745)
WARN 12:52:39,188-0800 [*Initialization*] communication.tcp.TcpCommunicationSpi: Failed to process selector key (will close): GridSelectorNioSessionImpl [selectorIdx=0, queueSize=1, writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=5 lim=420 cap=32768], recovery=null, super=GridNioSessionImpl [locAddr=/10.97.184.106:5702, rmtAddr=/10.97.189.92:58788, createTime=1484945559174, closeTime=0, bytesSent=0, bytesRcvd=420, sndSchedTime=1484945559174, lastSndTime=1484945559174, lastRcvTime=1484945559185, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser#21e93eaf, directMode=true], GridConnectionBytesVerifyFilter], accepted=true]]
ERROR 12:52:39,189-0800 [*Initialization*] communication.tcp.TcpCommunicationSpi: Closing NIO session because of unhandled exception.
class org.apache.ignite.internal.util.nio.GridNioException: Invalid message type: -84
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:1595)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1516)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1289)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:745)
Caused by: class org.apache.ignite.IgniteException: Invalid message type: -84
at org.apache.ignite.internal.managers.communication.GridIoMessageFactory.create(GridIoMessageFactory.java:805)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$5.create(TcpCommunicationSpi.java:1631)
at org.apache.ignite.internal.util.nio.GridDirectParser.decode(GridDirectParser.java:76)
at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:104)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
at org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onMessageReceived(GridConnectionBytesVerifyFilter.java:113)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:2332)
at org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:173)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:918)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:1583)
... 4 more
Version information.
>>> +----------------------------------------------------------------------+
>>> Ignite ver. 1.8.0#20161205-sha1:9ca40dbeb7d559fcb299bdb6f5c90cdf8ce7e533
>>> +----------------------------------------------------------------------+
>>> OS name: Windows Server 2012 R2 6.3 amd64
>>> CPU(s): 2
>>> Heap: 3.6GB
>>> VM name: 13752#host
>>> Grid name: T-XXX
>>> Local node [ID=983EC5A0-2D9A-40C9-B4C3-3D59739BDDB9, order=1, clientMode=false]
>>> Local node addresses: [hostname.example.com/0:0:0:0:0:0:0:1, /10.97.184.106, /127.0.0.1]
>>> Local ports: TCP:5702 TCP:5703 TCP:5705
One of the similar issues I've found in my research is that it is recommended to disable the shared memory feature (setSharedMemoryPort -1) as a first step in removing a problem like this.
The server is running on Windows and the other server joining the cache is on OSX.
INFO 12:50:17,569-0800 [*Initialization*] ignite.internal.IgniteKernal%T-XXX: OS: Windows Server 2012 R2 6.3 amd64
How do I prevent these errors? Have I configured the cluster poorly or is there an incompatibility between the two machines I am using?
Very likely it's a misconfiguration issue. This can happen if discovery SPI on one node tries to connect to communication SPI on another node. See this post: http://apache-ignite-users.70518.x6.nabble.com/Invalid-message-type-84-error-td9869.html