Unexpected "Internal error" exception when using spring-cloud-gcp-pubsub 3.2.1 - spring-webflux

I'm using PubSubReactiveFactory fromspring-cloud-gcp-pubsub onOpenJDK 11 Debian Linux and I've observed the following exception in our application:
com.google.api.gax.rpc.InternalException: io.grpc.StatusRuntimeException: INTERNAL: http2 exception
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:110)
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:67)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:572)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:542)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:535)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.grpc.StatusRuntimeException: INTERNAL: http2 exception
at io.grpc.Status.asRuntimeException(Status.java:535)
... 14 common frames omitted
Caused by: io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:172)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:357)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1007)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams$2.process(DefaultHttp2Connection.java:968)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.decrementPendingIterations(DefaultHttp2Connection.java:1029)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.forEachActiveStream(DefaultHttp2Connection.java:984)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection.forEachActiveStream(DefaultHttp2Connection.java:209)
at io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler.goingAway(NettyClientHandler.java:839)
at io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler.access$200(NettyClientHandler.java:91)
at io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler$2.onGoAwayReceived(NettyClientHandler.java:278)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2Connection.goAwayReceived(DefaultHttp2Connection.java:237)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.onGoAwayRead0(DefaultHttp2ConnectionDecoder.java:217)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onGoAwayRead(DefaultHttp2ConnectionDecoder.java:583)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onGoAwayRead(Http2InboundFrameLogger.java:119)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readGoAwayFrame(DefaultHttp2FrameReader.java:580)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:271)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378)
at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438)
at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234)
at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283)
at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
The "Internal" gRPC status is propagated to application code and force us to retry pull operation polluting logs with errors/warnings along the way.
To add more context this is happening when Pub/Sub PullRequest API takes long time (I'm observing p99 20-30 seconds latency when this happens). Netty closes the connection with "Stream closed before write could take place" in DefaultHttp2RemoteFlowController.java:481 and status io.netty.handler.codec.http2.Http2Error.STREAM_CLOSED(0x5) Then this status code is translated to io.grpc.internal.Http2Error.INTERNAL and propagated up the stack.
Has anybody experience this error and come up with a way to gracefully handle it?

Related

Issues while running with BigQueryIO.Write.Method.STORAGE_WRITE_API

We are testing with STORAGE_WRITE_API to insert data into BigQuery. We've seen several errors/warnings in our Dataflow pipeline(written in Java). It might work well in the beginning, but eventually the system lag would be increasing, it would stop processing any data from PubSub and the unacked messages piled up.
One common warning is:
Operation ongoing in step insertTableRowsToBigQuery/StorageApiLoads/StorageApiWriteSharded/Write Records for at least 03h35m00s without outputting or completing in state process
at java.base#11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
at java.base#11.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
at java.base#11.0.9/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Callback.await(RetryManager.java:153)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.await(RetryManager.java:136)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager.await(RetryManager.java:256)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:248)
at app//org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.process(StorageApiWritesShardedRecords.java:453)
at app//org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
Other exceptions we've seen:
Got error io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Stream is closed
Got error io.grpc.StatusRuntimeException: ALREADY_EXIST
PodSandboxStatus of sandbox "..." for pod "df-...-pipeline-...-harness-qw4j_default(...)" error: rpc error: code = Unknown desc = Error: No such container
Code sample:
toBq.apply("insertTableRowsToBigQuery",
BigQueryIO
.writeTableRows()
.to(String.format("%s:%s.%s", PROJECT_ID, DATASET, table))
.withTriggeringFrequency(Duration.standardSeconds(options.getTriggeringFrequency()))
.withNumStorageWriteApiStreams(options.getNumStorageWriteApiStreams())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
There was a production issue related to connection being stuck after streaming 10MB which has been fixed. If you try again, it should work.

How to handle reactor-netty terminating a request in spring-webflux?

We have a spring-webflux application running on spring-boot-starter-webflux:jar:2.1.5.RELEASE and reactor-netty:jar:0.8.8.RELEASE. When a reactive client goes away (k8s pod killed or the client subscription is disposed off) before the request is completed, we see the server simply stops processing the request and no more application logs related to that request are seen. However, reactive-netty prints a trace log indicating that the channel is inactive and will be terminated.
Is there anything we can do to handle this termination gracefully? We would ideally like to respond to a cancellation signal from reactor.
2020-03-18T16:46:40.372Z TRACE --- [reactor-http-epoll-2] r.n.c.ChannelOperations : [id: 0x4cad974b, L:/<SERVER_IP_ADDRESS>:8080 ! R:/<SOME_OTHER_IP_ADDRESS>:55436] Disposing ChannelOperation from a channel
java.lang.Exception: ChannelOperation terminal stack
at reactor.netty.channel.ChannelOperations.terminate(ChannelOperations.java:391)
at reactor.netty.channel.ChannelOperations.onInboundClose(ChannelOperations.java:360)
at reactor.netty.channel.ChannelOperationsHandler.channelInactive(ChannelOperationsHandler.java:72)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:420)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:223)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:331)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:834)
The doOnCancel operator allows for a callback to be provided when a subscription is cancelled. This is different from the doOnEach operator which is run only when the subscription completes or errors out.
https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Mono.html#doOnCancel-java.lang.Runnable-

repast.simphony.ui.GUIScheduleRunner error message

I'm a new user in RePast learning to run the mesoFON model. I get this error message. What is the problem?
I'm using Eclipse IDE 2018-09.
FATAL [Thread-5] 11:34:18,767 repast.simphony.ui.GUIScheduleRunner -
RunTimeException when running the schedule
Current tick (1.0)
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at repast.simphony.engine.schedule.DynamicTargetAction.execute(DynamicTargetAction.java:72)
at repast.simphony.engine.schedule.DefaultAction.execute(DefaultAction.java:38)
at repast.simphony.engine.schedule.ScheduleGroup.executeList(ScheduleGroup.java:205)
at repast.simphony.engine.schedule.ScheduleGroup.execute(ScheduleGroup.java:231)
at repast.simphony.engine.schedule.Schedule.execute(Schedule.java:352)
at repast.simphony.ui.GUIScheduleRunner$ScheduleLoopRunnable.run(GUIScheduleRunner.java:52)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.reflect.InvocationTargetException
at meso_FON.application.Environment$$FastClassByCGLIB$$fd509841.invoke(<generated>)
at net.sf.cglib.reflect.FastMethod.invoke(FastMethod.java:53)
at repast.simphony.engine.schedule.DynamicTargetAction.execute(DynamicTargetAction.java:69)
... 6 more
Caused by: java.lang.IllegalArgumentException: Comparison method violates its general contract!
at java.base/java.util.TimSort.mergeLo(TimSort.java:781)
at java.base/java.util.TimSort.mergeAt(TimSort.java:518)
at java.base/java.util.TimSort.mergeCollapse(TimSort.java:448)
at java.base/java.util.TimSort.sort(TimSort.java:245)
at java.base/java.util.Arrays.sort(Arrays.java:1515)
at java.base/java.util.ArrayList.sort(ArrayList.java:1749)
at java.base/java.util.Collections.sort(Collections.java:177)
at org.khelekore.prtree.MinMaxNodeGetter.<init>(MinMaxNodeGetter.java:29)
at org.khelekore.prtree.LeafBuilder.getMM(LeafBuilder.java:69)
at org.khelekore.prtree.LeafBuilder.buildLeafs(LeafBuilder.java:34)
at org.khelekore.prtree.PRTree.load(PRTree.java:65)
at meso_FON.application.Environment.getPRTree(Environment.java:423)
at meso_FON.application.Environment.queryPRTree(Environment.java:234)
... 9 more
It appears that there is an issue with a mesoFOM specific-method call. I'd suggest reaching out to the mesoFOM model developers directly to see if they can help.

Jetty server closes the stream generating an 500 error

When constructing a http response using jetty-9.4.6 I get the following exception. In my particular case I'm constructing the message from camel but this doesn't impact the behavior.
org.apache.cxf.interceptor.Fault: Could not write to XMLStreamWriter.
at org.apache.cxf.interceptor.StaxOutEndingInterceptor.handleMessage(StaxOutEndingInterceptor.java:75)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.interceptor.AbstractFaultChainInitiatorObserver.onMessage(AbstractFaultChainInitiatorObserver.java:112)
at org.apache.cxf.phase.PhaseInterceptorChain.wrapExceptionAsFault(PhaseInterceptorChain.java:366)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:324)
at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.phase.PhaseInterceptorChain.resume(PhaseInterceptorChain.java:278)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:78)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:170)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:193)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handleAsync(Server.java:609)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:334)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128)
at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:673)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:591)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.ctc.wstx.exc.WstxIOException: Closed
at com.ctc.wstx.sw.BaseStreamWriter._finishDocument(BaseStreamWriter.java:1421)
at com.ctc.wstx.sw.BaseStreamWriter.writeEndDocument(BaseStreamWriter.java:532)
at org.apache.cxf.interceptor.StaxOutEndingInterceptor.handleMessage(StaxOutEndingInterceptor.java:56)
... 32 more
Caused by: org.eclipse.jetty.io.EofException: Closed
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:476)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination$JettyOutputStream.write(JettyHTTPDestination.java:322)
at org.apache.cxf.io.AbstractWrappedOutputStream.write(AbstractWrappedOutputStream.java:51)
at com.ctc.wstx.io.UTF8Writer.flush(UTF8Writer.java:100)
at com.ctc.wstx.sw.BufferingXmlWriter.flush(BufferingXmlWriter.java:241)
at com.ctc.wstx.sw.BufferingXmlWriter.close(BufferingXmlWriter.java:214)
at com.ctc.wstx.sw.BaseStreamWriter._finishDocument(BaseStreamWriter.java:1419)
... 34 more
If the headers are beyond a certain size the exception is thrown. See Maximum on http header values? .
The exception doesn't suggest the reason for closing.

JobTracker - High memory and native thread usage

We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS.
Hadoop version: 1.2.1
Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1
Observed behavior: JT will accumulate threads in waiting state, leading to OOM:
2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444)
at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860)
at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706)
at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890)
at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
After looking through the JT logs I found these warnings:
2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010
java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
at org.apache.hadoop.ipc.Client.call(Client.java:1118)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392)
at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)
Caused by: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249)
at org.apache.hadoop.ipc.Client.call(Client.java:1093)
... 9 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635)
... 12 more
This appears to be similar to hadoop bug reporter here: https://issues.apache.org/jira/browse/MAPREDUCE-5606
I tried proposed solution by disabling saving job logs into the output path and it solved the problem at the expense of missing logs :)
I also ran jstack on JT and it showed hundreds of WAITING or TIMED_WAITING threads as such:
pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:327)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:378)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181)
at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque
st(MediaHttpUploader.java:629)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.
java:409)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl
eClientRequest.java:460)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo
ogleAsyncWriteChannel.java:354)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker)
It appears JT is having hard time keeping up communicating with GCS via GCS Connector.
Please advise,
Thank you
At the moment, every open FSDataOutputStream in the GCS connector for Hadoop consumes a thread until it's closed, because a separate thread needs to run the "resumable" HttpRequests while the user of the OutputStream writes bytes intermittently. In most cases, (such as in individual Hadoop tasks), there's only ever one long-lived output stream, and possibly a few shorter-lived ones for writing small metadata/marker files, etc.
In general, there are two possible causes for the OOM you're running into:
You have lots of queued up jobs; every submitted job holds an unclosed OutputStream, and thus consumes a "waiting" thread. However, since you mention you only need to queue up ~10 jobs, this shouldn't be the root cause.
Something is causing a "leak" of the PrintWriter objects, originally created in logSubmitted and added to fileManager. Typically, terminal events (like logFinished will correctly close() all the PrintWriters before removing them from the map via markCompleted, but in theory they may be bugs here or there which can cause one of the OutputStreams to leak without being close()'d. For example, while I haven't had a chance to verify this assertion, it seems that IOException trying to do something like logMetaInfo will "removeWriter" without closing it.
I've verified that at least under normal circumstances, the OutputStream seem to get closed correctly, and my sample JobTracker shows a clean jstack after having successfully run a lot of jobs.
TL;DR: There are some working theories as to why some resource may leak and ultimately prevent necessary threads from being created. You should consider changing hadoop.job.history.user.location to some HDFS location in the meantime, as a way to preserve the job logs in the absence of placing them on GCS.