Infinispan Cache Indexing- org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper\write.lock - indexing

I am using Infinispan 8.2.2 in client-server mode (2 nodes 1 cluster), use replicated cache with indexing configuration as below. Data in getting indexed as well. I use a .proto file to map to my object and index some of the fields in the object as well. Whenever I writes to cache for each row write, below exception occurs. Could you please help.
Configuration
`<replicated-cache name="tickets-cache" mode="ASYNC" start="EAGER">
<indexing index="ALL">
property name="default.indexmanager">near-real-time</property
<property name="lucene_version">LUCENE_CURRENT</property>
<property name="default.chunk_size">128000</property>
</indexing>
<locking acquire-timeout="60000"/>
<expiration lifespan="43200000"/>
</replicated-cache>`
Exception
Exception
2016-08-04 20:59:08,318 ERROR [org.hibernate.search.exception.impl.LogErrorHandler] (Hibernate Search sync consumer thread for index org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper) HSEARCH000058: Exception occurred org.apache.lucene.store.LockObtainFailedException: Lock held by another program: D:\WorkSpace\Softwares\infinispan-server-8.2.2.Final\org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper\write.lock
Primary Failure:
Entity org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper Id A:SgRCQUcx Work Type org.hibernate.search.backend.UpdateLuceneWork
: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: D:\WorkSpace\Softwares\infinispan-server-8.2.2.Final\org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper\write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:118)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:776)
at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.createNewIndexWriter(IndexWriterHolder.java:123)
at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.getIndexWriter(IndexWriterHolder.java:89)
at org.hibernate.search.backend.impl.lucene.AbstractWorkspaceImpl.getIndexWriter(AbstractWorkspaceImpl.java:117)
at org.hibernate.search.backend.impl.lucene.AbstractWorkspaceImpl.getIndexWriterDelegate(AbstractWorkspaceImpl.java:203)
at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.applyUpdates(LuceneBackendQueueTask.java:80)
at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.run(LuceneBackendQueueTask.java:46)
at org.hibernate.search.backend.impl.lucene.SyncWorkProcessor$Consumer.applyChangesets(SyncWorkProcessor.java:162)
at org.hibernate.search.backend.impl.lucene.SyncWorkProcessor$Consumer.run(SyncWorkProcessor.java:148)
at java.lang.Thread.run(Thread.java:745)
2016-08-04 20:59:08,320 ERROR [org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask] (Hibernate Search sync consumer thread for index org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper) HSEARCH000072: Couldn't open the IndexWriter because of previous error: operation skipped, index ouf of sync!
2016-08-04 20:59:55,255 INFO [org.jboss.as.clustering.infinispan] (remote-thread--p2-t13) DGISPN0001: Started ___event_log_cache cache from clustered container
2016-08-04 20:59:55,403 INFO [org.jboss.as.clustering.infinispan] (remote-thread--p2-t13) DGISPN0001: Started ___query_cache cache from clustered container
2016-08-04 21:00:12,036 ERROR [org.hibernate.search.exception.impl.LogErrorHandler] (Hibernate Search sync consumer thread for index org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper) HSEARCH000058: Exception occurred org.apache.lucene.store.LockObtainFailedException: Lock held by another program: D:\WorkSpace\Softwares\infinispan-server-8.2.2.Final\org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper\write.lock
Primary Failure:
Entity org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper Id A:SgRCQUcw Work Type org.hibernate.search.backend.UpdateLuceneWork
: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: D:\WorkSpace\Softwares\infinispan-server-8.2.2.Final\org.infinispan.query.remote.impl.indexing.ProtobufValueWrapper\write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:118)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:776)
at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.createNewIndexWriter(IndexWriterHolder.java:123)
at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.getIndexWriter(IndexWriterHolder.java:89)
at org.hibernate.search.backend.impl.lucene.AbstractWorkspaceImpl.getIndexWriter(AbstractWorkspaceImpl.java:117)
at org.hibernate.search.backend.impl.lucene.AbstractWorkspaceImpl.getIndexWriterDelegate(AbstractWorkspaceImpl.java:203)
at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.applyUpdates(LuceneBackendQueueTask.java:80)
at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.run(LuceneBackendQueueTask.java:46)
at org.hibernate.search.backend.impl.lucene.SyncWorkProcessor$Consumer.applyChangesets(SyncWorkProcessor.java:162)
at org.hibernate.search.backend.impl.lucene.SyncWorkProcessor$Consumer.run(SyncWorkProcessor.java:148)
at java.lang.Thread.run(Thread.java:745)

Related

Cache partition not replicated

I have 2 nodes with the persistence enabled. I create a cache like so
// all the queues across the frontier instances
CacheConfiguration cacheCfg2 = new CacheConfiguration("queues");
cacheCfg2.setBackups(backups);
cacheCfg2.setCacheMode(CacheMode.PARTITIONED);
globalQueueCache = ignite.getOrCreateCache(cacheCfg2);
where backups is a value > 1
When one of the nodes dies, I get
Exception in thread "Thread-2" javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostParts [cacheName=queues, part=2]
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryAdapter.executeScanQuery(GridCacheQueryAdapter.java:597)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl$1.applyx(IgniteCacheProxyImpl.java:519)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl$1.applyx(IgniteCacheProxyImpl.java:517)
at org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:3482)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:516)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:843)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.query(GatewayProtectedCacheProxy.java:418)
at crawlercommons.urlfrontier.service.ignite.IgniteService$QueueCheck.run(IgniteService.java:270)
Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostParts [cacheName=queues, part=2]
... 9 more
I expected the content to have been replicated onto the other node. Why isn't that the case?
Most likely there is a misconfiguration somewhere. Check the following:
you are not working with an existing cache (replace getOrCreateCache to createCache)
you are not having more server nodes than the backup factor is
inspect the logs for "Detected lost partitions" message and what happened prior

Stripped pool starvation in WAL writing causes node cluster node failure

Moderate workload on 3 node ignite cluster causes one node to fail with stripped pool startvation while archiving WAL.
This happens one or two times in week.
I already checked all IO problems which could hang WAL rollover. But this issue still persist
I am using latest ignite 2.7 as a library inside spring boot application
: >>> Possible starvation in striped pool.
Deadlock: false
Completed: 1397
Thread [name="sys-stripe-7-#8%server.node%", id=22, state=WAITING, blockCnt=3, waitCnt=757]
Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.awaitNext(FileWriteAheadLogManager.java:2871)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.access$2300(FileWriteAheadLogManager.java:2451)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1205)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:836)
at o.a.i.i.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4267)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6333)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6082)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5782)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3719)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5900(BPlusTree.java:3613)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1895)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1779)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1638)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
at o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2295)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3242)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:135)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:309)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:304)
at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:505)
at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
ERROR --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-1, blockedFor=10s]
WARN --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Thread [name="sys-stripe-1-#2%server.node%", id=16, state=WAITING, blockCnt=0, waitCnt=754]
Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
Failure Detection feature is not very well configured in Apache Ignite 2.7 by default. You can turn it off (by setting to NoOp) or set a large failureDetectionTimeout to avoid such messages (and shutdown of nodes).

Apache Geode debug Unknown pdx type=2140705

If I start a GFSH client and connect to Geode. There is a lot of data in myRegion and to check through it then I run:
query --query="select * from /myRegion"
I am getting the response:
Result : false
startCount : 0
endCount : 20
Message : Unknown pdx type=2140705
How does one troubleshoot / debug this problem?
UPDATE: The error in the Geode server log is:
[info 2018/07/04 10:53:07.275 BST IsGeode <Function Execution Processor1> tid=0x48] Exception occurred:
java.lang.IllegalStateException: Unknown pdx type=1318971
at org.apache.geode.internal.InternalDataSerializer.readPdxSerializable(InternalDataSerializer.java:3042)
at org.apache.geode.internal.InternalDataSerializer.basicReadObject(InternalDataSerializer.java:2859)
at org.apache.geode.DataSerializer.readObject(DataSerializer.java:2961)
at org.apache.geode.internal.util.BlobHelper.deserializeBlob(BlobHelper.java:90)
at org.apache.geode.internal.cache.EntryEventImpl.deserialize(EntryEventImpl.java:1911)
at org.apache.geode.internal.cache.EntryEventImpl.deserialize(EntryEventImpl.java:1904)
at org.apache.geode.internal.cache.PreferBytesCachedDeserializable.getDeserializedValue(PreferBytesCachedDeserializable.java:73)
at org.apache.geode.internal.cache.LocalRegion.getDeserialized(LocalRegion.java:1269)
at org.apache.geode.internal.cache.LocalRegion$NonTXEntry.getValue(LocalRegion.java:8771)
at org.apache.geode.internal.cache.EntriesSet$EntriesIterator.moveNext(EntriesSet.java:179)
at org.apache.geode.internal.cache.EntriesSet$EntriesIterator.next(EntriesSet.java:134)
at org.apache.geode.cache.query.internal.CompiledSelect.doNestedIterations(CompiledSelect.java:837)
at org.apache.geode.cache.query.internal.CompiledSelect.doIterationEvaluate(CompiledSelect.java:699)
at org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:423)
at org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:53)
at org.apache.geode.cache.query.internal.DefaultQuery.executeUsingContext(DefaultQuery.java:558)
at org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:385)
at org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:319)
at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:247)
at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:202)
at org.apache.geode.management.internal.cli.functions.DataCommandFunction.execute(DataCommandFunction.java:147)
at org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:185)
at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662)
at org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1108)
at java.lang.Thread.run(Thread.java:748)
You can tell the immediate cause from the stack trace.
A PDX serialized stream contains a type id which is a reference into a repository of type metadata maintained by a GemFire cluster. In this case, the serialized data of the object contained a typeId that is not in the cluster's metadata repository.
So the question becomes, "what serialized that object and why did it use an invalid type id ?"
The only way I've seen this happen before is when a cluster is fully restarted and the pdx metadata goes away, either because it was not persistent or because it was deleted (by clearing out the locator working directory for example).
GemFire clients cache the mapping between a type and it's type ID. This allows them to quickly serialize objects without continually looking up the type id from the server. Client connections can persist across cluster restarts. When a client reconnects it does not flush the cached information and continues to write objects using its cached type ID.
So the combination of a pdx-metadata losing cluster restart and a client that is not restarted (e.g. an app. server) is the only way I have seen this happen before. Does this match your scenario ?
If so, one of the best ways to avoid this is to persist your pdx metadata and never delete it.

JobTracker - High memory and native thread usage

We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS.
Hadoop version: 1.2.1
Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1
Observed behavior: JT will accumulate threads in waiting state, leading to OOM:
2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444)
at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860)
at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706)
at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890)
at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
After looking through the JT logs I found these warnings:
2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010
java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
at org.apache.hadoop.ipc.Client.call(Client.java:1118)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392)
at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)
Caused by: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249)
at org.apache.hadoop.ipc.Client.call(Client.java:1093)
... 9 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635)
... 12 more
This appears to be similar to hadoop bug reporter here: https://issues.apache.org/jira/browse/MAPREDUCE-5606
I tried proposed solution by disabling saving job logs into the output path and it solved the problem at the expense of missing logs :)
I also ran jstack on JT and it showed hundreds of WAITING or TIMED_WAITING threads as such:
pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:327)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:378)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181)
at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque
st(MediaHttpUploader.java:629)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.
java:409)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl
eClientRequest.java:460)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo
ogleAsyncWriteChannel.java:354)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker)
It appears JT is having hard time keeping up communicating with GCS via GCS Connector.
Please advise,
Thank you
At the moment, every open FSDataOutputStream in the GCS connector for Hadoop consumes a thread until it's closed, because a separate thread needs to run the "resumable" HttpRequests while the user of the OutputStream writes bytes intermittently. In most cases, (such as in individual Hadoop tasks), there's only ever one long-lived output stream, and possibly a few shorter-lived ones for writing small metadata/marker files, etc.
In general, there are two possible causes for the OOM you're running into:
You have lots of queued up jobs; every submitted job holds an unclosed OutputStream, and thus consumes a "waiting" thread. However, since you mention you only need to queue up ~10 jobs, this shouldn't be the root cause.
Something is causing a "leak" of the PrintWriter objects, originally created in logSubmitted and added to fileManager. Typically, terminal events (like logFinished will correctly close() all the PrintWriters before removing them from the map via markCompleted, but in theory they may be bugs here or there which can cause one of the OutputStreams to leak without being close()'d. For example, while I haven't had a chance to verify this assertion, it seems that IOException trying to do something like logMetaInfo will "removeWriter" without closing it.
I've verified that at least under normal circumstances, the OutputStream seem to get closed correctly, and my sample JobTracker shows a clean jstack after having successfully run a lot of jobs.
TL;DR: There are some working theories as to why some resource may leak and ultimately prevent necessary threads from being created. You should consider changing hadoop.job.history.user.location to some HDFS location in the meantime, as a way to preserve the job logs in the absence of placing them on GCS.

Solution for java.sql.SQLException: could not use local transaction commit in a global transaction

I'm running a J2EE MDB in a Weblogic 10.3 clustered environment.
Within one of the methods in the code I create a JDBC connection and turn off auto commit. The connection to the database is made through a XA enabled Data source. The reason for turning the auto-commit off is because I'm using a temp table.
When I'm done with the temp table & other SQL queries I commit the Transaction and turn auto-commit back on. For the most part this has not been a problem but I received the following exception when executing the method, the exception occurred on the line in the code where I issue the commit.
Not sure why I received as in the method I only make calls to a single database, I wonder if its because the data source is XA enabled? Any insight as to why this exception occurs and why it only occurs sometimes would be greatly appreciated
java.sql.SQLException: could not use local transaction commit in a global transaction
at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:112)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:173)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:229)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:403)
at oracle.jdbc.driver.PhysicalConnection.disallowGlobalTxnMode(PhysicalConnection.java:6139)
at oracle.jdbc.driver.PhysicalConnection.commit(PhysicalConnection.java:3320)
at oracle.jdbc.driver.PhysicalConnection.commit(PhysicalConnection.java:3357)
at oracle.jdbc.OracleConnectionWrapper.commit(OracleConnectionWrapper.java:130)
at weblogic.jdbc.wrapper.XAConnection.commit(XAConnection.java:844)
at weblogic.jdbc.wrapper.JTAConnection.commit(JTAConnection.java:326)
at com.myPackage.myApp.myDBCaller.tempTableMethod(myDBCaller.java:390)
at com.myPackage.myApp.myDBCaller.start(myDBCaller.java:153)
at com.myPackage.myApp.myDBCaller.onMessage(myDBCaller.java:117)
at sun.reflect.GeneratedMethodAccessor1683.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at com.bea.core.repackaged.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:281)
at com.bea.core.repackaged.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:187)
at com.bea.core.repackaged.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:154)
at com.bea.core.repackaged.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed(DelegatingIntroductionInterceptor.java:126)
at com.bea.core.repackaged.springframework.aop.support.DelegatingIntroductionInterceptor.invoke(DelegatingIntroductionInterceptor.java:114)
at com.bea.core.repackaged.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:176)
at com.bea.core.repackaged.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:89)
at com.bea.core.repackaged.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:176)
at com.bea.core.repackaged.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed(DelegatingIntroductionInterceptor.java:126)
at com.bea.core.repackaged.springframework.aop.support.DelegatingIntroductionInterceptor.invoke(DelegatingIntroductionInterceptor.java:114)
at com.bea.core.repackaged.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:176)
at com.bea.core.repackaged.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:210)
at $Proxy228.onMessage(Unknown Source)
at weblogic.ejb.container.internal.MDListener.execute(MDListener.java:518)
at weblogic.ejb.container.internal.MDListener.transactionalOnMessage(MDListener.java:423)
at weblogic.ejb.container.internal.MDListener.onMessage(MDListener.java:325)
at weblogic.jms.client.JMSSession.onMessage(JMSSession.java:4547)
at weblogic.jms.client.JMSSession.execute(JMSSession.java:4233)
at weblogic.jms.client.JMSSession.executeMessage(JMSSession.java:3709)
at weblogic.jms.client.JMSSession.access$000(JMSSession.java:114)
at weblogic.jms.client.JMSSession$UseForRunnable.run(JMSSession.java:5058)
at weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.java:516)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:201)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:173)
Try set property
<property name="AutomaticEnlistingEnabled" value="false"/>
of xaDatasource.