Randomly getting java.lang.ClassCastException in snappy job - apache-spark-sql

Snappy job written in Scala aborts with exception:
java.lang.ClassCastException: com.....$Class1 cannot be cast to com.....$Class1.
Class1 is custom class that is stored in RDD. Interesting thing is this error is thrown while casting same class. So far, no patterns are found.
In the job, we fetch data from hbase, enrich data with analytical metadata using Dataframes and push it to a table in SnappyData. We are using Snappydata 1.2.0.1.
Not sure why is this happening.
Below is Stack Trace:
Job aborted due to stage failure: Task 76 in stage 42.0 failed 4 times, most recent failure: Lost task 76.3 in stage 42.0 (TID 3550, HostName, executor XX.XX.x.xxx(10360):7872): java.lang.ClassCastException: cannot be cast to
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:86)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$2.hasNext(WholeStageCodegenExec.scala:571)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$1.hasNext(WholeStageCodegenExec.scala:514)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:132)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:233)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1006)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:997)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:41)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.sql.execution.WholeStageCodegenRDD.computeInternal(WholeStageCodegenExec.scala:557)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$1.(WholeStageCodegenExec.scala:504)
at org.apache.spark.sql.execution.WholeStageCodegenRDD.compute(WholeStageCodegenExec.scala:503)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:41)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:103)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:126)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:326)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.spark.executor.SnappyExecutor$$anon$2$$anon$3.run(SnappyExecutor.scala:57)
at java.lang.Thread.run(Thread.java:748)

Classes are not unique by name. They're unique by name + classloader.
ClassCastException of the kind you're seeing happens when you pass data between parts of the app where one or both parts are loaded in a separate classloader.
You might need to clean up your classpath, you might need to resolve the classes from the same classloader, or you might have to serialize the data (especially if you have features that rely on reloading code at runtime).

Related

Spark throws Error: FileNotFoundException when writing data frame to S3

we have a data frame which we want to write to s3 as parquet format and in overwrite mode.
every time we write the dataframe it's always a new folder. The code to write the s3 location is as follows:
df.write
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.option("maxRecordsPerFile", maxRecordsPerFile)
.mode("overwrite")
.format(format)
.save(output)
What we observe is, at times we get FilenotFoundException (full trace below). Can somebody help me understand
when i am writing to a new s3 location (meaning nobody is reading from the location); why does the writing program throw the below exception?
how to fix it? --i see couple of stackoverflows pointing to this exception. But they say that it happens when you try to read when write is happening. But my case is not like that. i dont read when write happens.
my spark is 2.3.2 ; EMR-5.18.1 ; the code is written in scala
I am using s3:// as output folder path. Should i change it to some s3n or s3a ? will that help?
Caused by: java.io.FileNotFoundException: No such file or directory 's3://BUCKET/snapshots/FOLDER/_bid_9223370368440344985/part-00020-693dfbcb-74e9-45b0-b892-0b19fa92365c-c000.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:104)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
I finally was able to solve the problem
The df : DataFrame was formed on the same s3 folder to which the same is being written in overwrite mode.
So during the overwrite; the source folder is getting cleared --which was resulting into the error
Hope this helps somebody.

Solution Partitioning failing

I have implemented a solution partitioner for my planning problem. But when I now run the optimizer, it returns the following error:
Exception in thread "main" java.lang.IllegalStateException: The partition child thread with partIndex (1) has thrown an exception. Relayed here in the parent thread.
at org.optaplanner.core.impl.partitionedsearch.queue.PartitionQueue$PartitionQueueIterator.createUpcomingSelection(PartitionQueue.java:157)
at org.optaplanner.core.impl.partitionedsearch.queue.PartitionQueue$PartitionQueueIterator.createUpcomingSelection(PartitionQueue.java:121)
at org.optaplanner.core.impl.heuristic.selector.common.iterator.UpcomingSelectionIterator.hasNext(UpcomingSelectionIterator.java:42)
at org.optaplanner.core.impl.partitionedsearch.DefaultPartitionedSearchPhase.solve(DefaultPartitionedSearchPhase.java:131)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:88)
at org.optaplanner.core.impl.solver.DefaultSolver.solve(DefaultSolver.java:191)
at com.paconsulting.Demo.main(PowerPeersDemo.java:137)
Caused by: java.lang.IllegalStateException: When lookUpEnabled (false) is disabled in the constructor, this method should not be called.
at org.optaplanner.core.impl.score.director.AbstractScoreDirector.lookUpWorkingObject(AbstractScoreDirector.java:506)
at org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove.rebase(ChangeMove.java:83)
at org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove.rebase(ChangeMove.java:33)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.forageResult(MultiThreadedLocalSearchDecider.java:196)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.decideNextStep(MultiThreadedLocalSearchDecider.java:157)
at org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase.solve(DefaultLocalSearchPhase.java:70)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:88)
at org.optaplanner.core.impl.partitionedsearch.PartitionSolver.solve(PartitionSolver.java:121)
at org.optaplanner.core.impl.partitionedsearch.DefaultPartitionedSearchPhase.lambda$solve$1(DefaultPartitionedSearchPhase.java:119)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-03-21 21:47:41,705 [main] DEBUG PS step (1), time spent (1493), score (-249530hard/0soft), best score (-249530hard/0soft), picked move (part-0 {3886 variables changed}).
Exception in thread "main" java.lang.IllegalStateException: The partition child thread with partIndex (1) has thrown an exception. Relayed here in the parent thread.
at org.optaplanner.core.impl.partitionedsearch.queue.PartitionQueue$PartitionQueueIterator.createUpcomingSelection(PartitionQueue.java:157)
at org.optaplanner.core.impl.partitionedsearch.queue.PartitionQueue$PartitionQueueIterator.createUpcomingSelection(PartitionQueue.java:121)
at org.optaplanner.core.impl.heuristic.selector.common.iterator.UpcomingSelectionIterator.hasNext(UpcomingSelectionIterator.java:42)
at org.optaplanner.core.impl.partitionedsearch.DefaultPartitionedSearchPhase.solve(DefaultPartitionedSearchPhase.java:131)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:88)
at org.optaplanner.core.impl.solver.DefaultSolver.solve(DefaultSolver.java:191)
at com.paconsulting.powerpeers.PowerPeersDemo.main(PowerPeersDemo.java:137)
Caused by: java.lang.IllegalStateException: When lookUpEnabled (false) is disabled in the constructor, this method should not be called.
at org.optaplanner.core.impl.score.director.AbstractScoreDirector.lookUpWorkingObject(AbstractScoreDirector.java:506)
at org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove.rebase(ChangeMove.java:83)
at org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove.rebase(ChangeMove.java:33)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.forageResult(MultiThreadedLocalSearchDecider.java:196)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.decideNextStep(MultiThreadedLocalSearchDecider.java:157)
at org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase.solve(DefaultLocalSearchPhase.java:70)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:88)
at org.optaplanner.core.impl.partitionedsearch.PartitionSolver.solve(PartitionSolver.java:121)
at org.optaplanner.core.impl.partitionedsearch.DefaultPartitionedSearchPhase.lambda$solve$1(DefaultPartitionedSearchPhase.java:119)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I do have implemented #PlanningId on all the relevant objects.
Running version 1.18 of OptaPlanner.
This is a bug, thank you for reporting:
Partitioned Search is incompatible with Multithreaded Incremental Solving in version 7.19 or lower. This was a gap in our test coverage.
Partitioned Search was implemented before Multithreaded Incremental Solving and the latter didn't take it into account here on AbstractScoreDirector:
public InnerScoreDirector<Solution_> createChildThreadScoreDirector(ChildThreadType childThreadType) {
if (childThreadType == ChildThreadType.PART_THREAD) {
AbstractScoreDirector<Solution_, Factory_> childThreadScoreDirector = (AbstractScoreDirector<Solution_, Factory_>)
scoreDirectorFactory.buildScoreDirector(false, constraintMatchEnabledPreference); // That false is lookUpEnabled
That false kills the ability to nest multithreaded solving under Partitioned Search.
I've created a jira issue and fixed it for 7.20 in this pull request.

Convert pyspark dataframe to pandas dataframe

I have pyspark dataframe where its dimension is (28002528,21) and tried to convert it to pandas dataframe by using the following code line :
pd_df=spark_df.toPandas()
I got this error:
first Part
Py4JJavaError: An error occurred while calling o170.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 39.0 failed 1 times, most recent failure: Lost task 3.0 in stage 39.0 (TID 89, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:552)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
...
...
Caused by: java.lang.OutOfMemoryError: Java heap space
...
...
Second Part
Exception happened during processing of request from ('127.0.0.1', 56842)
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:56657)
Traceback (most recent call last):
...
...
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
...
...
and I tried also to take sample of the original pyspark dataframe
smaple_pd_df=spark_df.sample(0.05).toPandas()
I got an error looks like the first part only of the previous error
You get
java.lang.OutOfMemoryError which probably means that you are trying to load all data into a single node which doesn't have enough RAM to handle the entire DataFrame. If you are using a cloud solution provider such as Databricks, try increasing the size of cluster RAM.
What toPandas() does is collect the whole dataframe into a single node (as explained in #ulmefors's answer).
More specifically, it collects it to the driver. The specific option you should be fine-tuning is spark.driver.memory, increase it accordingly.
Otherwise, if you're planning on doing further transformations on this (rather large) pandas dataframe, you could consider doing them in pyspark first and then collecting the (smaller) result into the driver, hopefully that will fit in memory.
More details are available in the Spark configuration documentation, here.

Issue in saving the content of a dataframe to table

I have a data source (hive external tables) which refresh the data in adhoc manner. To avoid any discrepancies in the execution i'm trying to save the data as a table in my location.
Initially, i have loaded the data from data source to a dataframe
source = hqlContext.table("datasourcedb.table1") // this is working fine
Then, trying to save it the my application location -
source.write.mode('overwrite').saveAsTable("appdb.table1") //No read/write operations on appdb.table1 while doing this action
Above actions throwing exceptions:
java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: BLOCK
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask$1(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/02 04:31:32 ERROR TaskSetManager: Task 9 in stage 1.0 failed 4 times; aborting job
18/03/02 04:31:32 ERROR InsertIntoHadoopFsRelation: Aborting job.
**Note: The size of the source is abot 6GB. Hence, no persist action is planned **

JobTracker - High memory and native thread usage

We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS.
Hadoop version: 1.2.1
Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1
Observed behavior: JT will accumulate threads in waiting state, leading to OOM:
2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444)
at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860)
at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706)
at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890)
at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
After looking through the JT logs I found these warnings:
2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010
java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
at org.apache.hadoop.ipc.Client.call(Client.java:1118)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392)
at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)
Caused by: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249)
at org.apache.hadoop.ipc.Client.call(Client.java:1093)
... 9 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635)
... 12 more
This appears to be similar to hadoop bug reporter here: https://issues.apache.org/jira/browse/MAPREDUCE-5606
I tried proposed solution by disabling saving job logs into the output path and it solved the problem at the expense of missing logs :)
I also ran jstack on JT and it showed hundreds of WAITING or TIMED_WAITING threads as such:
pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:327)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:378)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181)
at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque
st(MediaHttpUploader.java:629)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.
java:409)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl
eClientRequest.java:460)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo
ogleAsyncWriteChannel.java:354)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker)
It appears JT is having hard time keeping up communicating with GCS via GCS Connector.
Please advise,
Thank you
At the moment, every open FSDataOutputStream in the GCS connector for Hadoop consumes a thread until it's closed, because a separate thread needs to run the "resumable" HttpRequests while the user of the OutputStream writes bytes intermittently. In most cases, (such as in individual Hadoop tasks), there's only ever one long-lived output stream, and possibly a few shorter-lived ones for writing small metadata/marker files, etc.
In general, there are two possible causes for the OOM you're running into:
You have lots of queued up jobs; every submitted job holds an unclosed OutputStream, and thus consumes a "waiting" thread. However, since you mention you only need to queue up ~10 jobs, this shouldn't be the root cause.
Something is causing a "leak" of the PrintWriter objects, originally created in logSubmitted and added to fileManager. Typically, terminal events (like logFinished will correctly close() all the PrintWriters before removing them from the map via markCompleted, but in theory they may be bugs here or there which can cause one of the OutputStreams to leak without being close()'d. For example, while I haven't had a chance to verify this assertion, it seems that IOException trying to do something like logMetaInfo will "removeWriter" without closing it.
I've verified that at least under normal circumstances, the OutputStream seem to get closed correctly, and my sample JobTracker shows a clean jstack after having successfully run a lot of jobs.
TL;DR: There are some working theories as to why some resource may leak and ultimately prevent necessary threads from being created. You should consider changing hadoop.job.history.user.location to some HDFS location in the meantime, as a way to preserve the job logs in the absence of placing them on GCS.