I have huge dataframe around 7GB records.
I am trying to get the count of the dataframe and download it as csv
Both of them result in below error.
is there any other way of downloading the dataframe without multiple partitions
print(df.count())
df.coalesce(1).write.option("header", "true").csv('/user/ABC/Output.csv')
Error:
java.io.IOException: Stream is corrupted
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/26 18:15:44 ERROR scheduler.TaskSetManager: Task 8 in stage 360.0 failed 1 times; aborting job
[Stage 360:=======> (8 + 1) / 60]
Py4JJavaError: An error occurred while calling o18867.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 360.0 failed 1 times, most recent failure: Lost task 8.0 in stage 360.0 (TID 13986, localhost, executor driver): java.io.IOException: Stream is corrupted
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Related
I am using On New or Update sftp connector to read around 3000+ files daily. Everything runs smooth. However,at the end of processing phase. the sftp connector throws below error,and does not process the last file which remains in the sftp folder for next run. this error scenario keeps repeating for each run. Hence, the last file of each run does not process.
20:03:37.733 06/15/2022 Worker-0 [MuleRuntime].uber.33: [demo-data-api].prcsFiles-Error-SuccessFlow.CPU_INTENSIVE #38f99b06 ERROR
event:c3ccc560-f1be-11ec-a890-02732233ad66
********************************************************************************
Message : "org.mule.weave.v2.module.reader.ReaderParsingException: org.mule.runtime.api.exception.MuleRuntimeException - Exception was found trying to retrieve the contents of file /home/transaction/data.json
org.mule.runtime.api.exception.MuleRuntimeException: Exception was found trying to retrieve the contents of file /home/transaction/data.json
at org.mule.extension.sftp.internal.connection.SftpClient.exception(SftpClient.java:427)
at org.mule.extension.sftp.internal.connection.SftpClient.exception(SftpClient.java:423)
at org.mule.extension.sftp.internal.connection.SftpClient.getFileContent(SftpClient.java:349)
at org.mule.extension.sftp.internal.connection.SftpFileSystem.retrieveFileContent(SftpFileSystem.java:117)
at org.mule.extension.sftp.internal.SftpInputStream$SftpFileInputStreamSupplier.getContentInputStream(SftpInputStream.java:111)
at org.mule.extension.sftp.internal.SftpInputStream$SftpFileInputStreamSupplier.getContentInputStream(SftpInputStream.java:93)
at org.mule.extension.file.common.api.AbstractConnectedFileInputStreamSupplier.getContentInputStream(AbstractConnectedFileInputStreamSupplier.java:81)
at org.mule.extension.file.common.api.AbstractFileInputStreamSupplier.get(AbstractFileInputStreamSupplier.java:65)
at org.mule.extension.file.common.api.AbstractFileInputStreamSupplier.get(AbstractFileInputStreamSupplier.java:33)
at org.mule.extension.file.common.api.stream.LazyStreamSupplier.lambda$new$1(LazyStreamSupplier.java:29)
at org.mule.extension.file.common.api.stream.LazyStreamSupplier.get(LazyStreamSupplier.java:42)
at org.mule.extension.file.common.api.stream.AbstractNonFinalizableFileInputStream.lambda$createLazyStream$0(AbstractNonFinalizableFileInputStream.java:48)
at $java.io.InputStream$$EnhancerByCGLIB$$55e4687e.read(<generated>)
at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:102)
at org.mule.runtime.core.internal.streaming.bytes.AbstractInputStreamBuffer.consumeStream(AbstractInputStreamBuffer.java:111)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.consumeForwardData(FileStoreInputStreamBuffer.java:239)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.consumeForwardData(FileStoreInputStreamBuffer.java:202)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.doGet(FileStoreInputStreamBuffer.java:125)
at org.mule.runtime.core.internal.streaming.bytes.AbstractInputStreamBuffer.get(AbstractInputStreamBuffer.java:93)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.assureDataInLocalBuffer(BufferedCursorStream.java:126)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.doRead(BufferedCursorStream.java:101)
at org.mule.runtime.core.internal.streaming.bytes.AbstractCursorStream.read(AbstractCursorStream.java:124)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.read(BufferedCursorStream.java:26)
at java.io.InputStream.read(InputStream.java:101)
at org.mule.runtime.core.internal.streaming.bytes.ManagedCursorStreamDecorator.read(ManagedCursorStreamDecorator.java:96)
at org.mule.weave.v2.el.SeekableCursorStream.read(MuleTypedValue.scala:306)
at org.mule.weave.v2.module.reader.UTF8StreamSourceReader.handleBOM(SeekableStreamSourceReader.scala:179)
at org.mule.weave.v2.module.reader.UTF8StreamSourceReader.readAscii(SeekableStreamSourceReader.scala:163)
at org.mule.weave.v2.module.json.reader.JsonTokenizer.$init$(JsonTokenizer.scala:21)
at org.mule.weave.v2.module.json.reader.indexed.IndexedJsonTokenizer.<init>(IndexedJsonTokenizer.scala:15)
at org.mule.weave.v2.module.json.reader.indexed.IndexedJsonParser.parser(IndexedJsonParser.scala:17)
at org.mule.weave.v2.module.json.reader.JsonReader.readValue(JsonReader.scala:40)
at org.mule.weave.v2.module.json.reader.JsonReader.doRead(JsonReader.scala:30)
at org.mule.weave.v2.module.reader.Reader.read(Reader.scala:35)
at org.mule.weave.v2.module.reader.Reader.read$(Reader.scala:33)
at org.mule.weave.v2.module.json.reader.JsonReader.read(JsonReader.scala:20)
at org.mule.weave.v2.el.MuleTypedValue.value(MuleTypedValue.scala:147)
at org.mule.weave.v2.model.values.wrappers.DelegateValue.valueType(DelegateValue.scala:17)
at org.mule.weave.v2.model.values.wrappers.DelegateValue.valueType$(DelegateValue.scala:16)
at org.mule.weave.v2.el.MuleTypedValue.valueType(MuleTypedValue.scala:177)
at org.mule.weave.v2.model.types.ObjectType$.accepts(Type.scala:1068)
at org.mule.weave.v2.interpreted.node.executors.BinaryOverloadedStaticExecutor.executeBinary(BinaryOverloadedStaticExecutor.scala:45)
at org.mule.weave.v2.interpreted.node.ChainedBinaryOpNode.doExecute(ChainedBinaryOpNode.scala:37)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.ChainedBinaryOpNode.execute(ChainedBinaryOpNode.scala:7)
at org.mule.weave.v2.interpreted.node.NullSafeNode.doExecute(NullSafeNode.scala:14)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.NullSafeNode.execute(NullSafeNode.scala:8)
at org.mule.weave.v2.interpreted.node.BinaryOpNode.doExecute(BinaryOpNode.scala:15)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.BinaryOpNode.execute(BinaryOpNode.scala:9)
at org.mule.weave.v2.interpreted.node.structure.DocumentNode.doExecute(DocumentNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.structure.DocumentNode.execute(DocumentNode.scala:11)
at org.mule.weave.v2.interpreted.InterpretedMappingExecutableWeave.execute(InterpreterMappingCompilerPhase.scala:196)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:250)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.$anonfun$evaluate$2(WeaveExpressionLanguageSession.scala:101)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.doEvaluate(WeaveExpressionLanguageSession.scala:285)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:100)
at org.mule.runtime.core.internal.el.dataweave.DataWeaveExpressionLanguageAdaptor$1.evaluate(DataWeaveExpressionLanguageAdaptor.java:274)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluate(DefaultExpressionManagerSession.java:51)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluateBoolean(DefaultExpressionManagerSession.java:72)
at org.mule.runtime.core.internal.routing.ProcessorExpressionRoute.accepts(ProcessorExpressionRoute.java:34)
at org.mule.runtime.core.internal.routing.ExecutableRoute.shouldExecute(ExecutableRoute.java:41)
at org.mule.runtime.core.internal.routing.ChoiceRouter$SinkRouter.lambda$route$0(ChoiceRouter.java:161)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531)
at org.mule.runtime.core.internal.routing.ChoiceRouter$SinkRouter.route(ChoiceRouter.java:161)
at org.mule.runtime.core.api.util.func.CheckedConsumer.accept(CheckedConsumer.java:19)
at org.mule.runtime.core.api.rx.Exceptions.lambda$checkedConsumer$0(Exceptions.java:51)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:482)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onNext(FluxOnAssembly.java:351)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxMap$MapConditionalSubscriber.onNext(FluxMap.java:213)
at reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner.onNext(MonoFlatMapMany.java:242)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:496)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:180)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onNext(FluxOnAssembly.java:351)
at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:447)
at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:534)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.mule.service.scheduler.internal.AbstractRunnableFutureDecorator.doRun(AbstractRunnableFutureDecorator.java:151)
at org.mule.service.scheduler.internal.RunnableFutureDecorator.run(RunnableFutureDecorator.java:54)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.mule.extension.sftp.api.SftpConnectionException: Error occurred while trying to connect to host
... 112 more
Caused by: org.mule.runtime.api.connection.ConnectionException:
at org.mule.extension.sftp.api.SftpConnectionException.<init>(SftpConnectionException.java:38)
... 112 more
Caused by: org.mule.runtime.api.connection.ConnectionException:
... 112 more
Caused by: 4:
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1540)
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1290)
at org.mule.extension.sftp.internal.connection.SftpClient.getFileContent(SftpClient.java:347)
... 110 more
Caused by: java.io.IOException: Pipe closed
at java.io.PipedInputStream.read(PipedInputStream.java:307)
at com.jcraft.jsch.Channel$MyPipedInputStream.updateReadSide(Channel.java:362)
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1311)
... 112 mor, while reading `comingData` as Json.
Trace:
at main (Unknown)
at org.mule.weave.v2.el.utils.ExceptionHandler$.handleLocatableException(ExceptionHandler.scala:24)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.doEvaluate(WeaveExpressionLanguageSession.scala:291)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:100)
at org.mule.runtime.core.internal.el.dataweave.DataWeaveExpressionLanguageAdaptor$1.evaluate(DataWeaveExpressionLanguageAdaptor.java:274)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluate(DefaultExpressionManagerSession.java:51)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluateBoolean(DefaultExpressionManagerSession.java:72)
at org.mule.runtime.core.internal.routing.ProcessorExpressionRoute.accepts(ProcessorExpressionRoute.java:34)
at org.mule.runtime.core.internal.routing.ExecutableRou... [truncated]
SFTP Configuration:
<sftp:config name="SFTP_Config" doc:name="SFTP Config" doc:id="81f37ff8-d629-4f64-ab2c-5632350b8fca" >
<sftp:connection host="sample.com" port="00" username="example1" password="111111" connectionTimeout="30" responseTimeout="30">
<pooling-profile maxActive="10" maxIdle="10" maxWait="10" evictionCheckIntervalMillis="60000" minEvictionMillis="120000"/>
</sftp:connection>
</sftp:config>
<flow name="prcsFiles" doc:id="55ac597a-378f-40b6-8041-df7ca8254ebe" maxConcurrency="1">
<sftp:listener doc:name="On New or Updated File" doc:id="9a5d6eae-0fc6-46a5-be10-de98fb7ee16a" config-ref="SFTP_Config" directory="home/transaction/" outputMimeType="application/json" timeBetweenSizeCheckUnit="MILLISECONDS">
<reconnect-forever/>
<scheduling-strategy >
<cron expression="0 15 10 ? * *" timeZone="UTC" />
</scheduling-strategy>
</sftp:listener>
</flow>
I am using cron expressing which runs once daily. does someone came across this issue before.
Any thoughts will be appreciated.
I'm reading a huge csv file including 39,795,158 records and writing into MSSQL server, on Azure Databricks. The Databricks(notebook) is running on a cluster node with 56 GB Memory, 16 Cores, and 12 workers.
This is my code in Python and PySpark:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from time import sleep
url = "jdbc:sqlserver://{0}:{1};database={2}".format(server, port, database)
spark.conf.set("spark.databricks.io.cache.enabled", True)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Read csv file.
df_lake = spark.read \
.option('header', 'false') \
.schema(s) \
.option('delimiter', ',') \
.csv('wasbs://...')
batch_size = 60000
rows = df_lake.count()
org_pts = df_lake.rdd.getNumPartitions() # 566
new_pts = 1990
# Re-partition the DataFrame
df_repartitioned = df_lake.repartition(new_pts)
# Write the DataFrame into MSSQL server, by using JDBC driver
df_repartitioned.write \
.format("jdbc") \
.mode("overwrite") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", url) \
.option("dbtable", tablename) \
.option("user", username) \
.option("password", password) \
.option("batchsize", batch_size) \
.save()
sleep(10)
Then I got the logs and errors as following as:
rows: 39795158
org_pts: 566
new_pts: 1990
Copy error: An error occurred while calling o9647.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62 in stage 462.0 failed 4 times, most recent failure: Lost task 62.3 in stage 462.0 (TID 46609) (10.139.64.12 executor 27): com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:234)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:1217)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.rollback(SQLServerConnection.java:3508)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:728)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:857)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1025)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1025)
at org.apache.spark.SparkContext.$anonfun$runJob$2(SparkContext.scala:2517)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2828)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2775)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2769)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2769)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1305)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1305)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1305)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3036)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2977)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2965)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1067)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2477)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2460)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2517)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2542)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$1(RDD.scala:1025)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:419)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:1023)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:855)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:63)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:96)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:213)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:257)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:209)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:167)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:166)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:1080)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:130)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:273)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:104)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:223)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:1080)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:469)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:439)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:312)
at sun.reflect.GeneratedMethodAccessor448.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:234)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:1217)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.rollback(SQLServerConnection.java:3508)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:728)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:857)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1025)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1025)
at org.apache.spark.SparkContext.$anonfun$runJob$2(SparkContext.scala:2517)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
For 3 - 6 millions records, it was no problem. But for 10 millions or above records, it was failed.
I'm not sure why it was happened on 10 millions or above records.
Are there any solutions for huge DataFrame process on Azure Databricks?
Using too many partitions when reading from the external database risks overloading that database with too many queries. Most DBMS systems have limits on the concurrent connections. As a starting point, aim to have the number of partitions be close to the number of cores or task slots in your Spark cluster in order to maximize parallelism but keep the total number of queries capped at a reasonable limit.
Workaround
If you need lots of parallelism after fetching the JDBC rows (because you’re doing something CPU bound in Spark) but don’t want to issue too many concurrent queries to your database then consider using a lower numPartitions for the JDBC read and then doing an explicit repartition() in Spark.
Refer this official doc
I solved by reducing the Memory and Cores of my cluster. I setup the cluster again, with 14GB Memory, 4 Cores, and 8 Workers. It worked. It's writing without any error. I'm not sure why it was failed on bigger settings for cluster
I have EMR and Hive in it.
If I open Hive console and run: select * from table - works fine.
Also if I run select * from table where ID > 2 (or any filter) - it works fine.
But when I run: select count(*) from table - it shows error:
-----
Diagnostic Messages for this Task:
Container launch failed for container_1597771662497_0002_01_000006 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:163)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
What can be the problem?
Command run (trying to get Maximum run scored)
Run_M = foreach Run_Group_All generate (Match.Player, Match.Run) , MAX(Match.Run);
As per log Group command is failing , can anybody help where is problem?
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:556)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:84)
at org.apache.pig.builtin.AlgebraicLongMathBase.exec(AlgebraicLongMathBase.java:93)
at org.apache.pig.builtin.AlgebraicLongMathBase.exec(AlgebraicLongMathBase.java:37)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:326)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextLong(POUserFunc.java:410)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:351)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:400)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:317)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:474)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:442)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:422)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:269)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:346)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Number
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:77)
... 20 more
2017-09-03 07:48:03,212 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-09-03 07:48:03,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local1294624349_0011 has failed! Stop running all dependent jobs
2017-09-03 07:48:03,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-09-03 07:48:03,213 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2017-09-03 07:48:03,214 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2017-09-03 07:48:03,214 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2017-09-03 07:48:03,215 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.8.1 0.15.0 goldi 2017-09-03 07:48:01 2017-09-03 07:48:03 GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_local1294624349_0011 Cric,Match,Run_Group_All,Run_M GROUP_BY Message: Job failed! file:/tmp/temp-1949037811/tmp1601097545,
Input(s):
Failed to read data from "/home/goldi/Batting.csv"
Output(s):
Failed to produce result in "file:/tmp/temp-1949037811/tmp1601097545"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local1294624349_0011
2017-09-03 07:48:03,217 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-09-03 07:48:03,218 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias Run_M
Details at logfile: /home/goldi/pig_1504365116860.log
Replace '(Match.Player, Match.Run)' with 'group'.
Run_M = foreach Run_Group_All generate FLATTEN(group) as (player,run) , MAX(Match.Run);
When I try store data using pig command
STORE D into '/deidentifiedDir';
Getting this error
2015-05-06 20:17:14,587 [Thread-96] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0009
java.io.IOException: Mkdirs failed to create file:/pig/deidentifiedDir/_temporary/_attempt_local_0009_m_000000_0
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:366)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:465)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:372)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextOutputFormat.getRecordWriter(PigTextOutputFormat.java:98)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:83)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:488)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:610)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)