Understanding groupby, reduceByKey on transformed dataset - dataframe

I am working with a large dataset on a standalone spark set up. I am still new to spark(so my fundamentals might be a little weak). Also I hope I don't fall for the XY trap...
The task is relatively simple in pandas but I can't seem to debug my pyspark error.
I have the following dataset.
+-------------+--------+----------+----------+--------------------+
| id|latitude| longitude| timestamp| categoryname|
+-------------+--------+----------+----------+--------------------+
|f69bfce8-a2c5|5.866167|118.088919|1551319828| null|
|b9d48e00-0e57|3.224278| 101.72999|1551445560| CONVENIENCE STORE|
|a6c5d9e2-1f99|3.148319| 101.61653|1551530554| RESTAURANTS|
|92988985-67e2| 1.54056| 110.31867|1551458606| null|
|e1771886-cb87|2.803712|101.663718|1551352028| null|
Using a udf I was able to calculate the distance for each row from a single point using the haversine lib.
distance_udf = F.udf(lambda lat1, lon1, lat2, lon2: haversine((lat1, lon1), (lat2, lon2)))
Giving me
+-------------+-----------------+--------+----------+------------------+
| id| categoryname|latitude| longitude| distance|
+-------------+-----------------+--------+----------+------------------+
|f69bfce8-a2c5| null|5.866167|118.088919|1846.2724187047768|
|b9d48e00-0e57|CONVENIENCE STORE|3.224278| 101.72999|10.727485447625341|
|a6c5d9e2-1f99| RESTAURANTS|3.148319| 101.61653| 4.505927571918682|
|92988985-67e2| null| 1.54056| 110.31867| 979.531392507226|
|e1771886-cb87| null|2.803712|101.663718| 40.27783211167852|
+-------------+-----------------+--------+----------+------------------+
After .filter() and .drop() I am left with
+-------------+--------------------+
| id| categoryname|
+-------------+--------------------+
|d05e2151-0fb9| null|
|8900e7dd-d51e| null|
|a1e712f9-0784|RESIDENTIAL BUILDING|
|5b2c6eb3-f13e| null|
|c7a05929-43fb| RESTAURANTS|
+-------------+--------------------+
I have tried df.groupby('categoryname').count() on the transformed dataframe and get an error
I am trying to get the count of each categoryname.
I have also tried to convert it into RDD and tried using .reduceByKey() to no avail.
What am I missing? Is it my set up? the dataset isn't that big; only 50Gb.
The groupby() functions works fine when i first load the dataset but dosen't seem to work once I have done a few transformations.
Could someone please point me in the right direction?
EDIT:
Traceback (most recent call last):
File "C:\Users\Siddharth\Desktop\Uni\DataBooks\Movingwalls\sparkTest.py", line 52, in <module>
results = spark.sql('''SELECT count(DISTINCT idfa), categoryname FROM test2 GROUP BY categoryname''').show()
File "C:\Users\Siddharth\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\dataframe.py", line 378, in show
print(self._jdf.showString(n, 20, vertical))
File "C:\Users\Siddharth\AppData\Local\Programs\Python\Python36\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\Siddharth\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Users\Siddharth\AppData\Local\Programs\Python\Python36\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o110.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 2, localhost, executor driver): java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)

Related

On New or Update Sftp connector issue with pipe closed

I am using On New or Update sftp connector to read around 3000+ files daily. Everything runs smooth. However,at the end of processing phase. the sftp connector throws below error,and does not process the last file which remains in the sftp folder for next run. this error scenario keeps repeating for each run. Hence, the last file of each run does not process.
20:03:37.733 06/15/2022 Worker-0 [MuleRuntime].uber.33: [demo-data-api].prcsFiles-Error-SuccessFlow.CPU_INTENSIVE #38f99b06 ERROR
event:c3ccc560-f1be-11ec-a890-02732233ad66
********************************************************************************
Message : "org.mule.weave.v2.module.reader.ReaderParsingException: org.mule.runtime.api.exception.MuleRuntimeException - Exception was found trying to retrieve the contents of file /home/transaction/data.json
org.mule.runtime.api.exception.MuleRuntimeException: Exception was found trying to retrieve the contents of file /home/transaction/data.json
at org.mule.extension.sftp.internal.connection.SftpClient.exception(SftpClient.java:427)
at org.mule.extension.sftp.internal.connection.SftpClient.exception(SftpClient.java:423)
at org.mule.extension.sftp.internal.connection.SftpClient.getFileContent(SftpClient.java:349)
at org.mule.extension.sftp.internal.connection.SftpFileSystem.retrieveFileContent(SftpFileSystem.java:117)
at org.mule.extension.sftp.internal.SftpInputStream$SftpFileInputStreamSupplier.getContentInputStream(SftpInputStream.java:111)
at org.mule.extension.sftp.internal.SftpInputStream$SftpFileInputStreamSupplier.getContentInputStream(SftpInputStream.java:93)
at org.mule.extension.file.common.api.AbstractConnectedFileInputStreamSupplier.getContentInputStream(AbstractConnectedFileInputStreamSupplier.java:81)
at org.mule.extension.file.common.api.AbstractFileInputStreamSupplier.get(AbstractFileInputStreamSupplier.java:65)
at org.mule.extension.file.common.api.AbstractFileInputStreamSupplier.get(AbstractFileInputStreamSupplier.java:33)
at org.mule.extension.file.common.api.stream.LazyStreamSupplier.lambda$new$1(LazyStreamSupplier.java:29)
at org.mule.extension.file.common.api.stream.LazyStreamSupplier.get(LazyStreamSupplier.java:42)
at org.mule.extension.file.common.api.stream.AbstractNonFinalizableFileInputStream.lambda$createLazyStream$0(AbstractNonFinalizableFileInputStream.java:48)
at $java.io.InputStream$$EnhancerByCGLIB$$55e4687e.read(<generated>)
at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:102)
at org.mule.runtime.core.internal.streaming.bytes.AbstractInputStreamBuffer.consumeStream(AbstractInputStreamBuffer.java:111)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.consumeForwardData(FileStoreInputStreamBuffer.java:239)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.consumeForwardData(FileStoreInputStreamBuffer.java:202)
at com.mulesoft.mule.runtime.core.internal.streaming.bytes.FileStoreInputStreamBuffer.doGet(FileStoreInputStreamBuffer.java:125)
at org.mule.runtime.core.internal.streaming.bytes.AbstractInputStreamBuffer.get(AbstractInputStreamBuffer.java:93)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.assureDataInLocalBuffer(BufferedCursorStream.java:126)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.doRead(BufferedCursorStream.java:101)
at org.mule.runtime.core.internal.streaming.bytes.AbstractCursorStream.read(AbstractCursorStream.java:124)
at org.mule.runtime.core.internal.streaming.bytes.BufferedCursorStream.read(BufferedCursorStream.java:26)
at java.io.InputStream.read(InputStream.java:101)
at org.mule.runtime.core.internal.streaming.bytes.ManagedCursorStreamDecorator.read(ManagedCursorStreamDecorator.java:96)
at org.mule.weave.v2.el.SeekableCursorStream.read(MuleTypedValue.scala:306)
at org.mule.weave.v2.module.reader.UTF8StreamSourceReader.handleBOM(SeekableStreamSourceReader.scala:179)
at org.mule.weave.v2.module.reader.UTF8StreamSourceReader.readAscii(SeekableStreamSourceReader.scala:163)
at org.mule.weave.v2.module.json.reader.JsonTokenizer.$init$(JsonTokenizer.scala:21)
at org.mule.weave.v2.module.json.reader.indexed.IndexedJsonTokenizer.<init>(IndexedJsonTokenizer.scala:15)
at org.mule.weave.v2.module.json.reader.indexed.IndexedJsonParser.parser(IndexedJsonParser.scala:17)
at org.mule.weave.v2.module.json.reader.JsonReader.readValue(JsonReader.scala:40)
at org.mule.weave.v2.module.json.reader.JsonReader.doRead(JsonReader.scala:30)
at org.mule.weave.v2.module.reader.Reader.read(Reader.scala:35)
at org.mule.weave.v2.module.reader.Reader.read$(Reader.scala:33)
at org.mule.weave.v2.module.json.reader.JsonReader.read(JsonReader.scala:20)
at org.mule.weave.v2.el.MuleTypedValue.value(MuleTypedValue.scala:147)
at org.mule.weave.v2.model.values.wrappers.DelegateValue.valueType(DelegateValue.scala:17)
at org.mule.weave.v2.model.values.wrappers.DelegateValue.valueType$(DelegateValue.scala:16)
at org.mule.weave.v2.el.MuleTypedValue.valueType(MuleTypedValue.scala:177)
at org.mule.weave.v2.model.types.ObjectType$.accepts(Type.scala:1068)
at org.mule.weave.v2.interpreted.node.executors.BinaryOverloadedStaticExecutor.executeBinary(BinaryOverloadedStaticExecutor.scala:45)
at org.mule.weave.v2.interpreted.node.ChainedBinaryOpNode.doExecute(ChainedBinaryOpNode.scala:37)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.ChainedBinaryOpNode.execute(ChainedBinaryOpNode.scala:7)
at org.mule.weave.v2.interpreted.node.NullSafeNode.doExecute(NullSafeNode.scala:14)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.NullSafeNode.execute(NullSafeNode.scala:8)
at org.mule.weave.v2.interpreted.node.BinaryOpNode.doExecute(BinaryOpNode.scala:15)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.BinaryOpNode.execute(BinaryOpNode.scala:9)
at org.mule.weave.v2.interpreted.node.structure.DocumentNode.doExecute(DocumentNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute(ValueNode.scala:26)
at org.mule.weave.v2.interpreted.node.ValueNode.execute$(ValueNode.scala:21)
at org.mule.weave.v2.interpreted.node.structure.DocumentNode.execute(DocumentNode.scala:11)
at org.mule.weave.v2.interpreted.InterpretedMappingExecutableWeave.execute(InterpreterMappingCompilerPhase.scala:196)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:250)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.$anonfun$evaluate$2(WeaveExpressionLanguageSession.scala:101)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.doEvaluate(WeaveExpressionLanguageSession.scala:285)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:100)
at org.mule.runtime.core.internal.el.dataweave.DataWeaveExpressionLanguageAdaptor$1.evaluate(DataWeaveExpressionLanguageAdaptor.java:274)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluate(DefaultExpressionManagerSession.java:51)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluateBoolean(DefaultExpressionManagerSession.java:72)
at org.mule.runtime.core.internal.routing.ProcessorExpressionRoute.accepts(ProcessorExpressionRoute.java:34)
at org.mule.runtime.core.internal.routing.ExecutableRoute.shouldExecute(ExecutableRoute.java:41)
at org.mule.runtime.core.internal.routing.ChoiceRouter$SinkRouter.lambda$route$0(ChoiceRouter.java:161)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531)
at org.mule.runtime.core.internal.routing.ChoiceRouter$SinkRouter.route(ChoiceRouter.java:161)
at org.mule.runtime.core.api.util.func.CheckedConsumer.accept(CheckedConsumer.java:19)
at org.mule.runtime.core.api.rx.Exceptions.lambda$checkedConsumer$0(Exceptions.java:51)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:482)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onNext(FluxOnAssembly.java:351)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxMap$MapConditionalSubscriber.onNext(FluxMap.java:213)
at reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner.onNext(MonoFlatMapMany.java:242)
at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:496)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:180)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:490)
at org.mule.runtime.core.privileged.processor.chain.AbstractMessageProcessorChain$2.onNext(AbstractMessageProcessorChain.java:485)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onNext(FluxPeekFuseable.java:204)
at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onNext(FluxOnAssembly.java:351)
at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:447)
at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:534)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.mule.service.scheduler.internal.AbstractRunnableFutureDecorator.doRun(AbstractRunnableFutureDecorator.java:151)
at org.mule.service.scheduler.internal.RunnableFutureDecorator.run(RunnableFutureDecorator.java:54)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.mule.extension.sftp.api.SftpConnectionException: Error occurred while trying to connect to host
... 112 more
Caused by: org.mule.runtime.api.connection.ConnectionException:
at org.mule.extension.sftp.api.SftpConnectionException.<init>(SftpConnectionException.java:38)
... 112 more
Caused by: org.mule.runtime.api.connection.ConnectionException:
... 112 more
Caused by: 4:
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1540)
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1290)
at org.mule.extension.sftp.internal.connection.SftpClient.getFileContent(SftpClient.java:347)
... 110 more
Caused by: java.io.IOException: Pipe closed
at java.io.PipedInputStream.read(PipedInputStream.java:307)
at com.jcraft.jsch.Channel$MyPipedInputStream.updateReadSide(Channel.java:362)
at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:1311)
... 112 mor, while reading `comingData` as Json.
Trace:
at main (Unknown)
at org.mule.weave.v2.el.utils.ExceptionHandler$.handleLocatableException(ExceptionHandler.scala:24)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.doEvaluate(WeaveExpressionLanguageSession.scala:291)
at org.mule.weave.v2.el.WeaveExpressionLanguageSession.evaluate(WeaveExpressionLanguageSession.scala:100)
at org.mule.runtime.core.internal.el.dataweave.DataWeaveExpressionLanguageAdaptor$1.evaluate(DataWeaveExpressionLanguageAdaptor.java:274)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluate(DefaultExpressionManagerSession.java:51)
at org.mule.runtime.core.internal.el.DefaultExpressionManagerSession.evaluateBoolean(DefaultExpressionManagerSession.java:72)
at org.mule.runtime.core.internal.routing.ProcessorExpressionRoute.accepts(ProcessorExpressionRoute.java:34)
at org.mule.runtime.core.internal.routing.ExecutableRou... [truncated]
SFTP Configuration:
<sftp:config name="SFTP_Config" doc:name="SFTP Config" doc:id="81f37ff8-d629-4f64-ab2c-5632350b8fca" >
<sftp:connection host="sample.com" port="00" username="example1" password="111111" connectionTimeout="30" responseTimeout="30">
<pooling-profile maxActive="10" maxIdle="10" maxWait="10" evictionCheckIntervalMillis="60000" minEvictionMillis="120000"/>
</sftp:connection>
</sftp:config>
<flow name="prcsFiles" doc:id="55ac597a-378f-40b6-8041-df7ca8254ebe" maxConcurrency="1">
<sftp:listener doc:name="On New or Updated File" doc:id="9a5d6eae-0fc6-46a5-be10-de98fb7ee16a" config-ref="SFTP_Config" directory="home/transaction/" outputMimeType="application/json" timeBetweenSizeCheckUnit="MILLISECONDS">
<reconnect-forever/>
<scheduling-strategy >
<cron expression="0 15 10 ? * *" timeZone="UTC" />
</scheduling-strategy>
</sftp:listener>
</flow>
I am using cron expressing which runs once daily. does someone came across this issue before.
Any thoughts will be appreciated.

Pyspark -- java.lang.NullPointerException (When using jdbc reading from Teradata to a dataframe)

Below is the Pyspark code to load data from EDW (Teradata) to HDFS(Hadoop system) using the JDBC driver:
from pyspark.sql import *
from pyspark.sql.types import *
q = """(select columns
from TD.Table1 as a
inner join TD.Table2 as b
on a.col=b.col
where b.col = date '2017-09-30'
) foo"""
df = spark.read.format('jdbc') \
.option('url','jdbc:teradata://teradata-dns-sysa.ab.xyz.com') \
.option('driver','com.teradata.jdbc.TeraDriver') \
.option('user','username') \
.option('password','####') \
.option('tmode','tera') \
.option('dbtable',q) \
.option('type','fastload') \
.load()
df.show(20,False)
This throws the error below:
Py4JJavaError: An error occurred while calling o64.showString.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2854)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2838)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2154)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2367)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Basically when I trying to do any actions on this df, like save as a hive table to hive, and save as an orc file to hdfs. It throws out the same error.
I did a df.printSchema and I found out that maybe the reason is that the dataset contains null values for (nullable = false) columns
So, I also tried to create a new Dataframe from the previous one, specifying the wanted schema:
df2 = spark.createDataFrame(df.rdd,schema)
Still got the same NPE error.
Any idea how to solve this one? Thank you!

Something rarely in Pig, cloudera quickstart

I do not understand because when I run a pig script in the editor,
a workflow is created in ozzie and also three jobs like image , rather than simply running the script like in hive.
Image
entrada = LOAD '/user/cloudera/Divisas/Barril-WTI.csv' using PigStorage (',') AS (Fecha:chararray, Valor: float);
entrada_sin_cabecera = filter entrada by Fecha != 'Date';
orden = order entrada_sin_cabecera by Valor;
dump orden;
Also gives me the following error when running: pig -f WTI.pig
Error before Pig is launched
ERROR 2997: Encountered IOException. File WTI.pig does not exist
java.io.FileNotFoundException: File WTI.pig does not exist at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:424)
at
org.apache.pig.impl.io.FileLocalizer.fetchFilesInternal(FileLocalizer.java:747)
at
org.apache.pig.impl.io.FileLocalizer.fetchFile(FileLocalizer.java:688)
at org.apache.pig.Main.run(Main.java:424) at
org.apache.pig.Main.main(Main.java:158) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Spark 1.4.1 - Using pyspark

I tried using this command , I get error
Code
instances = sqlContext.sql("SELECT instance_id ,instance_usage_code
FROM ib_instances WHERE (instance_usage_code) = 'OUT_OF_ENTERPRISE' ")
instances.write.format("orc").save("instances2")
hivectx.sql(""" CREATE TABLE IF NOT EXISTS instances2 (instance_id
string, instance_usage_code STRING)""" )
hivectx.sql (" LOAD DATA LOCAL INPATH '/home/hduser/instances2' into
table instances2 ")
Error
Traceback (most recent call last): File
"/home/hduser/spark_script.py", line 57, in
instances.write.format("orc").save("instances2") File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/s
ql/readwriter.py", line 304, in save File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/
py4j/java_gateway.py", line 538, in call File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/
py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o55.save.
: java.lang.AssertionError: assertion failed: The ORC data source can
only be used with HiveContext. at
scala.Predef$.assert(Predef.scala:179) at
org.apache.spark.sql.hive.orc.DefaultSource.createRelation(OrcRelation
.scala:54) at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:322)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
ava:57) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
orImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at
py4j.Gateway.invoke(Gateway.java:259) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:207) at
java.lang.Thread.run(Thread.java:745)
My guess is you create a standard SQLContext, instead of a Hive one (that adds a couple of options). Create your sqlContext as a HiveContext instance. The scala version is:
val sqlContext = new HiveContext(sc)

Parquet backed table corrupted - HIVE - expected magic number at tail [80, 65, 82, 49] but found [1, 92, 78, 10]

Distribution: CDH-4.6.0-1.cdh4.6.0.p0.26
Hive Version: 0.10.0
Parquet Version: 1.2.5
I have two big date partitioned external Hive tables full of log files that I recently converted to Parquet to take advantage of the compression and columnar storage. So far I've been very happy with the performance.
Our dev team recently added a field to the logs, so I was charged with adding a column to both log tables. It worked perfectly for one, but the other appears to have become corrupted. I reverted the change, but I still can't query the table.
I'm convinced the data is fine (because it didn't change) but something is wrong in the metastore. An msck repair table repopulates the partitions after I drop/create, but does not take care of the error below. There are two things that can fix it, but neither are making me happy:
Re-insert the data.
Copy data back into the table from the production cluster.
I'm really hoping there's a command that I don't know about that will fix the table without needing to resort to the above 2 options. Like I said, the data is fine. I've googled the heck out of the error, and I get some results but they all pertain to Impala which is NOT what were using.
select * from upload_metrics_hist where dt = '2014-07-01' limit 5;
The problem is this:
Caused by: java.lang.RuntimeException: hdfs://hdfs-dev/data/prod/upload-metrics/upload_metrics_hist/dt=2014-07-01/000005_0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [1, 92, 78, 10]
Full error
2014-07-17 02:00:48,835 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:372)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:319)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:433)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:540)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:358)
... 10 more
Caused by: java.lang.RuntimeException: hdfs://hdfs-dev/data/prod/upload-metrics/upload_metrics_hist/dt=2014-07-01/000005_0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [1, 92, 78, 10]
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:263)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:229)
at parquet.hive.DeprecatedParquetInputFormat$RecordReaderWrapper.getSplit(DeprecatedParquetInputFormat.java:327)
at parquet.hive.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:204)
at parquet.hive.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:108)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
... 15 more