What is going wrong with my etl process?

What is going wrong with my etl process? - gooddata

I'm using GoodData's CloudConnect (based on CloverETL) to read a massive json file and write certain elements to a .csv.
Unfortunately, I'm seeing the error pasted below in the console log. Am I running out of memory due to the error, or is that not enough memory the actual error?
ERROR [WatchDog_0] - Component [JSONReader:JSONREADER1] finished with status ERROR.
Java heap space
ERROR [WatchDog_0] - Error details:
org.jetel.exception.JetelRuntimeException: Component [JSONReader:JSONREADER1] finished with status ERROR.
at org.jetel.graph.Node.createNodeException(Node.java:543)
at org.jetel.graph.Node.run(Node.java:522)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.checkThrownException(TreeReader.java:766)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.manageThread(TreeReader.java:757)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.processInput(TreeReader.java:732)
at org.jetel.component.TreeReader.execute(TreeReader.java:412)
at org.jetel.graph.Node.run(Node.java:493)
... 1 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at net.sf.saxon.tinytree.TinyTree.condense(TinyTree.java:379)
at net.sf.saxon.tinytree.TinyBuilder.close(TinyBuilder.java:177)
at net.sf.saxon.event.ReceivingContentHandler.endDocument(ReceivingContentHandler.java:219)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endDocument(AbstractSAXParser.java:745)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:515)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:404)
at net.sf.saxon.event.Sender.send(Sender.java:193)
at net.sf.saxon.event.Sender.send(Sender.java:50)
at net.sf.saxon.Configuration.buildDocument(Configuration.java:2973)
at net.sf.saxon.sxpath.XPathExpression.evaluate(XPathExpression.java:154)
at org.jetel.component.tree.reader.xml.XmlXPathEvaluator.iterate(XmlXPathEvaluator.java:79)
at org.jetel.component.tree.reader.XPathPushParser.handleContext(XPathPushParser.java:104)
at org.jetel.component.tree.reader.XPathPushParser.parse(XPathPushParser.java:84)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor$PipeParser.work(TreeReader.java:827)
at org.jetel.graph.runtime.CloverWorker.run(CloverWorker.java:87)
... 1 more

This looks like the second case: this error is caused by insufficient memory for your task.
Error occurred during evaluating (one of) your JSONReader component(s).
The JSON seems to be really huge and you should consider splitting this task into smaller ones if possible.
Did you run your transformation locally or on the gooddata server?
It is really hard to advise something specific without knowing details.

Try to use JSONExtract instead if JSONReader - it uses less memory, but also reads JSON files.

From the respective help documents:
JSONReader uses DOM, so the whole input is stored in memory and therefore the component can be memory-greedy.
JSONExtract uses SAX instead of DOM, so it uses less memory than JSONReader

Related

Issues while running with BigQueryIO.Write.Method.STORAGE_WRITE_API

We are testing with STORAGE_WRITE_API to insert data into BigQuery. We've seen several errors/warnings in our Dataflow pipeline(written in Java). It might work well in the beginning, but eventually the system lag would be increasing, it would stop processing any data from PubSub and the unacked messages piled up.
One common warning is:
Operation ongoing in step insertTableRowsToBigQuery/StorageApiLoads/StorageApiWriteSharded/Write Records for at least 03h35m00s without outputting or completing in state process
at java.base#11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
at java.base#11.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
at java.base#11.0.9/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Callback.await(RetryManager.java:153)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.await(RetryManager.java:136)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager.await(RetryManager.java:256)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:248)
at app//org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.process(StorageApiWritesShardedRecords.java:453)
at app//org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
Other exceptions we've seen:
Got error io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Stream is closed
Got error io.grpc.StatusRuntimeException: ALREADY_EXIST
PodSandboxStatus of sandbox "..." for pod "df-...-pipeline-...-harness-qw4j_default(...)" error: rpc error: code = Unknown desc = Error: No such container
Code sample:
toBq.apply("insertTableRowsToBigQuery",
BigQueryIO
.writeTableRows()
.to(String.format("%s:%s.%s", PROJECT_ID, DATASET, table))
.withTriggeringFrequency(Duration.standardSeconds(options.getTriggeringFrequency()))
.withNumStorageWriteApiStreams(options.getNumStorageWriteApiStreams())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

There was a production issue related to connection being stuck after streaming 10MB which has been fixed. If you try again, it should work.

JobTracker - High memory and native thread usage

We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS.
Hadoop version: 1.2.1
Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1
Observed behavior: JT will accumulate threads in waiting state, leading to OOM:
2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444)
at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860)
at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706)
at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890)
at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
After looking through the JT logs I found these warnings:
2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010
java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
at org.apache.hadoop.ipc.Client.call(Client.java:1118)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392)
at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)
Caused by: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249)
at org.apache.hadoop.ipc.Client.call(Client.java:1093)
... 9 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635)
... 12 more
This appears to be similar to hadoop bug reporter here: https://issues.apache.org/jira/browse/MAPREDUCE-5606
I tried proposed solution by disabling saving job logs into the output path and it solved the problem at the expense of missing logs :)
I also ran jstack on JT and it showed hundreds of WAITING or TIMED_WAITING threads as such:
pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:327)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:378)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181)
at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque
st(MediaHttpUploader.java:629)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.
java:409)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl
eClientRequest.java:460)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo
ogleAsyncWriteChannel.java:354)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker)
It appears JT is having hard time keeping up communicating with GCS via GCS Connector.
Please advise,
Thank you

At the moment, every open FSDataOutputStream in the GCS connector for Hadoop consumes a thread until it's closed, because a separate thread needs to run the "resumable" HttpRequests while the user of the OutputStream writes bytes intermittently. In most cases, (such as in individual Hadoop tasks), there's only ever one long-lived output stream, and possibly a few shorter-lived ones for writing small metadata/marker files, etc.
In general, there are two possible causes for the OOM you're running into:
You have lots of queued up jobs; every submitted job holds an unclosed OutputStream, and thus consumes a "waiting" thread. However, since you mention you only need to queue up ~10 jobs, this shouldn't be the root cause.
Something is causing a "leak" of the PrintWriter objects, originally created in logSubmitted and added to fileManager. Typically, terminal events (like logFinished will correctly close() all the PrintWriters before removing them from the map via markCompleted, but in theory they may be bugs here or there which can cause one of the OutputStreams to leak without being close()'d. For example, while I haven't had a chance to verify this assertion, it seems that IOException trying to do something like logMetaInfo will "removeWriter" without closing it.
I've verified that at least under normal circumstances, the OutputStream seem to get closed correctly, and my sample JobTracker shows a clean jstack after having successfully run a lot of jobs.
TL;DR: There are some working theories as to why some resource may leak and ultimately prevent necessary threads from being created. You should consider changing hadoop.job.history.user.location to some HDFS location in the meantime, as a way to preserve the job logs in the absence of placing them on GCS.

Array in output schema caused exception

I am following this WordCount example using the Google BigQuery-Hadoop connector:
https://developers.google.com/hadoop/writing-with-bigquery-connector#completecode
The example works fine as it is.
To test array in the output schema, I have altered just one line in the code by adding an array object definition to the output schema:
String outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Number','type': 'INTEGER'},{'name':'Persons','mode':'REPEATED','type':'RECORD','fields':[{'name': 'name','type': 'STRING'},{'name': 'age','type': 'INTEGER'}]}]";
Now when I run the WordCount example, it gives this exception:
java.lang.IllegalStateException
at com.google.gson.JsonArray.getAsString(JsonArray.java:133)
at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.getSchemaFromString(BigQueryUtils.java:97)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getRecordWriter(BigQueryOutputFormat.java:121)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:568)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Does anyone know what the issue is?
Thank you

This is actually a bug in the current version of the BigQuery connector which prevents it from supporting inner records with more than 1 field.
We have a fix internally and it's slated to go out with the next release (0.4.3) which may still be a couple weeks out; if you'd like to help try out a staging build, feel free to reach out to gcp-hadoop-contact#google.com and we can provide instructions.

How to load a delimited file with empty fields in pig

I am using the below command to load the file, when I try to dump or illustrate the loaded data, it fails with the below error. I have checked the sanity of the data, each line contains correct number of delimiters, but when the field is empty the delimiter immediately follows , I tried loading the below single sample line. It does not work.
hs_2_inr = LOAD 'hs_2_inr.dat' USING PigStorage('^') as ( year:chararray, country:chararray, s_no:chararray, hs_8:chararray, hs_8_desc:chararray, prevyr_inr:chararray, curyr_inr:chararray, growth:chararray, dummy:chararray);
Here is the sample data
1997^BOTSWANA^1.^10063001^*RICE PARBOILED^^2.43^^
Below is the exception
2013-06-30 21:02:23,015 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:84)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:739)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:626)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:323)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
2013-06-30 21:02:23,016 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. Exception
so how do I load a file with empty fields in pig?

Your code works fine. As you mentioned in your comment, ILLUSTRATE was your problem. Per the docs, ILLUSTRATE wasn't being maintained for a while. Do not rely on it. You shouldn't need it in any non-diagnostic code anyway. Use DESCRIBE instead.
In newer Pig versions, the warning on ILLUSTRATE seems to have gone away, so it may be safe again, but I'd still rely more heavily on DESCRIBE to avoid a source of potential issues. In Pig 0.10, which I'm on, ILLUSTRATE still gave me the same error you received.

Lucene Search Error Stack

I am seeing the following error when trying to search using Lucene. (version 1.4.3). Any ideas as to why I could be seeing this and how to fix it?
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.InputStream.refill(InputStream.java:154)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:195)
at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:55)
at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:89)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118)
at org.apache.lucene.store.Lock$With.run(Lock.java:109)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:106)
at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:43)
In this same environment I also see the following error:
Caused by: java.io.IOException: Lock obtain timed out:
Lock#/tmp/lucene-3ec31395c8e06a56e2939f1fdda16c67-write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:58)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:223)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:213)
The same code works in a test environment, however not in production. Cannot identify any obvious differences between the two environments.

File permissions are wrong (it needs write permission) or your are not able to access a locked file that the current process needs.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas