How to load a delimited file with empty fields in pig - apache-pig

I am using the below command to load the file, when I try to dump or illustrate the loaded data, it fails with the below error. I have checked the sanity of the data, each line contains correct number of delimiters, but when the field is empty the delimiter immediately follows , I tried loading the below single sample line. It does not work.
hs_2_inr = LOAD 'hs_2_inr.dat' USING PigStorage('^') as ( year:chararray, country:chararray, s_no:chararray, hs_8:chararray, hs_8_desc:chararray, prevyr_inr:chararray, curyr_inr:chararray, growth:chararray, dummy:chararray);
Here is the sample data
1997^BOTSWANA^1.^10063001^*RICE PARBOILED^^2.43^^
Below is the exception
2013-06-30 21:02:23,015 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:84)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:739)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:626)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:323)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
2013-06-30 21:02:23,016 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. Exception
so how do I load a file with empty fields in pig?

Your code works fine. As you mentioned in your comment, ILLUSTRATE was your problem. Per the docs, ILLUSTRATE wasn't being maintained for a while. Do not rely on it. You shouldn't need it in any non-diagnostic code anyway. Use DESCRIBE instead.
In newer Pig versions, the warning on ILLUSTRATE seems to have gone away, so it may be safe again, but I'd still rely more heavily on DESCRIBE to avoid a source of potential issues. In Pig 0.10, which I'm on, ILLUSTRATE still gave me the same error you received.

Related

HSQLDB throws Asset failed exception and file io error on db.script.new file during Checkpoint

Our application is a Java based desktop application which will download the binary data from the source, parses it and add it to HSQLDB database. When downloading from the sources individually, application works perfectly. But when doing the same from multiple sources simultaneously with each source in an individual thread, I am getting an error of
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 23 in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
or sometimes,
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 1016 in statement [CHECKPOINT]
followed by
java.sql.SQLException: File input/output error: C:\ProgramData\test\data\database\db.script.new in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
Java: 1.8;
HSQL version: 1.8.10
We are not in the position to migrate the HSQLDB to latest version because of various reasons.
HSQL Properties:
hsqldb.script_format=0
runtime.gc_interval=0
sql.enforce_strict_size=false
hsqldb.cache_size_scale=8
readonly=false
hsqldb.nio_data_file=true
hsqldb.cache_scale=14
version=1.8.0
hsqldb.default_table_type=memory
hsqldb.cache_file_scale=1
hsqldb.log_size=200
modified=yes
hsqldb.cache_version=1.7.0
hsqldb.original_version=1.8.0
hsqldb.compatible_version=1.8.0
Any help or hint will be appreciated.
This is an 7 year old version which is not ideal for multi-threaded usage.
The simple solution is to perform the database updates with a single thread. You can retrofit your multi-threaded application with a synchronized block over a singleton object around the code that performs the database update.

Array in output schema caused exception

I am following this WordCount example using the Google BigQuery-Hadoop connector:
https://developers.google.com/hadoop/writing-with-bigquery-connector#completecode
The example works fine as it is.
To test array in the output schema, I have altered just one line in the code by adding an array object definition to the output schema:
String outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Number','type': 'INTEGER'},{'name':'Persons','mode':'REPEATED','type':'RECORD','fields':[{'name': 'name','type': 'STRING'},{'name': 'age','type': 'INTEGER'}]}]";
Now when I run the WordCount example, it gives this exception:
java.lang.IllegalStateException
at com.google.gson.JsonArray.getAsString(JsonArray.java:133)
at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.getSchemaFromString(BigQueryUtils.java:97)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getRecordWriter(BigQueryOutputFormat.java:121)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:568)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Does anyone know what the issue is?
Thank you
This is actually a bug in the current version of the BigQuery connector which prevents it from supporting inner records with more than 1 field.
We have a fix internally and it's slated to go out with the next release (0.4.3) which may still be a couple weeks out; if you'd like to help try out a staging build, feel free to reach out to gcp-hadoop-contact#google.com and we can provide instructions.

What is going wrong with my etl process?

I'm using GoodData's CloudConnect (based on CloverETL) to read a massive json file and write certain elements to a .csv.
Unfortunately, I'm seeing the error pasted below in the console log. Am I running out of memory due to the error, or is that not enough memory the actual error?
ERROR [WatchDog_0] - Component [JSONReader:JSONREADER1] finished with status ERROR.
Java heap space
ERROR [WatchDog_0] - Error details:
org.jetel.exception.JetelRuntimeException: Component [JSONReader:JSONREADER1] finished with status ERROR.
at org.jetel.graph.Node.createNodeException(Node.java:543)
at org.jetel.graph.Node.run(Node.java:522)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.checkThrownException(TreeReader.java:766)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.manageThread(TreeReader.java:757)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.processInput(TreeReader.java:732)
at org.jetel.component.TreeReader.execute(TreeReader.java:412)
at org.jetel.graph.Node.run(Node.java:493)
... 1 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at net.sf.saxon.tinytree.TinyTree.condense(TinyTree.java:379)
at net.sf.saxon.tinytree.TinyBuilder.close(TinyBuilder.java:177)
at net.sf.saxon.event.ReceivingContentHandler.endDocument(ReceivingContentHandler.java:219)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endDocument(AbstractSAXParser.java:745)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:515)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:404)
at net.sf.saxon.event.Sender.send(Sender.java:193)
at net.sf.saxon.event.Sender.send(Sender.java:50)
at net.sf.saxon.Configuration.buildDocument(Configuration.java:2973)
at net.sf.saxon.sxpath.XPathExpression.evaluate(XPathExpression.java:154)
at org.jetel.component.tree.reader.xml.XmlXPathEvaluator.iterate(XmlXPathEvaluator.java:79)
at org.jetel.component.tree.reader.XPathPushParser.handleContext(XPathPushParser.java:104)
at org.jetel.component.tree.reader.XPathPushParser.parse(XPathPushParser.java:84)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor$PipeParser.work(TreeReader.java:827)
at org.jetel.graph.runtime.CloverWorker.run(CloverWorker.java:87)
... 1 more
This looks like the second case: this error is caused by insufficient memory for your task.
Error occurred during evaluating (one of) your JSONReader component(s).
The JSON seems to be really huge and you should consider splitting this task into smaller ones if possible.
Did you run your transformation locally or on the gooddata server?
It is really hard to advise something specific without knowing details.
Try to use JSONExtract instead if JSONReader - it uses less memory, but also reads JSON files.
From the respective help documents:
JSONReader uses DOM, so the whole input is stored in memory and therefore the component can be memory-greedy.
JSONExtract uses SAX instead of DOM, so it uses less memory than JSONReader

Error when run Hive-0.9.0 Exception in thread "main" java.lang.NoSuchFieldError: type

Really sorry for stupid question, but struggling to find answer. I am trying to start up Hive on my 3 node Hadoop cluster, HDFS runs OK as does PIG, Hbase but for the life of me I can not get Hive to run properly.
This is the classpath output >
:/home/hduser/hive-0.9.0/conf:/home/hduser/hive-0.9.0/lib/antlr-runtime-3.0.1.jar:/home/hduser/hive-0.9.0/lib/commons-cli-1.2.jar:/home/hduser/hive-0.9.0/lib/commons-codec-1.3.jar:/home/hduser/hive-0.9.0/lib/commons-collections-3.2.1.jar:/home/hduser/hive-0.9.0/lib/commons-dbcp-1.4.jar:/home/hduser/hive-0.9.0/lib/commons-lang-2.4.jar:/home/hduser/hive-0.9.0/lib/commons-logging-1.0.4.jar:/home/hduser/hive-0.9.0/lib/commons-logging-api-1.0.4.jar:/home/hduser/hive-0.9.0/lib/commons-pool-1.5.4.jar:/home/hduser/hive-0.9.0/lib/datanucleus-connectionpool-2.0.3.jar:/home/hduser/hive-0.9.0/lib/datanucleus-core-2.0.3.jar:/home/hduser/hive-0.9.0/lib/datanucleus-enhancer-2.0.3.jar:/home/hduser/hive-0.9.0/lib/datanucleus-rdbms-2.0.3.jar:/home/hduser/hive-0.9.0/lib/derby-10.4.2.0.jar:/home/hduser/hive-0.9.0/lib/guava-r09.jar:/home/hduser/hive-0.9.0/lib/hadoop-0.20.2-core.jar:/home/hduser/hive-0.9.0/lib/hbase-0.92.0.jar:/home/hduser/hive-0.9.0/lib/hbase-0.92.0-tests.jar:/home/hduser/hive-0.9.0/lib/hive-builtins-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-cli-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-common-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-contrib-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive_contrib.jar:/home/hduser/hive-0.9.0/lib/hive-exec-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-hbase-handler-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-hwi-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-jdbc-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-metastore-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-pdk-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-serde-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-service-0.9.0.jar:/home/hduser/hive-0.9.0/lib/hive-shims-0.9.0.jar:/home/hduser/hive-0.9.0/lib/jackson-core-asl-1.8.8.jar:/home/hduser/hive-0.9.0/lib/jackson-jaxrs-1.8.8.jar:/home/hduser/hive-0.9.0/lib/jackson-mapper-asl-1.8.8.jar:/home/hduser/hive-0.9.0/lib/jackson-xc-1.8.8.jar:/home/hduser/hive-0.9.0/lib/JavaEWAH-0.3.2.jar:/home/hduser/hive-0.9.0/lib/jdo2-api-2.3-ec.jar:/home/hduser/hive-0.9.0/lib/jline-0.9.94.jar:/home/hduser/hive-0.9.0/lib/json-20090211.jar:/home/hduser/hive-0.9.0/lib/libfb303-0.7.0.jar:/home/hduser/hive-0.9.0/lib/libfb303.jar:/home/hduser/hive-0.9.0/lib/libthrift-0.7.0.jar:/home/hduser/hive-0.9.0/lib/libthrift.jar:/home/hduser/hive-0.9.0/lib/log4j-1.2.16.jar:/home/hduser/hive-0.9.0/lib/slf4j-api-1.6.1.jar:/home/hduser/hive-0.9.0/lib/slf4j-log4j12-1.6.1.jar:/home/hduser/hive-0.9.0/lib/stringtemplate-3.1-b1.jar:/home/hduser/hive-0.9.0/lib/zookeeper-3.4.3.jar:
Logging initialized using configuration in file:/home/hduser/hive-0.9.0/conf/hive-log4j.properties
Hive history file=/tmp/hduser/hive_job_log_hduser_201212181716_326152902.txt
and then from HIVE command line I run this:
hive> CREATE TABLE pokes (foo INT, bar STRING);
however I get the following error:
Exception in thread "main" java.lang.NoSuchFieldError: type
at org.apache.hadoop.hive.ql.parse.HiveLexer.mKW_CREATE(HiveLexer.java:1602)
at org.apache.hadoop.hive.ql.parse.HiveLexer.mTokens(HiveLexer.java:6380)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:89)
at org.antlr.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:133)
at org.antlr.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:127)
at org.antlr.runtime.CommonTokenStream.setup(CommonTokenStream.java:132)
at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:91)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:547)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:438)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Make sure you don't have any other antlr-*.jar in your classpath except the one which is there in HIVE_HOME/lib folder. If still doesn't work download the latest version from the antlr's website and put it into the HIVE_HOME/lib folder and give it a try.
HTH
just remove all the files with prefix jackson*
and copy new version of jackson* files from hive.
rm /opt/hadoop/hadoop/lib/jackson*
cp /opt/hive/hive/lib/jackson* /opt/hadoop/hadoop/lib
I did it this way, and it worked perfectly fine!
I hope it helps.

Exporting an HSQLDB to XML using DBUnit results in null pointer errors

I'm trying to export the entire contents of my database, an HSQLDB, to XML using DBUnit, and I'm getting null pointer errors that I can't understand. I'm following the example in the FAQ:
IDatabaseConnection xmlConnection = new DatabaseConnection(conn);
IDataSet allTables = xmlConnection.createDataSet();
XmlDataSet.write(allTables, new FileOutputStream(DATABASE_PATH + ".xml"));
The null pointer error occurs on the last line. conn and DATABASE_PATH aren't null as they're both checked for that and used later in the program without a problem (exporting the database into CSV using OpenCSV, which works perfectly and exactly as expected).
The stacktrace is as follows:
org.dbunit.dataset.DataSetException: java.sql.SQLException: java.lang.NullPointerException java.lang.NullPointerException
at org.dbunit.database.DatabaseDataSet.initialize(DatabaseDataSet.java:243)
at org.dbunit.database.DatabaseDataSet.getTableNames(DatabaseDataSet.java:272)
at org.dbunit.database.DatabaseDataSet.createIterator(DatabaseDataSet.java:258)
at org.dbunit.dataset.AbstractDataSet.iterator(AbstractDataSet.java:189)
at org.dbunit.dataset.stream.DataSetProducerAdapter.(DataSetProducerAdapter.java:63)
at org.dbunit.dataset.xml.XmlDataSetWriter.write(XmlDataSetWriter.java:128)
at org.dbunit.dataset.xml.XmlDataSet.write(XmlDataSet.java:104)
at org.dbunit.dataset.xml.XmlDataSet.write(XmlDataSet.java:91)
at pms.DatabaseExporter.exportToXML(DatabaseExporter.java:181)
at pms.DatabaseExporter.main(DatabaseExporter.java:301)
Caused by: java.sql.SQLException: java.lang.NullPointerException java.lang.NullPointerException
at org.hsqldb.jdbc.Util.sqlException(Util.java:224)
at org.hsqldb.jdbc.JDBCStatement.fetchResult(JDBCStatement.java:1830)
at org.hsqldb.jdbc.JDBCStatement.executeQuery(JDBCStatement.java:181)
at org.hsqldb.jdbc.JDBCDatabaseMetaData.execute(JDBCDatabaseMetaData.java:6150)
at org.hsqldb.jdbc.JDBCDatabaseMetaData.getTables(JDBCDatabaseMetaData.java:3170)
at org.dbunit.database.DefaultMetadataHandler.getTables(DefaultMetadataHandler.java:137)
at org.dbunit.database.DatabaseDataSet.initialize(DatabaseDataSet.java:199)
... 9 more
Caused by: org.hsqldb.HsqlException: java.lang.NullPointerException
at org.hsqldb.error.Error.error(Error.java:108)
at org.hsqldb.result.Result.newErrorResult(Result.java:1069)
at org.hsqldb.StatementDMQL.execute(StatementDMQL.java:192)
at org.hsqldb.Session.executeCompiledStatement(Session.java:1315)
at org.hsqldb.Session.executeDirectStatement(Session.java:1206)
at org.hsqldb.Session.execute(Session.java:990)
at org.hsqldb.jdbc.JDBCStatement.fetchResult(JDBCStatement.java:1822)
... 14 more
Caused by: java.lang.NullPointerException
at org.hsqldb.types.CharacterType.compare(CharacterType.java:418)
at org.hsqldb.index.IndexAVL.compareRowForInsertOrDelete(IndexAVL.java:617)
at org.hsqldb.index.IndexAVLMemory.insert(IndexAVLMemory.java:214)
at org.hsqldb.persist.RowStoreAVL.indexRow(RowStoreAVL.java:171)
at org.hsqldb.persist.RowStoreAVLHybridExtended.indexRow(RowStoreAVLHybridExtended.java:99)
at org.hsqldb.Table.insertSys(Table.java:2625)
at org.hsqldb.dbinfo.DatabaseInformationMain.SYSTEM_TABLES(DatabaseInformationMain.java:2353)
at org.hsqldb.dbinfo.DatabaseInformationMain.generateTable(DatabaseInformationMain.java:348)
at org.hsqldb.dbinfo.DatabaseInformationFull.generateTable(DatabaseInformationFull.java:379)
at org.hsqldb.dbinfo.DatabaseInformationMain.setStore(DatabaseInformationMain.java:507)
at org.hsqldb.persist.PersistentStoreCollectionSession.getStore(PersistentStoreCollectionSession.java:138)
at org.hsqldb.Table.getRowStore(Table.java:2817)
at org.hsqldb.RangeVariable$RangeIteratorMain.(RangeVariable.java:939)
at org.hsqldb.RangeVariable$RangeIteratorMain.(RangeVariable.java:917)
at org.hsqldb.RangeVariable.getIterator(RangeVariable.java:770)
at org.hsqldb.QuerySpecification.buildResult(QuerySpecification.java:1293)
at org.hsqldb.QuerySpecification.getSingleResult(QuerySpecification.java:1245)
at org.hsqldb.QuerySpecification.getResult(QuerySpecification.java:1235)
at org.hsqldb.StatementQuery.getResult(StatementQuery.java:66)
at org.hsqldb.StatementDMQL.execute(StatementDMQL.java:190)
... 18 more
I've googled and couldn't find anything relating to this kind of error during export. I'm not that experienced with SQL or JDBC so I'm hoping there's enough info in the stack trace for someone more knowledgeable to tell me what's going wrong. If there's some other library that would be better for my needs, I have no problem switching... The only thing I need is export/import with XML right now, so I'm not using DBUnit for anything else. Anyway if anybody can tell me what's going on wrong or if I ought to be using something else I'd really appreciate it.
This is an error in the particular version of HSQLDB's system table creation, which was spotted and corrected recently. You can try the updated HSQLDB jar from http://hsqldb.org/support/