Spark Redshift saving into s3 as Parquet - amazon-s3

Issues saving a redshift table into s3 as a parquet file... This is coming from the date field. I'm going to try to convert the column to a long and store it as a unix timestamp for now.
Caused by: java.lang.NumberFormatException: multiple points
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1110)
at java.lang.Double.parseDouble(Double.java:540)
at java.text.DigitList.getDouble(DigitList.java:168)
at java.text.DecimalFormat.parse(DecimalFormat.java:1321)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1793)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1455)
at com.databricks.spark.redshift.Conversions$$anon$1.parse(Conversions.scala:54)
at java.text.DateFormat.parse(DateFormat.java:355)
at com.databricks.spark.redshift.Conversions$.com$databricks$spark$redshift$Conversions$$parseTimestamp(Conversions.scala:67)
at com.databricks.spark.redshift.Conversions$$anonfun$1.apply(Conversions.scala:122)
at com.databricks.spark.redshift.Conversions$$anonfun$1.apply(Conversions.scala:108)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at com.databricks.spark.redshift.Conversions$.com$databricks$spark$redshift$Conversions$$convertRow(Conversions.scala:108)
at com.databricks.spark.redshift.Conversions$$anonfun$createRowConverter$1.apply(Conversions.scala:135)
at com.databricks.spark.redshift.Conversions$$anonfun$createRowConverter$1.apply(Conversions.scala:135)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241)
... 8 more
These are my gradle dependencies:
dependencies {
compile 'com.amazonaws:aws-java-sdk:1.10.31'
compile 'com.amazonaws:aws-java-sdk-redshift:1.10.31'
compile 'org.apache.spark:spark-core_2.10:1.5.1'
compile 'org.apache.spark:spark-sql_2.10:1.5.1'
compile 'com.databricks:spark-redshift_2.10:0.5.1'
compile 'com.fasterxml.jackson.module:jackson-module-scala_2.10:2.6.3'
}
EDIT 1: df.write.parquet("s3n://bucket/path/log.parquet") is how I'm saving the dataframe after I load in the redshift data using spark-redshift.
EDIT 2: I'm running all of this on my macbook air, maybe too much data corrupts the Dataframe? Not sure... It works when I 'limit 1000', just not for the entire table... So "query" works, but "table" doesn't in the spark-redshift options params.

spark-redshift maintainer here. I believe that the error that you're seeing is caused by a thread-safety bug in spark-redshift (Java DecimalFormat instances are not thread-safe and we were sharing a single instance across multiple threads).
This has been fixed in the 0.5.2 release, which is available on Maven Central and Spark Packages. Upgrade to 0.5.2 and this should work!

Related

Migration script fails with IllegalStateException due to SHADOW_TABLE_NAME_SUFFIXES

I've updated the Room version from 2.4.3 to 2.5.0-alpha03 and after the last migration, the JSON generated will once in a while fail with
Caused by: java.lang.IllegalStateException: Cannot parse existing schema file: C:\mypath\com.example.MyDatabase\74.json. If you've modified the file, you might've broken the JSON format, try deleting the file and re-running the compiler.
If you've not modified the file, please file a bug at
https://issuetracker.google.com/issues/new?component=413107&template=1096568
with a sample app to reproduce the issue.
at androidx.room.vo.Database.exportSchema(Database.kt:111)
at androidx.room.DatabaseProcessingStep.process(DatabaseProcessingStep.kt:123)
at androidx.room.compiler.processing.CommonProcessorDelegate.processRound(XBasicAnnotationProcessor.kt:123)
at androidx.room.compiler.processing.javac.JavacBasicAnnotationProcessor.process(JavacBasicAnnotationProcessor.kt:71)
at org.jetbrains.kotlin.kapt3.base.incremental.IncrementalProcessor.process(incrementalProcessors.kt:90)
at org.jetbrains.kotlin.kapt3.base.ProcessorWrapper.process(annotationProcessing.kt:197)
at jdk.compiler/com.sun.tools.javac.processing.JavacProcessingEnvironment.callProcessor(JavacProcessingEnvironment.java:985) ... 44 more
After checking out the differences between the last schema file 73.json and the new one, 74.json, apart from the changes I've wanted to make, there's this block:
"SHADOW_TABLE_NAME_SUFFIXES": [
"_content",
"_segdir",
"_segments",
"_stat",
"_docsize"
],
"shadowTableNames$delegate": {
"initializer": {},
"_value": {}
},
inside the only ftsVersion block I have. Whatever I write in the migration script doesn't matter, I always get the same issue. What I've found is that SHADOW_TABLE_NAME_SUFFIXES is a static variable from androidx.room.migration.bundle.FtsEntityBundle & if I delete this block from 74.json, I don't get the issue anymore.
Can anyone help me with more info on this and why it could pop up in the schema file?
I've posted a bug report as per the stack trace's advice and it seems to be an issue from Room 2.5.0-alpha02 and 2.5.0-alpha03, which they will fix https://issuetracker.google.com/issues/246751839

nutch 1.16 crawl example from NutchTutorial returns NoSuchMethodError on org.apache.commons.cli.OptionBuilder (Windows 10)

I have been trying to run a Nutch 1.16 crawler using code example and instructions from https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial but no matter what, I seem to get stuck when initiating the actual crawl.
I'm running it through Cygwin64 on a Windows 10 machine, using a binary installation (though I have tried compiling one with the same results). Initially, Nutch would throw an UnsatisfiedLinkError (NativeIO$Windows.access0) which I fixed by adding libraries from several other answers for the same issue. Upon doing so, I could at least start a server, but trying to crawl through nutch itself would return NoSuchMethodError no matter what I did. nutch-site.xml only contains http.agent.name and plugin.includes options, both taken from the same example.
The following is the error message (I also tried to omit seed.txt):
$ bin/nutch inject crawl/crawldb urls/seed.txt
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.commons.cli.OptionBuilder.withArgPattern(Ljava/lang/String;I)Lorg/apache/commons/cli/OptionBuilder;
at org.apache.hadoop.util.GenericOptionsParser.buildGeneralOptions(GenericOptionsParser.java:207)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:370)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at org.apache.nutch.crawl.Injector.main(Injector.java:534)
The following is the list of libraries currently present in the lib directory:
activation-1.1.jar
amqp-client-5.2.0.jar
animal-sniffer-annotations-1.14.jar
antlr-runtime-3.5.2.jar
antlr4-4.5.1.jar
aopalliance-1.0.jar
apache-nutch-1.16.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
api-asn1-api-1.0.0-M20.jar
api-util-1.0.0-M20.jar
args4j-2.0.16.jar
ascii-utf-themes-0.0.1.jar
asciitable-0.3.2.jar
asm-3.3.1.jar
asm-7.1.jar
avro-1.7.7.jar
bootstrap-3.0.3.jar
cglib-2.2.1-v20090111.jar
cglib-2.2.2.jar
char-translation-0.0.2.jar
checker-compat-qual-2.0.0.jar
closure-compiler-v20130603.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.8.0.jar
commons-cli-1.2-sources.jar
commons-cli-1.2.jar
commons-codec-1.11.jar
commons-collections-3.2.2.jar
commons-collections4-4.2.jar
commons-compress-1.18.jar
commons-configuration-1.6.jar
commons-daemon-1.0.13.jar
commons-digester-1.8.jar
commons-el-1.0.jar
commons-httpclient-3.1.jar
commons-io-2.4.jar
commons-jexl-2.1.1.jar
commons-lang-2.6.jar
commons-lang3-3.8.1.jar
commons-logging-1.1.3.jar
commons-math3-3.1.1.jar
commons-net-3.1.jar
crawler-commons-1.0.jar
curator-client-2.7.1.jar
curator-framework-2.7.1.jar
curator-recipes-2.7.1.jar
cxf-core-3.3.3.jar
cxf-rt-bindings-soap-3.3.3.jar
cxf-rt-bindings-xml-3.3.3.jar
cxf-rt-databinding-jaxb-3.3.3.jar
cxf-rt-frontend-jaxrs-3.3.3.jar
cxf-rt-frontend-jaxws-3.3.3.jar
cxf-rt-frontend-simple-3.3.3.jar
cxf-rt-security-3.3.3.jar
cxf-rt-transports-http-3.3.3.jar
cxf-rt-transports-http-jetty-3.3.3.jar
cxf-rt-ws-addr-3.3.3.jar
cxf-rt-ws-policy-3.3.3.jar
cxf-rt-wsdl-3.3.3.jar
dom4j-1.6.1.jar
ehcache-3.3.1.jar
elasticsearch-0.90.1.jar
error_prone_annotations-2.1.3.jar
FastInfoset-1.2.16.jar
geronimo-jcache_1.0_spec-1.0-alpha-1.jar
gora-hbase-0.3.jar
gson-2.2.4.jar
guava-25.0-jre.jar
guice-3.0.jar
guice-servlet-3.0.jar
h2-1.4.197.jar
hadoop-0.20.0-ant.jar
hadoop-0.20.0-core.jar
hadoop-0.20.0-examples.jar
hadoop-0.20.0-test.jar
hadoop-0.20.0-tools.jar
hadoop-annotations-2.9.2.jar
hadoop-auth-2.9.2.jar
hadoop-common-2.9.2.jar
hadoop-core-1.2.1.jar
hadoop-core_0.20.0.xml
hadoop-core_0.21.0.xml
hadoop-core_0.22.0.xml
hadoop-hdfs-2.9.2.jar
hadoop-hdfs-client-2.9.2.jar
hadoop-mapreduce-client-common-2.2.0.jar
hadoop-mapreduce-client-common-2.9.2.jar
hadoop-mapreduce-client-core-2.2.0.jar
hadoop-mapreduce-client-core-2.9.2.jar
hadoop-mapreduce-client-jobclient-2.2.0.jar
hadoop-mapreduce-client-jobclient-2.9.2.jar
hadoop-mapreduce-client-shuffle-2.2.0.jar
hadoop-mapreduce-client-shuffle-2.9.2.jar
hadoop-yarn-api-2.9.2.jar
hadoop-yarn-client-2.9.2.jar
hadoop-yarn-common-2.9.2.jar
hadoop-yarn-registry-2.9.2.jar
hadoop-yarn-server-common-2.9.2.jar
hadoop-yarn-server-nodemanager-2.9.2.jar
hbase-0.90.0-tests.jar
hbase-0.90.0.jar
hbase-0.92.1.jar
hbase-client-0.98.0-hadoop2.jar
hbase-common-0.98.0-hadoop2.jar
hbase-protocol-0.98.0-hadoop2.jar
HikariCP-java7-2.4.12.jar
htmlparser-1.6.jar
htrace-core-2.04.jar
htrace-core4-4.1.0-incubating.jar
httpclient-4.5.6.jar
httpcore-4.4.9.jar
httpcore-nio-4.4.9.jar
icu4j-61.1.jar
istack-commons-runtime-3.0.8.jar
j2objc-annotations-1.1.jar
jackson-annotations-2.9.9.jar
jackson-core-2.9.9.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.9.9.jar
jackson-dataformat-cbor-2.9.9.jar
jackson-jaxrs-1.9.13.jar
jackson-jaxrs-base-2.9.9.jar
jackson-jaxrs-json-provider-2.9.9.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-jaxb-annotations-2.9.9.jar
jackson-xc-1.9.13.jar
jakarta.activation-api-1.2.1.jar
jakarta.ws.rs-api-2.1.5.jar
jakarta.xml.bind-api-2.3.2.jar
jasper-compiler-5.5.12.jar
jasper-runtime-5.5.12.jar
java-xmlbuilder-0.4.jar
javassist-3.12.1.GA.jar
javax.annotation-api-1.3.2.jar
javax.inject-1.jar
javax.persistence-2.2.0.jar
javax.servlet-api-3.1.0.jar
jaxb-api-2.2.2.jar
jaxb-impl-2.2.3-1.jar
jaxb-runtime-2.3.2.jar
jcip-annotations-1.0-1.jar
jersey-client-1.19.4.jar
jersey-core-1.9.jar
jersey-guice-1.9.jar
jersey-json-1.9.jar
jersey-server-1.9.jar
jets3t-0.9.0.jar
jettison-1.1.jar
jetty-6.1.26.jar
jetty-client-6.1.22.jar
jetty-continuation-9.4.19.v20190610.jar
jetty-http-9.4.19.v20190610.jar
jetty-io-9.4.19.v20190610.jar
jetty-security-9.4.19.v20190610.jar
jetty-server-9.4.19.v20190610.jar
jetty-sslengine-6.1.26.jar
jetty-util-6.1.26.jar
jetty-util-9.4.19.v20190610.jar
joda-time-2.3.jar
jquery-2.0.3-1.jar
jquery-selectors-0.0.3.jar
jquery-ui-1.10.2-1.jar
jquerypp-1.0.1.jar
jsch-0.1.54.jar
json-smart-1.3.1.jar
jsp-2.1-6.1.14.jar
jsp-api-2.1-6.1.14.jar
jsp-api-2.1.jar
jsr305-3.0.0.jar
junit-3.8.1.jar
juniversalchardet-1.0.3.jar
leveldbjni-all-1.8.jar
log4j-1.2.17.jar
lucene-analyzers-common-4.3.0.jar
lucene-codecs-4.3.0.jar
lucene-core-4.3.0.jar
lucene-grouping-4.3.0.jar
lucene-highlighter-4.3.0.jar
lucene-join-4.3.0.jar
lucene-memory-4.3.0.jar
lucene-queries-4.3.0.jar
lucene-queryparser-4.3.0.jar
lucene-sandbox-4.3.0.jar
lucene-spatial-4.3.0.jar
lucene-suggest-4.3.0.jar
maven-parent-config-0.3.4.jar
metrics-core-3.0.1.jar
modernizr-2.6.2-1.jar
mssql-jdbc-6.2.1.jre7.jar
neethi-3.1.1.jar
netty-3.6.2.Final.jar
netty-all-4.0.23.Final.jar
nimbus-jose-jwt-4.41.1.jar
okhttp-2.7.5.jar
okio-1.6.0.jar
org.apache.commons.cli-1.2.0.jar
ormlite-core-5.1.jar
ormlite-jdbc-5.1.jar
oro-2.0.8.jar
paranamer-2.3.jar
protobuf-java-2.5.0.jar
reflections-0.9.8.jar
servlet-api-2.5-20081211.jar
servlet-api-2.5.jar
skb-interfaces-0.0.1.jar
slf4j-api-1.7.26.jar
slf4j-log4j12-1.7.25.jar
snappy-java-1.0.5.jar
spatial4j-0.3.jar
spring-aop-4.0.9.RELEASE.jar
spring-beans-4.0.9.RELEASE.jar
spring-context-4.0.9.RELEASE.jar
spring-core-4.0.9.RELEASE.jar
spring-expression-4.0.9.RELEASE.jar
spring-web-4.0.9.RELEASE.jar
ST4-4.0.8.jar
stax-api-1.0-2.jar
stax-ex-1.8.1.jar
stax2-api-3.1.4.jar
t-digest-3.2.jar
tika-core-1.22.jar
txw2-2.3.2.jar
typeaheadjs-0.9.3.jar
warc-hadoop-0.1.0.jar
webarchive-commons-1.1.5.jar
wicket-bootstrap-core-0.9.2.jar
wicket-bootstrap-extensions-0.9.2.jar
wicket-core-6.17.0.jar
wicket-extensions-6.13.0.jar
wicket-ioc-6.17.0.jar
wicket-request-6.17.0.jar
wicket-spring-6.17.0.jar
wicket-util-6.17.0.jar
wicket-webjars-0.4.0.jar
woodstox-core-5.0.3.jar
wsdl4j-1.6.3.jar
xercesImpl-2.12.0.jar
xml-apis-1.4.01.jar
xml-resolver-1.2.jar
xmlenc-0.52.jar
xmlParserAPIs-2.6.2.jar
xmlschema-core-2.2.4.jar
zookeeper-3.4.6.jar
This is my java version:
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)
I'd also like to point out that, despite what another answer may have said, nutch 1.4 (or any other version of nutch for that matter) did NOT resolve the issue, at least on Windows.
EDIT: The following answer worked for me, but I left the original one because it may still be useful to someone working with other versions of nutch.
Again, thanks to Sebastian Nagel, in order to get around the NoSuchMethodError, just edit ivy\ivy.xml to reference a different version of hadoop libraries, in my case I installed hadoop 3.1.3 and I also added the corresponding 3.1.3 versions of winutils.exe and hadoop.dll to the hadoop\bin directory referenced by HADOOP_HOME. Running bin/crawl and it seems to be working correctly.
Outdated answer: Okay, after working on the source code itself (courtesy of https://github.com/apache/commons-cli) under the suggestion of Sebastian Nagel, I was able to find the (very simple) implementation for the method (https://github.com/marcelmaatkamp/EntityExtractorUtils/blob/master/src/main/java/org/apache/commons/cli/OptionBuilder.java):
/**
* The next Option created will have an argument patterns and
* the number of pattern occurances
*
* #param argPattern string representing a pattern regex
* #param limit the number of pattern occurance in the argument
* return the OptionBuilder instance
*/
public static OptionBuilder withArgPattern( String argPattern,
int limit )
{
OptionBuilder.argPattern = argPattern;
OptionBuilder.limit = limit;
Using maven I was then able to compile the code into their own jar files, which I then added in the lib folder for apache nutch.
This still did not completely resolve my problem, as there seem to be deprecated functions being used by the entire nutch framework, which will probably mean even more work under similar circumstances (for instance, right after using the new jar I've been returned a NoSuchMethodError over org.apache.hadoop.mapreduce.Job.getInstance).
I leave this answer here as a temporary solution to anyone who may have also gotten stuck on the same issue, but I surely wish there was an easier way of finding out which methods appear in which jar file before exploring their entire file structure, although it may just be me ignoring it.

Flink s3 read error: Data read has a different length than the expected

Using flink 1.7.0, but also seen on flink 1.8.0. We are getting frequent but somewhat random errors when reading gzipped objects from S3 through the flink .readFile source:
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Data read has a different length than the expected: dataLength=9713156; expectedLength=9770429; includeSkipped=true; in.getClass()=class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client$2; markedSupported=false; marked=0; resetSinceLastMarked=false; markCount=0; resetCount=0
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.checkLength(LengthCheckInputStream.java:151)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:93)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:76)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.closeStream(S3AInputStream.java:529)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:490)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.flink.fs.s3.common.hadoop.HadoopDataInputStream.close(HadoopDataInputStream.java:89)
at java.util.zip.InflaterInputStream.close(InflaterInputStream.java:227)
at java.util.zip.GZIPInputStream.close(GZIPInputStream.java:136)
at org.apache.flink.api.common.io.InputStreamFSInputWrapper.close(InputStreamFSInputWrapper.java:46)
at org.apache.flink.api.common.io.FileInputFormat.close(FileInputFormat.java:861)
at org.apache.flink.api.common.io.DelimitedInputFormat.close(DelimitedInputFormat.java:536)
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$SplitReader.run(ContinuousFileReaderOperator.java:336)
ys
Within a given job, we generally see many / most of the jobs read successfully, but there's pretty much always at least one failure (say out of 50 files).
It seems this error is actually originating from the AWS client, so perhaps flink has nothing to do with it, but I'm hopeful someone might have an insight as to how to make this work reliably.
When the error occurs, it ends up killing the source and canceling all the connected operators. I'm still new to flink, but I would think that this is something that could be recoverable from a previous snapshot? Should I expect that flink will retry reading the file when this kind of exception occurs?
Maybe you can try to add more connection for s3a like
flink:
...
config: |
fs.s3a.connection.maximum: 320

Not able to save large spark dataframe as pickle

I have large dataframe (little more than 20G), trying to save that as pickle object to be later used in the another process.
I have tried different configuration, below are the latest one.
executor_cores=4
executor_memory='20g'
driver_memory='40g'
deploy_mode='client'
max_executors_dynamic='spark.dynamicAllocation.maxExecutors=400'
num_executors_static=300
spark_driver_memoryOverhead='5g'
spark_executor_memoryOverhead='2g'
spark_driver_maxResultSize='8g'
spark_kryoserializer_buffer_max='1g'
Note:- I cannot increase spark_driver_maxResultSize more than 8G.
I have also tried saving dataframe as hdfs files and then tried to save it as pickel but getting same error messsage as earlier.
My understanding is, when we use pandas.pickle it brings all the data into one driver and then create pickle object. As data size is more than driver_max_result_size code is failing. (Code has worked earlier for 2G data).
Do we have any worksround to solve this problem?
big_data_frame.toPandas().to_pickle('{}/result_file_01.pickle'.format(result_dir))
big_data_frame.write.save('{}/result_file_01.pickle'.format(result_dir), format='parquet', mode='append')
df_to_pickel=sqlContext.read.format('parquet').load(file_path)
df_to_pickel.toPandas().to_pickle('{}/scoring__{}.pickle'.format(afs_dir, rd.strftime('%Y%m%d')))
Error message
Py4JJavaError: An error occurred while calling o1638.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 955 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
Saving as pickle file is an RDD function in Spark, not dataframe. To save your frame using pickle, run
big_data_frame.rdd.saveAsPickleFile(filename)
If you are working with big data, it is never a good idea to run either collect or toPandas in spark as it collects everything in memory, crashing the system. I would suggest you to use parquet or any other format for saving your data as RDD functions are in maintenance mode, which means spark is not introducing any new features to it rapidly.
To read the file, try
pickle_rdd = sc.pickleFile(filename).collect()
df = spark.createDataFrame(pickle_rdd)

Accumulo-Pig error - Connector info for AccumuloInputFormat can only be set once per job

Versions:
Accumulo 1.5
Pig 0.10
Attempted:
Read/write data in/into Accumulo from Pig, using accumulo-pig.
Encountered an error - any insight into getting past this error is greatly appreciated.
Switching to Accumulo 1.4 is not an option as we are using the Accumulo Thrift Proxy in our C# codebase.
Impact:
This is currently a roadblock in our project.
Source reference:
Source code - https://git-wip-us.apache.org/repos/asf/accumulo-pig.git
Error:
In attemtping to read a dataset in Accumulo, from Pig, I am getting the following error-
org.apache.pig.backend.executionengine.ExecException: ERROR 2118:
Connector info for AccumuloInputFormat can only be set once per job
Code snippet:
DATA = LOAD 'accumulo://departments?instance=indra&user=root&password=xxxxxxx&zookeepers=cdh-dn01:2181' using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);
dump DATA;
Try using the ACCUMULO-1783-1.5 branch from the same repository. The way that Pig sets up the InputFormat doesn't play nicely with how Accumulo sets up InputFormats (notably, Accumulo makes a funny assertion that you never call the same static method more than one for a Configuration).
I have been using pig 0.12 -- I doubt there's a difference in how 0.10 sets up the InputFormats as opposed to 0.12, but I'm not positive YMMV.
I just pushed a fix to the above branch that gets rid of the previously mentioned limitation on Hadoop version.