Nifi PutHive3Streaming - writing to paritioned tables - hive

I am using Nifi 1.7.1 to write to a partitioned hive table. Although the data is streamed successfully, I see several messages in the hive metastore log:
2018-10-29T17:09:40,682 ERROR [pool-10-thread-198]: metastore.RetryingHMSHandler (RetryingHMSHandler.java:invokeInternal(201)) - AlreadyExistsException(message:Partition already exists: Partition(values:[2018, 3, 28], dbName:default, tableName:myTable, createTime:0, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:type, type:string, comment:null), FieldSchema(name:id, type:string, comment:null), FieldSchema(name:referenced_event_id, type:string, comment:null), FieldSchema(name:happened, type:string, comment:null), FieldSchema(name:processed, type:string, comment:null), FieldSchema(name:tracking_id, type:string, comment:null), FieldSchema(name:source_attributes, type:struct<id:string,origin:string,data:map<string,string>,external_data:map<string,string>>, comment:null), FieldSchema(name:event_data, type:struct<service:struct<name:string,version:string>,result:struct<mno:string,mvno:string,mcc:string,mnc:string,country:string>>, comment:null)], location:hdfs://node-master:8020/user/hive/warehouse/myTable/year=2018/month=3/day=28, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:6, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[tracking_id], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:null, catName:hive))
I have tried this with:
"hive3-stream-part-vals": "${year},${month},${day}",
"hive3-stream-autocreate-partition": "false",
and also with
"hive3-stream-autocreate-partition": "true",
Does anyone have a clear idea on why these errors are being logged?

I think you're running into https://issues.apache.org/jira/browse/HIVE-18931. What is your setting for the processor's Max Concurrent Tasks property? If it is greater than 1, can you try setting it to 1 and see if you still get the message? If it is 1, are multiple clients (NiFi, beeline, etc.) trying to write to that table at the same time?

Related

Kafka-connect without schema registry

I have a kafka-topic and I would like to feed it with AVRO data (currently in JSON). I know the "proper" way to do it is to use schema-registry but for testing purposes I would like to make it work without it.
So I am sending AVRO data as Array[Byte] as opposed to regular Json objects:
val writer = new SpecificDatumWriter[GenericData.Record]("mySchema.avsc")
val out = new ByteArrayOutputStream
val encoder = EncoderFactory.get.binaryEncoder(out, null)
writer.write(myAvroData, encoder)
encoder.flush
out.close
out.toByteArray
The schema is embarked within each data; how can I make it work with kafka-connect? The kafka-connect configuration currently exhibits the following properties (data is written to s3 as json.gz files), and I want to write Parquet files:
{
"name": "someName",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "120",
"topics": "user_sync",
"s3.region": "someRegion",
"s3.bucket.name": "someBucket",
"s3.part.size": "5242880",
"s3.compression.type": "gzip",
"filename.offset.zero.pad.width": "20",
"flush.size": "5000",
"rotate.interval.ms": "600000",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "YYYY/MM/dd/HH",
"timezone" : "UTC",
"locale": "en",
"partition.duration.ms": "600000",
"timestamp.extractor": "RecordField",
"timestamp.field" : "ts",
"schema.compatibility": "NONE"
I suppose I need to change "format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat? But is it enough?
Thanks a lot!
JsonConverter will be unable to consume Avro encoded data since the binary format contains a schema ID from the registry that's needed to be extracted before the converter can determine what the data looks like
You'll want to use the registryless-avro-converter, which will create a Structured object, and then should be able to converted to a Parquet record.

Data Loss in Kafka S3 Connector

We are using Kafka S3 Connector for log pipeline, as it guarantees exactly-once semantics. However, we've experienced two data loss events on different topics. We found a suspicious error message in kafka-connect worker's log as below.
[2019-04-10 08:56:22,388] ERROR WorkerSinkTask{id=s3-sink-common-log-4} Commit of offsets threw an unexpected exception for sequence number 2721: {logging_common_log-9=OffsetAndMetadata{offset=4485661604, metadata=''}, logging_common_log-8=OffsetAndMetadata{offset=4485670359, metadata=''}} (org.apache.kafka.connect.runtime.WorkerSinkTask:260)
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:808)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doCommitOffsetsAsync(ConsumerCoordinator.java:641)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:608)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitAsync(KafkaConsumer.java:1486)
at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommitAsync(WorkerSinkTask.java:352)
at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommit(WorkerSinkTask.java:363)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:432)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:209)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The configs of worker and connector are:
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"flush.size": "999999999",
"rotate.schedule.interval.ms": "60000",
"retry.backoff.ms": "5000",
"s3.part.retries": "3",
"s3.retry.backoff.ms": "200",
"s3.part.size": "26214400",
"tasks.max": "3",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"schema.compatibility": "NONE",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"timestamp.extractor": "Record",
"partition.duration.ms": "3600000",
"path.format": "YYYY/MM/dd/HH",
"timezone": "America/Los_Angeles",
"locale": "US",
"append.late.data": "false",
...
},
and
group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3
offset.storage.partitions=25
config.storage.topic=connect-configs
config.storage.replication.factor=3
status.storage.topic=connect-status
status.storage.replication.factor=3
status.storage.partitions=5
rest.port=8083
plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
plugin.path=/usr/share/java
The questions are:
1. What's the root cause?
2. How to prevent it?
3. How to reproduce it?
Thank you very much for any hints/advice/similar experience!

Presto-Glue-EMR integration: presto-cli giving NullPointerException

I am trying to connect my Glue catalog to Presto and Hive in EMR. While running the queries in presto-cli, I am getting NullPointerException whereas the same query succeeds in hive-cli.
Started the cli like below
presto-cli --catalog hive
Exception on executing a query:
Query 20180814_174636_00003_iika5 failed: java.lang.NullPointerException: parameters is null
EMR Configuration looks like this:
[
{
"classification": "presto-connector-hive",
"properties": {
"hive.metastore": "glue"
},
"configurations": []
},
{
"classification": "hive-site",
"properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
},
"configurations": []
}
]
EMR version: 5.16.0
Presto version: 0.203
Reference Doc: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.html
Debug logs
Query 20180816_060942_00001_m9i52 failed: java.lang.NullPointerException: parameters is null
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException: parameters is null
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2052)
at com.google.common.cache.LocalCache.get(LocalCache.java:3943)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3967)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4952)
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4958)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.get(CachingHiveMetastore.java:207)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.getPartitionNamesByParts(CachingHiveMetastore.java:499)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.doGetPartitionNames(SemiTransactionalHiveMetastore.java:467)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.getPartitionNamesByParts(SemiTransactionalHiveMetastore.java:445)
at com.facebook.presto.hive.HivePartitionManager.getFilteredPartitionNames(HivePartitionManager.java:284)
at com.facebook.presto.hive.HivePartitionManager.getPartitions(HivePartitionManager.java:146)
at com.facebook.presto.hive.HiveMetadata.getTableLayouts(HiveMetadata.java:1305)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.getTableLayouts(ClassLoaderSafeConnectorMetadata.java:73)
at com.facebook.presto.metadata.MetadataManager.getLayouts(MetadataManager.java:346)
at com.facebook.presto.sql.planner.iterative.rule.PickTableLayout.planTableScan(PickTableLayout.java:203)
at com.facebook.presto.sql.planner.iterative.rule.PickTableLayout.access$200(PickTableLayout.java:61)
at com.facebook.presto.sql.planner.iterative.rule.PickTableLayout$PickTableLayoutWithoutPredicate.apply(PickTableLayout.java:186)
at com.facebook.presto.sql.planner.iterative.rule.PickTableLayout$PickTableLayoutWithoutPredicate.apply(PickTableLayout.java:153)
at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:168)
at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:141)
at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:104)
at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:193)
at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:106)
at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:193)
at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:106)
at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:95)
at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:140)
at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:129)
at com.facebook.presto.execution.SqlQueryExecution.doAnalyzeQuery(SqlQueryExecution.java:327)
at com.facebook.presto.execution.SqlQueryExecution.analyzeQuery(SqlQueryExecution.java:312)
at com.facebook.presto.execution.SqlQueryExecution.start(SqlQueryExecution.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: parameters is null
at java.util.Objects.requireNonNull(Objects.java:228)
at com.facebook.presto.hive.metastore.Partition.<init>(Partition.java:54)
at com.facebook.presto.hive.metastore.Partition$Builder.build(Partition.java:180)
at com.facebook.presto.hive.metastore.glue.converter.GlueToPrestoConverter.convertPartition(GlueToPrestoConverter.java:141)
at com.facebook.presto.hive.metastore.glue.GlueHiveMetastore.lambda$getPartitions$8(GlueHiveMetastore.java:558)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at com.facebook.presto.hive.metastore.glue.GlueHiveMetastore.getPartitions(GlueHiveMetastore.java:558)
at com.facebook.presto.hive.metastore.glue.GlueHiveMetastore.getPartitionNamesByParts(GlueHiveMetastore.java:541)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.loadPartitionNamesByParts(CachingHiveMetastore.java:504)
at com.google.common.cache.CacheLoader$FunctionToCacheLoader.load(CacheLoader.java:165)
at com.google.common.cache.CacheLoader$1.load(CacheLoader.java:188)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3524)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2273)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2156)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2046)
... 33 more
Seems like presto 0.203 has this bug, I faced it too, I switched to a newer version and it worked.
In the time I am writing this answer, EMR 5.17 is released and it has presto 0.206 which has this problem resolved.

Druid RabbitMQ Firehose

I'm trying to setup druid to work with rabbitmq firehose but getting the following error from Tranquility
java.lang.IllegalArgumentException: Could not resolve type id 'rabbitmq' into a subtype of [simple type, class io.druid.data.input.FirehoseFactory]
I did the following
1. Installed Druid
2. Downloaded extension druid-rabbitmq
3. Copied druid-rabbitmq into druid extensions
4. Copied amqp-client jar to druid lib
5. Added druid-rabbitmq into druid.extensions.loadList in common.runtime.properties
6. In Tranquility server.json configuration added the firehose config
"ioConfig" : {
"type" : "realtime",
"firehose" : {
"type" : "rabbitmq",
"connection" : {
"host": "localhost",
"port": "5672",
"username": "blackbox",
"password": "blackbox",
"virtualHost": "blackbox-vhost",
"uri": "amqp://localhost:5672/blackbox-vhost"
},
"config" : {
"exchange": "test-exchange",
"queue" : "test-q",
"routingKey": "#",
"durable": "true",
"exclusive": "false",
"autoDelete": "false",
"maxRetries": "10",
"retryIntervalSeconds": "1",
"maxDurationSeconds": "300"
}
}
}
I'm using imply 1.3.0 but I think Tranquility is for stream pushing while a firehose is used for stream pulling so I think this was the problem. So now I created a realtime node and it's running fine. I also had to copy lyra jar file into druid lib directory. Now I can publish data from rabbit and its been inserted into druid and I can query the data but problem is that in rabbit the message is still showing as unacked. Any idea?

Elasticsearch snapshot restore throwing "repository missing" exception

"error": "RemoteTransportException[[Francis Underwood][inet[/xx.xx.xx.xx:9300]][cluster/snapshot/get]]; nested: RepositoryMissingException[[xxxxxxxxx] missing]; ",
"status": 404
I am also unable to create new snapshot repository for snapshots on s3
PUT _snapshot/bkp_xxxxx_master
{
"type": "s3",
settings": {
"region": "us-xxxx-x",
"bucket": "elasticsearch-backups",
"access_key": "xxxxxxxxxxxx",
"secret_key": "xxxxxxxxxxxxxxxxxxx"
}
}
Response I receive for this PUT is below:
{
"error": "RemoteTransportException[[Francis Underwood][inet[/xx.xx.xx.xx:9300]][cluster/repository/put]]; nested: RepositoryException[[bkp_xxxxxxx_master] failed to create repository]; nested:'AbstractMethodError[org.elasticsearch.cloud.aws.blobstore.S3BlobStore.immutableBlobContainer(Lorg/elasticsearch/common/blobstore/BlobPath;)Lorg/elasticsearch/common/blobstore/ImmutableBlobContainer;]; ",
"status": 500
}
Thanks in advance!
I know this is an old issue but I had been able to replicate this over multiple ElasticSearch versions and it turns out that the reason was conflict between JVM versions and elasticsearch-aws-cloud plugin versions.
As long as you have consistent versions across the cluster (in my case it was Joda version in elasticsearch-aws-cloud was not compatible with the latest JVM version I had installed on the newer nodes.