Cloudera Manager Embedded PostgreSQL Hive Metastore Server OutOfMemoryError Issue - hive

I'm using:
Cloudera Manager Free Edition: 4.5.1
Cloudera Hadoop Distro: CDH 4.2.0-1.cdh4.2.0.p0.10 (Parcel)
Hive Metastore with cloudera manager embedded PostgreSQL database.
My cloudera manager is running on a separate machine and it's not part of the cluster.
After setting up the cluster using cloudera manager, I started using hive through hue + beeswax.
Everything was running fine for a while and then all of a suddden, whenever I ran any query against a particular table that had a large number of partitions (about 14000), the query started to time out:
FAILED: SemanticException org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
When I noticed this, I looked at the logs and found out that the connection to the Hive Metastore was timing out:
WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
Having seen this, I thought there was a problem with the hive metastore. So I looked at the logs for the hive metastore and discovered java.lang.OutOfMemoryErrors:
/var/log/hive/hadoop-cmf-hive1-HIVEMETASTORE-hci-cdh01.hcinsight.net.log.out:
2013-05-07 14:13:08,744 ERROR org.apache.thrift.ProcessFunction: Internal error processing get_partitions_ with_auth
java.lang.OutOfMemoryError: Java heap space
at sun.reflectH.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.jav a:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.datanucleus.util.ClassUtils.newInstance(ClassUtils.java:95)
at org.datanucleus.store.rdbms.sql.expression.SQLExpressionFactory.newLiteralParameter(SQLExpressi onFactory.java:248)
at org.datanucleus.store.rdbms.scostore.RDBMSMapEntrySetStore.getSQLStatementForIterator(RDBMSMapE ntrySetStore.java:323)
at org.datanucleus.store.rdbms.scostore.RDBMSMapEntrySetStore.iterator(RDBMSMapEntrySetStore.java: 221)
at org.datanucleus.sco.SCOUtils.populateMapDelegateWithStoreData(SCOUtils.java:987)
at org.datanucleus.sco.backed.Map.loadFromStore(Map.java:258)
at org.datanucleus.sco.backed.Map.keySet(Map.java:509)
at org.datanucleus.store.fieldmanager.LoadFieldManager.internalFetchObjectField(LoadFieldManager.j ava:118)
at org.datanucleus.store.fieldmanager.AbstractFetchFieldManager.fetchObjectField(AbstractFetchFiel dManager.java:114)
at org.datanucleus.state.AbstractStateManager.replacingObjectField(AbstractStateManager.java:1183)
at org.apache.hadoop.hive.metastore.model.MStorageDescriptor.jdoReplaceField(MStorageDescriptor.ja va)
at org.apache.hadoop.hive.metastore.model.MStorageDescriptor.jdoReplaceFields(MStorageDescriptor.j ava)
at org.datanucleus.jdo.state.JDOStateManagerImpl.replaceFields(JDOStateManagerImpl.java:2860)
at org.datanucleus.jdo.state.JDOStateManagerImpl.replaceFields(JDOStateManagerImpl.java:2879)
at org.datanucleus.jdo.state.JDOStateManagerImpl.loadFieldsInFetchPlan(JDOStateManagerImpl.java:16 47)
at org.datanucleus.store.fieldmanager.LoadFieldManager.processPersistable(LoadFieldManager.java:63 )
at org.datanucleus.store.fieldmanager.LoadFieldManager.internalFetchObjectField(LoadFieldManager.j ava:84)
at org.datanucleus.store.fieldmanager.AbstractFetchFieldManager.fetchObjectField(AbstractFetchFiel dManager.java:104)
at org.datanucleus.state.AbstractStateManager.replacingObjectField(AbstractStateManager.java:1183)
at org.apache.hadoop.hive.metastore.model.MPartition.jdoReplaceField(MPartition.java)
at org.apache.hadoop.hive.metastore.model.MPartition.jdoReplaceFields(MPartition.java)
at org.datanucleus.jdo.state.JDOStateManagerImpl.replaceFields(JDOStateManagerImpl.java:2860)
at org.datanucleus.jdo.state.JDOStateManagerImpl.replaceFields(JDOStateManagerImpl.java:2879)
at org.datanucleus.jdo.state.JDOStateManagerImpl.loadFieldsInFetchPlan(JDOStateManagerImpl.java:16 47)
at org.datanucleus.ObjectManagerImpl.performDetachAllOnTxnEndPreparation(ObjectManagerImpl.java:35 52)
at org.datanucleus.ObjectManagerImpl.preCommit(ObjectManagerImpl.java:3291)
at org.datanucleus.TransactionImpl.internalPreCommit(TransactionImpl.java:369)
at org.datanucleus.TransactionImpl.commit(TransactionImpl.java:256)
At this point, the hive metastore gets shutdown and restarted:
2013-05-07 14:39:40,576 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: Shutting down hive metastore.
2013-05-07 14:41:09,979 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: Starting hive metastore on po rt 9083
Now, to fix this, I've changed the max heap size of both the hive metastore server and the beeswax server:
1. Hive/Hive Metastore Server(Base)/Resource Management/Java Heap Size of Metastore Server : 2 GiB (First thing I did.)
2. Hue/Beeswax Server(Base)/Resource Management/Java Heap Size of Beeswax Server : 2 GiB (After reading some groups posts and stuff online, I tried this as well.)
Neither of these above 2 steps seem to have helped as I continue to see OOMEs in the hive metastore log.
Then I noticed that the actual metastore 'database' is being run as part of my cloudera manager and I'm wondering if that PostgreSQL process is running out of memory. I looked for ways to increase the java heap size for that process and found very little documentation regarding that.
I was wondering if one of you guys could help me solve this issue.
Should I increase the java heap size for the embedded database? If so, where would I do this?
Is there something else that I'm missing?
Thanks!

Did you try doing the below.
'SET hive.metastore.client.socket.timeout=300;'
This solved the issue for me. Let me know how it went.

Related

PutHiveStreaming Processor in Nifi throws NPE

I'm debugging a HiveProcessor which follows the official PutHiveStreaming processor, but it writes to Hive 2.x instead of 3.x. The flow runs in Nifi cluster 1.7.1. Although this exception happens, data is still written to Hive.
The exception is:
java.lang.NullPointerException: null
at org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook.getFilteredObjects(AuthorizationMetaStoreFilterHook.java:77)
at org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook.filterDatabases(AuthorizationMetaStoreFilterHook.java:54)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabases(HiveMetaStoreClient.java:1147)
at org.apache.hive.hcatalog.common.HiveClientCache$CacheableHiveMetaStoreClient.isOpen(HiveClientCache.java:471)
at sun.reflect.GeneratedMethodAccessor1641.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:169)
at com.sun.proxy.$Proxy308.isOpen(Unknown Source)
at org.apache.hive.hcatalog.common.HiveClientCache.get(HiveClientCache.java:270)
at org.apache.hive.hcatalog.common.HCatUtil.getHiveMetastoreClient(HCatUtil.java:558)
at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.<init>(AbstractRecordWriter.java:95)
at org.apache.hive.hcatalog.streaming.StrictJsonWriter.<init>(StrictJsonWriter.java:82)
at org.apache.hive.hcatalog.streaming.StrictJsonWriter.<init>(StrictJsonWriter.java:60)
at org.apache.nifi.util.hive.HiveWriter.lambda$getRecordWriter$0(HiveWriter.java:91)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.nifi.util.hive.HiveWriter.getRecordWriter(HiveWriter.java:91)
at org.apache.nifi.util.hive.HiveWriter.<init>(HiveWriter.java:75)
at org.apache.nifi.util.hive.HiveUtils.makeHiveWriter(HiveUtils.java:46)
at org.apache.nifi.processors.hive.PutHive2Streaming.makeHiveWriter(PutHive2Streaming.java:1152)
at org.apache.nifi.processors.hive.PutHive2Streaming.getOrCreateWriter(PutHive2Streaming.java:1065)
at org.apache.nifi.processors.hive.PutHive2Streaming.access$1000(PutHive2Streaming.java:114)
at org.apache.nifi.processors.hive.PutHive2Streaming$1.lambda$process$2(PutHive2Streaming.java:858)
at org.apache.nifi.processor.util.pattern.ExceptionHandler.execute(ExceptionHandler.java:127)
at org.apache.nifi.processors.hive.PutHive2Streaming$1.process(PutHive2Streaming.java:855)
at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2211)
at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2179)
at org.apache.nifi.processors.hive.PutHive2Streaming.onTrigger(PutHive2Streaming.java:808)
at org.apache.nifi.processors.hive.PutHive2Streaming.lambda$onTrigger$4(PutHive2Streaming.java:672)
at org.apache.nifi.processor.util.pattern.PartialFunctions.onTrigger(PartialFunctions.java:114)
at org.apache.nifi.processor.util.pattern.RollbackOnFailure.onTrigger(RollbackOnFailure.java:184)
at org.apache.nifi.processors.hive.PutHive2Streaming.onTrigger(PutHive2Streaming.java:672)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1165)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:203)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I also like to re-produce the error. Would using TestRunners.newTestRunner(processor); be able to find it? I refer to the test case for Hive 3.x
https://github.com/apache/nifi/blob/ea9b0db2f620526c8dd0db595cf8b44c3ef835be/nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/test/java/org/apache/nifi/processors/hive/TestPutHiveStreaming.java
The other way is to run Hive 2.x and Nifi container locally. But then I have to run docker cp to copy the nar package by mvn, and attach remote JVM from intellij as this blog describes.
https://community.hortonworks.com/articles/106931/nifi-debugging-tutorial.html
Have someone done similar? or is there an easier way to debug a custom processor?
This is a red herring error, there's some issue on the Hive side where it can't get its own IP address or hostname, and issues this error periodically as a result. However I don't think it causes any real problems, as you said the data gets written to Hive.
Just for completeness, in Apache NiFi PutHiveStreaming is built to work against Hive 1.2.x, not Hive 2.x. There are currently no specific Hive 2.x processors, we've never determined whether the Hive 1.2.x processors work against Hive 2.x.
For debugging, if you can run Hive in a container and expose the metastore port (9083 is the default I believe), then you should be able to create an integration test using things like TestRunners and run NiFi locally from your IDE. This is how other integration tests are performed for external systems such as MongoDB or Elasticsearch for example.
There is a MiniHS2 class in the Hive test suite for integration testing, but it is not in a published artifact so unfortunately we're left with having to run tests against a real Hive instance.
NPE doesn't show up after hcatalog.hive.client.cache.disabled set to true
Kafka Connect recommends this setting, too.
from Kafka Connect Doc https://docs.confluent.io/3.0.0/connect/connect-hdfs/docs/hdfs_connector.html
As connector tasks are long running, the connections to Hive metastore
are kept open until tasks are stopped. In the default Hive
configuration, reconnecting to Hive metastore creates a new
connection. When the number of tasks is large, it is possible that the
retries can cause the number of open connections to exceed the max
allowed connections in the operating system. Thus it is recommended to
set hcatalog.hive.client.cache.disabled to true in hive.xml.
When Max Concurrent Tasks of PutHiveStreaming is set more than 1, this property is automatically set as false
Also the document from Nifi resolved the issue already.
The NiFi PutHiveStreaming has a pool of connections, therefore
multithreaded; Setting hcatalog.hive.client.cache.disabled to true
would allow each connection to set is own Session without relying on
the cache.
ref:
https://community.hortonworks.com/content/supportkb/196628/hive-client-puthivestreaming-fails-against-partiti.html

Apache Drill - Hive Integration: Drill Not listing Tables

I have been trying to integrate Apache Drill with Hive using Hive Storage Plugin configuration. I configured the storage plugin with all the necessary properties required. On Drill Shell, I can view the Hive Databases using:
Show Databases;
But when i try to list tables using:
Show Tables;
I get no results (No List of Tables).
Below are the steps i have followed from Apache Drill documentation and other sources:
I created a Drill Distributed Cluster by updating drill-override.conf with same cluster id on all nodes along with ZK IP with Port and then invoking drillbit.sh on each node.
Started Drill shell using drill-conf, Ensured that the Hive metastore service is active as well.
Below is configuration made in Hive Storage Plugin for Drill (from its Web-UI):
{
"type": "hive",
"configProps": {
"hive.metastore.uris": "thrift://node02.cluster.com:9083",
"javax.jdo.option.ConnectionURL": "jdbc:mysql://node02.cluster.com/hive",
"hive.metastore.warehouse.dir": "/apps/hive/warehouse",
"fs.default.name": "hdfs://node01.cluster.com:8020",
"hive.metastore.sasl.enabled": "false"
},
"enabled": true
}
All the properties are set after referring to hive-site.xml
So, That's what all others have done to integrate Drill with Hive. Am i missing something here?
Regarding Versions-
Drill: 1.14 ,Hive : 1.2 (Hive Metastore: MySQL)
We also have Hive Server2 on the same nodes, is that causing any issues?
I just want to integrate Drill with Hive 1.2, am i doing it right?
Any pointers will be helpful, have spent nearly 2 days to get it right.
Thanks for your time.
Starting from Drill 1.13 version Drill leverages Hive client 2.3.2 version.
It is recommended to use Hive 2.3 version to avoid unpredictable issues.
Regarding your setup, please remove all configProps except hive.metastore.uris.
The other configs can be default (it is in HiveConf.java) or can be specified in your hive-site.xml.
Also in case of empty result after usage Show Tables; even after executing use hive, check for errors in Drill's log files. If some error is there, you can create the Jira ticket to improve the output from Drill to reflect that issue.

Apache Flink error checkpointing to S3

We have Apache Flink (1.4.2) running on an EMR cluster. We are checkpointing to an S3 bucket, and are pushing about 5,000 records per second through the flows. We recently saw the following error in our logs:
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#ip-XXX-XXX-XXX-XXX:XXXXXX/user/taskmanager#-XXXXXXX]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.TaskManagerMessages$RequestTaskManagerLog".
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:442)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
Immediately after this we got the following in our logs:
2018-07-30 15:08:32,177 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 831 # 1532963312177
2018-07-30 15:09:46,750 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed
java.io.EOFException: Read an incomplete length
at org.apache.flink.runtime.blob.BlobUtils.readLength(BlobUtils.java:366)
at org.apache.flink.runtime.blob.BlobServerConnection.readFileFully(BlobServerConnection.java:403)
at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:349)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:114)
At this point, the flow crashed and was not able to automatically recover, however we were able to restart the flow manually, without needing to change the location of the s3 bucket. The fact that the crash occurred while pushing to S3, makes me think that is the crux of the problem.
Any ideas?
FYI, this was caused by too much cross-talk between nodes flooding the NICs on each server. The solution was more intelligent partitioning.

HBase Nutch error [Ljava.lang.StackTraceElement

My apache nutch is crawling and in log file following error is appeared.
ERROR store.HBaseStore - Connection refused 2014-11-17 00:00:38,255 ERROR store.HBaseStore - [Ljava.lang.StackTraceElement;#6dce5061
How to remove this error. According to my search this error is because of hbase and not in nutch. This question is posted here but it has no answer.I have to bounty this question if do not get an answer that's why I am posting again.
Some informations of my small cluster is following ( 2 machine cluster)
On machine one, hadoop and hbase are running
On machine two, apache nutch crawler(2.2.1) is running.
When I check log files of hbase and hadoop, there isn't any information about bug. Because of this bug, crawled data in not going to be saved in hbase(machine1). That's a real problem for me and my crawler in not crawler properly. There is about 266 GB already crawled data in table.
This problem "Connection refused" is simply because your region server is not running properly

Sunspot lock issue on EngineYard

I am having a problem when creating a new record on a RoR3 server.
It updates SolR indexes and it's having a problem with a lock.
RSolr::Error::Http (RSolr::Error::Http - 500 Internal Server Error
Error: Lock obtain timed out: NativeFSLock#/data/dfcgit_r3/releases/20130620195714/solr/data/production/index/write.lock
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock#/data/dfcgit_r3/releases/20130620195714/solr/data/production/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1108)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:83)
at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:101)
at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:171)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:219)
Any help with this?
We had the same error when running sunspot solr on amazon ec2.
The 'write'lock' indicated that some process had not released the lock on a resource, either the web server process was still at it or Solr had some other process running. I ran a check on the solr processes running by executing
ps -aux |grep solr
And it showed there were 4 processes running! So I stopped solr from the command : sunspot:solr:stop, then again ran the grep, killed the solr processes listed (kill -9) and then sunspot:solr:start
And the Sun shined again. It worked fine there after