Hive on AWS EMR does not work - hive

I have a brand new EMR cluster (EMR version 5.16, Hadoop 2.8.4, and Hive 2.3.3). It's connected to the GLU Data Catalog. I can list the tables, describe, and even "select " successfully. But, once I start a real query that has to run a Tez or MR jobs on YARN, it fails. Even "select count() " on a simple, uncompressed text table fails.
The Tez stderr logs suggests a bunch of S3 errors (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;), but the YARN container logs has lib errors (|lzo.GPLNativeCodeLoader|: Could not load native gpl library). Does anybody know whats going on here? This seems like basic stuff on EMR, but doesn't work.

Related

Unable to connect to AWS RDS using Liquibase on Mac

I am unable to connect to an AWS RDS instance using liquibase from the terminal using a Mac. I am able to connect to the same database and URL using MySQL workbench and pymysql in python. I have mysql-connector-java.jar in Liquibase's lib folder. I have also done brew install mysql and brew install mysql-client. I get the same error when using an invalid URL, but I have checked (many times) to make sure the URL string is correct. The exact error I get is:
Unexpected error running Liquibase: liquibase.exception.DatabaseException: liquibase.exception.DatabaseException: Connection could not be created to jdbc:mysql://-URL IS HERE-?createDatabaseIfNotExists=true with driver com.mysql.cj.jdbc.Driver. Communications link failure
From the command:
liquibase --url=jdbc:mysql://-URL IS HERE-:3306/cdp_sms?createDatabaseIfNotExists=true --username="cdpadmin" --password=$CDP_DB_PW --changeLogFile=db.changelog-master.xml update
Any help would be greatly appreciated.
I had encounter a similar issue a while ago with Liquibase and AWS RDS. I would suggest you to check your security groups (inbound/outbound rules) in the aws settings.

Alluxio + Hive on EMR

I have Alluxio 1.8 installed on an EMR 5.19.0 cluster, and can see my S3 tables using /usr/local/alluxio/bin/alluxio fs ls /.
However, when I start up hive and issue
hive> [[DDL w/ LOCATION = alluxio://master_host:19998/my_table ]]], I get the following:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
Is there a way of getting past this? I've tried starting hive with --auxpath pointing to both /usr/local/alluxio/client/alluxio-1.8.1-client.jar and a copy of the jar on hdfs without any success.
Any help?
I posted a blog talking about the reasons for the error message java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found. Here are some tips, hope they can help:
For Hive, set environment variable HIVE_AUX_JARS_PATH in conf/hive-env.sh:
export HIVE_AUX_JARS_PATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar:${HIVE_AUX_JARS_PATH}
which I guess is equivalent to what you have done to set --auxpath.
Depending on your setting of Hive (e.g., Hive on MR or Spark or Tez), you may also need to make sure the runtime is also able to access the client jar. Take Hive on MR as an example, you perhaps also need to append the path to Alluxio client jar to mapreduce.application.classpath or yarn.application.classpath to ensure each task of the MR jobs can access this jar.

External checkpoints to S3 on EMR

I am trying to deploy a production cluster for my Flink program. I am using a standard hadoop-core EMR cluster with Flink 1.3.2 installed, using YARN to run it.
I am trying to configure my RocksDB to write my checkpoints to an S3 bucket. I am trying to go through these docs: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/aws.html#set-s3-filesystem. The problem seems to be getting the dependencies working correctly. I receive this error when trying run the program:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:93)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:328)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:350)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory.<init>(FsCheckpointStreamFactory.java:99)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createStreamFactory(FsStateBackend.java:282)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createStreamFactory(RocksDBStateBackend.java:273
I have tried both leaving and adjusting the core-site.xml and leaving it as is. I have tried setting the HADOOP_CLASSPATH to the /usr/lib/hadoop/share that contains(what I assume are) most of the JARs described in the above guide. I tried downloading the hadoop 2.7.2 binaries, and copying over them into the flink/libs directory. All resulting in the same error.
Has anyone successfully gotten Flink being able to write to S3 on EMR?
EDIT: My cluster setup
AWS Portal:
1) EMR -> Create Cluster
2) Advanced Options
3) Release = emr-5.8.0
4) Only select Hadoop 2.7.3
5) Next -> Next -> Next -> Create Cluster ( I do fill out names/keys/etc)
Once the cluster is up I ssh into the Master and do the following:
1 wget http://apache.claz.org/flink/flink-1.3.2/flink-1.3.2-bin-hadoop27-scala_2.11.tgz
2 tar -xzf flink-1.3.2-bin-hadoop27-scala_2.11.tgz
3 cd flink-1.3.2
4 ./bin/yarn-session.sh -n 2 -tm 5120 -s 4 -d
5 Change conf/flink-conf.yaml
6 ./bin/flink run -m yarn-cluster -yn 1 ~/flink-consumer.jar
My conf/flink-conf.yaml I add the following fields:
state.backend: rocksdb
state.backend.fs.checkpointdir: s3:/bucket/location
state.checkpoints.dir: s3:/bucket/location
My program's checkpointing setup:
env.enableCheckpointing(getCheckpointRate,CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(getCheckpointMinPause)
env.getCheckpointConfig.setCheckpointTimeout(getCheckpointTimeout)
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
env.setStateBackend(new RocksDBStateBackend("s3://bucket/location", true))
If there are any steps you think I am missing, please let me know
I assume that you installed Flink 1.3.2 on your own on the EMR Yarn cluster, because Amazon does not yet offer Flink 1.3.2, right?
Given that, it seems as if you have a dependency conflict. The method org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration) was only introduced with Hadoop 2.4.0. Therefore I assume that you have deployed a Flink 1.3.2 version which was built with Hadoop 2.3.0. Please deploy a Flink version which was built with the Hadoop version running on EMR. This will most likely solve all dependency conflicts.
Putting the Hadoop dependencies into the lib folder seems to not reliably work because the flink-shaded-hadoop2-uber.jar appears to have precedence in the classpath.

Parquet Warning Filling up Logs in Hive MapReduce on Amazon EMR

I am running a custom UDAF on a table stored as parquet on Hive on Tez. Our Hive jobs are run on YARN, all set up in Amazon EMR. However, due to the fact that the parquet data we have was generated with an older version of Parquet (1.5), I am getting a warning that is filling up the YARN logs and causing the disk to run out of space before the job finishes. This is the warning:
PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring
statistics
because created_by could not be parsed (see PARQUET-251): parquet-mr version
It also prints a stack track. I have been trying to silence the warning logs to no avail. I have managed to turn off just about every type of log except this warning. I have tried modifying just about every Log4j settings file using the AWS config as outlined here.
Things I have tried so far:
I set the following settings in tez-site.xml (writing them in JSON format because that's what AWS requires for configuration) It is in proper XML format of course on the actual instance.
"tez.am.log.level": "OFF",
"tez.task.log.level": "OFF",
"tez.am.launch.cluster-default.cmd-opts": "-Dhadoop.metrics.log.level=OFF -Dtez.root.logger=OFF,CLA",
"tez.task-specific.log.level": "OFF;org.apache.parquet=OFF"
I have the following settings on mapred-site.xml. These settings effectively turned off all logging that occurs in my YARN logs except for the warning in question.
"mapreduce.map.log.level": "OFF",
"mapreduce.reduce.log.level": "OFF",
"yarn.app.mapreduce.am.log.level": "OFF"
I have these settings in just about every other log4j.properties file .I found in the list shown in previous AWS link.
"log4j.logger.org.apache.parquet.CorruptStatistics": "OFF",
"log4j.logger.org.apache.parquet": "OFF",
"log4j.rootLogger": "OFF, console"
Honestly at this point, I just want to find some way turn off logs and get the job running somehow. I've read about similar issues such as this link where they fixed it by changing log4j settings, but that's for Spark and it just doesn't seem to be working on Hive/Tez and Amazon. Any help is appreciated.
Ok, So I ended up fixing this by modifying the java logging.properties file for EVERY single data node and the master node in EMR. In my case the file was located at /etc/alternatives/jre/lib/logging.properties
I added a shell command to the bootstrap action file to automatically add the following two lines to the end of the properties file:
org.apache.parquet.level=SEVERE
org.apache.parquet.CorruptStatistics.level = SEVERE
Just wanted to update in case anyone else faced the same issue as this is really not set up properly by Amazon and required a lot of trial and error.

Cannot Load Hive Table into Pig via HCatalog

I am currently configuring a Cloudera HDP dev image using this tutorial on CentOS 6.5, installing the base and then adding the different components as I need them. Currently, I am installing / testing HCatalog using this section of the tutorial linked above.
I have successfully installed the package and am now testing HCatalog integration with Pig with the following script:
A = LOAD 'groups' USING org.apache.hcatalog.pig.HCatLoader();
DESCRIBE A;
I have previously created and populated a 'groups' table in Hive before running the command. When I run the script with the command pig -useHCatalog test.pig I get an exception rather than the expected output. Below is the initial part of the stacktrace:
Pig Stack Trace
---------------
ERROR 2245: Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1608)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1547)
at org.apache.pig.PigServer.registerQuery(PigServer.java:518)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
...
Has anyone encountered this error before? Any help would be much appreciated. I would be happy to provide more information if you need it.
The error was caused by HBase's Thrift server not being proper configured. I installed/configured Thrift and added the following to my hive-xml.site with the proper server information added:
<property>
<name>hive.metastore.uris</name>
<value>thrift://<!--URL of Your Server-->:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
I thought the snippet above was not required since I am running Cloudera HDP in pseudo-distributed mode.Turns out, it and HBase Thrift are required to use HCatalog with Pig.