Hive tez execution error - hive

I am running a hive query and I got the following error when setting the hive.execution.engine=tez, while the query is working under engine=MR.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
My query is an inner join and the data is quite big.
Another thing is that I have met this problem before. But tez works later so I thought it was about something unstable about hive.

While running your HQL via hive include following parameter. This will give you detailed logs and you can determine the root cause easily.
-hiveconf hive.root.logger=DEBUG,console
I faced similar problem and above property help me big time.
e.g.: I got following message
16/04/14 10:29:26 ERROR exec.Task: Failed to execute tez graph.
org.apache.tez.dag.api.TezException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=20480, maxMemory=11288
When I changed my setting to 11288, my query went through fine.

Once check your yarn-site.xml with following properties.
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
</configuration>

Found this post, which made it work for me. Needed to add the username
hadoop

Related

Connect timeout from Presto / Trino to Amazon S3

I currently have a Kubernetes setup outside of AWS where a data lake which resides in Amazon S3 gets queried using Presto v348. Data is stored in parquet file format. Additional component is a Hive metastore.
I encounter the following error and am at a loss on regards to troubleshooting the underlying issue:
io.prestosql.spi.PrestoException: Unable to execute HTTP request: Connect to s3-eu-central-1.amazonaws.com:80 [s3-eu-central-1.amazonaws.com] failed: connect timed out
This issue sometimes arises with bigger queries and interestingly brings the system into a state where all following queries time out. There are cases where in 1/5 of tries the query will succeed. Smaller queries in general work perfectly fine. This gets better after about 10-20min. Restarting Presto does not solve the 10-20min problem. Therefore I suspect there must be another problem.
I am aware of the fact that I might run into a performance ceiling, but the fact that instead of an error there are just timeouts and the whole system is unusable for 10-20 minutes is not acceptable.
I have already increased configs like hive.s3.max-connections in Presto and fs.s3a.connection.maximum in the metastore config but it doesn't seem to solve the problem. Besides these, I found no suggestions on how to tweak the setup to prevent the error from happening.
Presto connector config:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://hive-metastore:9083
hive.metastore.username=prestodb
hive.s3.aws-access-key="S3_ACCESS_KEY"
hive.s3.aws-secret-key="S3_SECRET_KEY"
hive.s3.endpoint=s3-eu-central-1.amazonaws.com
hive.s3.ssl.enabled=false
hive.s3.path-style-access=true
hive.parquet.use-column-names=true
hive.allow-drop-table=true
hive.s3-file-system-type=PRESTO
hive.s3.max-connections=50000
hive.s3select-pushdown.max-connections=50000
hive.s3.connect-timeout=60s
hive.allow-rename-column=true
Metatore config:
core-site.xml: |
<configuration>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>xxx</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>xxx</value>
</property>
<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>50000</value>
</property>
<property>
<name>fs.s3a.connection.establish.timeout</name>
<value>60000</value>
</property>
<property>
<name>fs.s3a.threads.max</name>
<value>64</value>
</property>
<property>
<name>fs.s3a.max.total.tasks</name>
<value>128</value>
</property>
</configuration>

Can Impala run on top of Alluxio?

I have tried to configured Impala to run on top of Alluxio, but failed.
Here is the Impala configurations:
/etc/impala/conf/core-site.xml(http://www.alluxio.org/docs/1.6/en/Running-Hadoop-MapReduce-on-Alluxio.html)
<configuration>
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
<description>The Alluxio FileSystem (Hadoop 1.x and 2.x)</description>
</property>
<property>
<name>fs.AbstractFileSystem.alluxio.impl</name>
<value>alluxio.hadoop.AlluxioFileSystem</value>
<description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description>
</property>
</configuration>
/etc/impala/conf/hive-site.xml(http://www.alluxio.org/docs/1.6/en/Running-Hive-with-Alluxio.html)
<property>
<name>fs.defaultFS</name>
<value>alluxio://master_hostname:port</value>
</property>
Then I started Impala(impala-server, impala-catalogd, impala-state-store), but in the log I found this:
...impala-server.cc:282] Currently configured default file system: FileSystem. fs.defaultFS (alluxio://192.168.1.10:19998/) is not supported.
...impala-server.cc:285] Aborting Impala Server startup due to improper configuration. Impalad exiting.
I have searched a lot on Bing but got no luck. Even there is few result on search key words 'impala on alluxio'. So can impala run on top of alluxio? Any suggestions will be appreciated.
My Impala version: 2.10.0-cdh5.13.0 RELEASE, Alluxio version: alluxio-1.8.0-hadoop-2.7
Have you tried running Hive with external tables on Alluxio? Instead of setting Alluxio as defaultFS, remove
<property>
<name>fs.defaultFS</name>
<value>alluxio://master_hostname:port</value>
</property>
and use something like the following to create a table on Alluxio:
hive> CREATE TABLE u_user (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 'alluxio://master_hostname:port/table_path';
That might help workaround Impala's filesystem implementation check. Also there is a bug in CDH 5.13 and below which prevents Impala from reading data in Alluxio. You might want to upgrade to CDH 5.14 which fixed that issue.

My Yarn Map-Reduce Job is taking a lot of time

Input File size : 75GB
Number of Mappers : 2273
Number of reducers : 1 (As shown in the web UI)
Number of splits : 2273
Number of Input files : 867
Cluster : Apache Hadoop 2.4.0
5 nodes cluster, 1TB each.
1 master and 4 Datanodes.
It's been 4 hrs. now and still only 12% of map is completed. Just wanted to know given my cluster configuration does this make sense or is there anything wrong with the configuration?
Yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux- services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource- tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8040</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
<description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>32</value>
<description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
<description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
<description>Number of CPU cores that can be allocated for containers.</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
Map-Reduce job where I am using multiple outputs. So reducer will emit multiple files. Each machine has 15GB Ram. Containers running are 8. Total memory available is 32GB in RM Web UI.
Any guidance is appreciated. Thanks in advance.
A few points to check:
The block & split size seems very small considering the data you shared. Try increasing both to an optimal level.
If not used, use a custom partitioner that would uniformly spread your data across reducers.
Consider using combiner.
Consider using appropriate compression (while storing mapper results)
Use optimum number of block replication.
Increase the number of reducers as appropriate.
These will help increase performance. Give a try and share your findings!!
Edit 1: Try to compare the log generated by a successful map task with that of the long running map task attempt. (12% means 272 map tasks completed). You will get to know where it got stuck.
Edit 2: Tweak these parameters: yarn.scheduler.minimum-allocation-mb, yarn.scheduler.maximum-allocation-mb, yarn.nodemanager.resource.memory-mb, mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.memory.mb, mapreduce.reduce.java.opts, mapreduce.task.io.sort.mb, mapreduce.task.io.sort.factor
These will improve the situation. Take trial and error approach.
Also refer: Container is running beyond memory limits
Edit 3: Try to understand a part of the logic, convert it to pig script, execute and see how it behaves.

Error Nutch No agents listed in 'http.agent.name'

I am using nutch2.2.1. Log file is generating following error
ERROR protocol.RobotRulesParser - Agent we advertise (nutch-spider-2.2.1) not listed first in 'http.robots.agents' property!
My nutch-site.xml is (for above property)
<property>
<name>http.agent.name</name>
<value>nutch-spider-2.2.1</value>
</property>
my nutch-default.xml is
<property>
<name>http.agent.name</name>
<value></value>
</property>
Where is actual problem? Please guide it clearly(properly explaination).
This question is posted here but I have to bounty this question (if needed) that's why posting it again.
You shoule add the property of "http.robots.agents" and put the value of http.agent.name as the first agent name, and keep the default * at the end of the list.just like:
<property>
<name>http.robots.agents</name>
<value>nutch-spider-2.2.1,*</value>
</property>

Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

i just installed hive and mysql..
and copied the mysqlconnector to the hive_home/lib folder
but when i try show databases and create table commands in the hive> prompt giving me the error as below:
create database saty;
FAILED: Error in metadata: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
and my hive_site.xml is
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hadoop?CreateDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
and i dont have a directory called /user/hive/warehouse in my file system.
i created these path with mkdir command.. and tried after reboot..
bout still getting the error..
regards,
satya
Try Specifying these two properties
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>username</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
<description>Password to use against metastore database</description>
</property>
And Similarly the username of your mysql login should have the permission to access the db sepcified in the JDBC connection string. This Can be acheived using the following command
GRANT ALL ON Databasename.* TO username#'%' IDENTIFIED BY 'password';
The answer is located at http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/5.0/CDH5-Installation-Guide/cdh5ig_hive_schema_tool.html
To suppress the schema check and allow the metastore to implicitly modify the schema, you need to set the hive.metastore.schema.verification configuration property to false in hive-site.xml.
reconfigure your hive using -hiveconf hive.root.logger=warn,console than find the detail reason that why you could not instantiate your hive mate store client.
The problem i met was wrong mysql configuration, the error message is "Binary logging not possible. Message: Transaction level 'READ-COMMITTED' in InnoDB is not safe for binlog mode 'STATEMENT'". When i change binlog_format from "statement" to "binlog_format=mixed", then hive meta store client instantiate successfully.
Hope it work for you.
I had similar issue and the cause in my case was selinux where it prevented Postgres from proper running.
I inserted following line to the first line of /etc/rc3.d/S64postgresql -
echo 0 > /selinux/enforce # Disable selinux temporarily
and restarted the node with the hive metastore.
So generally you can check two things:
Check whether the DB for the metastore is running properly
Whether the user/passwd from the property is correct