Accesing Hdfs from Spark gives TokenCache error Can't get Master Kerberos principal for use as renewer - authentication

I'm trying to run a test Spark script in order to connect Spark to hadoop.
The script is the following
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
file = sc.textFile("hdfs://hadoop_node.place:9000/errs.txt")
errors = file.filter(lambda line: "ERROR" in line)
errors.count()
When I run it with pyspark I get
py4j.protocol.Py4JJavaError: An error occurred while calling
o21.collect. : java.io.IOException: Can't get Master Kerberos
principal for use as renewer
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:187)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:898)
at org.apache.spark.rdd.RDD.collect(RDD.scala:608)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:243)
at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)
This happens despite the facts that
I've done a kinit and a klist shows I have the correct tokens
when I issue a ./bin/hadoop fs -ls hdfs://hadoop_node.place:9000/errs.txt
it shows the file
Both the local hadoop client and spark have the same configuration file
The core-site.xml in the spark/conf and hadoop/conf folders is the following
(got it from one of the hadoop nodes)
<configuration>
<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[1:$1](.*#place)s/#place//
RULE:[2:$1/$2#$0](.*/node1.place#place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2#$0](.*/node2.place#place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2#$0](.*/node3.place#place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2#$0](.*/node4.place#place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2#$0](.*/node5.place#place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2#$0](.*/node6.place#place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2#$0](.*/node7.place#place)s/^([a-zA-Z]*).*/$1/
RULE:[2:nobody]
DEFAULT
</value>
</property>
<property>
<name>net.topology.node.switch.mapping.impl</name>
<value>org.apache.hadoop.net.TableMapping</value>
</property>
<property>
<name>net.topology.table.file.name</name>
<value>/etc/hadoop/conf/topology.table.file</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://server.place:9000/</value>
</property>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
</configuration>
Can someone point out what am I missing?

After creating my own hadoop cluster in order to better understand how hadoop works. I fixed it.
You have to provide Spark with a valid .keytab file which has been generated for an account which has at least read access to the hadoop cluster.
Also, you have to provide spark with the hdfs-site.xml of your hdfs cluster.
So for my case I had to create a keytab file which when you run
klist -k -e -t
on it you get entries like the following
host/fully.qualified.domain.name#REALM.COM
In my case the host was the literal word host and not a variable.
Also in your hdfs-site.xml you have to provide the path of the keytab file and say that
host/_HOST#REALM.COM
will be your account.
Cloudera has a pretty detailed writeup on how to do it.
Edit
after playing a little bit with different configurations I think the following should be noted.
You have to provide spark with the exact hdfs-site.xml and core-site.xml of your hadoop cluster. Otherwise it wont work

Related

when initializing Hive for the first-time getting error on hive-site.xml

I am unable to find the cause of the below error as it points to hive-site.xml
so far what I have configured is completely correct.
FYI i am using hadoop 3.1.1 and hive 3.1.1 and mysql for hive metastore.
adminn#master:~$ schematool -initSchema -dbType mysql
Exception in thread "main" java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x8
at [row,col,system-id]: [3210,96,"file:/home/adminn/apache-hive-3.1.1-bin/conf/hive-site.xml"]
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:3003)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2931)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2806)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:1460)
at org.apache.hadoop.hive.conf.HiveConf.getVar(HiveConf.java:4990)
at org.apache.hadoop.hive.conf.HiveConf.getVar(HiveConf.java:5063)
at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5150)
at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:5098)
at org.apache.hive.beeline.HiveSchemaTool.<init>(HiveSchemaTool.java:96)
at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1473)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x8
at [row,col,system-id]: [3210,96,"file:/home/adminn/apache-hive-3.1.1-bin/conf/hive-site.xml"]
at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:621)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:491)
at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2456)
at com.ctc.wstx.sr.StreamScanner.validateChar(StreamScanner.java:2403)
at com.ctc.wstx.sr.StreamScanner.resolveCharEnt(StreamScanner.java:2369)
at com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1515)
at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2828)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1123)
at org.apache.hadoop.conf.Configuration$Parser.parseNext(Configuration.java:3257)
at org.apache.hadoop.conf.Configuration$Parser.parse(Configuration.java:3063)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2986)
... 15 more
given below is the hive-site.xml file where i made the required changes.
some imp part of hive-site.xml
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Hivehadoop#123$</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.metastore.ds.connection.url.hook</name>
<value/>
<description>Name of the hook to use for retrieving the JDO connection URL. If empty, the value in javax.jdo.option.ConnectionURL is used</description>
</property>
<property>
<name>javax.jdo.option.Multithreaded</name>
<value>true</value>
<description>Set this to true if multiple threads access metastore through JDO concurrently.</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore_db?createDatabaseIfNotExist=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/${user.name}</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/${user.name}_resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
this is solved, i removed the special character form the specified line number. then its working fine.

Connect timeout from Presto / Trino to Amazon S3

I currently have a Kubernetes setup outside of AWS where a data lake which resides in Amazon S3 gets queried using Presto v348. Data is stored in parquet file format. Additional component is a Hive metastore.
I encounter the following error and am at a loss on regards to troubleshooting the underlying issue:
io.prestosql.spi.PrestoException: Unable to execute HTTP request: Connect to s3-eu-central-1.amazonaws.com:80 [s3-eu-central-1.amazonaws.com] failed: connect timed out
This issue sometimes arises with bigger queries and interestingly brings the system into a state where all following queries time out. There are cases where in 1/5 of tries the query will succeed. Smaller queries in general work perfectly fine. This gets better after about 10-20min. Restarting Presto does not solve the 10-20min problem. Therefore I suspect there must be another problem.
I am aware of the fact that I might run into a performance ceiling, but the fact that instead of an error there are just timeouts and the whole system is unusable for 10-20 minutes is not acceptable.
I have already increased configs like hive.s3.max-connections in Presto and fs.s3a.connection.maximum in the metastore config but it doesn't seem to solve the problem. Besides these, I found no suggestions on how to tweak the setup to prevent the error from happening.
Presto connector config:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://hive-metastore:9083
hive.metastore.username=prestodb
hive.s3.aws-access-key="S3_ACCESS_KEY"
hive.s3.aws-secret-key="S3_SECRET_KEY"
hive.s3.endpoint=s3-eu-central-1.amazonaws.com
hive.s3.ssl.enabled=false
hive.s3.path-style-access=true
hive.parquet.use-column-names=true
hive.allow-drop-table=true
hive.s3-file-system-type=PRESTO
hive.s3.max-connections=50000
hive.s3select-pushdown.max-connections=50000
hive.s3.connect-timeout=60s
hive.allow-rename-column=true
Metatore config:
core-site.xml: |
<configuration>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>xxx</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>xxx</value>
</property>
<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>50000</value>
</property>
<property>
<name>fs.s3a.connection.establish.timeout</name>
<value>60000</value>
</property>
<property>
<name>fs.s3a.threads.max</name>
<value>64</value>
</property>
<property>
<name>fs.s3a.max.total.tasks</name>
<value>128</value>
</property>
</configuration>

hue connect hive had an error

Failed to open new session: java.lang.RuntimeException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
User: hadoop is not allowed to impersonate cheng
User:hadoop is my hadoop install use,and cheng is ubuntu user.
I have already the following configuration in my core-site.xml:
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
the hive user is not exist before,so I change the hadoop.proxyuser.hive.groups to
hadoop.proxyuser.hadoop.group and so on.in hue config hue.ini,set the hue user.
so the problem is solution.

NoClassDefFoundError HBase with YARN

I know that this is one of the topic that's asked much. Still after I digged into all of the topics I could find (most of them talking about CLASSPATH), I cant solve mine.
Examples of the topics I found and tried:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
java.lang.NoClassDefFoundError with HBase Scan
I'm using Hadoop 2.5.1 with HBase 0.98.11 on Ubuntu 14.04
I set up pseudo-distributed mode and running hadoop with hbase successfully. After I want to set up the full-distributed mode, jobs fail with NoClassDefFound error. I tried adding "export HADOOP_CLASSPATH=/usr/local/hbase-0.98.11-hadoop2/bin/hbase classpath" into hadoop-env (also yarn-env), still dont work.
One notice I found is if I comment the
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
I can run the jobs SUCCESSFULLY. BUT it seems that I run it on single not multi node.
Here are some of the configs:
mapred-site
<property>
<name>mapred.job.tracker</name>
<value>hadoop1:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>`
hdfs-site
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
yarn-site
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>shuffle service that needs to be set for Map Reduce to run
</description>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
In yarn-env and hadoop-env there is just as default except the HADOOP_CLASSPATH (which doesn't change things even if I add it or not..)
Here is the error trace:
2015-04-25 23:29:25,143 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at apriori2$FrequentItemsReduce.reduce(apriori2.java:550)
at apriori2$FrequentItemsReduce.reduce(apriori2.java:532)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1651)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1611)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700)
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1990)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:774)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Really thanks for every help sir.
With Yarn, you need to set "yarn.application.classpath" property with the classpath for your MapReduce job. "export HADOOP_CLASSPATH" would not work with Yarn.

Configuring hive in local mode

I am trying to set up hive-0.9.0 in local mode configuration. In /conf, I have created hive-site.xml and specified the property for warehouse folder.
But I think hive is not using my defined location as it is not creating the 'warehouse' folder in that location.
Also, is it necessary to have hadoop cluster running in local mode hive configuration as it throws error when I issue any DDL commands without starting hadoop cluster.
FAILED: Error in metadata: MetaException(message:Got exception: java.net.ConnectException Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: Connection refused)
The contents of hive-site.xml is as follows:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/home/hadoopuser/hive/warehouse</value>
<description>
Local or HDFS directory where Hive keeps table contents.
</description>
</property>
<property>
<name>hive.metastore.local</name>
<value>true</value>
<description>
Use false if a production metastore server is used.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/hadoopuser/hive/metastore_db;create=true</value>
<description>
The JDBC connection URL.
</description>
</property>
</configuration>
Hive queries internally runs mapreduce. So hadoop has to be up while you are trying to query Hive.