Setting up Spark Thrift Server on AWS EMR for making JBDC/ODBC connection - amazon-emr

how to set up the Spark Thrift Server on EMR? I am trying to make a JDBC/ODBC connection to EMR using Spark Thrift Server. For e.g.
beeline> !connect jdbc:hive2://10.253.3.5:10015
We execute the following to restart the Hive-Server2 -
sudo stop hive-server2
sudo stop hive-hcatalog-server
sudo start hive-hcatalog-server
sudo start hive-server2
Not sure what are the services to restart Spark Thrift Server on AWS EMR and how to set up the User Id and Password.

We need to start the Spark thrift Server by executing the following on EMR-
sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client
The Default port is 10001
Test Connection as below -
/usr/lib/spark/bin/beeline -u 'jdbc:hive2://x.x.x.x:10001/default' -e "show databases;"
Spark JDBC Driver can be used to connect to the Thrift Server from any application

Related

Not able to launch apache hive through cli

I have kerberized Hadoop Hortonworks cluster running. Beeline works fine.
But When I am launching hive it fails with the follwoing error:
Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: GSS initiate failed
[root#hdpm1 ~]# su - hive
[hive#hdpm1 ~]$ hive
Before runing beeline you must get TGT using kinit
example for hive user using service keytab:
kinit -kt <path_to keystore> <principal_name>
kinit -kt /etc/security/keytabs/hive.service.keytab hive/<host>

Apache not able to communicate to remote DB server

I configured EC2 instance with following environment
Centos 7.4 , PHP 5.4 , HTTPD 2.4
From this EC2 instance when I heat external DB connection request through web page it turns out to be connection timeout , while If I run same PHP script through CLI it works fine any connects with database. Also I did telnet on remote DB server on port 3306 it gots connected.
I have tried by disabling selinux "setenforce 0" and allowed apache to make remote connections
setenforce 0
setsebool -P httpd_can_network_connect 1
setsebool -P httpd_can_network_connect_db 1
But nothing worked.

Redis Monitor using Prometheus and Grafana

I have installed redis in a server
I wish to monitor redis via Prometheus and Grafana
Installed redis_exporter in the redis installed server using docker
$ docker pull oliver006/redis_exporter
$ docker run -d --name redis_exporter -p 9121:9121 oliver006/redis_exporter
Checked the redis_exporter running status in the server.
Added the redis installed and redis exporter installed IP in prometheus.yml file in Grafana Server
- job_name: 'redis_exporter'
target_groups:
- targets: ['IP:9121']
labels:
alias: redis
Restarted Prometheus in Grafana server
Checked the status in prometheus status page
It shows UP for the redis server IP:9121 mentioned in the prometheus.yml
In Grafana :
I have imported Prometheus Redis dashboard;(https://grafana.com/dashboards/763)
But data is not loading in the dashboard. Also the IP is not listed in the dashboard
Two things to check here:
Try this url and see if you're able to get the metrics.
curl -s "<redis_exporter>:9121/scrape?target=redis://<redis_instance>:6379"
Update the grafana dashboard variables from label_values(redis_up, addr) to label_values(redis_up, instance)
In case you set a password authentication for redis, need to supply a Redis password to redis-exporter
sudo docker run -d --name redis_exporter -p 9121:9121 oliver006/redis_exporter --redis.addr=redis://10.0.0.175:6379 --redis.password=redis_password_here

Cannot connect to remote jmx server using jvisualvm or jconsole (netcat working)

I have a spark application running on a remote server and I need to get its heap dump for performance purposes. I was able to run the jstatd service on the remote machine and connect to it using visualvm. However jstatd does not enable heap dump of remote machines (I am using visual vm 1.3.8).
To resolve this I started my application with the following extra options:
--conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=54321 \
-Dcom.sun.management.jmxremote.rmi.port=54320 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Djava.rmi.server.hostname=$HOSTNAME"
After running this I used netstat to gather all open ports by the process and got the following output:
sudo netstat -lp | grep 37407
tcp 0 0 *:54321 *:* LISTEN 37407/java
tcp 0 0 *:54320 *:* LISTEN 37407/java
To check if the remote port was accessible through my local machine I used the netcat utility and the connection with remote host on both 54321 and 54320 was successful.
However when I try to connect to the host using visualvm or jconsole it fails to connect. Visual vm reports the following error:
cannot connect to hostname:54321 using service:jmx:rmi:///jndi/rmi://hostname:54321/jmxrmi
What am I doing wrong here?
in order to enable jconsole connection : try adding this flag
-Dcom.sun.management.jmxremote.local.only=false
and in order to a heap dump, you don't need to connect via jconsole, just use jmap:
$>jmap -dump:format=b,live,file=<filename> <process-id>
and finally, if spark has a daemon controlling it, make sure it doesn't kill the process during the heap dump creation.
The problem is that $HOSTNAME is the hostname of the server you are running spark submit from, you need to set to the hostname of the machine the spark driver runs on:
-Djava.rmi.server.hostname=<hostname of spark driver>
BTW, This is the reason it only worked for you when your spark application and the spark submit was on the same server.

Connect to Spark running via YARN through a SSH tunnel

I have a Spark installation running under YARN on a remote cluster, with a firewall between me and the head node. I can use a ssh tunnel to access the head node:
> ssh -N -f -L 10000:remotenode:10000 between_machine
and this setup works, for example, to access a HiveServer2 running on remotenote. If Spark was running in cluster mode, I would need to just do the same for the 7077 port and direct the pyspark client to localhost with
> ssh -N -f -L 7077:remotenode:7077 between_machine
> ./pyspark --master spark://localhost:7077
How can I do that with Spark running under the YARN scheduler?
If you are looking for a port to connect, here is a quote from the doc:
You can access this interface by simply opening
http://:4040 in a web browser. If multiple SparkContexts
are running on the same host, they will bind to successive ports
beginning with 4040 (4041, 4042, etc).
If you are just looking for a more universal way to get to the host via ssh "tunnel", you could try ssh working as socks proxy:
ssh user#host -D 20000
And then configuring your browser to connect via socks proxy (host - localhost, port - 20000).