Superset with Apache Spark on Hive - apache-spark-sql

I have Apache Superset installed via Docker on my local machine. I have a separate production 20 Node Spark cluster with Hive as the Meta-Store. I want my SuperSet to be able to connect to Hive and run queries via Spark-SQL.
For connecting to Hive, I tried the following
**Add Database --> SQLAlchemy URI ***
hive://hive#<hostname>:10000/default
but it is giving some error when I test connection. I believe I have to do some tunneling, but I am not sure how.
I have the Hive thrift server as well.
Please let me know how to proceed.

What is the error you are receiving? Although the docs do not mention this, the best way to provide the connection URL is in the following format :
hive://<url>/default?auth=NONE ( when there is no security )
hive://<url>/default?auth=KERBEROS
hive://<url>/default?auth=LDAP

first you should connect the 2 containers together.
lets say you have the container_superset that runs superset and container_spark running spark.
run : docker network ls # display containers and its network
select the name of the superset network (should be something like superset_default )
run : docker run --network="superset_default" --name=NameTheConatinerHere --publish port1:port2 imageName
---> port1:port2 is the port mapping and imageName is the image of spak

Related

PDI connect to MongoDB Atlas

Using Pentaho Data Integration 9 community edition trying to connect to mongodb atlas but without success.
Tried the url mongodb provides:
mongodb+srv://<username>:<password>#something.XYZ.mongodb.net/<dbname>?retryWrites=true&w=majority
Which gives me the following error:
org.pentaho.mongo.MongoDbException: Malformed host spec: mongodb+srv://<username>:<password>#something.XYZ.mongodb.net/<dbname>?retryWrites=true&w=majority
I saw a tips to change to old connection string, something similar to the following:
mongodb://user:password#cluster0-shard-00-00-wuhae.mongodb.net:27017,cluster0-shard-00-01-wuhae.mongodb.net:27017,cluster0-shard-00-02-wuhae.mongodb.net:27017/shop?ssl=true&replicaSet=Cluster0-shard-0&authSource=admin&retryWrites=true
but also without success.
Any ideas?
Need to specify the replicaset instead since it doesnt seem to support the mongodb+srv syntax.
So in my case I had to add the following:
test-shard-00-01.XYZ.mongodb.net,test-shard-00-00.XYZ.mongodb.net,test-shard-00-02.XYZ.mongodb.net

How to set a specific port for single-user Jupyterhub server REST API calls?

I have setup Spark SQL on Jypterhub using Apache Toree SQL kernel. I wrote a Python function to update Spark configuration options in the kernel.json file for my team to change configuration based on their queries and cluster configuration. But I have to shutdown the running notebook and re-open or restart the kernel after running Python function. In this way, I'm forcing the Toree kernel to read the JSON file to pick up the new configuration.
I thought of implementing this shutdown and restart of kernel in a programmatic way. I got to know about the Jupyterhub REST API documentation and am able implement it by invoking related API's. But the problem is, the single user server API port is set randomly by the Spawner object of Jupyterhub and it keeps changing every time I spin up a cluster. I want this to be fixed before launching the Jupyterhub service.
Here is a solution I tried based on Jupyterhub docs:
sudo echo "c.Spawner.port = 35289
c.Spawner.ip = '127.0.0.1'" >> /etc/jupyterhub/jupyterhub_config.py
But this did not work as the port was again set by the Spawner randomly. I think there is a way to fix this. Any help on this would be greatly appreciated. Thanks

Apache Drill - Hive Integration: Drill Not listing Tables

I have been trying to integrate Apache Drill with Hive using Hive Storage Plugin configuration. I configured the storage plugin with all the necessary properties required. On Drill Shell, I can view the Hive Databases using:
Show Databases;
But when i try to list tables using:
Show Tables;
I get no results (No List of Tables).
Below are the steps i have followed from Apache Drill documentation and other sources:
I created a Drill Distributed Cluster by updating drill-override.conf with same cluster id on all nodes along with ZK IP with Port and then invoking drillbit.sh on each node.
Started Drill shell using drill-conf, Ensured that the Hive metastore service is active as well.
Below is configuration made in Hive Storage Plugin for Drill (from its Web-UI):
{
"type": "hive",
"configProps": {
"hive.metastore.uris": "thrift://node02.cluster.com:9083",
"javax.jdo.option.ConnectionURL": "jdbc:mysql://node02.cluster.com/hive",
"hive.metastore.warehouse.dir": "/apps/hive/warehouse",
"fs.default.name": "hdfs://node01.cluster.com:8020",
"hive.metastore.sasl.enabled": "false"
},
"enabled": true
}
All the properties are set after referring to hive-site.xml
So, That's what all others have done to integrate Drill with Hive. Am i missing something here?
Regarding Versions-
Drill: 1.14 ,Hive : 1.2 (Hive Metastore: MySQL)
We also have Hive Server2 on the same nodes, is that causing any issues?
I just want to integrate Drill with Hive 1.2, am i doing it right?
Any pointers will be helpful, have spent nearly 2 days to get it right.
Thanks for your time.
Starting from Drill 1.13 version Drill leverages Hive client 2.3.2 version.
It is recommended to use Hive 2.3 version to avoid unpredictable issues.
Regarding your setup, please remove all configProps except hive.metastore.uris.
The other configs can be default (it is in HiveConf.java) or can be specified in your hive-site.xml.
Also in case of empty result after usage Show Tables; even after executing use hive, check for errors in Drill's log files. If some error is there, you can create the Jira ticket to improve the output from Drill to reflect that issue.

How to use Zeppelin to access aws spark-ec2 cluster and s3 buckets

I have an aws ec2 cluster setup by the spark-ec2 script.
I would like to configure Zeppelin so that I can write scala code locally on Zeppelin and run it on the cluster (via master). Furthermore I would like to be able to access my s3 buckets.
I followed this guide and this other one however I can not seem to run scala code from zeppelin to my cluster.
I installed Zeppelin locally with
mvn install -DskipTests -Dspark.version=1.4.1 -Dhadoop.version=2.7.1
My security groups were set to both AmazonEC2FullAccess and AmazonS3FullAccess.
I edited the spark interpreter properties on the Zeppelin Webapp to spark://.us-west-2.compute.amazonaws.com:7077
from local[*]
When I test out
sc
in the interpreter, I recieve this error
java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.thrift.transport.TSocket.open(TSocket.java:182) at
When I try to edit "conf/zeppelin-site.xml" to change my port to 8082, no difference.
NOTE: I eventually would also want to access my s3 buckets with something like:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first
if any benevolent users have any advice (that wasn't already posted on StackOverflow) please let me know!
Most likely your IP address is blocked from connecting to your spark cluster. You can try by launching the spark-shell pointing at that end point (or even just telnetting). To fix it you can log into your AWS account and change the firewall settings. Its also possible that it isn't pointed at the correct host (I'm assuming you removed the specific box from spark://.us-west-2.compute.amazonaws.com:7077 but if not there should be a bit for the .us-west-2). You can try ssh'ing to that machine and running netstat --tcp -l -n to see if its listening (or even just ps aux |grep java to see if Spark is running).

elasticsearch-mesos not getting listed under frameworks of mesosUI

Iam trying to run elasticsearch-mesos on mesos.My machine is running ubuntu 14.04. I have running mesos cluster installed with mesosphere packages by following these instructions. When I run test frameworks it gets lister under frameworks of mesosUI but for elasticsearch-mesos its not getting listed under mesos webUI. I want to run elasticsearch-mesos on top of mesos. I followed instructions given here. When I run ./elasticsearch-mesos I am getting a message in terminal
I0108 17:24:01.898540 23861 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I tried running ./elasticsearch-mesos on both mesos masters and slaves.
The last few lines of terminal output is given below
2015-01-08 17:24:01,881:23844(0x7f175bfff700):ZOO_INFO#zookeeper_init#786: Initiating
client connection, host=localhost:2181 sessionTimeout=10000 watcher=0x7f1762a3e6a0
sessionId=0 sessionPasswd=<null> context=0x7f1710002530 flags=0
I0108 17:24:01.881392 23858 sched.cpp:137] Version: 0.21.1
2015-01-08 17:24:01,881:23844(0x7f172b7fe700):ZOO_INFO#check_events#1703: initiated
connection to server [127.0.0.1:2181]
2015-01-08 17:24:01,897:23844(0x7f172b7fe700):ZOO_INFO#check_events#1750: session
establishment complete on server [127.0.0.1:2181], sessionId=0x14ac7c469270006,
negotiated timeout=10000
I0108 17:24:01.898455 23861 group.cpp:313] Group process (group(1)#127.0.1.1:38668)
connected to ZooKeeper
I0108 17:24:01.898509 23861 group.cpp:790] Syncing group operations: queue size (joins,
cancels, datas) = (0, 0, 0)
I0108 17:24:01.898540 23861 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
According to the README at https://github.com/mesosphere/elasticsearch-mesos,
you may need to modify mesos.master.url to point to the same ZK url that the Mesos master is using (maybe not localhost). If you're using a single-master Mesos cluster, you can skip the ZK url and point this parameter directly to the Mesos master.
Please also note that the elasticsearch framework is a bit outdated, so use with caution