How to use Zeppelin to access aws spark-ec2 cluster and s3 buckets - amazon-s3

I have an aws ec2 cluster setup by the spark-ec2 script.
I would like to configure Zeppelin so that I can write scala code locally on Zeppelin and run it on the cluster (via master). Furthermore I would like to be able to access my s3 buckets.
I followed this guide and this other one however I can not seem to run scala code from zeppelin to my cluster.
I installed Zeppelin locally with
mvn install -DskipTests -Dspark.version=1.4.1 -Dhadoop.version=2.7.1
My security groups were set to both AmazonEC2FullAccess and AmazonS3FullAccess.
I edited the spark interpreter properties on the Zeppelin Webapp to spark://.us-west-2.compute.amazonaws.com:7077
from local[*]
When I test out
sc
in the interpreter, I recieve this error
java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.thrift.transport.TSocket.open(TSocket.java:182) at
When I try to edit "conf/zeppelin-site.xml" to change my port to 8082, no difference.
NOTE: I eventually would also want to access my s3 buckets with something like:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first
if any benevolent users have any advice (that wasn't already posted on StackOverflow) please let me know!

Most likely your IP address is blocked from connecting to your spark cluster. You can try by launching the spark-shell pointing at that end point (or even just telnetting). To fix it you can log into your AWS account and change the firewall settings. Its also possible that it isn't pointed at the correct host (I'm assuming you removed the specific box from spark://.us-west-2.compute.amazonaws.com:7077 but if not there should be a bit for the .us-west-2). You can try ssh'ing to that machine and running netstat --tcp -l -n to see if its listening (or even just ps aux |grep java to see if Spark is running).

Related

Superset with Apache Spark on Hive

I have Apache Superset installed via Docker on my local machine. I have a separate production 20 Node Spark cluster with Hive as the Meta-Store. I want my SuperSet to be able to connect to Hive and run queries via Spark-SQL.
For connecting to Hive, I tried the following
**Add Database --> SQLAlchemy URI ***
hive://hive#<hostname>:10000/default
but it is giving some error when I test connection. I believe I have to do some tunneling, but I am not sure how.
I have the Hive thrift server as well.
Please let me know how to proceed.
What is the error you are receiving? Although the docs do not mention this, the best way to provide the connection URL is in the following format :
hive://<url>/default?auth=NONE ( when there is no security )
hive://<url>/default?auth=KERBEROS
hive://<url>/default?auth=LDAP
first you should connect the 2 containers together.
lets say you have the container_superset that runs superset and container_spark running spark.
run : docker network ls # display containers and its network
select the name of the superset network (should be something like superset_default )
run : docker run --network="superset_default" --name=NameTheConatinerHere --publish port1:port2 imageName
---> port1:port2 is the port mapping and imageName is the image of spak

How to set a specific port for single-user Jupyterhub server REST API calls?

I have setup Spark SQL on Jypterhub using Apache Toree SQL kernel. I wrote a Python function to update Spark configuration options in the kernel.json file for my team to change configuration based on their queries and cluster configuration. But I have to shutdown the running notebook and re-open or restart the kernel after running Python function. In this way, I'm forcing the Toree kernel to read the JSON file to pick up the new configuration.
I thought of implementing this shutdown and restart of kernel in a programmatic way. I got to know about the Jupyterhub REST API documentation and am able implement it by invoking related API's. But the problem is, the single user server API port is set randomly by the Spawner object of Jupyterhub and it keeps changing every time I spin up a cluster. I want this to be fixed before launching the Jupyterhub service.
Here is a solution I tried based on Jupyterhub docs:
sudo echo "c.Spawner.port = 35289
c.Spawner.ip = '127.0.0.1'" >> /etc/jupyterhub/jupyterhub_config.py
But this did not work as the port was again set by the Spawner randomly. I think there is a way to fix this. Any help on this would be greatly appreciated. Thanks

How do I connect to Neptune using Java

I have the following code based on the docs...
#Controller
#RequestMapping("neptune")
public class NeptuneEndpoint {
#GetMapping("")
#ResponseBody
public String test(){
Cluster.Builder builder = Cluster.build();
builder.addContactPoint("...endpoint...");
builder.port(8182);
Cluster cluster = builder.create();
GraphTraversalSource g = EmptyGraph.instance()
.traversal()
.withRemote(
DriverRemoteConnection.using(cluster)
);
GraphTraversal t = g.V().limit(2).valueMap();
t.forEachRemaining(
e -> System.out.println(e)
);
cluster.close();
return "Neptune Up";
}
}
But when I try to run I get ...
java.util.concurrent.TimeoutException: Timed out while waiting for an available host - check the client configuration and connectivity to the server if this message persists
Also how would I add Secret key from AWS IAM account?
Neptune doesn't allow you to connect to the db instance from your local machine. You can only connect to Neptune via an EC2 inside the same VPC as Neptune (aws documentation).
Try making a runnable jar of this code and run it inside an ec2, the code should work fine. If you're trying to debug something from your local system, then use PuTTY instance tunneling to connect to ec2 which then will be forwarded to neptune cluster.
Have you created an instance with IAM auth enabled?
If yes, you will have to sign your request using SigV4. More information (and examples) on how to connect using SigV4 is available at https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth-connecting-gremlin-java.html
The examples given in the documentation above also contain information on how to use your IAM credentials to connect to a Neptune cluster.
I just had the same issue and the root cause was a dependency version conflict with Netty which is unfortunately a very pervasive dependency. Gremlin 3.3.2 uses io.netty/netty-all version 4.0.56.Final. You might find your project depends on another Netty jar such as io.netty/netty or io.netty/netty-handler both of which can cause issues so you will need to excluded them from other dependencies in your POM or use managed-dependencies to set a project level Netty version.
Another option is to use a AWS SigV4 signing proxy that acts as a bridge between Neptune and your local development environment. One of these proxies is https://github.com/monken/aws4-proxy
npm install --global aws4-proxy
# have your credentials exported as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
aws4-proxy --service neptune-db --endpoint cluster-die4eenu.cluster-eede5pho.eu-west-1.neptune.amazonaws.com --region eu-west-1
wscat localhost:3000/gremlin
Refer this
Note: You need to be in the same VPC to access Neptune cluster.

How to do kerberos authentication on a flink standalone installation?

I have a standalone Flink installation on top of which I want to run a streaming job that is writing data into a HDFS installation. The HDFS installation is part of a Cloudera deployment and requires Kerberos authentication in order to read and write the HDFS. Since I found no documentation on how to make Flink connect with a Kerberos-protected HDFS I had to make some educated guesses about the procedure. Here is what I did so far:
I created a keytab file for my user.
In my Flink job, I added the following code:
UserGroupInformation.loginUserFromKeytab("myusername", "/path/to/keytab");
Finally I am using a TextOutputFormatto write data to the HDFS.
When I run the job, I'm getting the following error:
org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBE
ROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1730)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1668)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1593)
at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:397)
at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:393)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:393)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:337)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:405)
For some odd reason, Flink seems to try SIMPLE authentication, even though I called loginUserFromKeytab. I found another similar issue on Stackoverflow (Error with Kerberos authentication when executing Flink example code on YARN cluster (Cloudera)) which had an answer explaining that:
Standalone Flink currently only supports accessing Kerberos secured HDFS if the user is authenticated on all worker nodes.
That may mean that I have to do some authentication at the OS level e.g. with kinit. Since my knowledge of Kerberos is very limited I have no idea how I would do it. Also I would like to understand how the program running after kinit actually knows which Kerberos ticket to pick from the local cache when there is no configuration whatsoever regarding this.
I'm not a Flink user, but based on what I've seen with Spark & friends, my guess is that "Authenticated on all worker nodes" means that each worker process has
a core-site.xml config available on local fs with
hadoop.security.authentication set to kerberos (among other
things)
the local dir containing core-site.xml added to the CLASSPATH so that it is found automatically by the Hadoop Configuration object [it will revert silently to default hard-coded values otherwise, duh]
implicit authentication via kinit and the default cache [TGT set globally for the Linux account, impacts all processes, duh] ## or ## implicit authentication via kinit and a "private" cache set thru KRB5CCNAME env variable (Hadoop supports only "FILE:" type) ## or ## explicit authentication via UserGroupInformation.loginUserFromKeytab() and a keytab available on the local fs
That UGI "login" method is incredibly verbose, so if it was indeed called before Flink tries to initiate the HDFS client from the Configuration, you will notice. On the other hand, if you don't see the verbose stuff, then your attempt to create a private Kerberos TGT is bypassed by Flink, and you have to find a way to bypass Flink :-/
You can also configure your stand alone cluster to handle authentication for you without additional code in your jobs.
Export HADOOP_CONF_DIR and point it to directory where core-site.xml and hdfs-site.xml is located
Add to flink-conf.yml
security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: <path to keytab>
security.kerberos.login.principal: <principal>
env.java.opts: -Djava.security.krb5.conf=<path to krb5 conf>
Add pre-bundled Hadoop to lib directory of your cluster https://flink.apache.org/downloads.html
The only dependencies you should need in your jobs is:
compile "org.apache.flink:flink-java:$flinkVersion"
compile "org.apache.flink:flink-clients_2.11:$flinkVersion"
compile 'org.apache.hadoop:hadoop-hdfs:$hadoopVersion'
compile 'org.apache.hadoop:hadoop-client:$hadoopVersion'
In order to access a secured HDFS or HBase installation from a standalone Flink installation, you have to do the following:
Log into the server running the JobManager, authenticate against Kerberos using kinit and start the JobManager (without logging out or switching the user in between).
Log into each server running a TaskManager, authenticate again using kinit and start the TaskManager (again, with the same user).
Log into the server from where you want to start your streaming job (often, its the same machine running the JobManager), log into Kerberos (with kinit) and start your job with /bin/flink run.
In my understanding, kinit is logging in the current user and creating a file somewhere in /tmp with some login data. The mostly static class UserGroupInformation is looking up that file with the login data when its loaded the first time. If the current user is authenticated with Kerberos, the information is used to authenticate against HDFS.

elasticsearch-mesos not getting listed under frameworks of mesosUI

Iam trying to run elasticsearch-mesos on mesos.My machine is running ubuntu 14.04. I have running mesos cluster installed with mesosphere packages by following these instructions. When I run test frameworks it gets lister under frameworks of mesosUI but for elasticsearch-mesos its not getting listed under mesos webUI. I want to run elasticsearch-mesos on top of mesos. I followed instructions given here. When I run ./elasticsearch-mesos I am getting a message in terminal
I0108 17:24:01.898540 23861 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I tried running ./elasticsearch-mesos on both mesos masters and slaves.
The last few lines of terminal output is given below
2015-01-08 17:24:01,881:23844(0x7f175bfff700):ZOO_INFO#zookeeper_init#786: Initiating
client connection, host=localhost:2181 sessionTimeout=10000 watcher=0x7f1762a3e6a0
sessionId=0 sessionPasswd=<null> context=0x7f1710002530 flags=0
I0108 17:24:01.881392 23858 sched.cpp:137] Version: 0.21.1
2015-01-08 17:24:01,881:23844(0x7f172b7fe700):ZOO_INFO#check_events#1703: initiated
connection to server [127.0.0.1:2181]
2015-01-08 17:24:01,897:23844(0x7f172b7fe700):ZOO_INFO#check_events#1750: session
establishment complete on server [127.0.0.1:2181], sessionId=0x14ac7c469270006,
negotiated timeout=10000
I0108 17:24:01.898455 23861 group.cpp:313] Group process (group(1)#127.0.1.1:38668)
connected to ZooKeeper
I0108 17:24:01.898509 23861 group.cpp:790] Syncing group operations: queue size (joins,
cancels, datas) = (0, 0, 0)
I0108 17:24:01.898540 23861 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
According to the README at https://github.com/mesosphere/elasticsearch-mesos,
you may need to modify mesos.master.url to point to the same ZK url that the Mesos master is using (maybe not localhost). If you're using a single-master Mesos cluster, you can skip the ZK url and point this parameter directly to the Mesos master.
Please also note that the elasticsearch framework is a bit outdated, so use with caution