Access CFS from a Spark application - datastax

I am trying to read from and write to my local cfs installation which I accomplished by installing DSE locally, in standalone mode.
My guess is that, in order to connect to the cfs, I somehow have to use the right host name, and by right I mean the one utilized by the Spark master (in order to create the spark context) when I do ./dse spark.
Should be fairly easy but I can't figure out how... Any ideas?

You should be able to access your cfs using a relative path, as in a hadoop env.
sc.textFile("/yourpath")
EDIT
Ok, so probably you need to specify also the driver host to enable communication with master.
val conf = new SparkConf()
.set("spark.driver.host", "driver ip")
.setMaster("master host name")
val sc = new SparkContext(conf)

Related

Superset with Apache Spark on Hive

I have Apache Superset installed via Docker on my local machine. I have a separate production 20 Node Spark cluster with Hive as the Meta-Store. I want my SuperSet to be able to connect to Hive and run queries via Spark-SQL.
For connecting to Hive, I tried the following
**Add Database --> SQLAlchemy URI ***
hive://hive#<hostname>:10000/default
but it is giving some error when I test connection. I believe I have to do some tunneling, but I am not sure how.
I have the Hive thrift server as well.
Please let me know how to proceed.
What is the error you are receiving? Although the docs do not mention this, the best way to provide the connection URL is in the following format :
hive://<url>/default?auth=NONE ( when there is no security )
hive://<url>/default?auth=KERBEROS
hive://<url>/default?auth=LDAP
first you should connect the 2 containers together.
lets say you have the container_superset that runs superset and container_spark running spark.
run : docker network ls # display containers and its network
select the name of the superset network (should be something like superset_default )
run : docker run --network="superset_default" --name=NameTheConatinerHere --publish port1:port2 imageName
---> port1:port2 is the port mapping and imageName is the image of spak

"No filesystem found for scheme s3" when trying to read/write using Apache Beam

I am starting to use Apache Beam in a project for the first time, and what I am trying to do is read and write Parquet files to and from S3, from an EMR cluster on AWS.
However, each time I try to execute my code, I only get:
java.lang.IllegalArgumentException: No filesystem found for scheme s3
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:459)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:119)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:140)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:152)
at org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn.process(FileIO.java:636)
The documentation does not provide any example, so I have no clue if I have to initialize something anywhere in my code.
I tried to check the Beam source code but, for what I understand, the FileSystems class should register all filesystem modules, and my pom.xml contains the Amazon Web Services Beam module (which in turn brings the AWS S3 module).
The only initialization block that I am doing now is:
val options = PipelineOptionsFactory.create()
options.runner = SparkRunner::class.java
val pipeline = Pipeline.create(options)
...
val runner = SparkRunner.fromOptions(options)
runner.run(pipeline).waitUntilFinish()
Spark starts to run correctly, up until the exception.
Any suggestion?
I believe you need to create a custom class for AWS credentials that will represent Apache Beam job options.
BasicAWSCredentials awsCreds = new BasicAWSCredentials(accessKey, secretKey);
YourCustomOptionsClass options = PipelineOptionsFactory.create().as(YourCustomOptionsClass.class);
options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCreds));
options.as(AwsOptions.class).setAwsRegion(region);
options.setRunner(DataflowRunner.class);
options.setProject(projectId);
options.set... (All other options you need)
In my code YourCustomOptionClass is implementing S3Options and DataflowPipelineOptions
To find out more about creating custom options check out apache beam documentation
https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options
Other full example that may help:
https://github.com/asaharland/beam-pipeline-examples/tree/master/src/main/java/com/harland/example/batch

S3 checkpointing for spark streaming leads to an error

I have enabled checkpointing for my sparkstreaming application using the getOrCreate method. The checkpoint directory points to an S3 bucket.
The problem i have is a credential issue in accessing S3 :
Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
I have already set the environment variables (AWS_SECRET_KEY and AWS_ACCESS_KEY).
Also my fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey have been specified in the application.conf.. So i dont know why it still fails.
The environment variables (AWS_SECRET_KEY and AWS_ACCESS_KEY) no longer works after Spark 1.3.
Please refer to for the new approach:
How to read input from S3 in a Spark Streaming EC2 cluster application
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)

How to submit code to a remote Spark cluster from IntelliJ IDEA

I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?
Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.

How to use Zeppelin to access aws spark-ec2 cluster and s3 buckets

I have an aws ec2 cluster setup by the spark-ec2 script.
I would like to configure Zeppelin so that I can write scala code locally on Zeppelin and run it on the cluster (via master). Furthermore I would like to be able to access my s3 buckets.
I followed this guide and this other one however I can not seem to run scala code from zeppelin to my cluster.
I installed Zeppelin locally with
mvn install -DskipTests -Dspark.version=1.4.1 -Dhadoop.version=2.7.1
My security groups were set to both AmazonEC2FullAccess and AmazonS3FullAccess.
I edited the spark interpreter properties on the Zeppelin Webapp to spark://.us-west-2.compute.amazonaws.com:7077
from local[*]
When I test out
sc
in the interpreter, I recieve this error
java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.thrift.transport.TSocket.open(TSocket.java:182) at
When I try to edit "conf/zeppelin-site.xml" to change my port to 8082, no difference.
NOTE: I eventually would also want to access my s3 buckets with something like:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first
if any benevolent users have any advice (that wasn't already posted on StackOverflow) please let me know!
Most likely your IP address is blocked from connecting to your spark cluster. You can try by launching the spark-shell pointing at that end point (or even just telnetting). To fix it you can log into your AWS account and change the firewall settings. Its also possible that it isn't pointed at the correct host (I'm assuming you removed the specific box from spark://.us-west-2.compute.amazonaws.com:7077 but if not there should be a bit for the .us-west-2). You can try ssh'ing to that machine and running netstat --tcp -l -n to see if its listening (or even just ps aux |grep java to see if Spark is running).