I am trying to get data from redis using AWS Glue(python). I want to know how to connect to redis from spark context. Redis is also hosted in same AWS region
I saw code in redis website. Unable to find code sample for Pyspark.
import com.redislabs.provider.redis._
...
sc = new SparkContext(new SparkConf()
.setMaster("local")
.setAppName("myApp")
// initial redis host - can be any node in cluster mode
.set("redis.host", "localhost")
// initial redis port
.set("redis.port", "6379")
// optional redis AUTH password
.set("redis.auth", "")
)
Is it possible to connect to redis from pyspark ??
Q: What data sources does AWS Glue support?
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. The metadata stored in the AWS Glue Data Catalog can be readily accessed from Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. You can also write custom Scala or Python code and import custom libraries and Jar files into your Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.
Related
I have been trying several solutions with custom jars from redislab and also with --packages in spark submit emr and still no suceess is there any simple way in emr to connect to elasticache ?
I've read AWS Glue is a a Hive compatible datastore, but I haven't found how to use AWS Glue as a JDBC datasource.
I'd like to use AWS Glue Catalog as a source for my reporting, as the Hive documentation shows here -
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-ConnectionURLforRemoteorEmbeddedMode
Connection URL for Remote or Embedded Mode
The JDBC connection URL format has the prefix jdbc:hive2:// and the Driver class is org.apache.hive.jdbc.HiveDriver. Note that this is different from the old HiveServer.
For a remote server, the URL format is jdbc:hive2://<host>:<port>/<db>;initFile=<file> (default port for HiveServer2 is 10000).
For an embedded server, the URL format is jdbc:hive2:///;initFile=<file> (no host or port).
When I edit the database in AWS Glue, it appears I can set a location for client - but I'm not sure what to put here, and didn't see any documentation on how this works.
Any thoughts?
AWS Glue is a Hive Metadata Store, not a Hive server.
Though, Hive server can use Glue as the metadata store.
https://aws.amazon.com/emr/features/hive/
i am working on a project that has a requirement to store scientific data on AWS S3 as raw data for the beginning of a data lake. we are planning JSON for application data and using S3 metadata to persist application metadata (JSON schema) and process metadata. at the moment, on site S3 is the only service that we have available to us from the AWS cloud.
the client would like a publish environment where they can get the raw data back as files. we would like to avoid building a custom catalog and security infrastructure.
i don't see anything about Apache Atlas that will connect directly to AWS S3. but we can put Apache Hive on top of AWS S3 and then put Apache Atlas and Ranger on top of that. but not sure if this is how we can publish the raw data from S3 or if that even works as Hive is more of a processing environment.
is it possible to use Apache Atlas and Ranger on top of AWS S3 directly?
I'm new to Neo4j and testing it on EC2 server, to see if we could use it for storing our ~1.5 nodes and their connection (currently using a Redshift).
I want to load all the data from Redshift to the Neo4j DB. I also work a lot with EMRs and usually storing most of my data on S3.
Is there any way to include AWS authentication information when I try importing data to Neo4j from S3 if the S3 location isn't public? Is there any other way to do so?
Thanks
I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)