I have been trying several solutions with custom jars from redislab and also with --packages in spark submit emr and still no suceess is there any simple way in emr to connect to elasticache ?
Based on the documentation, Amazon EMR serverless seems to accepts only Spark and Hive as job driver. Is there any support for custom Hadoop jar for map reduce jobs on serverless similar to EMR ?
That's correct. EMR serverless currently only supports Spark and Hive jobs, so no MapReduce.
we have presto cluster with Hadoop cluster
when all presto workers servers are installed on data-nodes machines
The following is example of a Hive Connector configuration file that is configured on the presto workers under catalog folder
connector.name=hive-hadoop2
hive.metastore.uri=thrift://metastore-node:9083
we want to know what are the risks , when the access from each of the presto workers isn't secured to hive metastore machine
as we understand presto worker/s are connect to hive meta-store by using thrift protocol and port 9083
but not clearly how presto-worker perform the authentication against the hive meta-store?
We'll appreciate to get more details about - how presto workers access to hive meta-store without hive secured and with hive secured
reference - https://docs.starburstdata.com/302-e/connector/hive-security.html
Hive metastore provides can be configured:
not to use authentication (trust user identity provided by the caller)
to use Kerberos authentication.
Both these modes are supported in Presto.
The basic mode (no auth) requires no additional configuration properties.
For the Kerberos authentication you need to set
hive.metastore.authentication.type=KERBEROS
hive.metastore.service.principal=...
hive.metastore.client.principal=...
hive.metastore.client.keytab=...
See full example & more at https://docs.starburstdata.com/latest/connector/hive-security.html#example-configuration-with-kerberos-authentication
If you need further help, you can get it on #troubleshooting channel on Trino (formerly Presto SQL) community slack.
I am trying to get data from redis using AWS Glue(python). I want to know how to connect to redis from spark context. Redis is also hosted in same AWS region
I saw code in redis website. Unable to find code sample for Pyspark.
import com.redislabs.provider.redis._
...
sc = new SparkContext(new SparkConf()
.setMaster("local")
.setAppName("myApp")
// initial redis host - can be any node in cluster mode
.set("redis.host", "localhost")
// initial redis port
.set("redis.port", "6379")
// optional redis AUTH password
.set("redis.auth", "")
)
Is it possible to connect to redis from pyspark ??
Q: What data sources does AWS Glue support?
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. The metadata stored in the AWS Glue Data Catalog can be readily accessed from Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. You can also write custom Scala or Python code and import custom libraries and Jar files into your Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.
Is there a way to using existing Apache Hdfs with creating new Cloudera Cluster and how to connect them?