Can I use Cloud Dataproc with an external Hive Metastore? - hive

By default, Cloud Dataproc runs a Hive Metastore local to the Dataproc cluster. This means:
The metastore is ephemeral with the cluster
It can be a pain to have multiple clusters using a single metastore
Is it possible to point Dataproc clusters to a single Hive metastore? Is it also possible to have the metastore live outside the cluster so running a cluster for a metastore is not required?

Yes, this is possible - clusters can use a common metastore running on Cloud SQL.
Cloud Dataproc clusters can use this Cloud SQL Proxy to connect to the external SQL metastore. Before using this solution, you should review the important notes.

Related

Can you connect Amazon ElastiŠ”ache Redis to amazon EMR pyspark?

I have been trying several solutions with custom jars from redislab and also with --packages in spark submit emr and still no suceess is there any simple way in emr to connect to elasticache ?

How to run the map reduce jobs on EMRserverless?

Based on the documentation, Amazon EMR serverless seems to accepts only Spark and Hive as job driver. Is there any support for custom Hadoop jar for map reduce jobs on serverless similar to EMR ?
That's correct. EMR serverless currently only supports Spark and Hive jobs, so no MapReduce.

presto + Hive Security Configuration

we have presto cluster with Hadoop cluster
when all presto workers servers are installed on data-nodes machines
The following is example of a Hive Connector configuration file that is configured on the presto workers under catalog folder
connector.name=hive-hadoop2
hive.metastore.uri=thrift://metastore-node:9083
we want to know what are the risks , when the access from each of the presto workers isn't secured to hive metastore machine
as we understand presto worker/s are connect to hive meta-store by using thrift protocol and port 9083
but not clearly how presto-worker perform the authentication against the hive meta-store?
We'll appreciate to get more details about - how presto workers access to hive meta-store without hive secured and with hive secured
reference - https://docs.starburstdata.com/302-e/connector/hive-security.html
Hive metastore provides can be configured:
not to use authentication (trust user identity provided by the caller)
to use Kerberos authentication.
Both these modes are supported in Presto.
The basic mode (no auth) requires no additional configuration properties.
For the Kerberos authentication you need to set
hive.metastore.authentication.type=KERBEROS
hive.metastore.service.principal=...
hive.metastore.client.principal=...
hive.metastore.client.keytab=...
See full example & more at https://docs.starburstdata.com/latest/connector/hive-security.html#example-configuration-with-kerberos-authentication
If you need further help, you can get it on #troubleshooting channel on Trino (formerly Presto SQL) community slack.

Fetch data from redis using AWS Glue (python)

I am trying to get data from redis using AWS Glue(python). I want to know how to connect to redis from spark context. Redis is also hosted in same AWS region
I saw code in redis website. Unable to find code sample for Pyspark.
import com.redislabs.provider.redis._
...
sc = new SparkContext(new SparkConf()
.setMaster("local")
.setAppName("myApp")
// initial redis host - can be any node in cluster mode
.set("redis.host", "localhost")
// initial redis port
.set("redis.port", "6379")
// optional redis AUTH password
.set("redis.auth", "")
)
Is it possible to connect to redis from pyspark ??
Q: What data sources does AWS Glue support?
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. The metadata stored in the AWS Glue Data Catalog can be readily accessed from Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. You can also write custom Scala or Python code and import custom libraries and Jar files into your Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.

Can we use existing Hdfs with Cloudera cluster which is not part of Cloudera Distributed Hadoop?

Is there a way to using existing Apache Hdfs with creating new Cloudera Cluster and how to connect them?