I've read AWS Glue is a a Hive compatible datastore, but I haven't found how to use AWS Glue as a JDBC datasource.
I'd like to use AWS Glue Catalog as a source for my reporting, as the Hive documentation shows here -
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-ConnectionURLforRemoteorEmbeddedMode
Connection URL for Remote or Embedded Mode
The JDBC connection URL format has the prefix jdbc:hive2:// and the Driver class is org.apache.hive.jdbc.HiveDriver. Note that this is different from the old HiveServer.
For a remote server, the URL format is jdbc:hive2://<host>:<port>/<db>;initFile=<file> (default port for HiveServer2 is 10000).
For an embedded server, the URL format is jdbc:hive2:///;initFile=<file> (no host or port).
When I edit the database in AWS Glue, it appears I can set a location for client - but I'm not sure what to put here, and didn't see any documentation on how this works.
Any thoughts?
AWS Glue is a Hive Metadata Store, not a Hive server.
Though, Hive server can use Glue as the metadata store.
https://aws.amazon.com/emr/features/hive/
Related
Is there a way to connect to Google cloud storage with PDI Community version?
I have looked at VFS connection, but no connection types are listed in the the drop down while creating a new VFS connection.
In the PDI 9.1 we use, this list is complete empty
Pentaho PDI 9.3 CE supports Amazon/Minio S3 but non of the other VFS Options, they should be available in Enterprise. I needed Azure Blob support and switched to Apache HOP. Well, HOP supports all kind of VFS connections including Azure BUT only Amazon S3, no Minio or other S3 implementation. So in the end i might have to get my S3 Data with Pentaho PDI and push them to Azure with Apache HOP :-)
There is no official documentation on this.
Pentaho PDI Example VFS connection Summary to S3 Storage provided by Netypp Storage GRID looks like this:
General Settings:
Connection Name: yourConnectionName
Connection Type: Amazon S3 / Minio
Connection Details
Access Key: yourAccessKey
Secret Key: yourSecretKey
Region: Default
Credential File: No
Profile Name:
Endpoint: https://s3.yourDomain
PathStyleAccess: true
Signature Version: AWSS3V4SignerType
Default S3 Connection true
Connection Type: Minio
Took me quite some time to find out the value for the Signature Version!
i am working on a project that has a requirement to store scientific data on AWS S3 as raw data for the beginning of a data lake. we are planning JSON for application data and using S3 metadata to persist application metadata (JSON schema) and process metadata. at the moment, on site S3 is the only service that we have available to us from the AWS cloud.
the client would like a publish environment where they can get the raw data back as files. we would like to avoid building a custom catalog and security infrastructure.
i don't see anything about Apache Atlas that will connect directly to AWS S3. but we can put Apache Hive on top of AWS S3 and then put Apache Atlas and Ranger on top of that. but not sure if this is how we can publish the raw data from S3 or if that even works as Hive is more of a processing environment.
is it possible to use Apache Atlas and Ranger on top of AWS S3 directly?
we have presto cluster with Hadoop cluster
when all presto workers servers are installed on data-nodes machines
The following is example of a Hive Connector configuration file that is configured on the presto workers under catalog folder
connector.name=hive-hadoop2
hive.metastore.uri=thrift://metastore-node:9083
we want to know what are the risks , when the access from each of the presto workers isn't secured to hive metastore machine
as we understand presto worker/s are connect to hive meta-store by using thrift protocol and port 9083
but not clearly how presto-worker perform the authentication against the hive meta-store?
We'll appreciate to get more details about - how presto workers access to hive meta-store without hive secured and with hive secured
reference - https://docs.starburstdata.com/302-e/connector/hive-security.html
Hive metastore provides can be configured:
not to use authentication (trust user identity provided by the caller)
to use Kerberos authentication.
Both these modes are supported in Presto.
The basic mode (no auth) requires no additional configuration properties.
For the Kerberos authentication you need to set
hive.metastore.authentication.type=KERBEROS
hive.metastore.service.principal=...
hive.metastore.client.principal=...
hive.metastore.client.keytab=...
See full example & more at https://docs.starburstdata.com/latest/connector/hive-security.html#example-configuration-with-kerberos-authentication
If you need further help, you can get it on #troubleshooting channel on Trino (formerly Presto SQL) community slack.
I installed and ran the Metastore Server standalone, without installing Hive. However, I cannot find any documentation about the thrift network API for communicating with the server. I need to be able to connect to the Metastore server directly or through HCatalog. Please advise.
There is a HCatalog Java client in hive-webhcat-java-client, which can be used in both client mode (which connect to hcatalog thrift server) and embed mode (which do all the things internally, connect to mysql directly).
HiveConf hiveConf = new HiveConf();
hiveConf.addResource("/Users/tzp/Documents/env/apache-hive-3.1.2-bin/conf/hive-site.xml");
//if you set this param, the client try to connect external hive metastore
hiveConf.set("metastore.thrift.uris", "thrift://localhost:9083");
HCatClient client = HCatClient.create(new Configuration(hiveConf));
List<String> dbNames = client.listDatabaseNamesByPattern("*");
System.out.println(dbNames);
I don't think Hive provide similar client in Python, but there is a third party lib hmsclient, do the same thing.
from hmsclient import hmsclient
client = hmsclient.HMSClient(host='localhost', port=9083)
with client as c:
c.check_for_named_partition('db', 'table', 'date=20180101')
HCatalog is functionally identical to Hive Metastore.
The JavaDoc for "Hive Metastore client" and its API (branch 1.x) is available at
https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.html
https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/metastore/api/package-summary.html
Now, good luck finding a tutorial or just code snippets...
I am trying to get data from redis using AWS Glue(python). I want to know how to connect to redis from spark context. Redis is also hosted in same AWS region
I saw code in redis website. Unable to find code sample for Pyspark.
import com.redislabs.provider.redis._
...
sc = new SparkContext(new SparkConf()
.setMaster("local")
.setAppName("myApp")
// initial redis host - can be any node in cluster mode
.set("redis.host", "localhost")
// initial redis port
.set("redis.port", "6379")
// optional redis AUTH password
.set("redis.auth", "")
)
Is it possible to connect to redis from pyspark ??
Q: What data sources does AWS Glue support?
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. The metadata stored in the AWS Glue Data Catalog can be readily accessed from Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. You can also write custom Scala or Python code and import custom libraries and Jar files into your Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.