My Requirements:
1. User's will run the sql queries through Apache nifi to Amazon S3.
Is this possible to achieve Nifi integration with Amazon Athena?
You should be able to easily integrate Apache NiFi and Amazon Athena. The NiFi capabilities to leverage/plug-in JDBC drivers and reuse that context in many areas helps here greatly. See here for info on the JDBC drivers with Athena https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html and here for using some of NiFi's DBCP facilities https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-dbcp-service-nar/1.5.0/org.apache.nifi.dbcp.DBCPConnectionPool/index.html
You should be able to by using a combination of an ExecuteStreamCommand and the awscli. The cli has the capabilities to issue Athena queries
Related
I was wondering how I would tackle the following on AWS? - or whether it was not possible?
Transient EMR Cluster for some bulk Spark processing
When that cluster terminates, then and only then use a Glue Job to do some limited processing
I am not convinced AWS Glue Triggers will help over environments.
Or could one say, well just keep on in the EMR Cluster, it's not a good use case? Glue can write to SAP Hana with appropriate Connector and Redshift Spectrum is common use case to load Redshift via Glue job with Redshift Spectrum.
You can use "Run a job" service integration using AWS Step Functions. Step functions supports both EMR and Glue integration.
Please refer to the link for details.
Having spoken to Amazon on this aspect, they indicate that Airflow via MWAA is the preferred option now.
we have presto cluster with Hadoop cluster
when all presto workers servers are installed on data-nodes machines
The following is example of a Hive Connector configuration file that is configured on the presto workers under catalog folder
connector.name=hive-hadoop2
hive.metastore.uri=thrift://metastore-node:9083
we want to know what are the risks , when the access from each of the presto workers isn't secured to hive metastore machine
as we understand presto worker/s are connect to hive meta-store by using thrift protocol and port 9083
but not clearly how presto-worker perform the authentication against the hive meta-store?
We'll appreciate to get more details about - how presto workers access to hive meta-store without hive secured and with hive secured
reference - https://docs.starburstdata.com/302-e/connector/hive-security.html
Hive metastore provides can be configured:
not to use authentication (trust user identity provided by the caller)
to use Kerberos authentication.
Both these modes are supported in Presto.
The basic mode (no auth) requires no additional configuration properties.
For the Kerberos authentication you need to set
hive.metastore.authentication.type=KERBEROS
hive.metastore.service.principal=...
hive.metastore.client.principal=...
hive.metastore.client.keytab=...
See full example & more at https://docs.starburstdata.com/latest/connector/hive-security.html#example-configuration-with-kerberos-authentication
If you need further help, you can get it on #troubleshooting channel on Trino (formerly Presto SQL) community slack.
I am trying to get data from redis using AWS Glue(python). I want to know how to connect to redis from spark context. Redis is also hosted in same AWS region
I saw code in redis website. Unable to find code sample for Pyspark.
import com.redislabs.provider.redis._
...
sc = new SparkContext(new SparkConf()
.setMaster("local")
.setAppName("myApp")
// initial redis host - can be any node in cluster mode
.set("redis.host", "localhost")
// initial redis port
.set("redis.port", "6379")
// optional redis AUTH password
.set("redis.auth", "")
)
Is it possible to connect to redis from pyspark ??
Q: What data sources does AWS Glue support?
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. The metadata stored in the AWS Glue Data Catalog can be readily accessed from Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. You can also write custom Scala or Python code and import custom libraries and Jar files into your Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.
I am looking for a native offering, such as any of the RDS solutions, Elastic Cache, Amazon Redshift, not something that I would have to host myself.
From the Apache Kudu: https://kudu.apache.org/ :
Kudu provides a combination of fast inserts/updates and efficient columnar
scans to enable multiple real-time analytic workloads across a single storage
layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the
flexibility to address a wider variety of use cases without exotic workarounds.
As I understand it, Kudu is a columnar distributed storage engine for tabular data that allows for fast scans and ad-hoc analytical queries but ALSO allows for random updates and inserts. Every table has a primary key that you can use to find and update single records...
Second answer after question was revised.
The answer is Amazon EMR running Apache Kudu.
Amazon EMR is Amazon's service for Hadoop. Apache Kudu is a package that you install on Hadoop along with many others to process "Big Data".
If you are looking for a managed service for only Apache Kudu, then there is nothing. Apache Kudu is an open source tool that sits on top of Hadoop and is a companion to Apache Impala. On AWS both require Amazon EMR running Hadoop version 2.x or greater.
We are planned to use Amazon RDS and currently using Mule Standalone.Can we use Amazon RDS in Mule and deploy in standalone. Reason I'm asking is there are few example over internet which is with CloudHub and Amazon RDS?. Hopefully shouldn't, but i'm confused.
Are we having any specific config difference in Amazon RDS with cloudhub or using with standalone.
Any pros and cons using Amazon RDS with cloudhub or with standalone.
Are we have specific connector for Amazon RDS as like SQS?
Helpful on clarify the doubts.
If you have already used, Please let me know if anything which i need to keep me in mind on development.
Helpful if you direct me with url's also. Thanks in advance.
No difference in using it with Cloudhub or Mule Standalone functionally. Just use the database connector and setup the correct JDBC properties: https://docs.mulesoft.com/mule-user-guide/v/3.8/