Can you connect Amazon ElastiСache Redis to amazon EMR pyspark? - amazon-emr

I have been trying several solutions with custom jars from redislab and also with --packages in spark submit emr and still no suceess is there any simple way in emr to connect to elasticache ?

Related

How to write EMR pyspark to local hdfs and later copy to s3?

I am using a transient cluster for Spark on EMR. an currently working on spark 3.0.1. Which command should be used such that data needs to be written,into HDFS on EMR and later get it copied over to S3 bucket?

How to run the map reduce jobs on EMRserverless?

Based on the documentation, Amazon EMR serverless seems to accepts only Spark and Hive as job driver. Is there any support for custom Hadoop jar for map reduce jobs on serverless similar to EMR ?
That's correct. EMR serverless currently only supports Spark and Hive jobs, so no MapReduce.

Scheduling over different AWS Components - Glue and EMR

I was wondering how I would tackle the following on AWS? - or whether it was not possible?
Transient EMR Cluster for some bulk Spark processing
When that cluster terminates, then and only then use a Glue Job to do some limited processing
I am not convinced AWS Glue Triggers will help over environments.
Or could one say, well just keep on in the EMR Cluster, it's not a good use case? Glue can write to SAP Hana with appropriate Connector and Redshift Spectrum is common use case to load Redshift via Glue job with Redshift Spectrum.
You can use "Run a job" service integration using AWS Step Functions. Step functions supports both EMR and Glue integration.
Please refer to the link for details.
Having spoken to Amazon on this aspect, they indicate that Airflow via MWAA is the preferred option now.

Fetch data from redis using AWS Glue (python)

I am trying to get data from redis using AWS Glue(python). I want to know how to connect to redis from spark context. Redis is also hosted in same AWS region
I saw code in redis website. Unable to find code sample for Pyspark.
import com.redislabs.provider.redis._
...
sc = new SparkContext(new SparkConf()
.setMaster("local")
.setAppName("myApp")
// initial redis host - can be any node in cluster mode
.set("redis.host", "localhost")
// initial redis port
.set("redis.port", "6379")
// optional redis AUTH password
.set("redis.auth", "")
)
Is it possible to connect to redis from pyspark ??
Q: What data sources does AWS Glue support?
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. The metadata stored in the AWS Glue Data Catalog can be readily accessed from Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. You can also write custom Scala or Python code and import custom libraries and Jar files into your Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.

How to integrate Apache Nifi with Amazon Athena?

My Requirements:
1. User's will run the sql queries through Apache nifi to Amazon S3.
Is this possible to achieve Nifi integration with Amazon Athena?
You should be able to easily integrate Apache NiFi and Amazon Athena. The NiFi capabilities to leverage/plug-in JDBC drivers and reuse that context in many areas helps here greatly. See here for info on the JDBC drivers with Athena https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html and here for using some of NiFi's DBCP facilities https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-dbcp-service-nar/1.5.0/org.apache.nifi.dbcp.DBCPConnectionPool/index.html
You should be able to by using a combination of an ExecuteStreamCommand and the awscli. The cli has the capabilities to issue Athena queries