Connect to Minio S3 with Pentaho Data Integration CE - amazon-s3

Is there a way to connect to Google cloud storage with PDI Community version?
I have looked at VFS connection, but no connection types are listed in the the drop down while creating a new VFS connection.
In the PDI 9.1 we use, this list is complete empty

Pentaho PDI 9.3 CE supports Amazon/Minio S3 but non of the other VFS Options, they should be available in Enterprise. I needed Azure Blob support and switched to Apache HOP. Well, HOP supports all kind of VFS connections including Azure BUT only Amazon S3, no Minio or other S3 implementation. So in the end i might have to get my S3 Data with Pentaho PDI and push them to Azure with Apache HOP :-)
There is no official documentation on this.
Pentaho PDI Example VFS connection Summary to S3 Storage provided by Netypp Storage GRID looks like this:
General Settings:
Connection Name: yourConnectionName
Connection Type: Amazon S3 / Minio
Connection Details
Access Key: yourAccessKey
Secret Key: yourSecretKey
Region: Default
Credential File: No
Profile Name:
Endpoint: https://s3.yourDomain
PathStyleAccess: true
Signature Version: AWSS3V4SignerType
Default S3 Connection true
Connection Type: Minio
Took me quite some time to find out the value for the Signature Version!

Related

AWS Glue as Hive Datasource

I've read AWS Glue is a a Hive compatible datastore, but I haven't found how to use AWS Glue as a JDBC datasource.
I'd like to use AWS Glue Catalog as a source for my reporting, as the Hive documentation shows here -
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-ConnectionURLforRemoteorEmbeddedMode
Connection URL for Remote or Embedded Mode
The JDBC connection URL format has the prefix jdbc:hive2:// and the Driver class is org.apache.hive.jdbc.HiveDriver. Note that this is different from the old HiveServer.
For a remote server, the URL format is jdbc:hive2://<host>:<port>/<db>;initFile=<file> (default port for HiveServer2 is 10000).
For an embedded server, the URL format is jdbc:hive2:///;initFile=<file> (no host or port).
When I edit the database in AWS Glue, it appears I can set a location for client - but I'm not sure what to put here, and didn't see any documentation on how this works.
Any thoughts?
AWS Glue is a Hive Metadata Store, not a Hive server.
Though, Hive server can use Glue as the metadata store.
https://aws.amazon.com/emr/features/hive/

How to send data to AWS S3 from Kafka using Kakfa Connect without Confluent?

I have a local instance of Apache Kafka 2.0.0 , it running very well. In my test I produce and consume data from twitter and put them in a specific topic twitter_tweets and everything is OK. But now I want to consume the topic twitter_tweets with Kafka Connect using de connector Kafka Connect S3 and obviusly store the data in AWS S3 without using Confluent-CLI.
Can I do this without Confluent? Anyone have an example or something to help me?
without Confluent
S3 Sink is open source; so is Apache Kafka Connect.
Connect framework is not specific to Confluent
You may use Kafka Connect Docker image, for example, or you may use confluent-hub to install S3 Connect on your own Kafka Connect installation.

Which s3 compatible blob storage?

I want deploy a s3 compatible blob storage in my Kubernetes Cluster. I already use GlusterFS for volumes like mongodb, and I tried to set up minio with the helm chart https://github.com/helm/charts/tree/master/stable/minio. I just realize I can't scale up minio easily because of erasure code.
So I have some questions about blob storage solutions :
Is GlusterFS blob storage service stable and reliable (https://github.com/gluster/gluster-kubernetes/tree/master/docs/examples/gluster-s3-storage-template) ?
Do I must use OpenShift to deploy GlusterFS blob storage as I read in the web ? I think no because I can see simple Kubernetes manifests in the GlusterFS repo like this one : https://github.com/gluster/gluster-kubernetes/blob/master/deploy/kube-templates/gluster-s3-template.yaml.
Is it easy to use Minio federation in Kubernetes ? Is it easily scalable with a "helm upgrade --set replicas=X" or do I need manually upgrade minio configuration ?
As you can see, I feel lost with this s3 storage. So if you have more information/solutions, do not hesitate.
Thanks in advance !
About reliability you should read more about user experience like:
An end user review of GlusterFS
Community Survey Feedback, 2019
Why openshift with glusterFS:
For standalone Red Hat Gluster Storage, there is no component installation required to use it with OpenShift Container Platform. OpenShift Container Platform comes with a built-in GlusterFS volume driver, allowing it to make use of existing volumes on existing clusters but Red Hat Gluster Storage is a commercial storage software product, based on Gluster.
How to deploy it in AWS
For minio please follow official docs:
ConfigMap allows injecting containers with configuration data even while a Helm release is deployed.
To update your MinIO server configuration while it is deployed in a release, you need to
Check all the configurable values in the MinIO chart using helm inspect values stable/minio.
Override the minio_server_config settings in a YAML formatted file, and then pass that file like this helm upgrade -f config.yaml stable/minio.
Restart the MinIO server(s) for the changes to take effect
I didn't try but, but as per documentation:
For federation I can see additional environment variables in the values.yaml.
In addition you should Run MinIO in federated mode Federation Quickstart Guide
Here you can find differences between google and amazon s3 sotrage
or Cloud Storage interoperability from gcloud perspective.
Hope this help.

Fetch data from redis using AWS Glue (python)

I am trying to get data from redis using AWS Glue(python). I want to know how to connect to redis from spark context. Redis is also hosted in same AWS region
I saw code in redis website. Unable to find code sample for Pyspark.
import com.redislabs.provider.redis._
...
sc = new SparkContext(new SparkConf()
.setMaster("local")
.setAppName("myApp")
// initial redis host - can be any node in cluster mode
.set("redis.host", "localhost")
// initial redis port
.set("redis.port", "6379")
// optional redis AUTH password
.set("redis.auth", "")
)
Is it possible to connect to redis from pyspark ??
Q: What data sources does AWS Glue support?
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. The metadata stored in the AWS Glue Data Catalog can be readily accessed from Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. You can also write custom Scala or Python code and import custom libraries and Jar files into your Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.

Can spinnaker use local storage such as mysql database?

I want to deploy spinnaker for my team. But I encounter a problem. The document of spinnaker said:
Before you can deploy Spinnaker, you must configure it to use one of the supported storage types.
Azure Storage
Google Cloud Storage
Redis
S3
Can spinnaker use local storage such as mysql database?
The Spinnaker microservice responsible for persisting your pipeline configs and application metadata, front50, has support for the storage systems you listed. One could add support for additional systems like mysql by extending front50, but that support does not exist today.
Some folks have had success configuring front50 to use s3 and pointing it at a minio installation.