HDFS over S3 / Google storage bucket translation layer - how? - amazon-s3

I'd love to expose a Google storage bucket over HDFS to a service.
Service in question is a cluster (SOLR) that can speak only to HDFS, given I have no hadoop (nor need for it), ideally I'd like to have a docker container that would user a Google storage bucket as a backend and expose it's contents via HDFS.
If possible I'd like to avoid mounts (like fuse gcsfs), has anyone done such thing?
I think I could just do mount gcsfs and setup a single node cluster with HDFS, but is there a simpler / more robust way?
Any hints / directions are appreciated.

The Cloud Storage Connector for Hadoop is the tool you might need.
It is not a Docker image but rather an install. Further instructions can be found in the GitHub repository under README.md and INSTALL.md
If it is accessed from AWS S3 you'll need a Service Account with access to Cloud Storage and set the env variable GOOGLE_APPLICATION_CREDENTIALS to /path/to/keyfile.
To use SOLR with GCS, you need indeed a hadoop cluster and you can do that in GCP by creating a dataproc cluster then use the connector mentioned to connect your SOLR solution with GCS. for more info check this SOLR

Related

How to set up AWS S3 bucket as persistent volume in on-premise k8s cluster

Since NFS has single point of failure issue. I am thinking to build a storage layer using S3 or Google Cloud Storage as PersistentVolumn in my local k8s cluster.
After a lot of google search, I still cannot find an way. I have tried using s3 fuse to mount volume to local, and then create PV by specifying the hotPath. However, a lot of my pods (for example airflow, jenkins), complained about no write permission, or say "version being changed".
Could someone help figuring out the right way to mount S3 or GCS bucket as a PersistenVolumn from local cluster without using AWS, or GCP.
S3 is not a file system and is not intended to be used in this way.
I do not recommend to use S3 this way, because in my experience any FUSE-drivers very unstable and with I/O operations you will easily ruin you mounted disk and stuck in Transport endpoint is not connected nightmare for you and your infrastructure users. It's also may lead to high CPU usage and RAM leakage.
Useful crosslinks:
How to mount S3 bucket on Kubernetes container/pods?
Amazon S3 with s3fs and fuse, transport endpoint is not connected
How stable is s3fs to mount an Amazon S3 bucket as a local directory

Which s3 compatible blob storage?

I want deploy a s3 compatible blob storage in my Kubernetes Cluster. I already use GlusterFS for volumes like mongodb, and I tried to set up minio with the helm chart https://github.com/helm/charts/tree/master/stable/minio. I just realize I can't scale up minio easily because of erasure code.
So I have some questions about blob storage solutions :
Is GlusterFS blob storage service stable and reliable (https://github.com/gluster/gluster-kubernetes/tree/master/docs/examples/gluster-s3-storage-template) ?
Do I must use OpenShift to deploy GlusterFS blob storage as I read in the web ? I think no because I can see simple Kubernetes manifests in the GlusterFS repo like this one : https://github.com/gluster/gluster-kubernetes/blob/master/deploy/kube-templates/gluster-s3-template.yaml.
Is it easy to use Minio federation in Kubernetes ? Is it easily scalable with a "helm upgrade --set replicas=X" or do I need manually upgrade minio configuration ?
As you can see, I feel lost with this s3 storage. So if you have more information/solutions, do not hesitate.
Thanks in advance !
About reliability you should read more about user experience like:
An end user review of GlusterFS
Community Survey Feedback, 2019
Why openshift with glusterFS:
For standalone Red Hat Gluster Storage, there is no component installation required to use it with OpenShift Container Platform. OpenShift Container Platform comes with a built-in GlusterFS volume driver, allowing it to make use of existing volumes on existing clusters but Red Hat Gluster Storage is a commercial storage software product, based on Gluster.
How to deploy it in AWS
For minio please follow official docs:
ConfigMap allows injecting containers with configuration data even while a Helm release is deployed.
To update your MinIO server configuration while it is deployed in a release, you need to
Check all the configurable values in the MinIO chart using helm inspect values stable/minio.
Override the minio_server_config settings in a YAML formatted file, and then pass that file like this helm upgrade -f config.yaml stable/minio.
Restart the MinIO server(s) for the changes to take effect
I didn't try but, but as per documentation:
For federation I can see additional environment variables in the values.yaml.
In addition you should Run MinIO in federated mode Federation Quickstart Guide
Here you can find differences between google and amazon s3 sotrage
or Cloud Storage interoperability from gcloud perspective.
Hope this help.

Copy objects from S3 to google cloud storage using aws-cli

Is this possible to access Google Cloud Storage using aws CLI?
Google Cloud Platform have support to copy files from S3 to Google Cloud Storage using gsutil with the following CLI.
gsutil -m cp -R s3://bucketname gs://bucketname
But I need to do this with aws CLI instead of gsutil.
I am not aware of any solution from the AWS side, but unless you have a special reason not to use gsutil or other Google solution, you may consider using Google Cloud Storage Transfer Service instead. This service is recommended when transferring data from Amazon S3 buckets.
Compared with simply using gsutil, or other CLI tools out there, Google Cloud Storage Transfer has several nice features like the possibility to schedule one-time or recurring transfers, where you can use advanced filters. Also, you can indicate if you want the source objects to be deleted after transferring them, and even synchronize the destination bucket with the source one, deleting existing objects if they don't have a corresponding object in the source.
You can schedule transfers from the GCP Console or using the XML and JSON API.

Can spinnaker use local storage such as mysql database?

I want to deploy spinnaker for my team. But I encounter a problem. The document of spinnaker said:
Before you can deploy Spinnaker, you must configure it to use one of the supported storage types.
Azure Storage
Google Cloud Storage
Redis
S3
Can spinnaker use local storage such as mysql database?
The Spinnaker microservice responsible for persisting your pipeline configs and application metadata, front50, has support for the storage systems you listed. One could add support for additional systems like mysql by extending front50, but that support does not exist today.
Some folks have had success configuring front50 to use s3 and pointing it at a minio installation.

Hadoop upload files from local machine to amazon s3

I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)