Can Kafka apache cluster works with S3 storage - amazon-s3

We have production Kafka cluster , while Kafka apache version is 2.8
Total Kafka machines in the cluster are 21 physical machines ,
Each machine used the internal disks and disks are with XFS filesystem , disks on each machine are in RAID 10 and total storage on each machine is 25T
OS on each Kafka machine is RHEL 7.9 version
Until now every thing is working fine
Recently customer asked about to move to S3 storage
And we want to understand if any of Kafka apache versions can works with S3 storage
I must to say that customer doesn't care about the existing topics data on the current disks
So its like to install scratch Kafka cluster with S3 storage ,
Docs / related links:
https://kafka-connect-fs.readthedocs.io/en/latest/connector.html
Can Amazon S3 act as Source to Kafka Cluster?

No. "Tiered Storage" is the feature you're asking for, and it is not available in open-source Apache Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
You can use Kafka Connect (or alternative tooling) to consume and write topics to S3, but Kafka itself still uses local disks for storage.

Related

EC2 VM running Standalone Confluent S3 sink connector has a difference of 36 GB between NetworkIn & NetworkOut values

I have a EC2 VM running Confluent's s3 sink connector in standalone mode for benchmarking a MSK serverless cluster.
Network in for this VM over 30 minutes = 112.4GB
Network out for this VM over 30 minutes= 75.8GB
s3 sink size over 30 minutes=74 GB
I'm unable to explain the difference of 36.6GB between what the VM is
ingesting from the MSK serverless cluster and what it is persisting
to the S3 bucket.
The VM is a m5.4xlarge instance with 56Gb heap size and 40% CPU
utilization over the course of the run, so it can't be a lack of
compute or memory capacity. This process is also the sole tenant of this vm. I'm using ssh on my local machine to start and stop the connector on the EC2 instance.
The data is being produced by the Confluent Datagen connector running
on a separate VM in standalone mode with same stats. The networkOut for the producer
matches the networkIn for this s3 sink VM.
The bucket is in the same region as the EC2 instance and the MSK
serverless cluster. I'm even using a S3 endpoint gateway.
The topic the connector reads from has 100 partitions 3 replication factor and 2 ISRs. My consumer lag stats are
SumOffsetLag=1.15M
EstimatedMaxTimelag=18.5s
maxOffSetLag=37.7K
This the configuration I'm using for the s3 sink connector.
format.class=io.confluent.connect.s3.format.json.JsonFormat
flush.size=50000
rotate.interval.ms=-1
rotate.schedule.interval.ms=-1
s3.credentials.provider.class=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
storage.class=io.confluent.connect.s3.storage.S3Storage
schema.compatibility=NONE
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
consumer.override.security.protocol=SASL_SSL
consumer.override.sasl.mechanism=AWS_MSK_IAM
consumer.override.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
consumer.override.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
schemas.enable=false
connector.class=io.confluent.connect.s3.S3SinkConnector
time.interval=HOURLY ```

How to set up AWS S3 bucket as persistent volume in on-premise k8s cluster

Since NFS has single point of failure issue. I am thinking to build a storage layer using S3 or Google Cloud Storage as PersistentVolumn in my local k8s cluster.
After a lot of google search, I still cannot find an way. I have tried using s3 fuse to mount volume to local, and then create PV by specifying the hotPath. However, a lot of my pods (for example airflow, jenkins), complained about no write permission, or say "version being changed".
Could someone help figuring out the right way to mount S3 or GCS bucket as a PersistenVolumn from local cluster without using AWS, or GCP.
S3 is not a file system and is not intended to be used in this way.
I do not recommend to use S3 this way, because in my experience any FUSE-drivers very unstable and with I/O operations you will easily ruin you mounted disk and stuck in Transport endpoint is not connected nightmare for you and your infrastructure users. It's also may lead to high CPU usage and RAM leakage.
Useful crosslinks:
How to mount S3 bucket on Kubernetes container/pods?
Amazon S3 with s3fs and fuse, transport endpoint is not connected
How stable is s3fs to mount an Amazon S3 bucket as a local directory

HSQLDB on S3 Compatible Service

We use HSQLDB as a filesystem based database as our application requirements for a RDBMS is minimal. We would now like to move this application to Pivotal Cloud Foundry. S3 compatible storage (on cloud) is the only service compatible "filesystem" on physical machines.
So if we move our current HSQLDB to S3, we would not be able to make a direct JDBC connection to the HSQLDB "file" (as accessing S3 objects need authetication etc).
Has anyone faced such a situation before? Are there ways to access HSQLDB with S3 as a storage medium ?
Thanks,
Midhun
Pivotal Cloud Foundry allows you to bind volume mounts to your cf push-ed applications. Thanks to the NFS volume service (see cf marketplace -s), you can bind volume mounts to your application with the usual cf create-service and cf bind-service commands. Then your HSQLDB files must be written under the filesystem directory where the NFS volume is mounted.
This could be handy a solution for running your app in Cloud Foundry with persistent filesystem storage for your HSQLDB database.
Default PCF installations provide such mount from some NFS server. Here is the NFS volumes documentation and especially for your PCF operator, how to enable this feature.

minio: What is the cluster architecture of minio.io object storage server?

I have searched minio.io for hours but id dosn't provide any good information about clustering, dose it has rings and instance are connected? or mini is just for single isolated machine. And for running a cluster we have to run many isolated instance of it and the our app choose to which instance we write?
if yes:
When I write a file to a bucket does minio replicate it between multi server?
I is it like amazon s3, or openstack swift that support of storing multi copy of object in different servers (and not multi disk on the same machine).
Here is the document for distributed minio: https://docs.minio.io/docs/distributed-minio-quickstart-guide
From what I can tell, minio does not support clustering with automatic replication across multiple servers, balancing, etcetera.
However, the minio documentation does say how you can set up one minio server to mirror another one:
https://gitlab.gioxa.com/opensource/minio/blob/1983925dcfc88d4140b40fc807414fe14d5391bd/docs/setup-replication-between-two-sites-running-minio.md
Minio also Introduced Continuous Availability and Active-Active Bucket Replication. CheckoutTheir active-active Replication Guide

Hadoop upload files from local machine to amazon s3

I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)