Scaling Kafka Connect to handle 10K S3 buckets - amazon-s3

I want to load data from various S3 buckets (more than 10,000 buckets and each file is around 20-50MB) into Apache Kafka. The list of buckets is dynamic - buckets are added and removed at runtime. Ideally, each bucket configuration should have its own polling interval (how often to scan for new files - at-least 60 seconds, but might be much more) and priority (number of concurrent files being processed).
Note that setting up notifications from each of the S3 buckets to SQS/SNS/Lambda is not an option due to various IT policies in the organizations of each of the bucket owners.
Kafka Connect seems to be most commonly used tool for such tasks, and its pluggable architecture will make it easier to add new sources in the future, so it fits well. Configuring each S3 bucket as its own connector will let me set a different number of tasks (which maps to priorities) and polling interval for each one. And building a Java custom Kafka Connect source task for my expected file format sounds reasonable.
However, the Kafka Connect code indicates that each running task is assigned its own thread for the lifetime of the task. So if I have 10K buckets, each configured with its own connector and with a single task, I will have 10K threads running in my Kafka Connect distributed worker pool. That's a lot of threads that are mostly just sleep()-ing.
What is the correct approach to scaling the number of tasks/connectors in Kafka Connect?

Kafka Connect is distributed framework which could work as stand-alone mode or distributed, as distributed framework you are creating cluster of kafka connect from several commodity servers each one hosts kafka connect instance and can execute connector's tasks , if you need more power you can add more servers hosting connect instances ,
Reading the S3 Source Connector documents I did not find a way to "whitelist" / "regex" to get it read from multiple buckets...

Related

Manage In-memory cache in multiple servers in aws

Once or twice a day some files are being uploaded to S3 Bucket. I want the uploaded data to be refreshed with the In-memory data of each server on every s3 upload.
Note there are multiple servers running and I want to store the same data in all the servers. Also, the servers are scaling based on the traffic(also on start-up of the new server goes up and older ones go down means server instances will not be the same always).
Like I want to keep updated data in the cache.
I want to build an architecture where auto-scaling of the server can be supported. I came across the FAN-OUT architecture of AWS by using the SNS and multiple SQS from which different servers can poll.
How can we handle the auto-scaling of the queue with respect to servers?
Or is there any other way to handle the scenario?
PS: I m totally new to the AWS environment.
It Will be a great help for any reference.
To me there are a few things that you need to have to make this work. These are opinions and, as with most architectural designs, there is certainly more than one way to handle this.
I start with the assumption that you've got an application running on an EC2 of some sort (Elastic Beanstalk, Fargate, Raw EC2s with auto scaling, etc.) and that you've solved for having the application installed and configured when a scale-up event occurs.
Conceptually I'd have this diagram:
The setup involves having the S3 bucket publish likely s3:ObjectCreated events to the SNS topic. These events will be published when an object in the bucket is updated or created.
Next:
During startup your application will pull the current data from S3.
As part of application startup create a queue named after the instance id of the EC2 (see here for some examples) The queue would need to subscribe to the SNS topic. If the queue already exists then that's not an error.
Your application would have a background thread or process that polls the SQS queue for messages.
If you get a message on the queue then that needs to tell the application to refresh the cache from S3.
When an instance is shut down there is an event from at least Elastic Beanstalk and the load balancers that your instance will be shut down. Remove the SQS queue tied to the instance at that time.
The only issue might be that a hard crash of an environment would leave orphan queues. It may be advisable to either manually clean these up or have a periodic task clean them up.

NiFi Site-to-Site Data Flow is Slow

I have multiple standalone NiFi instances (approx. 10) that I want to use to send data to a NiFi cluster (3 NiFi instances) using RPG (Site-to-Site). But, the flow from the standalone instances to the cluster seems to be slow.
Is this the right approach?
How many Site-to-Site Connections does NiFi allow?
Are there any best practices for Site-to-Site NiFi Data Flow?
You may want to first rule out your network. You could ssh to one of the standalone nodes and then try to SCP a large file from the standalone node to one of the nodes in the NiFi cluster. If that is slow then it is more of a network problem and there won't be much you can do to make it go faster in NiFi.
In NiFi, you can tune each side of the site-to-site config...
On the central cluster you can right-click on the remote Input Port and configure the concurrent tasks which defaults to 1. This is the number of threads that can concurrently process data received on the port.
On the standalone NiFi instances you can also configure the concurrent tasks used to send data to a given port. Right-click on the RPG and select "Manage remote ports", and then change the concurrent tasks for whichever port.

Transfer messages of different topics to hdfs by kafka-connect-hdfs

I want to transfer data from kafka to hdfs by confluent, and I do the experiments by the quickstart in CLI model successfully.
Now, I intend to deploy confluent platform on production environment, Is there any tutorial about distributed deployment in detail?
And if there are many topics in kafka, such as register_info, video_play_info, video_like_info, video_repost_info and etc.
I need to process messages by different converters, and transfer to different hive table.
what should i?
I need to process messages by different converters, and transfer to different hive table
Run bin/connect-distributed etc/kafka/connect-distributed.propeties
Create individual JSON files for each HDFS Connector
POST them to the REST endpoint of Kafka Connect
Distributed mode is documented here

minio: What is the cluster architecture of minio.io object storage server?

I have searched minio.io for hours but id dosn't provide any good information about clustering, dose it has rings and instance are connected? or mini is just for single isolated machine. And for running a cluster we have to run many isolated instance of it and the our app choose to which instance we write?
if yes:
When I write a file to a bucket does minio replicate it between multi server?
I is it like amazon s3, or openstack swift that support of storing multi copy of object in different servers (and not multi disk on the same machine).
Here is the document for distributed minio: https://docs.minio.io/docs/distributed-minio-quickstart-guide
From what I can tell, minio does not support clustering with automatic replication across multiple servers, balancing, etcetera.
However, the minio documentation does say how you can set up one minio server to mirror another one:
https://gitlab.gioxa.com/opensource/minio/blob/1983925dcfc88d4140b40fc807414fe14d5391bd/docs/setup-replication-between-two-sites-running-minio.md
Minio also Introduced Continuous Availability and Active-Active Bucket Replication. CheckoutTheir active-active Replication Guide

Transferring results from Zookeeper to webserver

In my project I am calculating about 10-100mbs of data on a zookeeper worker. I then use HTTP PUT to transfer the data from the worker process to my webserver, which eventually gets delivered to the client. Is there anyway using Zookeeper or Curator to transfer that data or am I on my own to get the data out of the Worker process and onto a process outside my ensemble?
I wouldn't recommend to use Zookeeper to transfer data, especially of such relatively large size. It is not really designed to do it. Zookeeper works best when it used to synchronize distributed processes or to store some relatively small configuration data that is shared among multiple hosts.
There is a hard limit of 1 Mb per ZK node and if you try to push it to the limit, Zookeeper clients may get timeouts and go into disconnected state while Zookeeper service processes large chunk of data.