Scaling of Spring Batch application in PCF without Spring Cloud Dataflow and other cloud services disabled - rabbitmq

I have read a lot of articles about Spring Batch scaling in cloud platforms, I have also followed Michael Minella's video on high performance batch processing in youtube (https://www.youtube.com/watch?v=J6IPlfm7N6w).
My usecase is that I would be processing a large file of more than 1GB using Spring Batch in PCF. I understand that the files can be split and the class DeployerPartitionhandler can be used to start a new instance dynamically in PCF per partition/file, but the catch is we don't have Spring Cloud Dataflow and Spring Cloud services enabled in our PCF environment.
I saw that we can combine Spring Batch with Spring Integration and rabbitmq to do remote chunking of the large file using a master/worker configuration. But these workers need to manually started in PCF as a separate instance. Based on the load we have to manually start more worker instances.
But is there any other way provided by Spring Batch and PCF to autoscale the worker instances as per load? Or is there a way to dynamically start a new instance in PCF when the master is ready with the chunk while reading the file?
FYI : If I use the Autoscaler feature of PCF based on some metric such as CPU utilization, for every new instance it reads the whole file again processes it.

Related

Scaling Kafka Connect to handle 10K S3 buckets

I want to load data from various S3 buckets (more than 10,000 buckets and each file is around 20-50MB) into Apache Kafka. The list of buckets is dynamic - buckets are added and removed at runtime. Ideally, each bucket configuration should have its own polling interval (how often to scan for new files - at-least 60 seconds, but might be much more) and priority (number of concurrent files being processed).
Note that setting up notifications from each of the S3 buckets to SQS/SNS/Lambda is not an option due to various IT policies in the organizations of each of the bucket owners.
Kafka Connect seems to be most commonly used tool for such tasks, and its pluggable architecture will make it easier to add new sources in the future, so it fits well. Configuring each S3 bucket as its own connector will let me set a different number of tasks (which maps to priorities) and polling interval for each one. And building a Java custom Kafka Connect source task for my expected file format sounds reasonable.
However, the Kafka Connect code indicates that each running task is assigned its own thread for the lifetime of the task. So if I have 10K buckets, each configured with its own connector and with a single task, I will have 10K threads running in my Kafka Connect distributed worker pool. That's a lot of threads that are mostly just sleep()-ing.
What is the correct approach to scaling the number of tasks/connectors in Kafka Connect?
Kafka Connect is distributed framework which could work as stand-alone mode or distributed, as distributed framework you are creating cluster of kafka connect from several commodity servers each one hosts kafka connect instance and can execute connector's tasks , if you need more power you can add more servers hosting connect instances ,
Reading the S3 Source Connector documents I did not find a way to "whitelist" / "regex" to get it read from multiple buckets...

Google Cloud Manage Tomcat Service

Does google cloud or aws provide manage Apache tomcat which just take war file and do auto-scaling based on load increase and decrease ? not compute engine. I dont want to create VM. this should be manage by manage service.
Google App Engine can directly take and run a WAR file - just use the appcfg deployment method.
You will have more options if you package with docker, as this then provides an image type that can be run in many places (Multilpe GCP, AWS and Azure options, on-prem Kubernetes, etc). This can even be as simple as building a dockerfile that just copies the WAR into a jetty image:
FROM jetty:latest
COPY YOUR_WAR.war /var/lib/jetty/webapps
It might be better to explode the war though - see discussion in this question
AWS provide ** AWS Elastic Beanstalk **
The AWS Elastic Beanstalk Tomcat platform is a set of environment configurations for Java web applications that can run in a Tomcat web container. Each configuration corresponds to a major version of Tomcat, like Java 8 with Tomcat 8.
Platform-specific configuration options are available in the AWS Management Console for modifying the configuration of a running environment. To avoid losing your environment's configuration when you terminate it, you can use saved configurations to save your settings and later apply them to another environment.
To save settings in your source code, you can include configuration files. Settings in configuration files are applied every time you create an environment or deploy your application. You can also use configuration files to install packages, run scripts, and perform other instance customization operations during deployments.
It also provide autoscaling
The Auto Scaling group in your Elastic Beanstalk environment uses two Amazon CloudWatch alarms to trigger scaling operations. The default triggers scale when the average outbound network traffic from each instance is higher than 6 MB or lower than 2 MB over a period of five minutes. To use Amazon EC2 Auto Scaling effectively, configure triggers that are appropriate for your application, instance type, and service requirements. You can scale based on several statistics including latency, disk I/O, CPU utilization, and request count.

Is Redis a good idea for Spring Cloud Stream? Should I use Kafka or RabbitMQ?

I'm deploying a small Spring Cloud Stream project,
using only http sources and jdbc sinks (3 instances each). The estimated load is 10 hits/second.
I was thinking on using redis because I feel more confortable with it, but in the latest documentation almost all the refereces are to kafka and RabbitMQ so I am wondering if redis is not going to be supported in the future or if there is any issue using redis.
Regards
Redis is not recommended for production with Spring Cloud Stream - the binder is not fully functional and message loss is possible.

spring cloud bus rabbitmq

We're using spring cloud config server. Spring config clients get updates using spring control bus (RabbitMQ).
Looks like every config client instance creates a queue connected to the 'spring.cloud.bus' exchange.
Any scalability limits on how many app instances can connect to a 'spring.cloud.bus' exchange ?
I suppose RabbitMQ could be scaled to handle this.
Looking for any guidelines on this.
Many thanx,
The spring cloud config server can have multiple instances since it is stateless. That coupled with a RabbitMQ cluster should scale to a very large number of instances.
A viable solution would be spring cloud config behind a load balancer with a RabbitMQ cluster.

Footprint of integration applications in Mule ESB

I want to have a shared Mule ESB server where I'm going to deploy many integration apps. Most of them are similar but with different parameters and they have to be independent from each other.
I want to know how many integration applications can a deploy in one Mule ESB container. I'm not so worried about CPU usage because that's easily balanced having many servers, but I'm more concerned about memory usage or any other Mule ESB limitation.
I understand that memory usage of integration apps depends on the integration you are deploying to Mule ESB, but would be good to understand the overhead of deploying of a new app. Does it load all Mule classes again? Does it reuse what's already loaded? All applications are active all the time consuming memory or they are put to sleep while they are inactive?
For example, deploying Mule ESB without apps consumes 100MB, then deploying one app consumes 120MB, can I conclude that each integration application of the same type will consume 20MB?