NiFi Site-to-Site Data Flow is Slow - apache

I have multiple standalone NiFi instances (approx. 10) that I want to use to send data to a NiFi cluster (3 NiFi instances) using RPG (Site-to-Site). But, the flow from the standalone instances to the cluster seems to be slow.
Is this the right approach?
How many Site-to-Site Connections does NiFi allow?
Are there any best practices for Site-to-Site NiFi Data Flow?

You may want to first rule out your network. You could ssh to one of the standalone nodes and then try to SCP a large file from the standalone node to one of the nodes in the NiFi cluster. If that is slow then it is more of a network problem and there won't be much you can do to make it go faster in NiFi.
In NiFi, you can tune each side of the site-to-site config...
On the central cluster you can right-click on the remote Input Port and configure the concurrent tasks which defaults to 1. This is the number of threads that can concurrently process data received on the port.
On the standalone NiFi instances you can also configure the concurrent tasks used to send data to a given port. Right-click on the RPG and select "Manage remote ports", and then change the concurrent tasks for whichever port.

Related

MaxScale cluster (master-master) setup

When deploying multiple MaxScale in a Master-Slave typology (failover from master to slave with Keepalived or similar) in front of a Galera Cluster in read-write-split mode , everything goes fine. But what about a Master-Master like typology in a round-robin fashion, is this possible?
Ex.: Having one MaxScale at 10.0.0.1, a 2nd at 10.0.0.2 and Haproxy in front of it with a roundrobin or leastconn distribution algorithm (or even without Haproxy/load-balancer, where applications just connect randomly to one or another) is this possible/well supported by MaxScale?
Usually you can connect to any number of MaxScale instances as long as certain features are enabled to guarantee that all MaxScale instances pick the same server where they send writes to.
Galera Clusters
If you are using a Galera cluster, this can be done in a safe and conflict-free manner by enabling the root_node_as_master parameter. It uses the Galera cluster itself to select which node it uses for writes. This allows your applications to connect to either of the MaxScale instances.
Even without this parameter it would not cause any problems with the database itself but due to the way Galera works, you increase the likelihood of running into a conflict when you COMMIT your transaction if you write to multiple nodes.
Asynchronous Replication Clusters
If you use MaxScale with a cluster that uses asynchronous replication, you still can do this as long as you configure your mariadbmon monitor to use cooperative_monitoring_locks. This causes the MaxScale instances to communicate via the database about which servers they see and which of them they chose for writes.
An added benefit of cooperative_monitoring_locks is that you can enable the automated cluster management parameters auto_failover and auto_rejoin without having to worry about two MaxScales attempting to change the replication configuration at the same time: cooperative_monitoring_locks makes sure only one MaxScale does it.

Does it require to put load balancer before Redis cluster

I am using Redis Cluster on 3 Linux servers (CentOS 7). I have standard configuration i.e. 6 nodes, 3 master instances, and 3 slave instances (one master have one slave) distributed on these 3 Linux servers. I am using this setup for my web application for data caching, HTTP response caching. My aim is to read primary and write secondary i.e. Read operation should not fail or delayed.
Now I would like to ask is it necessary to configure any load balancer before by 3 Linux servers so that my web application requests to Redis cluster instances can be distributed properly on these Redis servers? Or Redis cluster itself able to handle the load distribution?
If Yes, then please mention any reference link to configure the same. I have checked official documentation Redis Cluster but it does not specify anything regarding load balancer setup.
If you're running Redis in "Cluster Mode" you don't need a load balancer. Your Redis client (assuming it's any good) should contact Redis for a list of which slots are on which nodes when your application starts up. It will hash keys locally (in your application) and send requests directly to the node which owns the slot for that key (which avoids the extra call to Redis which results in a MOVED response).
You should be able to configure your client to do reads on slave and writes on master - or to do both reads and writes on only masters. In addition to configuring your client, if you want to do reads on slaves, check out the READONLY command: https://redis.io/commands/readonly .

Redis Cluster or Replication without proxy

Is it possible to build one master (port 6378) + two slave (read only port: 6379, 6380) "cluster" on one machine and increase the performances (especially reading) and do not use any proxy? Can the site or code connect to master instance and read data from read-only nodes? Or if I use 3 instances of Redis I have to use proxy anyway?
Edit: Seems like slave nodes don't have any data, they try to redirect to master instance, but it is not correct way, am I right?
Definitely. You can code the paths in your app so writes and reads go to different servers. Depending on the programming language that you're using and the Redis client, this may be easier or harder to achieve.
Edit: that said, I'm unsure how you're running a cluster with a single master - the minimum should be 3.
You need to send a READONLY command after connecting to the slave before you could execute any read commands.
A READONLY command only affects during the current socket session which means you need this command for every TCP connection.

Should I run haproxy for db and redis sentinel on web nodes?

I am setting up a cluster of servers using vagrant and playing with Redis sentinel and HAProxy for Postgresql db connection (with pgpool). I was curious if it make sense to put haproxy and redis sentinel on each of my web server nodes and have them connect directly to those. The thought is that it can create a distributed connection to the DB and redis and reduce the single point of failure to having a single haproxy that they connect to and then split to different db nodes. I can also keep the database connect (via haproxy) and redis (via sentinel) encapsulated to the localhost. Does this make sense?
It only makes sense if you're trying to save up on resources/costs.
Please note that redis sentinel must have a finite list of sentinel instances, which doesn't fit the scenario of placing one per machine, as your maching count would probably scale/change.
Otherwise , it's always makes the most sense to put different infrastructure components ( especially those with clustering/HA nature, such as redis ) on different machines.
By mixing them all together, you usually end up with applications getting in the way of each other and stealing CPU from each-other once the load increases. You also risk designing your applications/scripts/flows to be location aware (i.e assume external resources are always local ) which is also not a really good practice.

Transferring results from Zookeeper to webserver

In my project I am calculating about 10-100mbs of data on a zookeeper worker. I then use HTTP PUT to transfer the data from the worker process to my webserver, which eventually gets delivered to the client. Is there anyway using Zookeeper or Curator to transfer that data or am I on my own to get the data out of the Worker process and onto a process outside my ensemble?
I wouldn't recommend to use Zookeeper to transfer data, especially of such relatively large size. It is not really designed to do it. Zookeeper works best when it used to synchronize distributed processes or to store some relatively small configuration data that is shared among multiple hosts.
There is a hard limit of 1 Mb per ZK node and if you try to push it to the limit, Zookeeper clients may get timeouts and go into disconnected state while Zookeeper service processes large chunk of data.