API or other queryable source for getting total NiFi queued data - api

Is there an API point or whatever other queryable source where I can get the total queued data?:
setting up a little dataflow in NiFi to monitor NiFi itself sounds sketchy, but if it's a common practice, let's be it. Anyway, I cannot find the API endpoint to get that total
Note: I have a single NiFi instance: I don't have nor will implement S2S reporting since I am on a single instance, single node NiFi setup

The Site-to-Site Reporting tasks were developed because they work for clustered, standalone, and multiple instances thereof. You'd just need to put an Input Port on your canvas and have the reporting task send to that.
An alternative as of NiFi 1.10.0 (via NIFI-6780) is to get the nifi-sql-reporting-nar and use QueryNiFiReportingTask, you can use a SQL query to get the metrics you want. That uses a RecordSinkService controller service to determine how to send the results, there are various implementations such as Site-to-Site, Kafka, Database, etc. The NAR is not included in the standard NiFi distribution due to size constraints, but you can get the latest version (1.11.4) here, or change the URL to match your NiFi version.

#jonayreyes You can find information about how to get queue data from NiFi API Here:
NiFi Rest API - FlowFile Count Monitoring

Related

Apache Ignite's Continuous Queries event handler group & sequencing

We are trying to use the Continuous Query feature of Ignite. But we are facing an issue on handling that event. Below is our problem statement
We have defined a Continuous Query with remote filter for a cache and shared the filter definition with Thick Client.
We are running multiple replica of the "Thin Client" in Kubernetes cluster.
Now the problem is each instance of the "Thin Client" running in k8s cluster have registered the remote filter and each instance receiving the event and trying to process the data in parallel. This resulting in duplicating data process or even overriding the data in my store.
Is there any way to form a consumer group and ensure that only one instance of the "Thin Client" is receiving the notification and its processing the data ?
My Thick client and Thin Clients are in .NET
Couldn't found any details on Ignite document
https://ignite.apache.org/docs/latest/key-value-api/continuous-queries
Here each thin client is starting its own continuous query and thereby, by design, each thin client is getting its own event to consume. If you want to route an event to a specific client then you would need to start only one continuous query, and distribute that event to your app as you see fit.
Take a look at ignite messaging to see whether it fits your use case.
Also check out the distributed Queue/Set which have unique delivery guarantees.

Handling of pubsub subscribers for distributed longrunning tasks

I am evaluating the use of using pubsub for long-running tasks such as video transcoding, where a particular transcode may take between 2-10 minutes. Is pubsub a good approach for such a task distribution? For example, let's say I have five servers:
- publisher1
- publisher2
- publisher3
- publisher4
- publisher5
And a topic called "videos". Would it be possible to spread out the messages equally across those five servers? What about when servers are added or removed? What would be a good approach to doing this, or is pubsub not the right tool for something like this?
This does sound like a reasonable use case for pubsub. Specifically, if you use a pull subscriber, you can configure flow control settings to have at most one outstanding message to your server, and configure the max ack extension period (in java) to be a reasonable upper bound of your processing time. This api is described here http://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/pubsub/v1/package-summary.html
This should effectively load balance across your servers by default if you use the same subscriber id for all jobs. If a server is added and backlog exists, it will receive a new entry. If a server is removed, it will no longer be sent messages. If it removed while processing or crashes, the message it was working on will be resent to another server.
One concern however is that pubsub has a limit of 10MB per message. You might consider instead putting the data itself in a google cloud storage bucket. Cloud storage can publish the file location to a pubsub topic when an upload is complete. https://cloud.google.com/storage/docs/pubsub-notifications

how to capture bulletin messages in apache nifi

I want to know if there is a way to capture the bulletin messages(basically errors) that appear on the Nifi UI and store it in some attribute/file so that it can be looked upon later. The screen gets refreshed every 5 min and if there is a failure in any of the processors i would want to know the reason for it.
I am not particularly talking about the logging part here.
As you know, the bulletins reflect the messages that are already logged. So all this content is already stored in the {NIFI_HOME}/logs/nifi-app.log. However, if you wanted to consume the bulletin directly you have a couple different options.
You could consume the bulletins from the REST API. There are a couple endpoints for accessing the bulletins.
http[s]://{host}:{port}/nifi-api/controller/process-groups/{process-group-id}/status?recursive=true
This request will get the status (including bulletins) of all components under the specified Process Group. You can use the alias 'root' for the root level Process Group. The recursive flag will indicate whether or not to return just the children of that Process Group or all descendant components.
http[s]://{host}:{port}/nifi-api/controller/status
This request will get the status (including bulletins) of the Controller level components. This includes any reported bulletins from Controller Services, Reporting Tasks, and the NiFi Framework itself (clustering messages, etc).
http[s]://{host}:{port}/nifi-api/controller/bulletin-board?limit=n&sourceId={id}&message={str}
This request will access all bulletins and supports filtering based components, message and limiting the number of bulletins returned.
You could also create a Reporting Task implementation which has access to the bulletin repository. Reporting Tasks are an extension point which are meant to report details from this NiFi instance. This would require some Java code but would allow you to report the bulletin's however you like. Here is an example that reports metrics to Ambari [1].
[1] https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-ambari-bundle/nifi-ambari-reporting-task/src/main/java/org/apache/nifi/reporting/ambari/AmbariReportingTask.java

Using ServiceStack.Redis with RedisCloud

Using RedisCloud as a datastore for a ServiceStack based AppHarbor hosted app.
The RedisCloud .net client documentation states to not use the ServiceStack.Redis connection managers:
Note: the ServiceStack.Redis client connection managers (BasicRedisClientManager and PooledRedisClientManager) should be disabled when working with the Garantia Data Redis Cloud. Use the single DNS provided upon DB creation to access your Redis DB. The Garantia Data Redis Cloud distributes your dataset across multiple shards and efficiently balances the load between these shards.
Why would they suggest that? Because they are doing fancy load balancing stuff in their 'Garantia Data' layer and don't want to handle unnecessary connections? The RedisClient class is not thread-safe, so it makes it much more difficult from the application programming perspective.
Should I just ignore their instructions and use a PooledRedisClientManager? How would I configure it with the single uri that RedisCloud provides?
Or will I need to write a basic RedisClient pool wrapper that just creates new RedisClient connections as needed to handle concurrent access (i.e. ignores all read/write pooling specifics, hopefully delegating all that up-stream to the RedisCloud layer)?
Why would they suggest that? Because they are doing fancy load balancing stuff in their 'Garantia Data' layer and don't want to handle unnecessary connections?
I think you could be right. To my knowledge these classes simply wrap creating/retrieving instances of RedisClient (though, I think Basic always creates a new RedisClient). While I looked over their site, I did't see anything about 'max number of connections to the Redis server(s). The previous Redis vendor from AppHarbor (MyRedis) had plans that listed the number of max connections allowed per plan. However, I also didn't see anything on the site mention connection limits/handling.
Should I just ignore their instructions and use a PooledRedisClientManager? How would I configure it with the single uri that RedisCloud provides?
Well, if you do ignore their instructions my guess is you could eventually run into a 'max number of connections exceeded' error. That would make it difficult to get to your Redis Server(s). I think you could still use the BasicRedisClientManager because when you call GetClient() it always 'news up' a RedisClient in the same way shown in their example.

Active MQ get count number of messages consumed/produced per second

Is there any way in activemq with which we can get count number of messages
consumed/produced per second/minute at the broker end?
I have tried JMeter configuration using http://activemq.apache.org/jmeter-performance-tests.html but there is hardly any performance matrix I can gather.
thanks
If you wanted to write this yourself then you should use JMX on your broker. The Broker MBean has "TotalEnqueueCount" and "TotalDequeCount" attributes. You can poll at specific intervals for those values and calculate yourself how many messages a second/minute/hour that your broker is being produced to or consumed from.
You'll need to make sure you have JMX setup on the broker side, of course. See here for more details on that: http://activemq.apache.org/jmx.html
to simply view total enqueue/dequeue stats, use jconsole or the web console
if you need to process it further (to calculate rates, etc), then you should do one of the following:
access stats programmatically using Java JMX APIs and gather/process over time
use a third party tool for monitoring (Cacti and Splunk can also help with this)
another option is to use Camel Dataset to simulate data routing and gather stats