How to write to different hbase tables in apache flume - apache

I have configured Apache Flume to receive messages (JSON type) in HTTP source. My sinks are MongoDB and HBase.
How can I write the message according to a specified field to different collections and tables?
For example: let's assume we have T_1 and T_2. Now there is an incoming message that should be saved in T_1. How can I handle those messages and assign them where to be saved?

Try using the Multiplexing Channel Selector. The default one (Replicating Channel Selector copies the Flume event produced by the source to all its configured channels. Nevertheless, the multiplexing one is able to put the event into a specific channel depending on the value of a header within the Flume event.
In order to create such a header accordingly to your application logic you will need to create a custom handler for the HTTPSource. This can be easily done by implementing the HttpSourceHandler interface of the API.

you can use regex for tagging message type + multiplexing for sending it to right destination.
example , based on message "TEST"
regex for a string / field
agent.sources.s1.interceptors.i1.type=regex_extractor
agent.sources.s1.interceptors.i1.regex=(TEST1)
assign interceptor to serializer SE1
agent.sources.s1.interceptors.i1.serializers=SE1
agent.sources.s1.intercetpros.i1.serializers.SE1.name=Test
send to required channel , channels (c1,c2) you can map to different sinks
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=Test
agent.sources.s1.selector.mapping.Test=c1
all events of test regex will go to channel c1 , others will be defaulted to C2
agent.sources.s1.selector.default=c2

Related

Ingest data into warp10 - Performance tip

We're looking for the best way to ingest data in warp10. We are on a Microservices architecture that use Kafka mainly.
Two solutions:
Use Ingress endpoint as defined here: https://www.warp10.io/content/03_Documentation/03_Interacting_with_Warp_10/03_Ingesting_data/01_Ingress (This is the solution we use for now)
Use the warp10 Kafka plugin as defined here: https://blog.senx.io/introducing-the-warp-10-kafka-plugin/
As described here, we use Ingress solution as of now, based on an aggregation of data for x seconds, and call the Ingress API to send data per packet. (Instead of calling the API each time we need to insert something).
For few days, we are experimenting with the Kafka Plugin. We successfully set up the plugin and create an .mc2 responsible to consume data from a given topic and then insert them using UPDATE into warp10.
Questions:
Using the Kafka plugin, would it be better to apply the same buffer mechanism as the one applied when we use the Ingress endpoint? Or, is there any specific implementation in warp10 Kafka plugin that allows to consume message per message in the topic and call the UPDATE function for each ?
Today, as both solutions are working, we're trying to find differences to get the best performance results during ingestion of data. And if possible, without having to apply any buffer mechanism because we are trying to be in real-time as much as possible.
MC2 file:
{
'topics' [ 'our_topic_name' ] // List of Kafka topics to subscribe to
'parallelism' 1 // Number of threads to start for processing the incoming messages. Each thread will handle a certain number of partitions.
'config' { // Map of Kafka consumer parameters
'bootstrap.servers' 'kafka-headless:9092'
'group.id' 'senx-consumer'
'enable.auto.commit' 'true'
}
'macro' <%
// macro executed each time a kafka record is consumed
/*
// received record format :
{
'timestamp' 123 // The record timestamp
'timestampType' 'type' // The type of timestamp, can be one of 'NoTimestampType', 'CreateTime', 'LogAppendTime'
'topic' 'topic_name' // Name of the topic which received the message
'offset' 123 // Offset of the message in 'topic'
'partition' 123 // Id of the partition which received the message
'key' ... // Byte array of the message key
'value' ... // Byte array of the message value
'headers' { } // Map of message headers
}
*/
"recordArray" STORE
"preprod.write" "token" STORE
// macro can be called on timeout with an empty entry map
$recordArray SIZE 0 !=
<%
$recordArray 'value' GET // kafka record value is retrieved in bytes
'UTF-8' BYTES-> // convert bytes to string (WARP10 INGRESS format)
JSON->
"value" STORE
"Records received through Kafka" LOGMSG
$value LOGMSG
$value
<%
DROP
PARSE
// PARSE outputs a gtsList, including only one gts
0 GET
// GTS rename is required to use UPDATE function
"gts" STORE
$gts $gts NAME RENAME
%>
LMAP
// Store GTS in Warp10
$token
UPDATE
%>
IFT
%> // end macro
'timeout' 10000 // Polling timeout (in ms), if no message is received within this delay, the macro will be called with an empty map as input
}
If you want to cache something in Warp 10 to avoid lots of UPDATE per second, you can use SHM (SHared Memory). This is a built-in extension you need to activate.
Once activated, use it with SHMSTORE and SHMLOAD to keep objects in RAM between two WarpScript executions.
In you example, you can push all the incoming GTS in a list, or a list of list of GTS, using +! to append elements to an existing list.
The MERGE of all the GTS in the cache (by name + labels) and UPDATE in the database can then be done in a runner (don't forget to use a MUTEX)
Don't forget the total operation cost:
The ingress format can be optimized for ingestion speed, if you do not repeat classname and labels, and if you gather lines per gts. See here.
PARSE deserialize data from the Warp 10 ingress format.
UPDATE serialize data to the Warp 10 optimized ingress format (and push it to the update endpoint).
the update endpoint deserialize again.
It makes sense to do these deserialize/serialize/deserialize operation if your input data is far from the optimal ingress format. It also make sense if you want to RANGECOMPACT your data to save disk space, or do any preprocessing.

how to pass an attribute over ActiveMQ in Mule 4

We are migrating from Mule 3 to Mule 4 and in one of our functionalities we need to publish messages to a topic and downstream another mule component is consuming from the queue which is bridged to the topic.
Nothing special here .
To ensure we are able to trace the flow via logs we were sending a 'TrackingId' attribute while publishing messages to the topic ( Mule 3 )
message.setOutboundProperty("XYZ_TrackingID", flowVars['idFromUI']);
return payload;
However when I try the same in Mule 4 we get the following exception :
ERROR 2020-12-20 10:09:12,214 [[MuleRuntime].cpuIntensive.14: [mycomponent].my_Flow.CPU_INTENSIVE
#66024695] org.mule.runtime.core.internal.exception.OnErrorPropagateHandler:
Message : groovy.lang.MissingMethodException: No signature of method:
org.mule.runtime.api.el.BindingContextUtils$MessageWrapper.setOutboundProperty() is applicable for
argument types: (java.lang.String, org.mule.weave.v2.el.ByteArrayBasedCursorStream) values:
[XYZ_TrackingID, "1234567"].\nError type : (set debug level logging or '-
Dmule.verbose.exceptions=true' for
everything)\n********************************************************************************
Checked internet and it seems in Mule4 setting outbound properties is removed as per here
So how do I achieve the same in Mule 4 ?
Don't even try to do that for several reasons. For one message structure is different, so output properties doesn't exist anymore and that method doesn't even exists. On the other hand, in Mule 4 components like the Groovy component can only return a value and cannot change the event. They can not decide to what that value is going to be assigned. You can set the target in the configuration (payload or a variable) and not change the attributes. Note that variables in Mule 4 are referenced by var., not by flowVars. like in Mule 3 (ie vars.idFromUI).
There is a simpler way to set message properties in the Mule 4 JMS connector. Use the properties element and pass it an object with the properties.
For example it could be something like this:
<jms:publish config-ref="JMS_config" destination="${bridgeDestination}" destinationType="TOPIC">
<jms:message>
<jms:body>#["bridged_" ++ payload]</jms:body>
<jms:properties>#[{
XYZ_TrackingID: vars.idFromUI
}]</jms:properties>
</jms:message>
</jms:publish>
It is in the documentation: https://docs.mulesoft.com/jms-connector/1.0/jms-publish#setting-user-properties. I adapted my example from there.
I am not sure if Correlation Id serves the purpose of a tracking ID for your scenario. But you can pass a CID as below. It's there in the mule documentation.
https://docs.mulesoft.com/jms-connector/1.7/jms-publish
<jms:publish config-ref="JMS_config" sendCorrelationId="ALWAYS" destination="#[attributes.headers.replyTo.destination]">
<jms:message correlationId="#[attributes.headers.correlationId]"/>
</jms:publish>
If your priority is to customise the Tracking ID you want to publish, then try passing below format. The key names may differ as per your use case.
<jms:publish config-ref="JMS_config" destination="${bridgeDestination}" destinationType="TOPIC">
<jms:message>
<jms:body>#["bridged_" ++ payload]</jms:body>
<jms:properties>#[{
AUTH_TYPE: 'jwt',
AUTH_TOKEN: attributes.queryParams.token
}]</jms:properties>
</jms:message>
</jms:publish>
In the above the expression attributes.queryParams.token is basically trying to access a token query parameters which is passed to JMS as a property AUTH_TOKEN key-name , consumed by the API through a HTTP Listener or Requestor earlier.
However, attributes.headers.correlationId is a header. Both queryParams and headers are part of attributes in Mule 4.

mediasoup - miss match between PayloadTypes

I'm trying to use mediasoup to forward RTP streams with room.createRtpStreamer
my problem is that the payload type (for OPUS) I get from producer.rtpParameters.codecs[i].payloadType is 111,
while the one I get on the actual RTP packets is 100 (seen on Wireshark)
I tried to set preferredPayloadType in my server's config, but it seems to make no difference.
Note:
if I hardcode 100 as the Payload Type for the OPUS stream I can view/hear the stream using FFPlay
I'm using Chrome 55 (latest) and mediasoup 2.0.5 (latest)
any help will be appreciated.
The Producer has the RTP parameters decided by the client (browser), so the PT of OPUS is 111 (the default value generated by Chrome).
But, once in mediasoup server, the Consumers associated to that Producer use the RTP parameters given during the room creation. So, if the codecs given to room = new server.Room(codecs) [1] have a preferredPayloadType field, that will be used within the Consumers (otherwise it will be randomly chosen by the server).
So, when you call room.createRtpStreamer() you provide a Producer, and the generated RtpStreamer [2] has an associated Consumer and PlainRtpTransport. So, you should read the rtpStreamer.consumer.rtpParameters rather than the producer's ones.
[1] https://mediasoup.org/documentation/mediasoup/api/#server-Room
[2] https://mediasoup.org/documentation/mediasoup/api/#RtpStreamer
You should have a look at the SDP of the call setup message and check whether you get 111 or 100 for the OPUS payload.
From there you can decide which part has the bug (Chrome or mediasoup).
In the call setup message (initiating the call) check the payload of the OPUS code.
The called party should respond with the same payload number if it accepts OPUS and then both parties should use the same payload number in the RTP packets.
So I found that the payload I get on producer.rtpParameters.codecs[i].payloadType was the original payload and that room.createRtpStreamer changes the payload type.
Ended up doing the below to resolve the issue
// get the payload (type) from the room.rtpCapabilities.codecs.preferredPayloadType for the specific codec
let payload = this.room.rtpCapabilities.codecs.find((c)=>{
return c.name === producer.rtpParameters.codecs[i].name;
}).preferredPayloadType;

Pyspark: how to streaming Data from a given API Url

I was given an API url, and a method getUserPost() which returns the data needed for my data processing function. I am able to get the data by using Client from suds.client as follow:
from suds.client import Client
from suds.xsd.doctor import ImportDoctor, Import
url = 'url'
imp = Import('http://schemas.xmlsoap.org/soap/encoding/')
imp.filter.add('filter')
d = ImportDoctor(imp)
client = Client(url, doctor=d)
tempResult = client.service.getUserPosts(user_ids = '',date_from='2016-07-01 03:19:57', date_to='2016-08-01 03:19:57', limit=100, offset=0)
Now, each tempResult will contain 100 records. I want to stream the data from given API url to RDD for parallelized processing. However, after reading the pySpark.Streaming documentation I can't find a streaming method for customized data source. Could anyone give me an ideal how to do so?
Thank you.
After a while digging, I found out how to solve the problem. I employed the use of Kafka Streaming. Basically you need to create a producer from given API, specify topic and Port for communication. Then a consumer to listen to that specific topic and Port to start streaming the data.
Note that the Producer and Consumer must be working as different threads in order to archive real-time streaming.

Failed to respond to incoming message using data source row correlation

I am totally new to Parasoft virtualize. I created a virtual asset and added three fields in my data source correlation, My request xml has 4 fields. I am getting this error after processing the request.
<Reason>Failed to respond to incoming message using data source row correlation</Reason>
<Details>Values in incoming message did not match values in the data source "GetSubscriptionOperationsRequest"</Details>
Any suggestions on what might be the problem here?
The message which you are getting explains your problem.
The virtual asset cannot correlate your request with data in your data source to build response.
Try to send request with one of the values from column used for Datasource's corelation in your responder.
That value should be in request's filed used by correlation.
Maybe you should try to add catch all responder to see if you have correct corelation between incoming requests and your responder.
You can also ask Parasoft's technical support for help.