Read RabbitMQ from Beam/DataFlow - rabbitmq

I'm trying to run a RabbitMQ queue from beam/dataflow in a streaming fashion (so that it keeps running indefinitely.)
The Minimal example code i'm trying to run is:
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.rabbitmq.RabbitMqIO;
import org.apache.beam.sdk.io.rabbitmq.RabbitMqMessage;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
public class RabbitMqTest {
public static void main(String[] args) {
Pipeline pipeline = Pipeline.create();
final String serverUri = "amqp://guest:guest#localhost:5672";
pipeline
.apply("Read RabbitMQ message", RabbitMqIO.read().withUri(serverUri).withQueue("my_queue"))
.apply(ParDo.of(new DoFn<RabbitMqMessage, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
String message = new String(c.element().getBody());
System.out.println();
c.output(message);
}
}));
pipeline.run().waitUntilFinish();
}
}
However it crashes with:
Exception in thread "main" java.lang.NullPointerException
at org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.processElement(UnboundedReadEvaluatorFactory.java:169)
at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:160)
at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:124)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
If I don't pass in a withMaxReadTime() to RabbitMqIO.
If I do pass in a withMaxReadTime() it blocks for X seconds, then process any messages arrived during that time and then quits.
How do I set up a streaming flow that keeps running from RabbitMQ indefinitely?

I had similar issue with Dataflow pipeline. When tried to run it in Dataflow I got:
java.lang.NullPointerException
org.apache.beam.runners.dataflow.worker.WindmillTimeUtils.harnessToWindmillTimestamp(WindmillTimeUtils.java:58)
org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:400)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1230)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:143)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:967)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
They problem was RabbitMqIO uses timestamp from messages comming from RabbitMq e.g for watermark. It turned out messages from RabbitMq in my case didn't have timestamp set (it is not set by default in RabbitMq) and it was null. I fixed that by preparing patch for classes in Apache Beam. I made a change in RabbitMqMessage constructor. Now it looks like this:
public RabbitMqMessage(String routingKey, QueueingConsumer.Delivery delivery) {
this.routingKey = routingKey;
body = delivery.getBody();
contentType = delivery.getProperties().getContentType();
contentEncoding = delivery.getProperties().getContentEncoding();
headers = delivery.getProperties().getHeaders();
deliveryMode = delivery.getProperties().getDeliveryMode();
priority = delivery.getProperties().getPriority();
correlationId = delivery.getProperties().getCorrelationId();
replyTo = delivery.getProperties().getReplyTo();
expiration = delivery.getProperties().getExpiration();
messageId = delivery.getProperties().getMessageId();
/*
*** IMPORTANT ***
Sometimes timestamp in RabbitMq message properties is 'null'. `RabbitMqIO` uses that value as
watermark, when it is `null` it causes exceptions, 'null' has to be replaced with some value in this case current time
*/
// timestamp = delivery.getProperties().getTimestamp();
timestamp = delivery.getProperties().getTimestamp() == null ? new Date() : delivery.getProperties().getTimestamp();
type = delivery.getProperties().getType();
userId = delivery.getProperties().getUserId();
appId = delivery.getProperties().getAppId();
clusterId = delivery.getProperties().getClusterId();
}
and I had to change advance() method in RabbitMqIO to not use timestamp property which could be null:
#Override
public boolean advance() throws IOException {
try {
QueueingConsumer.Delivery delivery = consumer.nextDelivery(1000);
if (delivery == null) {
return false;
}
if (source.spec.useCorrelationId()) {
String correlationId = delivery.getProperties().getCorrelationId();
if (correlationId == null) {
throw new IOException(
"RabbitMqIO.Read uses message correlation ID, but received "
+ "message has a null correlation ID");
}
currentRecordId = correlationId.getBytes(StandardCharsets.UTF_8);
}
long deliveryTag = delivery.getEnvelope().getDeliveryTag();
checkpointMark.sessionIds.add(deliveryTag);
current = new RabbitMqMessage(source.spec.routingKey(), delivery);
/*
*** IMPORTANT ***
Sometimes timestamp in RabbitMq messages is 'null' stream in Dataflow fails because
watermark is based on that value, 'null' has to be replaced with some value. `RabbitMqMessage` was changed
to use `new Date()` in this situation and now timestamp can be taken from it
*/
//currentTimestamp = new Instant(delivery.getProperties().getTimestamp());
currentTimestamp = new Instant(current.getTimestamp());
if (currentTimestamp.isBefore(checkpointMark.oldestTimestamp)) {
checkpointMark.oldestTimestamp = currentTimestamp;
}
} catch (Exception e) {
throw new IOException(e);
}
return true;
}
After running my pipeline again I got this exception again in other place. This time it was caused by not set default value for oldestTimestamp property in RabbitMQCheckpointMark. I did next change and now RabbitMQCheckpointMark looks like this:
private static class RabbitMQCheckpointMark
implements UnboundedSource.CheckpointMark, Serializable {
transient Channel channel;
/*
*** IMPORTANT *** it should be initialized with some value because without it runner (e.g Dataflow) fails with 'NullPointerException'
Example error:
java.lang.NullPointerException
org.apache.beam.runners.dataflow.worker.WindmillTimeUtils.harnessToWindmillTimestamp(WindmillTimeUtils.java:58)
org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:400)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1230)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:143)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:967)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
*/
Instant oldestTimestamp = new Instant(Long.MIN_VALUE);
final List<Long> sessionIds = new ArrayList<>();
#Override
public void finalizeCheckpoint() throws IOException {
for (Long sessionId : sessionIds) {
channel.basicAck(sessionId, false);
}
channel.txCommit();
oldestTimestamp = Instant.now();
sessionIds.clear();
}
}
All those changes fixed my pipeline and now it works as expected. I hope you will find it useful.

This was a bug in the Io that has been fixed recently.

Related

RabbitMQ Camel Consumer - Consume a single message

I have a scenario where I want to "pull" messages of a RabbitMQ queue/topic and process them one at a time.
Specifically if there are already messages sitting on the queue when the consumer starts up.
I have tried the following with no success (meaning, each of these options reads the queue until it is either empty or until another thread closes the context).
1.Stopping route immediately it is first processed
final CamelContext context = new DefaultCamelContext();
try {
context.addRoutes(new RouteBuilder() {
#Override
public void configure() throws Exception {
RouteDefinition route = from("rabbitmq:harley?queue=IN&declare=false&autoDelete=false&hostname=localhost&portNumber=5672");
route.process(new Processor() {
Thread stopThread;
#Override
public void process(final Exchange exchange) throws Exception {
String name = exchange.getIn().getHeader(Exchange.FILE_NAME_ONLY, String.class);
String body = exchange.getIn().getBody(String.class);
// Doo some stuff
routeComplete[0] = true;
if (stopThread == null) {
stopThread = new Thread() {
#Override
public void run() {
try {
((DefaultCamelContext)exchange.getContext()).stopRoute("RabbitRoute");
} catch (Exception e) {}
}
};
}
stopThread.start();
}
});
}
});
context.start();
while(!routeComplete[0].booleanValue())
Thread.sleep(100);
context.stop();
}
Similar to 1 but using a latch rather than a while loop and sleep.
Using a PollingConsumer
final CamelContext context = new DefaultCamelContext();
context.start();
Endpoint re = context.getEndpoint(srcRoute);
re.start();
try {
PollingConsumer consumer = re.createPollingConsumer();
consumer.start();
Exchange exchange = consumer.receive();
String bb = exchange.getIn().getBody(String.class);
consumer.stop();
} catch(Exception e){
String mm = e.getMessage();
}
Using a ConsumerTemplate() - code similar to above.
I have also tried enabling preFetch and setting the max number of exchanges to 1.
None of these appear to work, if there are 3 messages on the queue, all are read before I am able to stop the route.
If I were to use the standard RabbitMQ Java API I would use a basicGet() call which lets me read a single message, but for other reasons I would prefer to use a Camel consumer.
Has anyone successfully been able to process a single message on a queue that holds multiple messages using a Camel RabbitMQ Consumer?
Thanks.
This is not the primary intention of the component as its for continued received. But I have created a ticket to look into supporting a basicGet (single receive). There is a new spring based rabbitmq component coming in 3.8 onwards so its going to be implemeneted there (first): https://issues.apache.org/jira/browse/CAMEL-16048

Processing message from rabbitmq at specified rate

We have been trying to make listener read messages from rabbitmq at a certain rate 1 msg/2 seconds. We did not find any such utility with rabbit mq so far. So thought of doing this with DB i.e. listener will read the messages and store it into DB and later a scheduler will process at that desired rate from DB. If there is any better way of doing this, please suggest. We are developing our application in Spring. Thanks in advance.
You can't do it with a listener, but you can do it with a RabbitTemplate ...
#SpringBootApplication
public class So40446967Application {
public static void main(String[] args) throws Exception {
ConfigurableApplicationContext context = SpringApplication.run(So40446967Application.class, args);
RabbitAdmin admin = context.getBean(RabbitAdmin.class);
AnonymousQueue queue = new AnonymousQueue();
admin.declareQueue(queue);
RabbitTemplate template = context.getBean(RabbitTemplate.class);
for (int i = 0; i < 10; i++) {
template.convertAndSend(queue.getName(), "foo" + i);
}
String out = (String) template.receiveAndConvert(queue.getName());
while (out != null) {
System.out.println(new Date() + " " + out);
Thread.sleep(2000);
out = (String) template.receiveAndConvert(queue.getName());
}
context.close();
}
}
Of course you can use something more sophisticated like a task scheduler or a Spring #Async method rather than sleeping.
Inspired on the Gary Russel answer:
you can use something more sophisticated like a task scheduler or a Spring #Async
You can also get a number of determined message per minute and simulate the same limit rate:
private final RabbitTemplate rabbitTemplate;
#Scheduled(fixedDelay = 60000) // 1 minute
public void read() {
List<String> messages = new ArrayList<>();
String message = getMessageFromQueue();
while(message != null && messages.size() < 30) { // 30 messages in 1 minute = 1 msg / 2 seconds
messages.add(message);
message = getMessageFromQueue();
}
public String getMessageFromQueue() {
return (String) rabbitTemplate.receiveAndConvert(QUEUE_NAME);
}
}

Timeout of basicPublish when server is outofspace

My case is rabbitmq server got out of space, just as below
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/ramonubuntu--vg-root 6299376 5956336 0 100% /
The producer publishes message to server(the message needs to be persisted), and then will be blocked forever, it will keeping waiting the response of publishing. Sure we should avoid the situation of server out of space, but is there any timeout mechanism to let producer quit the waiting?
I have tried heartbeat and SO_TIMEOUT, they both don't work, as the network works fine. Below is my producer.
protected void publish(byte[] message) throws Exception {
// ConnectionFactory can be reused between threads.
ConnectionFactory factory = new SoTimeoutConnectionFactory();
factory.setHost(this.getHost());
factory.setVirtualHost("te");
factory.setPort(5672);
factory.setUsername("amqp");
factory.setPassword("amqp");
factory.setConnectionTimeout(10 * 1000);
// doesn't help if server got out of space
factory.setRequestedHeartbeat(1);
final Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
// declare a 'topic' type of exchange
channel.exchangeDeclare(this.exchangeName, "topic", true);
channel.addReturnListener(new ReturnListener() {
#Override
public void handleReturn(int replyCode, String replyText, String exchange, String routingKey,
AMQP.BasicProperties properties, byte[] body) throws IOException {
logger.warn("[X]Returned message(replyCode:" + replyCode + ",replyText:" + replyText
+ ",exchange:" + exchange + ",routingKey:" + routingKey + ",body:" + new String(body));
}
});
channel.confirmSelect();
channel.addConfirmListener(new ConfirmListener() {
#Override
public void handleAck(long deliveryTag, boolean multiple) throws IOException {
logger.info("Ack: " + deliveryTag);
// RabbitMessagePublishMain.this.release(connection);
}
#Override
public void handleNack(long deliveryTag, boolean multiple) throws IOException {
logger.info("Nack: " + deliveryTag);
// RabbitMessagePublishMain.this.release(connection);
}
});
channel.basicPublish(this.exchangeName, RabbitMessageConsumerMain.EXCHANGE_NAME + ".-1", true,
MessageProperties.PERSISTENT_BASIC, message);
channel.waitForConfirmsOrDie(10*1000);
// now we can close connection
connection.close();
}
It will block at 'channel.waitForConfirmsOrDie(10*1000);', and the SotimeoutConnectionFactory,
public class SoTimeoutConnectionFactory extends ConnectionFactory {
#Override
protected void configureSocket(Socket socket) throws IOException {
super.configureSocket(socket);
socket.setSoTimeout(10 * 1000);
}
}
Also I captured the network between producer and rabbimq,
Please help.
You need to implement Connection Block/Unblocked.
This is basically a way of notifying the publisher that the server is running out of resources. The advantage with this is that the publisher will also be notified once it is safe to publish again.
I would recommend that you take a look at this article. A simple way of implementing this is to have a flag that indicates if it is safe to publish, if it is not wait until it is.
As an example you can take a look on how I implemented this in one of my Python examples.

Hadoop RPC server doesn't stop

I was trying to create a simple parent child process with IPC between them using Hadoop IPC. It turns out that program executes and prints the results but it doesn't exit. Here is the code for it.
interface Protocol extends VersionedProtocol{
public static final long versionID = 1L;
IntWritable getInput();
}
public final class JavaProcess implements Protocol{
Server server;
public JavaProcess() {
String rpcAddr = "localhost";
int rpcPort = 8989;
Configuration conf = new Configuration();
try {
server = RPC.getServer(this, rpcAddr, rpcPort, conf);
server.start();
} catch (IOException e) {
e.printStackTrace();
}
}
public int exec(Class klass) throws IOException,InterruptedException {
String javaHome = System.getProperty("java.home");
String javaBin = javaHome +
File.separator + "bin" +
File.separator + "java";
String classpath = System.getProperty("java.class.path");
String className = klass.getCanonicalName();
ProcessBuilder builder = new ProcessBuilder(
javaBin, "-cp", classpath, className);
Process process = builder.start();
int exit_code = process.waitFor();
server.stop();
System.out.println("completed process");
return exit_code;
}
public static void main(String...args) throws IOException, InterruptedException{
int status = new JavaProcess().exec(JavaProcessChild.class);
System.out.println(status);
}
#Override
public IntWritable getInput() {
return new IntWritable(10);
}
#Override
public long getProtocolVersion(String paramString, long paramLong)
throws IOException {
return Protocol.versionID;
}
}
Here is the child process class. However I have realized that it is due to RPC.getServer() on the server side that it the culprit. Is it some known hadoop bug, or I am missing something?
public class JavaProcessChild{
public static void main(String...args){
Protocol umbilical = null;
try {
Configuration defaultConf = new Configuration();
InetSocketAddress addr = new InetSocketAddress("localhost", 8989);
umbilical = (Protocol) RPC.waitForProxy(Protocol.class, Protocol.versionID,
addr, defaultConf);
IntWritable input = umbilical.getInput();
JavaProcessChild my = new JavaProcessChild();
if(input!=null && input.equals(new IntWritable(10))){
Thread.sleep(10000);
}
else{
Thread.sleep(1000);
}
} catch (Throwable e) {
e.printStackTrace();
} finally{
if(umbilical != null){
RPC.stopProxy(umbilical);
}
}
}
}
We sorted that out via mail. But I just want to give my two cents here for the public:
So the thread that is not dying there (thus not letting the main thread finish) is the org.apache.hadoop.ipc.Server$Reader.
The reason is, that the implementation of readSelector.select(); is not interruptable. If you look closely in a debugger or threaddump, it is waiting on that call forever, even if the main thread is already cleaned up.
Two possible fixes:
make the reader thread a deamon (not so cool, because the selector
won't be cleaned up properly, but the process will end)
explicitly close the "readSelector" from outside when interrupting the threadpool
However, this is a bug in Hadoop and I have no time to look through the JIRAs. Maybe this is already fixed, in YARN the old IPC is replaced by protobuf and thrift anyways.
BTW also this is platform dependend on the implementation of the selectors, I observed these zombies on debian/windows systems, but not on redhat/solaris.
If anyone is interested in a patch for Hadoop 1.0, email me. I will sort out the JIRA bug in the near future and edit this here with more information. (Maybe this is fixed in the meanwhile anyways).

Maximum threads issue

To begin with, I checked the discussions regarding this issue and couldn't find an answer to my problem and that's why I'm opening this question.
I've set up a web service using restlet 2.0.15.The implementation is only for the server. The connections to the server are made through a webpage, and therefore I didn't use ClientResource.
Most of the answers to the exhaustion of the thread pool problem suggested the inclusion of
#exhaust + #release
The process of web service can be described as a single function.Receive GET requests from the webpage, query the database, frame the results in XML and return the final representation. I used a Filter to override the beforeHandle and afterHandle.
The code for component creation code:
Component component = new Component();
component.getServers().add(Protocol.HTTP, 8188);
component.getContext().getParameters().add("maxThreads", "512");
component.getContext().getParameters().add("minThreads", "100");
component.getContext().getParameters().add("lowThreads", "145");
component.getContext().getParameters().add("maxQueued", "100");
component.getContext().getParameters().add("maxTotalConnections", "100");
component.getContext().getParameters().add("maxIoIdleTimeMs", "100");
component.getDefaultHost().attach("/orcamento2013", new ServerApp());
component.start();
The parameters are the result of a discussion present in this forum and modification by my part in an attempt to maximize efficiency.
Coming to the Application, the code is as follows:
#Override
public synchronized Restlet createInboundRoot() {
// Create a router Restlet that routes each call to a
// new instance of HelloWorldResource.
Router router = new Router(getContext());
// Defines only one route
router.attach("/{taxes}", ServerImpl.class);
//router.attach("/acores/{taxes}", ServerImplAcores.class);
System.out.println(router.getRoutes().size());
OriginFilter originFilter = new OriginFilter(getContext());
originFilter.setNext(router);
return originFilter;
}
I used an example Filter found in a discussion here, too. The implementation is as follows:
public OriginFilter(Context context) {
super(context);
}
#Override
protected int beforeHandle(Request request, Response response) {
if (Method.OPTIONS.equals(request.getMethod())) {
Form requestHeaders = (Form) request.getAttributes().get("org.restlet.http.headers");
String origin = requestHeaders.getFirstValue("Origin", true);
Form responseHeaders = (Form) response.getAttributes().get("org.restlet.http.headers");
if (responseHeaders == null) {
responseHeaders = new Form();
response.getAttributes().put("org.restlet.http.headers", responseHeaders);
responseHeaders.add("Access-Control-Allow-Origin", origin);
responseHeaders.add("Access-Control-Allow-Methods", "GET,POST,DELETE");
responseHeaders.add("Access-Control-Allow-Headers", "Content-Type");
responseHeaders.add("Access-Control-Allow-Credentials", "true");
response.setEntity(new EmptyRepresentation());
return SKIP;
}
}
return super.beforeHandle(request, response);
}
#Override
protected void afterHandle(Request request, Response response) {
if (!Method.OPTIONS.equals(request.getMethod())) {
Form requestHeaders = (Form) request.getAttributes().get("org.restlet.http.headers");
String origin = requestHeaders.getFirstValue("Origin", true);
Form responseHeaders = (Form) response.getAttributes().get("org.restlet.http.headers");
if (responseHeaders == null) {
responseHeaders = new Form();
response.getAttributes().put("org.restlet.http.headers", responseHeaders);
responseHeaders.add("Access-Control-Allow-Origin", origin);
responseHeaders.add("Access-Control-Allow-Methods", "GET,POST,DELETE"); //
responseHeaders.add("Access-Control-Allow-Headers", "Content-Type");
responseHeaders.add("Access-Control-Allow-Credentials", "true");
}
}
super.afterHandle(request, response);
Representation requestRepresentation = request.getEntity();
if (requestRepresentation != null) {
try {
requestRepresentation.exhaust();
} catch (IOException e) {
// handle exception
}
requestRepresentation.release();
}
Representation responseRepresentation = response.getEntity();
if(responseRepresentation != null) {
try {
responseRepresentation.exhaust();
} catch (IOException ex) {
Logger.getLogger(OriginFilter.class.getName()).log(Level.SEVERE, null, ex);
} finally {
}
}
}
The responseRepresentation does not have a #release method because it crashes the processes giving the warning WARNING: A response with a 200 (Ok) status should have an entity (...)
The code of the ServerResource implementation is the following:
public class ServerImpl extends ServerResource {
String itemName;
#Override
protected void doInit() throws ResourceException {
this.itemName = (String) getRequest().getAttributes().get("taxes");
}
#Get("xml")
public Representation makeItWork() throws SAXException, IOException {
DomRepresentation representation = new DomRepresentation(MediaType.TEXT_XML);
DAL dal = new DAL();
String ip = getRequest().getCurrent().getClientInfo().getAddress();
System.out.println(itemName);
double tax = Double.parseDouble(itemName);
Document myXML = Auxiliar.getMyXML(tax, dal, ip);
myXML.normalizeDocument();
representation.setDocument(myXML);
return representation;
}
#Override
protected void doRelease() throws ResourceException {
super.doRelease();
}
}
I've tried the solutions provided in other threads but none of them seem to work. Firstly, it does not seem that the thread pool is augmented with the parameters set as the warnings state that the thread pool available is 10. As mentioned before, the increase of the maxThreads value only seems to postpone the result.
Example: INFO: Worker service tasks: 0 queued, 10 active, 17 completed, 27 scheduled.
There could be some error concerning the Restlet version, but I downloaded the stable version to verify this was not the issue.The Web Service is having around 5000 requests per day, which is not much.Note: the insertion of the #release method either in the ServerResource or OriginFilter returns error and the referred warning ("WARNING: A response with a 200 (Ok) status should have an entity (...)")
Please guide.
Thanks!
By reading this site the problem residing in the server-side that I described was resolved by upgrading the Restlet distribution to the 2.1 version.
You will need to alter some code. You should consult the respective migration guide.