Camel S3: Listing S3 bucket files with scheduler turned off - amazon-s3

I am having following Camel route which is trying to read the list of files from an S3 bucket:
from("direct:my-route").
.from("aws-s3://my.bucket?useIAMCredentials=true&useAwsKMS=true&awsKMSKeyId=my-key-id&deleteAfterRead=false&operation=listObjects&includeBody=false&prefix=test1/test.xml")
.log(" File detected: ${header.CamelAwsS3Key}")
.end();
However this route is being called by an external scheduler which is running every minute. It looks like the default behaviour of the Camel-S3 component is to run with a scheduler however this is causing the same files being processed again and again.
I have tried to turn the Camel-S3 scheduler off with startScheduler=false however this does not execute the 'aws-s3' part when the external scheduler kicks in and getting null values for '${header.CamelAwsS3Key}'.
Is it possible to run this component without the internal scheduler?
Camel version being used - 2.22.0
Dependency used for aws:
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-aws</artifactId>
<version>${camel.version}</version>
</dependency>

Dont have 2 x from, that is not basically two independent consumers. Instead use a content enricher (pollEnrich) to consume from s3 when the other from is called.
from
pollEnrich
log
Read the docs about content enricher and pollEnrich / enrich (specially around timeouts with poll enrich).
https://camel.apache.org/manual/latest/content-enricher.html

Related

GET Mapreduce Job Progress after Job finished

I'm developing a application that can collect mpreduce job progress info to analyze.The first way is parse log file.but It's ugly。Is there any method like hook or plugin can do this
You can probably use the YARN application API to get most of the information. See this Yarn Application API
Here is an excerpt from the page:
... All query parameters for this api will filter on all applications. However the queue query parameter will only implicitly filter on unfinished applications that are currently in the given queue.
There are other YARN APIs too, that you can utilize to achieve your goal. It is certainly better than scanning log files.

Spring Cloud Task not started with Spring Cloud Stream using RabbitMQ

I am experimenting with Spring Cloud APIs as part of microservices course.
To setup server-less task, I am using Cloud Task, Cloud Stream(RabbitMQ), and Spring Web.
For this I have setup following projects:
Serverless task to be executed -
https://github.com/Omkar-Shetkar/pluralsight-springcloud-m3-task
Component to receive Http request from user and submit to RabbitMQ -
https://github.com/Omkar-Shetkar/pluralsight-springcloud-m3-taskintake
Sink component to receive TaskLaunchRequest and forward to cloud task - https://github.com/Omkar-Shetkar/pluralsight-springcloud-m3-tasksink
Having setup above components, ensured that task component is available in local maven repository.
After initiating a POST request onto /tasks in pluralsight.com.TaskController.launchTask(String) I see a HTTP response.
But, I couldn't see any update in tasklogs DB associated with serverless task.
This means, task it self is not called.
In RabbitMQ console I could see connections are established from intake and sink components but I don't see any message exchange happening.
Queue with name tasktopic is having ZERO message count.
Appreciate any pointers and suggestions on how to proceed on this to resolve this issue.
Thanks.
There were two issue with my current implementation:
In intake and sink modules -> application.properties, binding property key was wrong.
It should be:
In intake module
spring.cloud.stream.bindings.output.destination=tasktopic
In sink module
spring.cloud.stream.bindings.input.destination=tasktopic
Also, local cloud deployer versions were incompatible in sink modules pom.xml.
Updated the same to:
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-deployer-local</artifactId>
<version>1.3.0.RELEASE</version>
</dependency>
With these changes, I am able to get RabbitMQ messages.
#EnableTaskLauncher annotation is missing in TaskIntakeApplication.
#SpringBootApplication
#EnableTaskLauncher
public class PluralsightSpringcloudM3TaskintakeApplication {
public static void main(String[] args) {
SpringApplication.run(PluralsightSpringcloudM3TaskintakeApplication.class, args);
}
}

How to create a Datalake using Apache Kafka, Amazon Glue and Amazon S3?

I want to store all the data from a Kafka's topic into Amazon S3. I have a Kafka cluster that receives in one topic 200.000 messages per second, and each value message has 50 fields (strings, timestamps, integers, and floats).
My main idea is to use Kafka Connector to store the data in a bucket s3 and after that use Amazon Glue to transform the data and keep it into another bucket. I have the next questions:
1) How to do it? That architecture will work well? I tried with Amazon EMR (Spark Streaming) but I had too many concerns How to decrease the processing time and failed tasks using Apache Spark for events streaming from Apache Kafka?
2) I tried to use Kafka Connect from Confluent, but I have a few questions:
Can I connect to my Kafka Cluster from other Kafka instance and
run in a standalone way my Kafka Connector s3?
What means this error "ERROR Task s3-sink-0 threw an uncaught an
unrecoverable exception"?
ERROR Task s3-sink-0 threw an uncaught and unrecoverable exception
(org.apache.kafka.connect.runtime.WorkerTask:142)
java.lang.NullPointerException at
io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:122)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146)
at
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at
org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:26,086]
ERROR Task is being killed and will not recover until manually
restarted (org.apache.kafka.connect.runtime.WorkerTask:143)
[2018-10-05 15:32:27,980] WARN could not create Dir using directory
from url file:/targ. skipping. (org.reflections.Reflections:104)
java.lang.NullPointerException at
org.reflections.vfs.Vfs$DefaultUrlTypes$3.matches(Vfs.java:239) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:98) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at
org.reflections.Reflections.scan(Reflections.java:237) at
org.reflections.Reflections.scan(Reflections.java:204) at
org.reflections.Reflections.(Reflections.java:129) at
org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268)
at
org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:27,981]
WARN could not create Vfs.Dir from url. ignoring the exception and
continuing (org.reflections.Reflections:208)
org.reflections.ReflectionsException: could not create Vfs.Dir from
url, no matching UrlType was found [file:/targ] either use
fromURL(final URL url, final List urlTypes) or use the static
setDefaultURLTypes(final List urlTypes) or
addDefaultURLTypes(UrlType urlType) with your specialized UrlType. at
org.reflections.vfs.Vfs.fromURL(Vfs.java:109) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at
org.reflections.Reflections.scan(Reflections.java:237) at
org.reflections.Reflections.scan(Reflections.java:204) at
org.reflections.Reflections.(Reflections.java:129) at
org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268)
at
org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:35,441]
INFO Reflections took 12393 ms to scan 429 urls, producing 13521 keys
and 95814 values (org.reflections.Reflections:229)
If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
What means all these fields key.converter, value.converter, key.converter.schemas.enable, value.converter.schemas.enable, internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable?
What are the possible values for key.converter, value.converter?
3) Once my raw data is in a bucket, I would like to use Amazon Glue to take these data, to deserialize Protobuffer, to change the format of some fields, and finally to store it in another bucket in Parquet. How can I use my own java protobuffer library in Amazon Glue?
4) If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?
To complement #cricket_007's answer
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
Kafka S3 Connector is part of the Confluent distribution, which also includes Kafka, as well as other related services, but it is not meant to run on your brokers directly, rather:
as a standalone worker running a Connector's configuration given when the service is launched
or as an additional workers' cluster running on the side of your Kafka Brokers' cluster. In that case, interaction/running of connectors is better via the Kafka Connect REST API (Search for "Managing Kafka Connectors" for documentation with examples)
If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
Are you talking about another Kafka Connect instance?
if so, you can simply execute the Kafka Connect service in distributed mode which was meant to give the reliability you seem to be looking for...
Or do you mean another Kafka (brokers) cluster?
in that case, you could try (but that would be experimental, and I haven't tried it myself...) to run Kafka Connect in standalone mode and simply update bootstrap.servers parameter of your connector's configuration to point to the new cluster. Why that might work: in standalone mode the offsets of your sink connector(s) are stored locally on your worker (contrarily to distributed mode where the offsets are stored on the Kafka cluster directly...). Why that might not work: it's simply not intended for this use and I'm guessing you might need your topics and partitions to be exactly the same...?
What are the possible values for key.converter, value.converter?
Check Confluent's documentation for kafka-connect-s3 ;)
How can I use my own java protobuffer library in Amazon Glue?
Not sure of the actual method, but Glue jobs spawn off an EMR cluster behind the scenes so I don't see why it shouldn't be possible...
If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?
Yes.
Assuming a daily partitioning, you could actually have you're schedule run the crawler first thing in the morning, as soon as you can expect new data to have created that day's folder on S3 (so at least one object for that day exists on S3)... The crawler will add that day's partition which will then be available for querying with any newly added object.
We use S3 Connect for hundreds of topics and process data using Hive, Athena, Spark, Presto, etc. Seems to work fine, though I feel like an actual database might return results faster.
In any case, to answer about Connect
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
I'm not sure I understand the question, but Kafka Connect needs to connect to one cluster, you don't need two Kafka clusters to use it. You'd typically run Kafka Connect processes as part of their own cluster, not on the brokers.
What means this error "ERROR Task s3-sink-0 threw an uncaught an unrecoverable exception"?
It means you need to look at the logs to figure out what exception is being thrown and stopping the connector from reading data.
WARN could not create Dir using directory from url file:/targ ... If you're using HDFS connector, I don't think you should be using the default file:// URI
If you can resume the steps to connect to Kafka and keep on s3 from another Kafka instance, how will you do?
You can't "resume from another Kafka instance". As mentioned, Connect can only consume from a single Kafka cluster, and any consumed offsets and consumer groups are stored with it.
What means all these fields
These fields are removed from the latest Kafka releases, you can ignore them. You definitely should not change them
internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable
These are your serializers and deserializers like the regular producer consumer API have
key.converter, value.converter
I believe these are only important for JSON converters. See https://rmoff.net/2017/09/06/kafka-connect-jsondeserializer-with-schemas-enable-requires-schema-and-payload-fields
key.converter.schemas.enable, value.converter.schemas.enable
to deserialize Protobuf, to change the format of some fields, and finally to store it in another bucket in Parquet
Kafka Connect would need to be loaded with a Protobuf converter, and I don't know there is one (I think Blue Apron wrote something... Search github).
Generally speaking, Avro would be much easier to convert to Parquet because native libraries already exist to do that. S3 Connect by Confluent doesn't currently write Parquet format, but there in an open PR. The alternative is to use Pinterest Secor library.
I don't know Glue, but if it's like Hive, you would use ADD JAR during a query to load external code plugins and functions
I have minimal experience with Athena, but Glue maintains all the partitions as a Hive metastore. The automatic part would be the crawler, you can put a filter on the query to do partition pruning

Apache Camel: File component moveFailed redelivery strategy

When a certain endpoint is not available (500 for instance) my queue file is moved to .error directory. I am using the parameter: moveFailed for this.
<from uri="file:inbox?autoCreate=true&readLockTimeout=2000&charset=utf-8&preMove=.processing&delete=true&moveFailed=.error&maxMessagesPerPoll=50&delay=1000"/>
According to: http://camel.apache.org/file2.html
When moving the files to the “fail” location Camel will handle the
error and will not pick up the file again.
What is the best approach to implement a redelivery policy/strategy so that the files get picked up again when failed?
Setup a retry by redelivering to that certain endpoint component, not to the whole route.
You can do this by specifying number of retries, a delay between retries, and a backoff multiplier if you so wish using an error handler.
onException(RestException.class)
.maximumRedeliveries(3)
.redeliveryDelay(100L)
.backOffMultiplier(1.5)
Or setting this in your camel context:
<errorHandler id="errorhandler" redeliveryPolicyRef="redeliveryPolicy"/>
<redeliveryPolicyProfile id="redeliveryPolicy" maximumRedeliveries="3" redeliveryDelay="100" backOffMultiplier="1.5" retryAttemptedLogLevel="WARN"/>
This way, the file is only delivered to the error folder once it has run out of redelivery attempts.
You could also look at using the dead letter handler, and putting the file into a queue to be processed later.

WSO2 APIM 2.0 Gateway-Worker-Node: "the requested resource XXX is not available"

I have a gatewaymanager (GWM) with 2 worker nodes. When I deploy an API its pushed to the GWM and is available threre --> API-Call works fine.
I decided to synchronize the APIs from the GWM to the worker nodes via rsync. The filesystems under ~wso2/repository/deployment/server on the workernodes are synced and similar to the GWM node.
But when I call the API on a worker node I get this message:
<am:fault xmlns:am="http://wso2.org/apimanager"><am:code>404</am:code>
<am:type>Status report</am:type><am:message>Not Found</am:message>
<am:description>The requested resource (/XXX/1/foo) is not available.
</am:description>
</am:fault>
I also restarted the workes, but same result.
Did I miss something or is there a trigger to load the APIs on the workers to the cache, or something like this?
Faced same issue , when the contents of mediation files were changed.
**Solution which worked for me **
Demote your api to created
Ensure gateway is checked
Redeploy it