How to configure kafka s3 sink connector for json using its fields AND time based partitioning? - amazon-s3

I have a json coming in like this:
{
"app" : "hw",
"content" : "hello world",
"time" : "2018-05-06 12:53:04"
}
I wish to push to S3 in the following file format:
/upper-directory/$jsonfield1/$jsonfield2/$date/$HH
I know I can achieve:
/upper-directory/$date/$HH
with TimeBasedPartitioner and Topic.dir, but how do I put in the 2 json fields as well?

You need to write your own Partitioner to achieve a combination of TimeBased and Field Partitioners
That means make a new Java project, look at the source code for a reference point, build a JAR out of the project, and then copy the jar into kafka-connect-storage-common on all servers running Kafka Connect, which is picked up by the S3 connector. After you've copy the JAR, you will need to reboot the Connect process.
Note: there's already a PR that is trying to add this - https://github.com/confluentinc/kafka-connect-storage-common/pull/73/files

Related

why lenses s3 source connector not working?

I am trying to use lenses s3 source connector in aws msk(kafka).
Download kafka-connect-aws-s3-kafka-3-1-4.0.0.zip as a plug-in, save it in S3 and register it.
And the above plug-in was specified and the connector configuration was written as follows.
<Connector Configuration>
connector.class=io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector
key.converter.schemas.enable=false
connect.s3.kcql=insert into my_topic select * from my_bucket:dev/domain_name/year=2022/month=11/ STOREAS 'JSON'
tasks.max=2
connect.s3.aws.auth.mode=Default
value.converter.schemas.enable=false
connect.s3.aws.region= ap-northeast-2
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
The connector is normally created and data is read from S3 to the specified topic, but there are two problems here.
As described in "connect.s3.kcql", data is imported based on /year=2022/month=11/, but other partitioned month and date data are also imported. It seems that the paths of "/year=" and "/month= " specified under /dev/domain_name(=PREFIX_NAME) are not recognized and all are imported. I wonder if there is a way.
(refer to my full s3 path: my_bucket/dev/domain_name/year=2022/month=11/hour=1/*.json )
The json file exists more in the specified s3 path, but it is no longer imported into the topic. No errors occur and this is normal.
When I look at the connector log, I keep getting "flushing 0 outstanding messages for offset commit" message.

From S3 to Kafka using Apache Camel Source

I want to read data from amazon-s3 into kafka. I found camel-aws-s3-kafka-connector source and I try to use it and it works but... I want to read data from s3 without deleting files but execly once for each consumer without duplicates. It is possible to do this using only configuration file? I' ve already create file which looks like:
name=CamelSourceConnector
connector.class=org.apache.camel.kafkaconnector.awss3.CamelAwss3SourceConnector
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.camel.kafkaconnector.awss3.converters.S3ObjectConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
#prefix=WriteTopic
camel.source.endpoint.prefix=full/path/to/WriteTopic2
camel.source.path.bucketNameOrArn=BucketName
camel.source.endpoint.autocloseBody=false
camel.source.endpoint.deleteAfterRead=false
camel.sink.endpoint.region=xxxx
camel.component.aws-s3.accessKey=xxxx
camel.component.aws-s3.secretKey=xxxx
Additionaly with configuration as above I am not able to read only from "WriteTopic" but from all folders in s3, is it also possible to configure?
S3Bucket folders with files
I found workaround for duplicates problem, I'm not completly sure it is the best possible way but it may help somebody. My approach is described here: https://camel.apache.org/blog/2020/12/CKC-idempotency-070/ . I used camel.idempotency.repository.type=memory, and my configuration file looks like:
name=CamelAWS2S3SourceConnector connector.class=org.apache.camel.kafkaconnector.aws2s3.CamelAws2s3SourceConnector key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
# scieżka z ktorej czytamy dane
camel.source.endpoint.prefix=full/path/to/topic/prefix
camel.source.path.bucketNameOrArn="Bucket name"
camel.source.endpoint.deleteAfterRead=false
camel.component.aws2-s3.access-key=****
camel.component.aws2-s3.secret-key=****
camel.component.aws2-s3.region=****
#remove duplicates from messages#
camel.idempotency.enabled=true
camel.idempotency.repository.type=memory
camel.idempotency.expression.type=body
It is also important that I changed camel connector library. Initially I used camel-aws-s3-kafka-connector source, to use Idempotent Consumer I need to change connector on camel-aws2-s3-kafka-connector source

Nexus 3 Repository Manager Create (Or Run Pre-generated) Task Without Using User Interface

This question arose when I was trying to reboot my Nexus3 container on a weekly schedule and connect to an S3 bucket I have. I have my container set up to connect to the S3 bucket just fine (it creates a new [A-Z,0-9]-metrics.properties file each time) but the previous artifacts are not found when looking though the UI.
I used the Repair - Reconcile component database from blob store task from the UI settings and it works great!
But... all the previous steps are done automatically through scripts and I would like the same for the final step of Reconciling the blob store.
Connecting to the S3 blob store is done with reference to examples from nexus-book-examples. As below:
Map<String, String> config = new HashMap<>()
config.put("bucket", "nexus-artifact-storage")
blobStore.createS3BlobStore('nexus-artifact-storage', config)
AWS credentials are provided during the docker run step so the above is all that is needed for the blob store set up. It is called by a modified version of provision.sh, which is a script from the nexus-book-examples git page.
Is there a way to either:
Create a task with a groovy script? or,
Reference one of the task types and run the task that way with a POST?
depending on the specific version of repository manager that you are using, there may be REST endpoints for listing and running scheduled tasks. This was introduced in 3.6.0 according to this ticket: https://issues.sonatype.org/browse/NEXUS-11935. For more information about the REST integration in 3.x, check out the following: https://help.sonatype.com/display/NXRM3/Tasks+API
For creating a scheduled task, you will have to add some groovy code. Perhaps the following would be a good start:
import org.sonatype.nexus.scheduling.TaskConfiguration
import org.sonatype.nexus.scheduling.TaskInfo
import org.sonatype.nexus.scheduling.TaskScheduler
import groovy.json.JsonOutput
import groovy.json.JsonSlurper
class TaskXO
{
String typeId
Boolean enabled
String name
String alertEmail
Map<String, String> properties
}
TaskXO task = new JsonSlurper().parseText(args)
TaskScheduler scheduler = container.lookup(TaskScheduler.class.name)
TaskConfiguration config = scheduler.createTaskConfigurationInstance(task.typeId)
config.enabled = task.enabled
config.name = task.name
config.alertEmail = task.alertEmail
task.properties?.each { key, value -> config.setString(key, value) }
TaskInfo taskInfo = scheduler.scheduleTask(config, scheduler.scheduleFactory.manual())
JsonOutput.toJson(taskInfo)

How do I generate the variation file for all assets

I'm new to Akeneo, and I discovered profile configuration for assets.
So I imported my YML in order to add asset transformations, and now, cli based, I can't find a command that allows me to generate the variation file for all assets. I saw the command to do that asset by asset and channel by channel, but I need to do that for all of them.
Do you know how I can manage to do that ? I already tried pim:asset:generate-missing-variation-files but that didn't change anything
There is no built-in command to do that, however you could develop a very simple command to achieve this.
You can use the pimee_product_asset.finder.asset service to call retrieveVariationsNotGenerated() in order to retrieve every variation that are not yet genreated, then finally use the pimee_product_asset.variation_file_generator to generate the variation with generate().
Not tested code, but this would be like that:
$finder = $this->get('pimee_product_asset.finder.asset');
$generator = $this->get('pimee_product_asset.variation_file_generator');
$variations = $finder->retrieveVariationsNotGenerated();
foreach ($variations as $variation) {
$generator->generate($variation);
}

How to give input Mapreduce jobs which use s3 data from java code

I know that we can normally give the parameters while running the jar file in EC2 instance
But how do we give inputs through code?
I am trying this because I am trying to call my java code from a jsp, so in the java code ,I want to directly pick up data from s3 and proceed , I tries like this but in vain:
DataExtractor.getRelevantData("s3n://syamk/revanthinput/", "999999", "94645", "20120606",
"s3n://revanthufl/gen/testoutput" + "interm");
here s3n://syamk/revanthinput/ I was using input and instead of s3n://revanthufl/gen/testoutput.
I was using output and in the parameters I am using the same strings(s3n://syamk/revanthinput/ and s3n://revanthufl/gen/testoutput) to run the jar.But doing like this from code is throwing and exception,
[java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).] with root cause
Based on my usage of flume, it would appear that you need to format your URL like s3n://AWS_ACCESS_KEY:AWS_SECRET_KEY#syamk/revanthinput/ when calling s3 from within code.