I want to read data from amazon-s3 into kafka. I found camel-aws-s3-kafka-connector source and I try to use it and it works but... I want to read data from s3 without deleting files but execly once for each consumer without duplicates. It is possible to do this using only configuration file? I' ve already create file which looks like:
name=CamelSourceConnector
connector.class=org.apache.camel.kafkaconnector.awss3.CamelAwss3SourceConnector
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.camel.kafkaconnector.awss3.converters.S3ObjectConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
#prefix=WriteTopic
camel.source.endpoint.prefix=full/path/to/WriteTopic2
camel.source.path.bucketNameOrArn=BucketName
camel.source.endpoint.autocloseBody=false
camel.source.endpoint.deleteAfterRead=false
camel.sink.endpoint.region=xxxx
camel.component.aws-s3.accessKey=xxxx
camel.component.aws-s3.secretKey=xxxx
Additionaly with configuration as above I am not able to read only from "WriteTopic" but from all folders in s3, is it also possible to configure?
S3Bucket folders with files
I found workaround for duplicates problem, I'm not completly sure it is the best possible way but it may help somebody. My approach is described here: https://camel.apache.org/blog/2020/12/CKC-idempotency-070/ . I used camel.idempotency.repository.type=memory, and my configuration file looks like:
name=CamelAWS2S3SourceConnector connector.class=org.apache.camel.kafkaconnector.aws2s3.CamelAws2s3SourceConnector key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
# scieżka z ktorej czytamy dane
camel.source.endpoint.prefix=full/path/to/topic/prefix
camel.source.path.bucketNameOrArn="Bucket name"
camel.source.endpoint.deleteAfterRead=false
camel.component.aws2-s3.access-key=****
camel.component.aws2-s3.secret-key=****
camel.component.aws2-s3.region=****
#remove duplicates from messages#
camel.idempotency.enabled=true
camel.idempotency.repository.type=memory
camel.idempotency.expression.type=body
It is also important that I changed camel connector library. Initially I used camel-aws-s3-kafka-connector source, to use Idempotent Consumer I need to change connector on camel-aws2-s3-kafka-connector source
Related
I am trying to use lenses s3 source connector in aws msk(kafka).
Download kafka-connect-aws-s3-kafka-3-1-4.0.0.zip as a plug-in, save it in S3 and register it.
And the above plug-in was specified and the connector configuration was written as follows.
<Connector Configuration>
connector.class=io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector
key.converter.schemas.enable=false
connect.s3.kcql=insert into my_topic select * from my_bucket:dev/domain_name/year=2022/month=11/ STOREAS 'JSON'
tasks.max=2
connect.s3.aws.auth.mode=Default
value.converter.schemas.enable=false
connect.s3.aws.region= ap-northeast-2
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
The connector is normally created and data is read from S3 to the specified topic, but there are two problems here.
As described in "connect.s3.kcql", data is imported based on /year=2022/month=11/, but other partitioned month and date data are also imported. It seems that the paths of "/year=" and "/month= " specified under /dev/domain_name(=PREFIX_NAME) are not recognized and all are imported. I wonder if there is a way.
(refer to my full s3 path: my_bucket/dev/domain_name/year=2022/month=11/hour=1/*.json )
The json file exists more in the specified s3 path, but it is no longer imported into the topic. No errors occur and this is normal.
When I look at the connector log, I keep getting "flushing 0 outstanding messages for offset commit" message.
I am trying to move & decompress data from Azure Data Lake Storage Gen1.
I have a couple of files with ".tsv.gz" extension, and I want to decompress and move them to a different folder, which is in the same data lake.
I've tried to use the wildcard "*.tsv.gz" inside the connection configuration, so I can make this process at once.
Am I making some mistake?
Thanks
Just tested it, you should just use:
*.tsv.gz
Without ' or "
Hope this helped!
PS: also, remember to check the "Copy file recursively" when you select the dataset in the pipeline.
I have a json coming in like this:
{
"app" : "hw",
"content" : "hello world",
"time" : "2018-05-06 12:53:04"
}
I wish to push to S3 in the following file format:
/upper-directory/$jsonfield1/$jsonfield2/$date/$HH
I know I can achieve:
/upper-directory/$date/$HH
with TimeBasedPartitioner and Topic.dir, but how do I put in the 2 json fields as well?
You need to write your own Partitioner to achieve a combination of TimeBased and Field Partitioners
That means make a new Java project, look at the source code for a reference point, build a JAR out of the project, and then copy the jar into kafka-connect-storage-common on all servers running Kafka Connect, which is picked up by the S3 connector. After you've copy the JAR, you will need to reboot the Connect process.
Note: there's already a PR that is trying to add this - https://github.com/confluentinc/kafka-connect-storage-common/pull/73/files
I have data in Spark which I want to save to S3. The recommended method is to save is using the saveAsTextFile method on the SparkContext, which is successful. I expect that the data will be saved as 'parts'.
My problem is that when I go to S3 to look at my data it has been saved in a folder name _temporary, with a subfolder 0 and then each part or task saved in its own folder.
For example,
data.saveAsTextFile("s3:/kirk/data");
results in file likes
s3://kirk/data/_SUCCESS
s3://kirk/data/_temporary/0/_temporary_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000/part-00000
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001/part-00001
and so on. I would expect and have seen something like
s3://kirk/data/_SUCCESS
s3://kirk/data/part-00000
s3://kirk/data/part-00001
Is this a configuration setting, or do I need to 'commit' the save to resolve the temporary files?
I had the same problem with spark streaming, that was because my Sparkmaster was set up with conf.setMaster("local") instead of conf.SetMaster("local[*]")
Without the [*], spark can't execute saveastextfile during the stream.
Try using coalesce() to reduce the rdd to 1 partition before you export.
Good luck!
I'am stuck trying to export a table to my google cloud storage bucket.
Example job id: job_0463426872a645bea8157604780d060d
I tried the cloud storage target with alot of different variations, all reveal the same error. If I try to copy the natality report, it works.
What am I doing wrong?
Thanks!
Daniel
It looks like the error says:
"Table too large to be exported to a single file. Specify a uri including a * to shard export." Try switching the destination URI to something like gs://foo/bar/baz*
Specify the file extension along with the pattern. Example
gs://foo/bar/baz*.gz in case of GZIP (compressed)
gs://foo/bar/baz*.csv in case of csv (uncompressed)
The foo directory is the bucket name and bar directory can be your
date in string format which could be generated on the fly.
I was able to do it with:
bq extract --destination_format=NEWLINE_DELIMITED_JSON myproject:mydataset.mypartition gs://mybucket/mydataset/mypartition/{*}.json