why lenses s3 source connector not working? - amazon-s3

I am trying to use lenses s3 source connector in aws msk(kafka).
Download kafka-connect-aws-s3-kafka-3-1-4.0.0.zip as a plug-in, save it in S3 and register it.
And the above plug-in was specified and the connector configuration was written as follows.
<Connector Configuration>
connector.class=io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector
key.converter.schemas.enable=false
connect.s3.kcql=insert into my_topic select * from my_bucket:dev/domain_name/year=2022/month=11/ STOREAS 'JSON'
tasks.max=2
connect.s3.aws.auth.mode=Default
value.converter.schemas.enable=false
connect.s3.aws.region= ap-northeast-2
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
The connector is normally created and data is read from S3 to the specified topic, but there are two problems here.
As described in "connect.s3.kcql", data is imported based on /year=2022/month=11/, but other partitioned month and date data are also imported. It seems that the paths of "/year=" and "/month= " specified under /dev/domain_name(=PREFIX_NAME) are not recognized and all are imported. I wonder if there is a way.
(refer to my full s3 path: my_bucket/dev/domain_name/year=2022/month=11/hour=1/*.json )
The json file exists more in the specified s3 path, but it is no longer imported into the topic. No errors occur and this is normal.
When I look at the connector log, I keep getting "flushing 0 outstanding messages for offset commit" message.

Related

Is multipart copy really needed to revert an object to a prior version?

For https://github.com/wlandau/gittargets/issues/6, I am trying to programmatically revert an object in an S3 bucket to an earlier version. From reading https://docs.aws.amazon.com/AmazonS3/latest/userguide/RestoringPreviousVersions.html, it looks like copying the object to itself (old version to current version) is recommended. However, I also read that there is a 5 GB limit for copying objects in S3. Does that limit apply to reverting an object to a previous version in the same bucket? A local download followed by a multi-part upload seems extremely inefficient for this use case.
You can create a multi-part transfer request that transfers from S3 to S3. It still takes time, but it doesn't require downloading the object's data and uploading it again, so in practice it tends to be considerably faster than other options:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('example-bucket')
bucket.copy(
{
'Bucket': 'example-bucket',
'Key': 'test.dat',
'VersionId': '0011223344', # From previous call to bucket.object_versions
},
Key='test.dat',
)

From S3 to Kafka using Apache Camel Source

I want to read data from amazon-s3 into kafka. I found camel-aws-s3-kafka-connector source and I try to use it and it works but... I want to read data from s3 without deleting files but execly once for each consumer without duplicates. It is possible to do this using only configuration file? I' ve already create file which looks like:
name=CamelSourceConnector
connector.class=org.apache.camel.kafkaconnector.awss3.CamelAwss3SourceConnector
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.camel.kafkaconnector.awss3.converters.S3ObjectConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
#prefix=WriteTopic
camel.source.endpoint.prefix=full/path/to/WriteTopic2
camel.source.path.bucketNameOrArn=BucketName
camel.source.endpoint.autocloseBody=false
camel.source.endpoint.deleteAfterRead=false
camel.sink.endpoint.region=xxxx
camel.component.aws-s3.accessKey=xxxx
camel.component.aws-s3.secretKey=xxxx
Additionaly with configuration as above I am not able to read only from "WriteTopic" but from all folders in s3, is it also possible to configure?
S3Bucket folders with files
I found workaround for duplicates problem, I'm not completly sure it is the best possible way but it may help somebody. My approach is described here: https://camel.apache.org/blog/2020/12/CKC-idempotency-070/ . I used camel.idempotency.repository.type=memory, and my configuration file looks like:
name=CamelAWS2S3SourceConnector connector.class=org.apache.camel.kafkaconnector.aws2s3.CamelAws2s3SourceConnector key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
# scieżka z ktorej czytamy dane
camel.source.endpoint.prefix=full/path/to/topic/prefix
camel.source.path.bucketNameOrArn="Bucket name"
camel.source.endpoint.deleteAfterRead=false
camel.component.aws2-s3.access-key=****
camel.component.aws2-s3.secret-key=****
camel.component.aws2-s3.region=****
#remove duplicates from messages#
camel.idempotency.enabled=true
camel.idempotency.repository.type=memory
camel.idempotency.expression.type=body
It is also important that I changed camel connector library. Initially I used camel-aws-s3-kafka-connector source, to use Idempotent Consumer I need to change connector on camel-aws2-s3-kafka-connector source

How to configure kafka s3 sink connector for json using its fields AND time based partitioning?

I have a json coming in like this:
{
"app" : "hw",
"content" : "hello world",
"time" : "2018-05-06 12:53:04"
}
I wish to push to S3 in the following file format:
/upper-directory/$jsonfield1/$jsonfield2/$date/$HH
I know I can achieve:
/upper-directory/$date/$HH
with TimeBasedPartitioner and Topic.dir, but how do I put in the 2 json fields as well?
You need to write your own Partitioner to achieve a combination of TimeBased and Field Partitioners
That means make a new Java project, look at the source code for a reference point, build a JAR out of the project, and then copy the jar into kafka-connect-storage-common on all servers running Kafka Connect, which is picked up by the S3 connector. After you've copy the JAR, you will need to reboot the Connect process.
Note: there's already a PR that is trying to add this - https://github.com/confluentinc/kafka-connect-storage-common/pull/73/files

Data Copy from s3 to Redshift: Manifest is in different bucket than files I need to download

I am trying to copy data from a large number of files in s3 over to Redshift. I have read-only access to the s3 bucket which contains these files. In order to COPY them efficiently, I created a manifest file that contains the links to each of the files I need copied over.
Bucket 1:
- file1.gz
- file2.gz
- ...
Bucket 2:
- manifest
Here is the command I've tried to copy data from bucket 1 using the manifest in bucket 2:
-- Load data from s3
copy data_feed_eval from 's3://bucket-2/data_files._manifest'
CREDENTIALS 'aws_access_key_id=bucket_1_key;aws_secret_access_key=bucket_1_secret'
manifest
csv gzip delimiter ',' dateformat 'YYYY-MM-DD' timeformat 'YYYY-MM-DD HH:MI:SS'
maxerror 1000 TRUNCATECOLUMNS;
However, when running this command, I get the following error:
09:45:32 [COPY - 0 rows, 7.576 secs] [Code: 500310, SQL State: XX000] [Amazon](500310) Invalid operation: Problem reading manifest file - S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid 901E02533CC5010D,ExtRid tEvf/TVfZzPfSNAFa8iTYjTBjvaHnMMPmuwss58SwopY/sZSkhUBe3yMGHTDyA0yDhDCD7ybX9gl45pV/eQ=,CanRetry 1
Details:
-----------------------------------------------
error: Problem reading manifest file - S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid 901E02533CC5010D,ExtRid tEvf/TVfZzPfSNAFa8iTYjTBjvaHnMMPmuwss58SwopY/sZSkhUBe3yMGHTDyA0yDhDCD7ybX9gl45pV/eQ=,CanRetry 1
code: 8001
context: s3://bucket-2/data_files._manifest
query: 2611231
location: s3_utility.cpp:284
process: padbmaster [pid=10330]
-----------------------------------------------;
I believe the issue here is I'm passing bucket_1 credentials in my COPY command. Is it possible to pass credentials for multiple buckets (bucket_1 with the actual files, and bucket_2 with the manifest) to the COPY command? How should I approach this assuming I don't have write access to bucket_1?
You have indicated that bucket_1_key key (which is IAM user) has permissions limited to "read-only" from bucket_1. If this is the case then the error occurs because that key has no permission read from bucket_2. You have mentioned this a possible cause already and it is exactly that.
There is no option to supply two sets of keys to COPY command. But, you should consider the following options:
Option 1
According to this "You can specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file."
If there is a common prefix for the set of files you want to load, you can use that prefix in bucket_1 in COPY command.
See http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html
You have mentioned you have read-only access to bucket 1. Make sure this is sufficient access as defined in http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html#copy-usage_notes-iam-permissions
All the other options require changes to your key/IAM user permissions or Redshift itself.
Option 2
Extend permissions of bucket_1_key key to be able to read from bucket_2 as well. You will have to make sure that your bucket_1_key key has LIST access to bucket_2 and GET access for the bucket_2 objects (as documented here).
This way you can continue using bucket_1_key key in COPY command. This method is referred to as Key-Based Access Control and uses plain-text access key ID and secret access key. AWS recommends to use Role-Based Access Control (option 3) instead.
Option 3
Use IAM role in COPY command instead of key (option 2). This is referred to as Role-Based Access Control. This is also strongly recommended authentication option to use in COPY command.
This IAM role would have to privileges to LIST access on buckets 1 and 2 and GET access for the objects in those buckets.
More info about Key-Based and Role-Based Access Control is here.

google big query: export table to own bucket results in unexpected error

I'am stuck trying to export a table to my google cloud storage bucket.
Example job id: job_0463426872a645bea8157604780d060d
I tried the cloud storage target with alot of different variations, all reveal the same error. If I try to copy the natality report, it works.
What am I doing wrong?
Thanks!
Daniel
It looks like the error says:
"Table too large to be exported to a single file. Specify a uri including a * to shard export." Try switching the destination URI to something like gs://foo/bar/baz*
Specify the file extension along with the pattern. Example
gs://foo/bar/baz*.gz in case of GZIP (compressed)
gs://foo/bar/baz*.csv in case of csv (uncompressed)
The foo directory is the bucket name and bar directory can be your
date in string format which could be generated on the fly.
I was able to do it with:
bq extract --destination_format=NEWLINE_DELIMITED_JSON myproject:mydataset.mypartition gs://mybucket/mydataset/mypartition/{*}.json