Scrapy with S3 support

Scrapy with S3 support - amazon-s3

I have been struggling for the last couple of hours but seem to be blind here. I am trying to establish a link between scrapy and Amazon S3 but keep getting the error that the bucket does not exist (it does, checked a dozen times).
The error message:
2016-11-01 22:58:08 [scrapy] ERROR: Error storing csv feed (30 items) in: s3://onvista.s3-website.eu-central-1.amazonaws.com/feeds/vista/2016-11-01T21-57-21.csv
in combination with
botocore.exceptions.ClientError: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist
My settings.py:
ITEM_PIPELINES = {
'onvista.pipelines.OnvistaPipeline': 300,
#'scrapy.pipelines.files.S3FilesStore': 600
}
AWS_ACCESS_KEY_ID = 'key'
AWS_SECRET_ACCESS_KEY = 'secret'
FEED_URI = 's3://onvista.s3-website.eu-central-1.amazonaws.com/feeds/%(name)s/%(time)s.csv'
FEED_FORMAT = 'csv'
Has anyone a working setting for me to have a glimpse?

Instead of referring to an Amazon S3 bucket via its Hosed Website URL, refer to it by name.
The scrapy Feed Exports documentation gives an example of:
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json
In your case, that would make it:
s3://onvista/feeds/%(name)s/%(time)s.json

Related

Boto3 generate presinged url does not work

Here is my code that I use to create a s3 client and generate a presigned url, which are some quite standard codes. They have been up running in the server for quite a while. I pulled the code out and ran it locally in a jupyter notebook
def get_s3_client():
return get_s3(create_session=False)
def get_s3(create_session=False):
session = boto3.session.Session() if create_session else boto3
S3_ENDPOINT = os.environ.get('AWS_S3_ENDPOINT')
if S3_ENDPOINT:
AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']
AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
AWS_DEFAULT_REGION = os.environ["AWS_DEFAULT_REGION"]
s3 = session.client('s3',
endpoint_url=S3_ENDPOINT,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
region_name=AWS_DEFAULT_REGION)
else:
s3 = session.client('s3', region_name='us-east-2')
return s3
s3 = get_s3_client()
BUCKET=[my-bucket-name]
OBJECT_KEY=[my-object-name]
signed_url = s3.generate_presigned_url(
'get_object',
ExpiresIn=3600,
Params={
"Bucket": BUCKET,
"Key": OBJECT_KEY,
}
)
print(signed_url)
When I tried to download the file using the url in the browser, I got an error message and it says "The specified key does not exist." I noticed in the error message that my object key becomes "[my-bucket-name]/[my-object-name]" rather than just "[my-object-name]".
Then I used the same bucket/key combination to generate a presigned url using aws cli, which is working as expected. I found out that somehow the s3 client method (boto3) inserted [my-object-name] in front of [my-object-name] compared to the aws cli method. Here are the results
From s3.generate_presigned_url()
https://[my-bucket-name].s3.us-east-2.amazonaws.com/[my-bucket-name]/[my-object-name]?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAV17K253JHUDLKKHB%2F20210520%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20210520T175014Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=5cdcc38e5933e92b5xed07b58e421e5418c16942cb9ac6ac6429ac65c9f87d64
From aws cli s3 presign
https://[my-bucket-name].s3.us-east-2.amazonaws.com/[my-object-name]?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAYA7K15LJHUDAVKHB%2F20210520%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20210520T155926Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=58208f91985bf3ce72ccf884ba804af30151d158d6ba410dd8fe9d2457369894
I've been working on this and searching for solutions for day and half and I couldn't find out what was wrong with my implementation. I guess it might be that I ignored some basic but important settings to create a s3 client using boto3 or something else. Thanks for the help!

Ok, myth is solved, I shouldn't provide the endpoint_url=S3_ENDPOINT param when I create the s3 client, boto3 will figure it out. After i removed it, everything works as expected.

S3 always returns error code: NoSuchKey even with incorrect Bucket Name instead of some specific error code detailing about bucket

S3 always returns error code: NoSuchKey i.e.
"when bucket name given in request is incorrect"
or
"when bucket name given in request is correct but with invalid object key"
Is there any way so that S3 API start returning me some specific error code stating that Bucket do not exist instead of generic error: NoSuchKey in a scenario where invalid bucketname is passed while requesting object.

First, check S3 object URL and the requested object URL is the same. Then check S3 upload handle properly asynchronously.
There can be a GetObject request that happened before the upload is completed.

"get_bucket_tagging" for s3 buckets give error, when there are no tags present

I am trying to get S3 bucket tags using "get_bucket_tagging".
Code:
response = client.get_bucket_tagging(Bucket='bucket_name')
print(response['TagSet'])
I am getting output till there are any tags present. But getting following error when there are 0 tags.
An error occurred (NoSuchTagSet) when calling the GetBucketTagging
operation: The TagSet does not exist
Is there any other method to check that?

From this document:
NoSuchTagSetError - There is no tag set associated with the bucket.
So when there is no tag set associated with the bucket, error/exception is expected. You need to handle this exception.
import boto3
client = boto3.client('s3')
try:
response = client.get_bucket_tagging(Bucket='bucket_name')
print(response['TagSet'])
except Exception, e:
# Handle exception
# Do something
print e

Amazon Redshift COPY always return S3ServiceException:Access Denied,Status 403

I'm really struggling with how to do data transfer from Amazon S3 bucket to Redshift with COPY command.
So far, I created an IAM User and 'AmazonS3ReadOnlyAccess' policy is assigned. But when I call COPY command likes following, Access Denied Error is always returned.
copy my_table from 's3://s3.ap-northeast-2.amazonaws.com/mybucket/myobject' credentials 'aws_access_key_id=<...>;aws_secret_access_key=<...>' REGION'ap-northeast-2' delimiter '|';
Error:
Amazon Invalid operation: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid EB18FDE35E1E0CAB,ExtRid ,CanRetry 1
Details: -----------------------------------------------
error: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid EB18FDE35E1E0CAB,ExtRid ,CanRetry 1
code: 8001
context: Listing bucket=s3.ap-northeast-2.amazonaws.com prefix=mybucket/myobject
query: 1311463
location: s3_utility.cpp:542
process: padbmaster [pid=4527]
-----------------------------------------------;
Is there anyone can give me some clues or advice?
Thanks a lot!

Remove the endpoint s3.ap-northeast-2.amazonaws.com from the S3 path:
COPY my_table
FROM 's3://mybucket/myobject'
CREDENTIALS ''
REGION 'ap-northeast-2'
DELIMITER '|'
;
(See the examples in the documentation.) While the Access Denied error is definitely misleading, the returned message gives some hint as to what went wrong:
bucket=s3.ap-northeast-2.amazonaws.com
prefix=mybucket/myobject
We'd expect to see bucket=mybucket and prefix=myobject, though.

Check encription of bucket.
According doc : https://docs.aws.amazon.com/en_us/redshift/latest/dg/c_loading-encrypted-files.html
The COPY command automatically recognizes and loads files encrypted using SSE-S3 and SSE-KMS.
Check kms: rules on you key|role
If files from EMR, check Security configurations for S3.

your redshift cluster role does not have right to access to the S3 bucket. make sure the role you use for redshift has access to the bucket and bucket does not have policy that blocks the access

Amazon S3 error- A conflicting conditional operation is currently in progress against this resource.

Why I got this error when I try to create a bucket in amazon S3?

This error means that, the bucket was recently deleted and is queued for delete in S3. You must wait until the name is available again.

This error means that, the bucket was recently deleted and is queued for delete in S3. You must wait until the Bucket name is available again.
Kindly note, I received this error when my access-priviliges were blocked.
The error means your Operation for creating new bucket at S3 is aborted.
There can be multiple reasons for this, you can check the below points for rectifying this error:
Is this Bucket available or is Queued for Deletion
Do you have adequate access privileges for this operation
Your Bucket Name must be unique
P.S: Edited this answer to add more details as shared by Sanity below, and his answer is more accurate with updated information.
You can view the related errors for this operation here.
I am editing my asnwer so that correct answer posted below can be selected as correct answer to this question.

Creating a S3 bucket policy and the S3 public access block for a bucket at the same time will cause the error.
Terraform example
resource "aws_s3_bucket_policy" "allow_alb_access_bucket_elb_log" {
bucket = local.bucket_alb_log_id
policy = data.aws_iam_policy_document.allow_alb_access_bucket_elb_log.json
}
resource "aws_s3_bucket_public_access_block" "lb_log" {
bucket = local.bucket_alb_log_id
block_public_acls = true
block_public_policy = true
}
Solution
resource "aws_s3_bucket_public_access_block" "lb_log" {
bucket = local.bucket_alb_log_id
block_public_acls = true
block_public_policy = true
#--------------------------------------------------------------------------------
# To avoid OperationAborted: A conflicting conditional operation is currently in progress
#--------------------------------------------------------------------------------
depends_on = [
aws_s3_bucket_policy.allow_alb_access_bucket_elb_log
]
}

We have also observed this error several times when we try to move bucket from one account to other. In order to achieve this you should do the following :
Backup content of the S3 bucket you want to move.
Delete S3 bucket on the account.
Wait for 1/2 hours
Create a bucket with the same name in another account
Restore s3 bucket backup

I received this error running a terraform apply with the error:
Error: error creating public access block policy for S3 bucket
(bucket-name): OperationAborted: A conflicting
conditional operation is currently in progress against this resource.
Please try again.
status code: 409, request id: 30B386F1FAA8AB9C, host id: M8flEj6+ncWr0174ftzHd74CXBjhlY8Ys70vTyORaAGWA2rkKqY6pUECtAbouqycbAZs4Imny/c=
It said to "please try again" which I did and it worked the second time. It seems there wasn't enough wait time when provisioning the initial resource with Terraform.

To fully resolve this error, I inserted a 5 second sleep between multiple requests. There is nothing else that I had to do.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy with S3 support - amazon-s3

Instead of referring to an Amazon S3 bucket via its Hosed Website URL, refer to it by name. The scrapy Feed Exports documentation gives an example of: s3://mybucket/scraping/feeds/%(name)s/%(time)s.json In your case, that would make it: s3://onvista/feeds/%(name)s/%(time)s.json

Related

Boto3 generate presinged url does not work

S3 always returns error code: NoSuchKey even with incorrect Bucket Name instead of some specific error code detailing about bucket

"get_bucket_tagging" for s3 buckets give error, when there are no tags present

Amazon Redshift COPY always return S3ServiceException:Access Denied,Status 403

Amazon S3 error- A conflicting conditional operation is currently in progress against this resource.

Categories

Resources