What I want to do: Download an S3 file (pdf) in a lambda and extract its text, using Rust.
The Error:
ERROR PDF error: Invalid file header
I checked the pdf file in the bucket, downloaded it from the console and everything looks correct, so something is breaking in the way I store the file.
How I am doing it:
let config = aws_config::load_from_env().await;
let client = s3::Client::new(&config);
// Get uploaded object in raw bucket (serde derived the json)
let key = event.records.get(0).unwrap().s3.object.key.clone();
let key = key.replace('+', " ");
let key = percent_encoding::percent_decode_str(&key).decode_utf8().unwrap().to_string();
let content = client
.get_object()
.bucket(raw_bucket_name)
.key(&key)
// .response_content_type("application/pdf") // this did not make any difference
.send()
.await?;
let mut bytes = content.body.into_async_read();
let file = tempfile::NamedTempFile::new()?;
let path = file.into_temp_path();
let mut file = tokio::fs::File::create(&path).await?;
tokio::io::copy(&mut bytes, &mut file).await?;
let content = pdf_extract::extract_text(path)?; // this line breaks
Versions:
tokio = { version = "1", features = ["macros"] }
aws-sdk-s3 = "0.21.0"
aws-config = "0.51.0"
pdf-extract = "0.6.4"
I feel like I misunderstood something in how to store the bytestream, but e.g. https://stackoverflow.com/a/62003659/4986655 do it in the same way afaiks.
Any help or pointers on what the issue might be or how to debug this are very welcome.
boto3 documentation does not clearly specify how to update the user metadata of an already existing S3 Object.
It can be done using the copy_from() method -
import boto3
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key')
s3_object.metadata.update({'id':'value'})
s3_object.copy_from(CopySource={'Bucket':'bucket-name', 'Key':'key'}, Metadata=s3_object.metadata, MetadataDirective='REPLACE')
You can do this using copy_from() on the resource (like this answer) mentions, but you can also use the client's copy_object() and specify the same source and destination. The methods are equivalent and invoke the same code underneath.
import boto3
s3 = boto3.client("s3")
src_key = "my-key"
src_bucket = "my-bucket"
s3.copy_object(Key=src_key, Bucket=src_bucket,
CopySource={"Bucket": src_bucket, "Key": src_key},
Metadata={"my_new_key": "my_new_val"},
MetadataDirective="REPLACE")
The 'REPLACE' value specifies that the metadata passed in the request should overwrite the source metadata entirely. If you mean to only add new key-values, or delete only some keys, you'd have to first read the original data, edit it and call the update.
To replacing only a subset of the metadata correctly:
Retrieve the original metadata with head_object(Key=src_key, Bucket=src_bucket). Also take note of the Etag in the response
Make desired changes to the metadata locally.
Call copy_object as above to upload the new metadata, but pass CopySourceIfMatch=original_etag in the request to ensure the remote object has the metadata you expect before overwriting it. original_etag is the one you got in step 1. In case the metadata (or the data itself) has changed since head_object was called (e.g. by another program running simultaneously), copy_object will fail with an HTTP 412 error.
Reference: boto3 issue 389
Similar to this answer but with the existing Metadata preserved while modifying only what is needed. From the system defined meta data, I've only preserved ContentType and ContentDisposition in this example. Other system defined meta data can also be preserved similarly.
import boto3
s3 = boto3.client('s3')
response = s3.head_object(Bucket=bucket_name, Key=object_name)
response['Metadata']['new_meta_key'] = "new_value"
response['Metadata']['existing_meta_key'] = "new_value"
result = s3.copy_object(Bucket=bucket_name, Key=object_name,
CopySource={'Bucket': bucket_name,
'Key': object_name},
Metadata=response['Metadata'],
MetadataDirective='REPLACE', TaggingDirective='COPY',
ContentDisposition=response['ContentDisposition'],
ContentType=response['ContentType'])
You can either update metadata by adding something or updating a current metadata value with a new one, here is the piece of code I am using :
import sys
import os
import boto3
import pprint
from boto3 import client
from botocore.utils import fix_s3_host
param_1= YOUR_ACCESS_KEY
param_2= YOUR_SECRETE_KEY
param_3= YOUR_END_POINT
param_4= YOUR_BUCKET
#Create the S3 client
s3ressource = client(
service_name='s3',
endpoint_url= param_3,
aws_access_key_id= param_1,
aws_secret_access_key=param_2,
use_ssl=True,
)
# Building a list of of object per bucket
def BuildObjectListPerBucket (variablebucket):
global listofObjectstobeanalyzed
listofObjectstobeanalyzed = []
extensions = ['.jpg','.png']
for key in s3ressource.list_objects(Bucket=variablebucket)["Contents"]:
#print (key ['Key'])
onemoreObject=key['Key']
if onemoreObject.endswith(tuple(extensions)):
listofObjectstobeanalyzed.append(onemoreObject)
#print listofObjectstobeanalyzed
else :
s3ressource.delete_object(Bucket=variablebucket,Key=onemoreObject)
return listofObjectstobeanalyzed
# for a given existing object, create metadata
def createmetdata(bucketname,objectname):
s3ressource.upload_file(objectname, bucketname, objectname, ExtraArgs={"Metadata": {"metadata1":"ImageName","metadata2":"ImagePROPERTIES" ,"metadata3":"ImageCREATIONDATE"}})
# for a given existing object, add new metadata
def ADDmetadata(bucketname,objectname):
s3_object = s3ressource.get_object(Bucket=bucketname, Key=objectname)
k = s3ressource.head_object(Bucket = bucketname, Key = objectname)
m = k["Metadata"]
m["new_metadata"] = "ImageNEWMETADATA"
s3ressource.copy_object(Bucket = bucketname, Key = objectname, CopySource = bucketname + '/' + objectname, Metadata = m, MetadataDirective='REPLACE')
# for a given existing object, update a metadata with new value
def CHANGEmetadata(bucketname,objectname):
s3_object = s3ressource.get_object(Bucket=bucketname, Key=objectname)
k = s3ressource.head_object(Bucket = bucketname, Key = objectname)
m = k["Metadata"]
m.update({'watson_visual_rec_dic':'ImageCREATIONDATEEEEEEEEEEEEEEEEEEEEEEEEEE'})
s3ressource.copy_object(Bucket = bucketname, Key = objectname, CopySource = bucketname + '/' + objectname, Metadata = m, MetadataDirective='REPLACE')
def readmetadata (bucketname,objectname):
ALLDATAOFOBJECT = s3ressource.get_object(Bucket=bucketname, Key=objectname)
ALLDATAOFOBJECTMETADATA=ALLDATAOFOBJECT['Metadata']
print ALLDATAOFOBJECTMETADATA
# create the list of object on a per bucket basis
BuildObjectListPerBucket (param_4)
# Call functions to see the results
for objectitem in listofObjectstobeanalyzed:
# CALL The function you want
readmetadata(param_4,objectitem)
ADDmetadata(param_4,objectitem)
readmetadata(param_4,objectitem)
CHANGEmetadata(param_4,objectitem)
readmetadata(param_4,objectitem)
I am using Sagemaker and have a bunch of model.tar.gz files that I need to unpack and load in sklearn. I've been testing using list_objects with delimiter to get to the tar.gz files:
response = s3.list_objects(
Bucket = bucket,
Prefix = 'aleks-weekly/models/',
Delimiter = '.csv'
)
for i in response['Contents']:
print(i['Key'])
And then I plan to extract with
import tarfile
tf = tarfile.open(model.read())
tf.extractall()
But how do I get to the actual tar.gz file from s3 instead of a some boto3 object?
You can download objects to files using s3.download_file(). This will make your code look like:
s3 = boto3.client('s3')
bucket = 'my-bukkit'
prefix = 'aleks-weekly/models/'
# List objects matching your criteria
response = s3.list_objects(
Bucket = bucket,
Prefix = prefix,
Delimiter = '.csv'
)
# Iterate over each file found and download it
for i in response['Contents']:
key = i['Key']
dest = os.path.join('/tmp',key)
print("Downloading file",key,"from bucket",bucket)
s3.download_file(
Bucket = bucket,
Key = key,
Filename = dest
)
I got this exception when calling the method put_bucket_acl(**kwargs) of boto3.client to set bucket acl. (PS: it's a bucket of ceph object instead of aws)
My code :
import boto3
import copy
s3_client = boto3.client('s3',
aws_access_key_id=s3_conf['ak'],
aws_secret_access_key=s3_conf['sk'],
endpoint_url=s3_conf["host"])
bucket_acl = s3.BucketAcl(test_bucket)
bucket_acl.grants.append(new_grants)
bucket_acl.put(ACL='private', AccessControlPolicy={'Grants': bucket_acl.grants, 'Owner': bucket_acl.owner})
I also try Session.client:
session = Session(s3_conf["ak"], s3_conf["sk"])
s3 = session.resource("s3", endpoint_url=s3_conf["host"])
s3_client = session.client("s3", endpoint_url=s3_conf["host"])
rsp = s3_client.get_bucket_acl(Bucket=test_bucket)
old_access_control_policy = { 'Grants': copy.deepcopy(rsp['Grants']), 'Owner': copy.deepcopy(rsp['Owner']) }
new_access_control_policy = copy.deepcopy(old_access_control_policy)
new_access_control_policy['Grants'].append(new_grants)
s3_client.put_bucket_acl(Bucket=test_bucket, ACL='private', AccessControlPolicy=old_access_control_policy)
if I remove the param AccessControlPolicy, it runs succeed
s3_client.put_bucket_acl(Bucket=test_bucket, ACL='private')
Am I call this method in a wrong way?
But the guide also call this call this method in same way :https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.put_bucket_acl
Thinks for any help.
It is ok to set bucket acl like:
s3_client.put_bucket_acl(Bucket=test_bucket, AccessControlPolicy={...})
ACL and AccessControlPolicy are AccessControlList and the parameter AccessControlList can only be one of: ACL, AccessControlPolicy, Bucket, ContentMD5, GrantFullControl, GrantRead, GrantReadACP, GrantWrite, GrantWriteACP. Later I need to read the code of botocore and find the difference between ceph s3 and amazon s3. Think for reading.
I am trying to upload gzip compressed files using Google's BigQuery Java client API. I am able to upload normal files without any issue. But gzip fails with the error "Invalid content type 'application/x-gzip'. Uploads must have content type 'application/octet-stream'".
Below is my code.
val pid = "****"
val dsid = "****"
val tid = "****"
val br = Source.fromFile(new File("****")).bufferedReader()
val mapper = new ObjectMapper()
val schemaFields = mapper.readValue(br, classOf[util.ArrayList[TableFieldSchema]])
val tschema = new TableSchema().setFields(schemaFields)
val tr = new TableReference().setProjectId(pid).setDatasetId(dsid).setTableId(tid)
val jc = new JobConfigurationLoad().setDestinationTable(tr)
.setSchema(tschema)
.setSourceFormat("NEWLINE_DELIMITED_JSON")
.setCreateDisposition("CREATE_IF_NEEDED")
.setWriteDisposition("WRITE_APPEND")
.setIgnoreUnknownValues(true)
val fmr = new SimpleDateFormat("dd-MM-yyyy_HH-mm-ss-SSS")
val now = fmr.format(new Date())
val loadJob = new Job().setJobReference(new JobReference().setJobId(Joiner.on("-")
.join("INSERT", pid, dsid, tid, now))
.setProjectId(pid))
.setConfiguration(new JobConfiguration().setLoad(jc))
// val data = new FileContent(MediaType.OCTET_STREAM.toString, new File("/Users/jegan/sessions/34560-6")) // This works.
val data = new FileContent(MediaType.GZIP.toString, new File("/Users/jegan/sessions/34560-6"))
val bq = BQHelper.createAuthorizedClientWithDefaultCredentials()
val job = bq.jobs().insert(pid, loadJob, data).execute()
And from this link, I see that we need to use resumable upload to achieve this.
https://cloud.google.com/bigquery/loading-data-post-request#resumable
But the issue is, I am using the Java Client library from Google. How to do resumable upload using this library? There seems to be not much information on this regard or I am missing something. Has anyone ever done this? Please point me to some documentation/samples. Thanks.
If application/octet-stream works, just use that. We don't use the media type for anything important.
That said, I thought I changed it so that we'd accept any media type. Are you using the most recent version of the Java client library?