we are trying to migrate data from aws s3 to gcp storage. we tried transfer job in gcp and its working fine. So we wanted to achieve that programmatically with aws lambda since we have dependencies on aws.
When i tried importing google.cloud module I am getting this error
lambda cloudwatch logs
Here is my code:
import os
import logging
import boto3
#from StringIO import StringIO
from google.cloud import storage
#import google-cloud-storage
# Setup logging
LOG = logging.getLogger(__name__)
LOG.setLevel(os.environ.get('LOG_LEVEL', 'INFO'))
GCS_BUCKET_NAME=os.environ['GCS_BUCKET_NAME']
S3 = boto3.client('s3')
def lambda_handler(event, context):
try:
l_t_bucketKey = _getKeys(event)
# Create google client
storage_client = storage.Client()
gcs_bucket = storage_client.get_bucket(os.environ['GCS_BUCKET_NAME'])
LOG.debug('About to copy %d files', len(l_t_bucketKey))
for bucket, key in l_t_bucketKey:
try:
inFileObj = StringIO()
S3.download_fileobj(
Bucket=bucket,
Key=key,
Fileobj=inFileObj
)
blob = gcs_bucket.blob(key)
blob.upload_from_file(inFileObj, rewind=True) # seek(0) before reading file obj
LOG.info('Copied s3://%s/%s to gcs://%s/%s', bucket, key, GCS_BUCKET_NAME, key)
except:
LOG.exception('Error copying file: {k}'.format(k=key))
return 'SUCCESS'
except Exception as e:
LOG.exception("Lambda function failed:")
return 'ERROR'
def _getKeys(d_event):
"""
Extracts (bucket, key) from event
:param d_event: Event dict
:return: List of tuples (bucket, key)
"""
l_t_bucketKey = []
if d_event:
if 'Records' in d_event and d_event['Records']:
for d_record in d_event['Records']:
try:
bucket = d_record['s3']['bucket']['name']
key = d_record['s3']['object']['key']
l_t_bucketKey.append((bucket, key))
except:
LOG.warn('Error extracting bucket and key from event')
return l_t_bucketKey
And I downloaded google-cloud-storage module from pypi website and imported that in aws lambda layer. Please help in providing me the best link for downloading this module.
Google Storage Bucket can be used with S3 APIs, so you can just use it in your Lambda functions without any extra GCP libraries.
source_client = boto3.client(
's3',
endpoint_url='https://storage.googleapis.com',
aws_access_key_id=os.environ['GCP_KEY'],
aws_secret_access_key=os.environ['GCP_SECRET']
To get access_key and secret - go to the GS bucket settings -> Interoperability -> Access keys for your user account -> Create a key
Related
Is there a way I can save a Pyspark or Pandas dataframe from Databricks to a blob storage without mounting or installing libraries?
I was able to achieve this after mounting the storage container into Databricks and using the library com.crealytics.spark.excel, but I was wondering if I can do the same without the library or without mounting because I will be working on clusters without these 2 permissions.
Here the code for saving the dataframe locally to dbfs.
# export
from os import path
folder = "export"
name = "export"
file_path_name_on_dbfs = path.join("/tmp", folder, name)
# Writing to DBFS
# .coalesce(1) used to generate only 1 file, if the dataframe is too big this won't work so you'll have multiple files qnd you need to copy them later one by one
sampleDF \
.coalesce(1) \
.write \
.mode("overwrite") \
.option("header", "true") \
.option("delimiter", ";") \
.option("encoding", "UTF-8") \
.csv(file_path_name_on_dbfs)
# path of destination, which will be sent to az storage
dest = file_path_name_on_dbfs + ".csv"
# Renaming part-000...csv to our file name
target_file = list(filter(lambda file: file.name.startswith("part-00000"), dbutils.fs.ls(file_path_name_on_dbfs)))
if len(target_file) > 0:
dbutils.fs.mv(target_file[0].path, dest)
dbutils.fs.cp(dest, f"file://{dest}") # this line is added for community edition only cause /dbfs is not recognized, so we copy the file locally
dbutils.fs.rm(file_path_name_on_dbfs,True)
The code that will send the file into az storage :
import requests
sas="YOUR_SAS_TOKEN_PREVIOUSLY_CREATED" # follow the link below to create SAS token (using sas is slightly more secure than raw key storage)
blob_account_name = "YOUR_BLOB_ACCOUNT_NAME"
container = "YOUR_CONTAINER_NAME"
destination_path_w_name = "export/export.csv"
url = f"https://{blob_account_name}.blob.core.windows.net/{container}/{destination_path_w_name}?{sas}"
# here we read the content of our previously exported df -> csv
# if you are not on community edition you might want to use /dbfs + dest
payload=open(dest).read()
headers = {
'x-ms-blob-type': 'BlockBlob',
'Content-Type': 'text/csv' # you can change the content type according to your needs
}
response = requests.request("PUT", url, headers=headers, data=payload)
# if response.status_code is 201 it means your file was created successfully
print(response.status_code)
Follow this link to setup a SAS token
Remember that anyone who got the sas token can access your storage depending on permissions you set while creating the sas token
Code for Excel export version (using com.crealytics:spark-excel_2.12:0.14.0)
Saving the dataframe :
data = [
('a',25,'ast'),
('b',15,'phone'),
('c',32,'dlp'),
('d',45,'rare'),
('e',60,'phq' )
]
colums = ["column1" ,"column2" ,"column3"]
sampleDF = spark.createDataFrame(data=data, schema = colums)
sampleDF.show()
# export
from os import path
folder = "export"
name = "export"
file_path_name_on_dbfs = path.join("/tmp", folder, name)
# Writing to DBFS
sampleDF.write.format("com.crealytics.spark.excel")\
.option("header", "true")\
.mode("overwrite")\
.save(file_path_name_on_dbfs + ".xlsx")
# excel
dest = file_path_name_on_dbfs + ".xlsx"
dbutils.fs.cp(dest, f"file://{dest}") # this line is added for community edition only cause /dbfs is not recognized, so we copy the file locally
Uploading the file to azure storage :
import requests
sas="YOUR_SAS_TOKEN_PREVIOUSLY_CREATED" # follow the link below to create SAS token (using sas is slightly more secure than raw key storage)
blob_account_name = "YOUR_BLOB_ACCOUNT_NAME"
container = "YOUR_CONTAINER_NAME"
destination_path_w_name = "export/export.xlsx"
# destination_path_w_name = "export/export.csv"
url = f"https://{blob_account_name}.blob.core.windows.net/{container}/{destination_path_w_name}?{sas}"
# here we read the content of our previously exported df -> csv
# if you are not on community edition you might want to use /dbfs + dest
# payload=open(dest).read()
payload=open(dest, 'rb').read()
headers = {
'x-ms-blob-type': 'BlockBlob',
# 'Content-Type': 'text/csv'
'Content-Type': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
}
response = requests.request("PUT", url, headers=headers, data=payload)
# if response.status_code is 201 it means your file was created successfully
print(response.status_code)
I am trying to create a s3-bucket using AWS SDK for Python in PyCharm and facing the following error.
"An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to."
Here is my code:
import os
import boto3
from botocore.exceptions import ClientError
ACCESS_KEY = 'AWS_ACCESS_KEY_ID'
SECRET_KEY = 'AWS_SECRET_ACCESS_KEY'
PRI_BUCKET_NAME = 'soundcloud2'
TRANSIENT_BUCKET_NAME = 'soundcloud3'
def main():
"""entry point"""
access = os.getenv(ACCESS_KEY)
secret = os.getenv(SECRET_KEY)
s3 = boto3.resource('s3', aws_access_key_id=access, aws_secret_access_key=secret)
create_bucket(TRANSIENT_BUCKET_NAME, s3)
def create_bucket(name, s3):
try:
bucket = s3.create_bucket(Bucket=name)
except ClientError as ce:
print('error', ce)
if __name__ == '__main__':
main()
Try modifying your code as below:
bucket = s3.create_bucket(Bucket=name,
CreateBucketConfiguration={
'LocationConstraint': 'ap-south-1'}
)
or whatever region, unless your region is in the U.S.
I have a problem with setting tags to S3 buckets with Python Boto.
I`m connecting to my own Ceph-storage and try this:
conn = boto.connect_s3(
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY,
host=RGW_HOST,
port=RGW_PORT,
is_secure=RGW_SECURE,
calling_format=boto.s3.connection.OrdinaryCallingFormat(),
)
new_id = '10'
bucket = conn.get_bucket(new_id)
tag_set = TagSet()
tag_set.add_tag(key='a', value='b')
tags = Tags()
tags.add_tag_set(tag_set)
bucket.set_tags(tags)
But I have a error:
boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidArgument</Code><BucketName>ipo36</BucketName><RequestId>tx000000000000000000035-005ac4c3cf-1063bb-default</RequestId><HostId>1063bb-default-default</HostId></Error>
Anyone know what i do wrong?
These days I would recommend using boto3 rather than boto 2.
Here's some code that works:
import boto3
client = boto3.client('s3', region_name='ap-southeast-2')
tag={'TagSet':[{'Key': 'Department', 'Value': 'Finance'}]}
response = client.put_bucket_tagging(Bucket='my-bucket', Tagging=tag)
I'm attempting to delete an S3 bucket using boto3 library
import boto3
s3 = boto3.client('s3')
bucket = s3.Bucket('my-bucket')
response = bucket.delete()
I get the following error:
"errorType": "AttributeError",
"errorMessage": "'S3' object has no attribute 'Bucket'"
I cannot see what's wrong... Thanks
try this:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
bucket.delete()
I think doing this way is more robust. Since API doesn't allow non-empty bucket removals.
import boto3
BUCKET_NAMES = [
"buckets",
"to",
"remove"
]
for bucket_name in BUCKET_NAMES:
s3 = boto3.resource("s3")
bucket = s3.Bucket(bucket_name)
bucket_versioning = s3.BucketVersioning(bucket_name)
if bucket_versioning.status == 'Enabled':
bucket.object_versions.delete()
else:
bucket.objects.all().delete()
response = bucket.delete()
This is because the client interface (boto3.client) doesn't have .Bucket(), only boto3.resource does, so this would work:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
response = bucket.delete()
Quoted from the docs:
Resources represent an object-oriented interface to Amazon Web Services (AWS). They provide a higher-level abstraction than the raw, low-level calls made by service clients.
Generally speaking, if you are using boto3, resources should probably be your preferred interface most of the time.
The error message contains 'S3' with capital S. I suspect a typo that's not pasted here since your code shows 's3' with lowercase s.
Personally, I'd just do it this way:
import boto3
s3 = boto3.client('s3')
bucket = 'my_bucket'
response = s3.delete_bucket(Bucket=bucket)
I am using almost exact codes to upload files to Amazon S3 and Google Cloud Storage, respectively, using boto:
import boto
filename = 'abc.png'
filenameWithPath = os.path.dirname(os.path.realpath(__file__)) + '/' + filename
cloudFilename = 'uploads/' + filename
# Upload to Amazon S3
conn = boto.connect_s3(aws_access_key_id=AWS_ACCESS_KEY, aws_secret_access_key=AWS_SECRET_KEY)
bucket = conn.get_bucket(AWS_BUCKET_NAME)
fpic = boto.s3.key.Key(bucket)
fpic.key = cloudFilename
fpic.set_contents_from_filename(filenameWithPath)
# Upload to Google Cloud Storage
conn = boto.connect_gs(gs_access_key_id=GS_ACCESS_KEY, gs_secret_access_key=GS_SECRET_KEY)
bucket = conn.get_bucket(GS_BUCKET_NAME)
fpic = boto.s3.key.Key(bucket)
fpic.key = cloudFilename
fpic.set_contents_from_filename(filenameWithPath)
The Amazon S3 part of code runs perfectly. However, the Google Cloud Storage part gives the error message TypeError, 'str' does not support the buffer interface at the statement fpic.set_contents_from_filename(...).
What was the problem?