Unzip a file to s3 - amazon-s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.

You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.

Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.

Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()

Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

Related

Transfer Files from one s3 Folder to Another s3 Folder using Boto3 Python

I am trying to copy files from one s3 bucket to another with some modifications in destination path.
The Original Script is as below:
import boto3
import os
old_bucket_name = 'XT01-sample-data'
old_prefix = 'Test/'
new_bucket_name = 'XT02-sample-data2'
new_prefix = old_bucket_name + '/' + old_prefix
s3 = boto3.resource('s3')
ENCRYPTION = os.environ.get('SERVER_SIDE_ENCRYPTION', 'AES256')
STORAGE_CLASS = os.environ.get('STORAGE_CLASS', 'INTELLIGENT_TIERING')
old_bucket = s3.Bucket(old_bucket_name)
new_bucket = s3.Bucket(new_bucket_name)
extra_args = {
'ServerSideEncryption': ENCRYPTION,
'StorageClass': STORAGE_CLASS
}
for obj in old_bucket.objects.filter(Prefix=old_prefix):
old_source = { 'Bucket': old_bucket_name,
'Key': obj.key}
# replace the prefix
new_key = obj.key.replace(old_prefix, new_prefix, 1)
new_obj = new_bucket.Object(new_key)
print("Object old ", obj)
print("new_key ", new_key)
print("new_obj ", new_obj)
new_obj.copy(old_source,ExtraArgs=extra_args)
print("Starting Deletion Loop")
bucket = s3.Bucket(old_bucket_name)
bucket.objects.filter(Prefix=old_prefix).delete()
The above script is copying the files from bucket XT01-sample-data, Folder Test/
to New bucket XT02-sample-data2 with new path XT01-sample-data/Test1/
The ask is now to modify the script to add timestamp in destination path and files from one folder lands under once time stamp.
Eg:
We have below files in source bucket at various folders
XT01-sample-data/Test1/Test1.1/File1.csv
XT01-sample-data/Test1/Test1.1/File2.csv
XT01-sample-data/Test1/Test1.1/File3.csv
XT01-sample-data/Test1/Test1.2/File1.2.1.csv
XT01-sample-data/Test1/Test1.2/File1.2.2.csv
XT01-sample-data/Test1/Test1.3/File1.3.csv
XT01-sample-data/Test1/Test2/Test2.1/File2.1.csv
Expected output should be all files from one subfolder should be placed under one timestamp.
not all files should be placed in one timestamp there should be a level of segregation based on Timestamp at milisecond level (unix timestamp)
For Files under folder Test1/Test1.1
XT01-sample-data/Test1/2020/01/23/0000/File1.csv
XT01-sample-data/Test1/2020/01/23/0000/File2.csv
XT01-sample-data/Test1/2020/01/23/0000/File3.csv
For files under folder Test1/Test1.2
XT01-sample-data/Test1/2020/01/23/0003/File1.2.1.csv
XT01-sample-data/Test1/2020/01/23/0003/File1.2.2.csv
For files under folder Test1/Test1.3
XT01-sample-data/Test1/2020/01/23/0004/File1.3.csv
For files under folder Test1/Test2/Test2.1
XT01-sample-data/Test1/2020/01/23/0005/File2.1.csv
I was able to resolve this issue by making an entry in dynamodb table.
Logic:
as soon as the first instance is encountered then entry is made in dynamodb table.
when next instance is encountered a corresponding check is made in dynamodb table before copying the data and same date which is present in dynamodb is taken.

Loading local file using Cloud Functions to BigQuery returns OSError: [Errno 30] `Read-only file system:

I am trying to load a Dataframe into BigQuery. I do this as follows:
# Prepare temp file to stream from local file
temp_file = table_name + '-' + str(timestamp_in_ms())
df.to_csv(temp_file, index=None, header=True)
# Define job_config
job_config = bigquery.LoadJobConfig()
job_config.schema = schema
job_config.skip_leading_rows = 1
job_config.source_format = bigquery.SourceFormat.CSV
# Create job to load data into table
with open(temp_file, "r+b") as source_file:
load_job = client.load_table_from_file(source_file, dataset_ref.table(table_name), job_config=job_config)
This works fine in local development, however when I deploy the Cloud Function it returns the following error:
OSError: [Errno 30] Read-only file system: '{temp_file}'
This happens on the line with open(temp_file, "r+b") as source_file:
Why can it not read local files on the Cloud Function temporary storage? What went wrong?
Probably you didn't specified the folder /tmp
Local Disk
Cloud Functions provides access to a local disk mount point (/tmp)
which is known as a "tmpfs" volume in which data written to the volume
is stored in memory. There is no specific fee associated with this
however writing data to the /tmp mountpoint will consume memory
resources provisioned for the function.
As explained on: https://cloud.google.com/functions/pricing

How can I reference an external SQL file using Airflow's BigQuery operator?

I'm currently using Airflow with the BigQuery operator to trigger various SQL scripts. This works fine when the SQL is written directly in the Airflow DAG file. For example:
bigquery_transform = BigQueryOperator(
task_id='bq-transform',
bql='SELECT * FROM `example.table`',
destination_dataset_table='example.destination'
)
However, I'd like to store the SQL in a separate file saved to a storage bucket. For example:
bql='gs://example_bucket/sample_script.sql'
When calling this external file I recieve a "Template Not Found" error.
I've seen some examples load the SQL file into the Airflow DAG folder, however, I'd really like to access files saved to a separate storage bucket. Is this possible?
You can reference any SQL files in your Google Cloud Storage Bucket. Here's a following example where I call the file Query_File.sql in the sql directory in my airflow dag bucket.
CONNECTION_ID = 'project_name'
with DAG('dag', schedule_interval='0 9 * * *', template_searchpath=['/home/airflow/gcs/dags/'], max_active_runs=15, catchup=True, default_args=default_args) as dag:
battery_data_quality = BigQueryOperator(
task_id='task-id',
sql='/SQL/Query_File.sql',
destination_dataset_table='project-name.DataSetName.TableName${{ds_nodash}}',
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False,
dag=dag
)
You can also consider using the gcs_to_gcs operator to copy things from your desired bucket into one that is accessible by composer.
download works differently in GoogleCloudStorageDownloadOperator for Airflow version 1.10.3 and 1.10.15.
def execute(self, context):
self.object = context['dag_run'].conf['job_name'] + '.sql'
logging.info('filemname in GoogleCloudStorageDownloadOperator: %s', self.object)
self.filename = context['dag_run'].conf['job_name'] + '.sql'
self.log.info('Executing download: %s, %s, %s', self.bucket,
self.object, self.filename)
hook = GoogleCloudStorageHook(
google_cloud_storage_conn_id=self.google_cloud_storage_conn_id,
delegate_to=self.delegate_to
)
file_bytes = hook.download(bucket=self.bucket,
object=self.object)
if self.store_to_xcom_key:
if sys.getsizeof(file_bytes) < 49344:
context['ti'].xcom_push(key=self.store_to_xcom_key, value=file_bytes.decode('utf-8'))
else:
raise RuntimeError(
'The size of the downloaded file is too large to push to XCom!'
)

Google Cloud Data Transfer to a GCS subfolder

I am trying to transfer data from AWS S3 bucket (e.g. s3://mySrcBkt) to GCS location ( a folder under a bucket as gs://myDestBkt/myDestination ). I could not find the same option from Interface as it has only provision to provide bucket and not a subfolder. Neither I found the similar povision from the storagetransfer API. Here is my code snippet:
String SOURCE_BUCKET = .... ;
String ACCESS_KEY = .....;
String SECRET_ACCESS_KEY = .....;
String DESTINATION_BUCKET = .......;
String STATUS = "ENABLED";
TransferJob transferJob =
new TransferJob()
.setName(NAME)
.setDescription(DESCRIPTION)
.setProjectId(PROJECT)
.setTransferSpec(
new TransferSpec()
.setObjectConditions(new ObjectConditions()
.setIncludePrefixes(includePrefixes))
.setTransferOptions(new TransferOptions()
.setDeleteObjectsFromSourceAfterTransfer(false)
.setOverwriteObjectsAlreadyExistingInSink(false)
.setDeleteObjectsUniqueInSink(false))
.setAwsS3DataSource(
new AwsS3Data()
.setBucketName(SOURCE_BUCKET)
.setAwsAccessKey(
new AwsAccessKey()
.setAccessKeyId(ACCESS_KEY)
.setSecretAccessKey(SECRET_ACCESS_KEY))
)
.setGcsDataSink(
new GcsData()
.setBucketName(DESTINATION_BUCKET)
))
.setSchedule(
new Schedule()
.setScheduleStartDate(date)
.setScheduleEndDate(date)
.setStartTimeOfDay(time))
.setStatus(STATUS);
Unfortunately I could not find anywhere to mention the destination folder for this transfer. I know gsutil rsync has similar however the scale & data integrity is a concern. Can anyone guide me/point me any way/workaround to achieve the goal ?
As the bucket and not a subdirectory is the available option for data transfer destination, the workaround for this scenario would be doing the transfer to your bucket, then doing the rsync operation between your bucket and the subdirectory, just keep in mind that you should try running the gsutil -m rsync -r -d -n to verify what it'll do, as you could delete data accidentally.

What would cause a 'BotoServerError: 400 Bad Request' when calling create_application_version?

I have included some code that uploads a war file into an s3 bucket (creating the bucket first if it does not exist). It then creates an elastic beanstalk application version using the just-uploaded war file.
Assume /tmp/server_war exists and is a valid war file. The following code will fail with boto.exception.BotoServerError: BotoServerError: 400 Bad Request:
#!/usr/bin/env python
import time
import boto
BUCKET_NAME = 'foo_bar23498'
s3 = boto.connect_s3()
bucket = s3.lookup(BUCKET_NAME)
if not bucket:
bucket = s3.create_bucket(BUCKET_NAME, location='')
version_label = 'server%s' % int(time.time())
# uplaod the war file
key_name = '%s.war' % version_label
s3key = bucket.new_key(key_name)
print 'uploading war file...'
s3key.set_contents_from_filename('/tmp/server.war',
headers={'Content-Type' : 'application/x-zip'})
# uses us-east-1 by default
eb = boto.connect_beanstalk()
eb.create_application_version(
application_name='TheApp',
version_label=version_label,
s3_bucket=BUCKET_NAME,
s3_key=key_name,
auto_create_application=True)
what would cause this?
One possible cause of this error is the bucket name. Apparently you can have s3 bucket names that contain underscores, but you cannot create application versions using keys in those buckets.
If you change the fourth line above to
BUCKET_NAME = 'foo-bar23498'
It should work.
Yes, it feels weird to be answering my own question...apparently this the recommended approach for this situation on stack overflow. I hope I save someone else a whole lot of debugging time.