Change UTF 16 encoding to UTF 8 encoding for files in AWS S3 - amazon-s3

My main goal is to have AWS Glue move files stored in S3 to a database in RDS. My current issue is that the format in which I get these files has a UTF 16 LE encoding and AWS Glue will only process text files with UTF 8 encoding. See (https://docs.aws.amazon.com/glue/latest/dg/glue-dg.pdf, pg. 5 footnote). On my local machine, python can easily change the encoding by this method:
from pathlib import Path
path = Path('file_path')
path.write_text(path.read_text(encoding="utf16"), encoding="utf8")
I attempted to implement this in a Glue job as such:
bucketname = "bucket_name"
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucketname)
subfolder_path = "folder1/folder2"
file_filter = "folder2/file_header_identifier"
for obj in my_bucket.objects.filter(Prefix=file_filter):
filename = (obj.key).split('/')[-1]
file_path = Path("s3://{}/{}/{}".format(bucketname, subfolder_path, filename))
file_path.write_text(file_path.read_text(encoding="utf16"), encoding="utf8")
I'm not getting an error in Glue, but it is not changing the text encoding of my file. But when I try something similar in Lambda, which is probably the wiser service to work with, I do get an error that s3 has no Bucket attribute. I'd prefer to keep all this ETL work in glue for convenience.
I'm very new to AWS so any advice is welcomed.

Related

KeyError: 'ETag' while trying to load data from S3 to Sagemaker

I Unload a file of 500 MB into S3 from Redshift, instead of saving into a single file in S3 it bifurcated into several chunks and now I am trying to access it from S3 to AWS Sagemaker. While trying to read the file using Pd.read_csv and dask.dataframe.read_csv I am getting Keyerror as 'ETag'
I'm a newbie to AWS, please do help me.
are you trying to import using a bucket name with /'s in it? The top level bucket is read in with
my_bucket = s3.Bucket("data-bucket-named")
and then the subfolders can be read in with:
subfolders= "subfolder1/subfolder2/subfolder3"
csvs = []
for object_summary in my_bucket.objects.filter(Prefix=subfolders):
key=object_summary.key
if key.endswith(".csv"):
csvs.append(key)
all_data = pd.DataFrame({})
for file in csvs:
df = pd.read_csv(f's3://{"data-bucket-named"}/{file}')
add_data = pd.concat([all_df, df])
Hope that helps.

issue with reading csv file from AWS S3 with boto3

I have a csv file with the following columns:
Name Adress/1 Adress/2 City State
When I try to read this csv file from local disk I have no issue.
But when I try to read it from S3 with the below code I get error when I use io.StringIO.
When I use io.BytesIO each record displays as one column. Though the file is a ',' separated some column do contain '/n' or '/t' in it. I believe these causing the issue.
I used AWS Wrangler with no issue. But my requirement is to read this csv file with boto3
import pandas as pd
import boto3
s3 = boto3.resource('s3', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
my_bucket = s3.Bucket(AWS_S3_BUCKET)
csv_obj=my_bucket.Object(key=key).get().get('Body').read().decode('utf16')
data= io.BytesIO(csv_obj) #io.StringIO(csv_obj)
sdf = pd.read_csv(data,delimiter=sep,names=cols, header=None,skiprows=1)
print(sdf)
Any suggestion please?
try get_object():
obj = boto3.client('s3').get_object(Bucket=AWS_S3_BUCKET, Key=key)
data = io.StringIO(obj['Body'].read().decode('utf-8'))

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.
read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

What would cause a 'BotoServerError: 400 Bad Request' when calling create_application_version?

I have included some code that uploads a war file into an s3 bucket (creating the bucket first if it does not exist). It then creates an elastic beanstalk application version using the just-uploaded war file.
Assume /tmp/server_war exists and is a valid war file. The following code will fail with boto.exception.BotoServerError: BotoServerError: 400 Bad Request:
#!/usr/bin/env python
import time
import boto
BUCKET_NAME = 'foo_bar23498'
s3 = boto.connect_s3()
bucket = s3.lookup(BUCKET_NAME)
if not bucket:
bucket = s3.create_bucket(BUCKET_NAME, location='')
version_label = 'server%s' % int(time.time())
# uplaod the war file
key_name = '%s.war' % version_label
s3key = bucket.new_key(key_name)
print 'uploading war file...'
s3key.set_contents_from_filename('/tmp/server.war',
headers={'Content-Type' : 'application/x-zip'})
# uses us-east-1 by default
eb = boto.connect_beanstalk()
eb.create_application_version(
application_name='TheApp',
version_label=version_label,
s3_bucket=BUCKET_NAME,
s3_key=key_name,
auto_create_application=True)
what would cause this?
One possible cause of this error is the bucket name. Apparently you can have s3 bucket names that contain underscores, but you cannot create application versions using keys in those buckets.
If you change the fourth line above to
BUCKET_NAME = 'foo-bar23498'
It should work.
Yes, it feels weird to be answering my own question...apparently this the recommended approach for this situation on stack overflow. I hope I save someone else a whole lot of debugging time.