How to construct S3 URL for copying to Redshift? - amazon-s3

I am trying to import a CSV file into a Redshift cluster. I have successfully completed the example in the Redshift documentation. Now I am trying to COPY from my own CSV file.
This is my command:
copy frontend_chemical from 's3://awssampledb/mybucket/myfile.CSV'
credentials 'aws_access_key_id=xxxxx;aws_secret_access_key=xxxxx'
delimiter ',';
This is the error I see:
An error occurred when executing the SQL command:
copy frontend_chemical from 's3://awssampledb/mybucket/myfile.CSV'
credentials 'aws_access_key_id=XXXX...'
[Amazon](500310) Invalid operation: The specified S3 prefix 'mybucket/myfile.CSV' does not exist
Details:
-----------------------------------------------
error: The specified S3 prefix 'mybucket/myfile.CSV' does not exist
code: 8001
context:
query: 3573
location: s3_utility.cpp:539
process: padbmaster [pid=2432]
-----------------------------------------------;
Execution time: 0.7s
1 statement failed.
I think I'm constructing the S3 URL wrong, but how should I do it?
My Redshift cluster is in the US East (N Virginia) region.

The Amazon Redshift COPY command can be used to load multiple files in parallel.
For example:
Bucket = mybucket
The files are in the bucket under the path data
Then refer to the contents as:
s3://mybucket/data
For example:
COPY frontend_chemical
FROM 's3://mybucket/data'
CREDENTIALS 'aws_access_key_id=xxxxx;aws_secret_access_key=xxxxx'
DELIMITER ',';
This will load all files within the data directory. You can also refer to a specific file by including it in the path, eg s3://mybucket/data/file.csv

Related

Azcopy command issue with parameters

I'm using Azcopy within a shell script to copy blobs within a container from one storage account to another on Azure.
Using the following command -
azcopy copy "https://$source_storage_name.blob.core.windows.net/$container_name/?$source_sas" "https://$dest_storage_name.blob.core.windows.net/$container_name/?$dest_sas" --recursive
I'm generating the SAS token for both source and destination accounts and passing them as parameters in the command above along with the storage account and container names.
On execution, I keep getting this error ->
failed to parse user input due to error: the inferred source/destination combination could not be identified, or is currently not supported
When I manually enter the storage account names, container name and SAS tokens, the command executes successfully and storage data gets transferred as expected. However, when I use parameters in the azcopy command I get the error.
Any suggestions on this would be greatly appreciated.
Thanks!
You can use the below PowerShell Script
param
(
[string] $source_storage_name,
[string] $source_container_name,
[string] $dest_storage_name,
[string] $dest_container_name,
[string] $source_sas,
[string] $dest_sas
)
.\azcopy.exe copy "https://$source_storage_name.blob.core.windows.net/$source_container_name/?$source_sas" "https://$dest_storage_name.blob.core.windows.net/$container_name/?$dest_sas" --recursive=true
To execute the above script you can run the below command.
.\ScriptFileName.ps1 -source_storage_name "<XXXXX>" -source_container_name "<XXXXX>" -source_sas "<XXXXXX>" -dest_storage_name "<XXXXX>" -dest_container_name "<XXXXXX>" -dest_sas "<XXXXX>"
I am Generating SAS token for both the Storage from here . Make Sure to Check all the boxes as i did in the picture.
OutPut ---

Copying from s3 bucket to redshift I get system messages in raw_line of stl_load_errosrs

I have a tab-separated text file in an s3 bucket which I would like to upload to redshift.
My redshift query looks like this:
COPY rk_test_tab
from 's3://my_bucket/my_file'
iam_role 'arn:aws:iam::XXX:role/XXX'
IGNOREHEADER 1
BLANKSASNULL
EMPTYASNULL
MAXERROR 10
DELIMITER '\t'
;
A raw line from the data file read via Python looks like this:
b'1008498338\t1.0\t1\t1\tCBDT\n'
This fails with the message Load into table 'rk_test_tab' failed. Check 'stl_load_errors' system table for details.
When I look at the stl_load_errors the raw_line field is very odd:
[GC Worker Start (ms): Min: 430.9, Avg: 430.9, Max: 431.0, Diff: 0.2]
(Double checked there are no lines like this in my file)
The err_reason is given as Delimiter not found.
If I remove the IGNOREHEADER 1 line, I get an error because my headers don't match the expected data field types, but crucially the raw_line does match that in the the file, so the file in the s3 bucket is getting through.
So my question is: what can be going wrong such that Redshift is trying to read in what look like system info messages (Garbage collection?).

(InternalError) when calling the SelectObjectContent operation in boto3

I have a series of files that are in JSON that need to be split into multiple files to reduce their size. One issue is that the files are extracted using a third party tool and arrive as a JSON object on a single line.
I can use S3 select to process a small file (say around 300Mb uncompressed) but when I try and use a larger file - say 1Gb uncompressed (90Mb gzip compressed) I get the following error:
[ERROR] EventStreamError: An error occurred (InternalError) when calling the SelectObjectContent operation: We encountered an internal error. Please try again.
The query that I am trying to run is:
select count(*) as rowcount from s3object[*][*] s
I can't run the query from the console because the file is larger than 128Mb but the code that is performing the operation is as follows:
def execute_select_query(bucket, key, query):
"""
Runs a query against an object in S3.
"""
if key.endswith("gz"):
compression = "GZIP"
else:
compression = "NONE"
LOGGER.info("Running query |%s| against s3://%s/%s", query, bucket, key)
return S3_CLIENT.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={"JSON": {"Type": "DOCUMENT"}, "CompressionType": compression},
OutputSerialization={'JSON': {}},
)

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

Hi , Google big query - bq fail load display file number how to get the file name

I'm running the following bq command
bq load --source_format=CSV --skip_leading_rows=1 --max_bad_records=1000 --replace raw_data.order_20150131 gs://raw-data/order/order/2050131/* order.json
and
getting the following message when loading data into bq .
*************************************
Waiting on bqjob_r4ca10491_0000014ce70963aa_1 ... (412s) Current status: DONE
BigQuery error in load operation: Error processing job
'orders:bqjob_r4ca10491_0000014ce70963aa_1': Too few columns: expected
11 column(s) but got 1 column(s). For additional help: http://goo.gl/RWuPQ
Failure details:
- File: 844 / Line:1: Too few columns: expected 11 column(s) but got
1 column(s). For additional help: http://goo.gl/RWuPQ
**********************************
The message display only the file number .
checked the files content most of them are good .
gsutil ls and the cloud console on the other hand display file names .
how can I know which file is it according to the file number?
There seems to be some weird spacing introduced in the question, but if the desired path to ingest is "/order.json" - that won't work: You can only use "" at the end of the path when ingesting data to BigQuery.