I have a series of files that are in JSON that need to be split into multiple files to reduce their size. One issue is that the files are extracted using a third party tool and arrive as a JSON object on a single line.
I can use S3 select to process a small file (say around 300Mb uncompressed) but when I try and use a larger file - say 1Gb uncompressed (90Mb gzip compressed) I get the following error:
[ERROR] EventStreamError: An error occurred (InternalError) when calling the SelectObjectContent operation: We encountered an internal error. Please try again.
The query that I am trying to run is:
select count(*) as rowcount from s3object[*][*] s
I can't run the query from the console because the file is larger than 128Mb but the code that is performing the operation is as follows:
def execute_select_query(bucket, key, query):
"""
Runs a query against an object in S3.
"""
if key.endswith("gz"):
compression = "GZIP"
else:
compression = "NONE"
LOGGER.info("Running query |%s| against s3://%s/%s", query, bucket, key)
return S3_CLIENT.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={"JSON": {"Type": "DOCUMENT"}, "CompressionType": compression},
OutputSerialization={'JSON': {}},
)
Related
On S3 there is a JSON file with the following format:
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
It is compressed, in .tar.gz format, and its unzipped size is ~30GB, therefore I would like to read it in a streaming fashion.
Using the aws cli, I managed to locally do so with the following command:
aws s3 cp s3://${BUCKET_NAME}/${FILE_NAME}.tar.gz - | gunzip -c -
However, I would like to do it natively in python 3.8.
Merging various solutions online, I tried the following strategies:
1. Uncompressing in-memory file [not working]
import boto3, gzip, json
from io import BytesIO
s3 = boto3.resource('s3')
key = 'FILE_NAME.tar.gz'
streaming_iterator = s3.Object('BUCKET_NAME', key).get()['Body'].iter_lines()
first_line = next(streaming_iterator)
gzipline = BytesIO(first_line)
gzipline = gzip.GzipFile(fileobj=gzipline)
print(gzipline.read())
Which raises
EOFError: Compressed file ended before the end-of-stream marker was reached
2. Using the external library smart_open [partially working]
import boto3
for line in open(
f's3://${BUCKET_NAME}/${FILE_NAME}.tar.gz',
mode="rb",
transport_params={"client": boto3.client('s3')},
encoding="raw_unicode_escape",
compression=".gz"
):
print(line)
This second solution works discretely well for ASCII characters, but for some reason it also turns non ASCII characters into garbage; e.g.,
input: \xe5\x9b\xbe\xe6\xa0\x87\xe3\x80\x82
output: å\x9b¾æ\xa0\x87ã\x80\x82
expected output: 图标。
This leads me to think that the encoding I put is wrong, but I literally tried every encoding present in this page and the only ones that don't lead to an Exception are raw_unicode_escape, unicode_escape and palmos (?), but they all produce garbage.
Any suggestion is welcomed, thanks in advance.
The return from a call to get_object() is a StreamingBody object, which as the name implies will allow you to read from the object in a streaming fashion. However, boto3 does not support seeking on this file object.
While you can pass this object to a tarfile.open call, you need to be careful. There are two caveats. First, you'll need to tell tarfile that you're passing it a non-seekable streaming object using the | character in the open string, and you can't do anything that would trigger a seek, such as attempt to get a list of files first, then operate on these files.
Putting it all together is fairly straight forward, you just need to open a object using boto3, then process each file in the tar file in turn:
# Use boto3 to read the object from S3
s3 = boto3.client('s3')
resp = s3.get_object(Bucket='example-bucket', Key='path/to/example.tar.gz')
obj = resp['Body']
# Open the tar file, the "|" is important, as it instructs
# tarfile that the fileobj is non-seekable
with tarfile.open(fileobj=obj, mode='r|gz') as tar:
# Enumerate the tar file objects as we extract data
for member in tar:
with tar.extractfile(member) as f:
# Read each row in turn and decode it
for row in f:
row = json.loads(row)
# Just print out the filename and results in this demo
print(member.name, row)
I have to upload the data from a Snowflake table to Azure BLOB using COPYINTO command. The copy command I have is working for SINGLE = TRUE property but I want to break the in multiple files if the size exceeds 40MB.
For example, There is a table 'TEST' in snowflake with 100MB, I want to upload this data in azure BLOB.
The copy into command should create files in below format
TEST_1.csv (40MB)
TEST_2.csv (40MB)
TEST_3.csv (20MB)
--COPY INTO Command I am using
copy into #stage/test.csv from snowflake.test file_format = (format_name = PRW_CSV_FORMAT) header=true OVERWRITE = TRUE SINGLE = TRUE max_file_size = 40000000
We cannot control the output size of file unloads, only the max file size. The number and size of the files are based on maximum performance as it parallelizes the operation. If you want to control the number/size of files, that would be a feature request. Otherwise, just work out a process outside of Snowflake to combine the files afterward. For more details about unloading, please refer to the blog
I'm currently using Airflow with the BigQuery operator to trigger various SQL scripts. This works fine when the SQL is written directly in the Airflow DAG file. For example:
bigquery_transform = BigQueryOperator(
task_id='bq-transform',
bql='SELECT * FROM `example.table`',
destination_dataset_table='example.destination'
)
However, I'd like to store the SQL in a separate file saved to a storage bucket. For example:
bql='gs://example_bucket/sample_script.sql'
When calling this external file I recieve a "Template Not Found" error.
I've seen some examples load the SQL file into the Airflow DAG folder, however, I'd really like to access files saved to a separate storage bucket. Is this possible?
You can reference any SQL files in your Google Cloud Storage Bucket. Here's a following example where I call the file Query_File.sql in the sql directory in my airflow dag bucket.
CONNECTION_ID = 'project_name'
with DAG('dag', schedule_interval='0 9 * * *', template_searchpath=['/home/airflow/gcs/dags/'], max_active_runs=15, catchup=True, default_args=default_args) as dag:
battery_data_quality = BigQueryOperator(
task_id='task-id',
sql='/SQL/Query_File.sql',
destination_dataset_table='project-name.DataSetName.TableName${{ds_nodash}}',
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False,
dag=dag
)
You can also consider using the gcs_to_gcs operator to copy things from your desired bucket into one that is accessible by composer.
download works differently in GoogleCloudStorageDownloadOperator for Airflow version 1.10.3 and 1.10.15.
def execute(self, context):
self.object = context['dag_run'].conf['job_name'] + '.sql'
logging.info('filemname in GoogleCloudStorageDownloadOperator: %s', self.object)
self.filename = context['dag_run'].conf['job_name'] + '.sql'
self.log.info('Executing download: %s, %s, %s', self.bucket,
self.object, self.filename)
hook = GoogleCloudStorageHook(
google_cloud_storage_conn_id=self.google_cloud_storage_conn_id,
delegate_to=self.delegate_to
)
file_bytes = hook.download(bucket=self.bucket,
object=self.object)
if self.store_to_xcom_key:
if sys.getsizeof(file_bytes) < 49344:
context['ti'].xcom_push(key=self.store_to_xcom_key, value=file_bytes.decode('utf-8'))
else:
raise RuntimeError(
'The size of the downloaded file is too large to push to XCom!'
)
I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9
I am getting an "Unexpected" error. I tried a few times, and I still could not load the data. Is there any other way to load data?
gs://log_data/r_mini_raw_20120510.txt.gzto567402616005:myv.may10c
Errors:
Unexpected. Please try again.
Job ID: job_4bde60f1c13743ddabd3be2de9d6b511
Start Time: 1:48pm, 12 May 2012
End Time: 1:51pm, 12 May 2012
Destination Table: 567402616005:myvserv.may10c
Source URI: gs://log_data/r_mini_raw_20120510.txt.gz
Delimiter: ^
Max Bad Records: 30000
Schema:
zoneid: STRING
creativeid: STRING
ip: STRING
update:
I am using the file that can be found here:
http://saraswaticlasses.net/bad.csv.zip
bq load -F '^' --max_bad_record=30000 mycompany.abc bad.csv id:STRING,ceid:STRING,ip:STRING,cb:STRING,country:STRING,telco_name:STRING,date_time:STRING,secondary:STRING,mn:STRING,sf:STRING,uuid:STRING,ua:STRING,brand:STRING,model:STRING,os:STRING,osversion:STRING,sh:STRING,sw:STRING,proxy:STRING,ah:STRING,callback:STRING
I am getting an error "BigQuery error in load operation: Unexpected. Please try again."
The same file works from Ubuntu while it does not work from CentOS 5.4 (Final)
Does the OS encoding need to be checked?
The file you uploaded has an unterminated quote. Can you delete that line and try again? I've filed an internal bigquery bug to be able to handle this case more gracefully.
$grep '"' bad.csv
3000^0^1.202.218.8^2f1f1491^CN^others^2012-05-02 20:35:00^^^^^"Mozilla/5.0^generic web browser^^^^^^^^
When I run a load from my workstation (Ubuntu), I get a warning about the line in question. Note that if you were using a larger file, you would not see this warning, instead you'd just get a failure.
$bq show --format=prettyjson -j job_e1d8636e225a4d5f81becf84019e7484
...
"status": {
"errors": [
{
"location": "Line:29057 / Field:12",
"message": "Missing close double quote (\") character: field starts with: <Mozilla/>",
"reason": "invalid"
}
]
My suspicion is that you have rows or fields in your input data that exceed the 64 KB limit. Perhaps re-check the formatting of your data, check that it is gzipped properly, and if all else fails, try importing uncompressed data. (One possibility is that the entire compressed file is being interpreted as a single row/field that exceeds the aforementioned limit.)
To answer your original question, there are a few other ways to import data: you could upload directly from your local machine using the command-line tool or the web UI, or you could use the raw API. However, all of these mechanisms (including the Google Storage import that you used) funnel through the same CSV parser, so it's possible that they'll all fail in the same way.