It seems that I can load data into BigQuery from S3 in the following sample:
This time, I would like to load the compressed files in S3. Not a CSV file.
If so, how can I load the data into BigQuery from S3?
sample
bq mk \
--transfer_config \
--data_source=amazon_s3 \
--display_name=load_from_s3 \
--target_dataset=test_dataset_s3 \
--params='{
"data_path":"s3://xxx-test01",
"destination_table_name_template":"test_table",
"access_key_id":"xxxxxxxxxxxxx",
"secret_access_key":"xxxxxxxxxxxxxx",
"file_format":"CSV",
"max_bad_records":"0",
"ignore_unknown_values":"true",
"field_delimiter":",",
"skip_leading_rows":"0",
"allow_quoted_newlines":"true"
}'
The same bq CLI command can be used without much changes. Assuming you have compressed CSV files, I tested the transfer with a sample compressed file of my own and the transfer was successful. Below is the bq command I tested.
bq mk \
--transfer_config \
--data_source=amazon_s3 \
--display_name=load_from_s3 \
--target_dataset=test_dataset \
--params='{
"data_path":"s3://awsbucket-name/sample.csv.gz",
"destination_table_name_template":"table-name",
"access_key_id":"xxxxxxxxxxxxxxx",
"secret_access_key":"xxxxxxxxxxxxxxxx",
"file_format":"CSV",
"max_bad_records":"0",
"ignore_unknown_values":"true",
"field_delimiter":",",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true"
}'
Logs of the file transfer:
However, it is also to be noted that for formats such as CSV and JSON, BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel. For more information refer to this documentation.
Related
I want to upload all .csv.gz files stored in the google cloud storage folder.
I tried
bq load
--skip_leading_rows=1
--allow_quoted_newlines
--source_format=CSV
transfer_test.multiple_gzfile_test
gs://test_compressed_file/* \
but it says no matches found: gs://test_compressed_file/*.
I also tried "gs://test_compressed_file/", gs://test_compressed_file/.csv.gz, and "gs://test_compressed_file/*.csv.gz", but none of them work.
As I understand, you are now able to load the csv.gz files into a BigQuery dataset.
According to Google’s documentation, the wildcard is used to match a pattern in the object’s name, therefore gs://test_compressed_file/* or gs://test_compressed_file/*.csv.gz should have worked.
Based on the aforementioned, it is possible that the error was produced by a typographical-error in the URI declaration.
See replication below:
# Listing folders in bucket
$ gsutil ls -l gs://<bucket>/
gs://<bucket>/CSVs/
gs://<bucket>/Images/
gs://<bucket>/PDFs/
# Listing files in gs://<bucket>/CSVs/
$ gsutil ls -l gs://<bucket>/CSVs/
0 2021-07-07T20:21:33Z gs://<bucket>/CSVs/
39 2021-07-07T20:22:24Z gs://<bucket>/CSVs/table_1.csv.gz
# Using command with typo in URI and getting error:
$ bq load --skip_leading_rows=1 --allow_quoted_newlines --source_format=CSV test1.table1 gs://<bucket>/CSV/*
Waiting on bqjob_<BQ_JOB1> ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job '<project>:bqjob_<BQ_JOB1>': Not found: Uris gs://<bucket>/CSV/*
# Using command with correct URI
$ bq load --skip_leading_rows=1 --allow_quoted_newlines --source_format=CSV test1.table1 gs://<bucket>/CSVs/*
Waiting on bqjob_<BQ_JOB2> ... (1s) Current status: DONE
$
I have the following code, which is used to run a SQL query on a keyfile, located in a S3 bucket. This runs perfectly. My question is, I do not wish to have the output written over to an output file. Could I see the output on the screen (my preference #1)? If not, what about an ability to append to the output file, rather than over-write it (my preference #2). I am using the AWS-CLI binaries to run this query. If there is another way, I am happy to try (as long as it is within bash)
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' "OutputFile"
Of course, you can use AWS CLI to do this since stdout is just a special file in linux.
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' /dev/stdout
Note the /dev/stdout in the end.
The AWS CLI does not offer such options.
However, you are welcome to instead call it via an AWS SDK of your choice.
For example, in the boto3 Python SDK, there is a select_object_content() function that returns the data as a stream. You can then read, manipulate, print or save it however you wish.
I think it opens /dev/stdout twice causing kaos.
I've created an S3 Batch Operation using an S3 inventory JSON file that's pointing to a few billion objects in my S3 bucket.
The operation has been stuck on "Preparing" status for 24 hours now.
What are the preparation times to expect in these kinds of volumes?
Would preparation time shorten if instead of providing it with the JSON manifest I'll join all the inventory CSVs into one uber-CSV?
I've used awscli to create the request like so:
aws s3control create-job \
--region ... \
--account-id ... \
--operation '{"S3PutObjectCopy":{"TargetResource":"arn:aws:s3:::some-bucket","MetadataDirective":"COPY"}}' \
--manifest '{"Spec":{"Format":"S3InventoryReport_CSV_20161130"},"Location":{"ObjectArn":"arn:aws:s3:::path_to_manifest/manifest.json","ETag":"..."}}' \
--report '{"Bucket":"arn:aws:s3:::some-bucket","Prefix":"reports", "Format":"Report_CSV_20180820", "Enabled":true, "ReportScope":"AllTasks"}' \
--priority 42 \
--role-arn ... \
--client-request-token $(uuidgen) \
--description "Batch request"
After ~4 days the tasks completed the preparation phase and were ready to be ran
I was downloading a file using awscli:
$ aws s3 cp s3://mybucket/myfile myfile
But the download was interrupted (computer went to sleep). How can I continue the download? S3 supports the Range header, but awscli s3 cp doesn't let me specify it.
The file is not publicly accessible so I can't use curl to specify the header manually.
There is a "hidden" command in the awscli tool which allows lower level access to S3: s3api.† It is less user friendly (no s3:// URLs and no progress bar) but it does support the range specifier on get-object:
--range (string) Downloads the specified range bytes of an object. For
more information about the HTTP range header, go to
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
Here's how to continue the download:
$ size=$(stat -f%z myfile) # assumes OS X. Change for your OS
$ aws s3api get-object \
--bucket mybucket \
--key myfile \
--range "bytes=$size-" \
/dev/fd/3 3>>myfile
You can use pv for a rudimentary progress bar:
$ aws s3api get-object \
--bucket mybucket \
--key myfile \
--range "bytes=$size-" \
/dev/fd/3 3>&1 >&2 | pv >> myfile
(The reason for this unnamed pipe rigmarole is that s3api writes a debug message to stdout at the end of the operation, polluting your file. This solution rebinds stdout to stderr and frees up the pipe for regular file contents through an alias. The version without pv could technically write to stderr (/dev/fd/2 and 2>), but if an error occurs s3api writes to stderr, which would then get appended to your file. Thus, it is safer to use a dedicated pipe there, as well.)
† In git speak, s3 is porcelain, and s3api is plumbing.
Use s3cmd it has a --continue function built in. Example:
# Start a download
> s3cmd get s3://yourbucket/yourfile ./
download: 's3://yourbucket/yourfile' -> './yourfile' [1 of 1]
123456789 of 987654321 12.5% in 235s 0.5 MB/s
[ctrl-c] interrupt
# Pick up where you left off
> s3cmd --continue get s3://yourbucket/yourfile ./
Note that S3 cmd is not multithreaded where awscli is multithreaded, e.g. awscli is faster. A currently maintained fork of s3cmd, called s4cmd appears to provide the multi-threaded capabilities while maintaining the usability features of s3cmd:
https://github.com/bloomreach/s4cmd
I have this requirement where i need to export the report data directly to csv since getting the array/query response and then building the scv and again uploading the final csv to amazon takes time. Is there a way by which i can directly create the csv with the redshift postgresql.
PgSQL - Export select query data direct to amazon s3 servers with headers
here is my version of pgsql - Version PgSQL 8.0.2 on amazon redshift
Thanks
You can use UNLOAD statement to save results to a S3 bucket. Keep in mind that this will create multiple files (at least one per computing node).
You will have to download all the files, combine them locally, sort (if needed), then add column headers and upload result back to S3.
Using the EC2 instance shouldn't take a lot of time - connection between EC2 and S3 is quite good.
In my experience, the quickest method is to use shells' commands:
# run query on the redshift
export PGPASSWORD='__your__redshift__pass__'
psql \
-h __your__redshift__host__ \
-p __your__redshift__port__ \
-U __your__redshift__user__ \
__your__redshift__database__name__ \
-c "UNLOAD __rest__of__query__"
# download all the results
s3cmd get s3://path_to_files_on_s3/bucket/files_prefix*
# merge all the files into one
cat files_prefix* > files_prefix_merged
# sort merged file by a given column (if needed)
sort -n -k2 files_prefix_merged > files_prefix_sorted
# add column names to destination file
echo -e "column 1 name\tcolumn 2 name\tcolumn 3 name" > files_prefix_finished
# add merged and sorted file into destination file
cat files_prefix_sorted >> files_prefix_finished
# upload destination file to s3
s3cmd put files_prefix_finished s3://path_to_files_on_s3/bucket/...
# cleanup
s3cmd del s3://path_to_files_on_s3/bucket/files_prefix*
rm files_prefix* files_prefix_merged files_prefix_sorted files_prefix_finished