Export table from Bigquery into GCS split sizes - google-bigquery

I am exporting a table of size>1GB from Bigquery into GCS but it splits the files into very small files of 2-3 MB. Is there a way to get bigger files like 40-60MB per files rather than 2-3 MB.
I do the expport via the api
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
https://cloud.google.com/bigquery/docs/reference/v2/jobs
The source table size is 60 GB on Bigquery. I extract the data with format - NewLine_Delimited_Json and GZIP compression
destination_cloud_storage_uris=[
'gs://bucket_name/main_folder/partition_date=xxxxxxx/part-*.gz'
]

Are you trying to export partitioned table? If yes, each partition is exported as different table and it might cause small files.
I run the export in cli with each of the following commands and received in both cases files of size 49 MB:
bq extract --compression=GZIP --destination_format=NEWLINE_DELIMITED_JSON project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
bq extract --compression=GZIP project:dataset.table gs://bucket_name/path5-component/file-name-*.gz

Please add more details to the question so we can provide specific advice: How are you exactly asking for this export?
Nevertheless, if you have many files in GCS and you want to merge them all into one, you can do:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose

Related

BigQuery export CSV: Can I control output partitioning?

This statement exports the query results to GCS:
EXPORT DATA OPTIONS(
uri='gs://<bucket>/<file_name>.*.csv',
format='CSV',
overwrite=true,
header=true
) AS
SELECT * FROM dataset.table
It splits big amounts of data into multiple files, sometimes it also produces empty files. I can't seem to find any info in BigQuery docs on how to control this. Can I configure export into a single file? Or into N files up to 1M rows each? Or N files up to 50MB each?
I have tested different scenarios (using Public datasets) and discovered that export data gets split into multiple files when your table is partitioned and is less than 1 GB. This result happens when using wildcard operator during the export.
BigQuery supports a single wildcard operator (*) in each URI. The wildcard can appear anywhere in the URI except as part of the bucket name. Using the wildcard operator instructs BigQuery to create multiple sharded files based on the supplied pattern.
Unfortunately, wildcard is a requirement for the EXPORT DATA syntax, otherwise your query will fail and get this error:
Can I configure export into a single file? Or into N files up to 1M rows each? Or N files up to 50MB each?
As mentioned above, exporting a partitioned table into a single file would not be possible using the EXPORT DATA syntax. A workaround for this, is to export using the UI or bq command.
Using UI export:
Open table > Export > Export to GCS > Fill in GCS location and filename
Using bq tool:
bq extract --destination_format CSV \
bigquery-public-data:covid19_geotab_mobility_impact.us_border_wait_times \
gs://bucket_name/900k_rows_using_bq_extract.csv
Using public data partitioned table, bigquery-public-data.covid19_geotab_mobility_impact.us_border_wait_times. See csv files exported to GCS bucket using these three different methods.

Bigquery exporting tables in GCS of file size 8GB even using single wildcard URI to export the table in less than 1 gb chunksof file

I tried manually and with command line to export big query table having 140GB of data into files of size less than 1GB in GCS bucket. It created 168 files overall after export. All files from 1 to 167 are of less than 1GB but the last file is around 8GB for both case while exporting using command line or using big query interface.
Here is screenshot of GCS bucket.
I followed Export bigquery table to GCS to export table into multiple file using single wildcard uri to split the exported table into chunks.
I want all exported files to be around 1 GB only. Can anybody help me with this? Thanks.
You read the documentation wrong.
There is no 1GB per file export configuration in BigQuery.
The 1GB what you have read is referring to the data size that you are trying to export.
If you are exporting more than 1 GB of data, you must export your data
to multiple files. When you export your data to multiple files, the
size of the files will vary.
So this tells that if your table is bigger than 1GB you must export to multiple files. But it DOESN'T tell you that the files will be smaller than 1GB, it tells the file size varies.

How to import public data set into Google Cloud Bucket

I am going to work on a data set that contains information about 311 calls in the United States. This data set is available publicly in BigQuery. I would like to copy this directly to my bucket. However, I am clueless about how to do this as I am a novice.
Here is a screenshot of the public location of the dataset on Google Cloud:
I have already created a bucket named 311_nyc in my Google Cloud Storage. How can I directly transfer the data without having to download the 12 gb file and uploading it again through my VM instance?
If you select the 311_service_requests table from the list on the left, an "Export" button will appear:
Then you can select Export to GCS, select your bucket, type a filename, choose format (between CSV and JSON) and check if you want the export file to be compressed (GZIP).
However, there are some limitations in BigQuery Exports. Copying some from the documentation link that apply to your case:
You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.
EDIT:
A simple way to merge the output files together is to use the gsutil compose command. However, if you do this the header with the column names will appear multiple times in the resulting file because it appears in all the files that are extracted from BigQuery.
To avoid this, you should perform the BigQuery Export by setting the print_header parameter to False:
bq extract --destination_format CSV --print_header=False bigquery-public-data:new_york_311.311_service_requests gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv
and then create the composite:
gsutil compose gs://<YOUR_BUCKET_NAME>/nyc_311_* gs://<YOUR_BUCKET_NAME>/all_data.csv
Now, in the all_data.csv file there are no headers at all. If you still need the column names to appear in the first row you have to create another CSV file with the column names and create a composite of these two. This can be done either manually by pasting the following (column names of the "311_service_requests" table) into a new file:
unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,borough,x_coordinate,y_coordinate,park_facility_name,park_borough,bbl,open_data_channel_type,vehicle_type,taxi_company_borough,taxi_pickup_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
or with the following simple Python script (in case you want to use it with a table with a big amount of columns that is hard to be done manually) that queries the column names of the table and writes them into a CSV file:
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT column_name
FROM `bigquery-public-data`.new_york_311.INFORMATION_SCHEMA.COLUMNS
WHERE table_name='311_service_requests'
"""
query_job = client.query(query)
columns = []
for row in query_job:
columns.append(row["column_name"])
with open("headers.csv", "w") as f:
print(','.join(columns), file=f)
Note that for the above script to run you need to have the BigQuery Python Client library installed:
pip install --upgrade google-cloud-bigquery
Upload the headers.csv file to your bucket:
gsutil cp headers.csv gs://<YOUR_BUCKET_NAME/headers.csv
And now you are ready to create the final composite:
gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/all_data.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
In case you want the headers you can skip creating the first composite and just create the final one using all sources:
gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
You can also use the gcoud commands:
Create a bucket:
gsutil mb gs://my-bigquery-temp
Extract the data set:
bq extract --destination_format CSV --compression GZIP 'bigquery-public-data:new_york_311.311_service_requests' gs://my-bigquery-temp/dataset*
Please note that you have to use gs://my-bigquery-temp/dataset* because the dataset is to large and can not be exported to a single file.
Check the bucket:
gsutil ls gs://my-bigquery-temp
gs://my-bigquery-temp/dataset000000000
......................................
gs://my-bigquery-temp/dataset000000000045
You can find more information Exporting table data
Edit:
To compose an object from the exported dataset files you can use gsutil tool:
gsutil compose gs://my-bigquery-temp/dataset* gs://my-bigquery-temp/composite-object
Please keep in mind that you can not use more that 32 blobs (files) to compose the object.
Related SO Question Google Cloud Storage Joining multiple csv files

loading a pg_dump off of s3 into redshift

I'm trying to load a complete database dump into Redshift. Is there a single command to restore the data from a pg_dump living on s3 into Redshift? If not, what are the best steps for tackling this?
Thanks
If you have a non compressed pg_dump this should be possible using a psql command (you may need to manually edit to get the right syntax, depending on your versions and options set).
However this is a very inefficient and slow way to load redshift and I do not recommend it. If your tables are large it could take days or weeks!
What you need to do is this:
create target tables on redshift based upon the source table, but
considering sort keys and distribution.
unload you postgres source tables into csv files using postgres
"copy" command
If the source csv files are very big (e.g. more than say 100MB),
consider splitting these into separate files as they will load
faster (redshift will parallelize)
gzip the csv files (recommended but not essential)
upload these csv files to s3, with a separate folder per table
load the data into redshift from s3 by using the redshift copy
command

Performance improvement for GZ to ORC File

Please let me know Is there any faster way to move (*.gz) to ORC table directly.
1)Another thought, from *.gz file to NON Partition table, Rather than creating External Table and dumping gz file data to External Table. Is there any other approach for quicker loading from Gz to External Table. We are thinking of 2 other approaches like Can we have ADF with Custom .exe to uncompress *.gz file and upload to Azure Blob.
For Example : If the *.Gz File is 10 GB and Un Compressed File is 120 GB , time it takes to uncompress is 40 Mins, How do we upload this un compressed 120 GB data File to Azure Blob. Do we need to have Azure Blob SDK for uploading or Will ADF Executes .exe at location where data is present i.e. exactly at the cluster which holds Blob Data. ( If ADF executes .exe at Azure Blob Storage Data Center’s Cluster, then there will be no Network cost, No Network latency and upload time to upload Uncompressed data will be very less). So Is it possible with ADF?. Will it be right approach ?
If above approach doesn’t work, If we create MR Solution where Mapper is going to UnCompress Gz File and Uploads to Azure Blob Storage, will there be any performance improvement, since I just need to create External Table pointing to uncompressed File. MR will be executing at Azure Blob storage location.
We see ORC and ORC with Partition are performing at same (sometimes we see minimal difference b/w ORC partition and ORC without partition). Will ORC With Partition perform better than ORC . Will ORC With Partition Bucketing performs better than ORC Partition ?. I see each ORC Partition File is close 50-100 MB and ORC With Out Partition (each File size 30-50 MB).
**Note: 120 GB of Un Compressed Data is compressed to 17 GB of ORC File Format
The only way that I know to move from gz to ORC file format is by writing a Hive query. Using a compressed format will always be slower since it needs to be decompressed before conversion. You may want to play around with these parameters as shown here, to see if it speed up moving from gz to orc.
For question #1 above, you may want to follow up with Azure Data Factory team.
For question #3, I have not tried it but computing on uncompressed data should be faster than using compressed data.
For #4, depends on what the field you are partitioning on. Make sure your key is not under partitioned (i.e. results in too few partitions). Also ensure you add sorted by to add a secondary partitioning key. Refer to this link for more details.
Hive has native support for compressed format, including GZIP, BZIP2 and deflate. So you can upload .gz files to Azure Blob and create external table with those files directly. And then you can create table with ORC and load the data there. Normally Hive runs faster with compressed files, please refer to Compression in Hadoop by MSIT for details.