BigQuery export CSV: Can I control output partitioning? - google-bigquery

This statement exports the query results to GCS:
EXPORT DATA OPTIONS(
uri='gs://<bucket>/<file_name>.*.csv',
format='CSV',
overwrite=true,
header=true
) AS
SELECT * FROM dataset.table
It splits big amounts of data into multiple files, sometimes it also produces empty files. I can't seem to find any info in BigQuery docs on how to control this. Can I configure export into a single file? Or into N files up to 1M rows each? Or N files up to 50MB each?

I have tested different scenarios (using Public datasets) and discovered that export data gets split into multiple files when your table is partitioned and is less than 1 GB. This result happens when using wildcard operator during the export.
BigQuery supports a single wildcard operator (*) in each URI. The wildcard can appear anywhere in the URI except as part of the bucket name. Using the wildcard operator instructs BigQuery to create multiple sharded files based on the supplied pattern.
Unfortunately, wildcard is a requirement for the EXPORT DATA syntax, otherwise your query will fail and get this error:
Can I configure export into a single file? Or into N files up to 1M rows each? Or N files up to 50MB each?
As mentioned above, exporting a partitioned table into a single file would not be possible using the EXPORT DATA syntax. A workaround for this, is to export using the UI or bq command.
Using UI export:
Open table > Export > Export to GCS > Fill in GCS location and filename
Using bq tool:
bq extract --destination_format CSV \
bigquery-public-data:covid19_geotab_mobility_impact.us_border_wait_times \
gs://bucket_name/900k_rows_using_bq_extract.csv
Using public data partitioned table, bigquery-public-data.covid19_geotab_mobility_impact.us_border_wait_times. See csv files exported to GCS bucket using these three different methods.

Related

How to import public data set into Google Cloud Bucket

I am going to work on a data set that contains information about 311 calls in the United States. This data set is available publicly in BigQuery. I would like to copy this directly to my bucket. However, I am clueless about how to do this as I am a novice.
Here is a screenshot of the public location of the dataset on Google Cloud:
I have already created a bucket named 311_nyc in my Google Cloud Storage. How can I directly transfer the data without having to download the 12 gb file and uploading it again through my VM instance?
If you select the 311_service_requests table from the list on the left, an "Export" button will appear:
Then you can select Export to GCS, select your bucket, type a filename, choose format (between CSV and JSON) and check if you want the export file to be compressed (GZIP).
However, there are some limitations in BigQuery Exports. Copying some from the documentation link that apply to your case:
You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.
EDIT:
A simple way to merge the output files together is to use the gsutil compose command. However, if you do this the header with the column names will appear multiple times in the resulting file because it appears in all the files that are extracted from BigQuery.
To avoid this, you should perform the BigQuery Export by setting the print_header parameter to False:
bq extract --destination_format CSV --print_header=False bigquery-public-data:new_york_311.311_service_requests gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv
and then create the composite:
gsutil compose gs://<YOUR_BUCKET_NAME>/nyc_311_* gs://<YOUR_BUCKET_NAME>/all_data.csv
Now, in the all_data.csv file there are no headers at all. If you still need the column names to appear in the first row you have to create another CSV file with the column names and create a composite of these two. This can be done either manually by pasting the following (column names of the "311_service_requests" table) into a new file:
unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,borough,x_coordinate,y_coordinate,park_facility_name,park_borough,bbl,open_data_channel_type,vehicle_type,taxi_company_borough,taxi_pickup_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
or with the following simple Python script (in case you want to use it with a table with a big amount of columns that is hard to be done manually) that queries the column names of the table and writes them into a CSV file:
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT column_name
FROM `bigquery-public-data`.new_york_311.INFORMATION_SCHEMA.COLUMNS
WHERE table_name='311_service_requests'
"""
query_job = client.query(query)
columns = []
for row in query_job:
columns.append(row["column_name"])
with open("headers.csv", "w") as f:
print(','.join(columns), file=f)
Note that for the above script to run you need to have the BigQuery Python Client library installed:
pip install --upgrade google-cloud-bigquery
Upload the headers.csv file to your bucket:
gsutil cp headers.csv gs://<YOUR_BUCKET_NAME/headers.csv
And now you are ready to create the final composite:
gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/all_data.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
In case you want the headers you can skip creating the first composite and just create the final one using all sources:
gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
You can also use the gcoud commands:
Create a bucket:
gsutil mb gs://my-bigquery-temp
Extract the data set:
bq extract --destination_format CSV --compression GZIP 'bigquery-public-data:new_york_311.311_service_requests' gs://my-bigquery-temp/dataset*
Please note that you have to use gs://my-bigquery-temp/dataset* because the dataset is to large and can not be exported to a single file.
Check the bucket:
gsutil ls gs://my-bigquery-temp
gs://my-bigquery-temp/dataset000000000
......................................
gs://my-bigquery-temp/dataset000000000045
You can find more information Exporting table data
Edit:
To compose an object from the exported dataset files you can use gsutil tool:
gsutil compose gs://my-bigquery-temp/dataset* gs://my-bigquery-temp/composite-object
Please keep in mind that you can not use more that 32 blobs (files) to compose the object.
Related SO Question Google Cloud Storage Joining multiple csv files

Is it possible to export the data from select query output or table to the excel file stored in your local directory

I got this from user guide :
bq --location=US extract 'mydataset.mytable' gs://example-bucket/myfile.csv
But I want to export the data to the file located in my local path
example : /home/rahul/myfile.csv
When I am trying I got the below error:
Extract URI must start with "gs://"
Is it possible to export in local directory?
Also, Can we export the result of our select query to the excel?
Example :
bq --location=US extract 'select * from mydataset.mytable' /home/abc/myfile.csv
No, the BigQuery extract operation takes data out from BigQuery into a Google Cloud Storage (GCS) bucket.
Once data is in GCS you can copy it to your local system with gsutil or any other tool that might combine both operations.

Export table from Bigquery into GCS split sizes

I am exporting a table of size>1GB from Bigquery into GCS but it splits the files into very small files of 2-3 MB. Is there a way to get bigger files like 40-60MB per files rather than 2-3 MB.
I do the expport via the api
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
https://cloud.google.com/bigquery/docs/reference/v2/jobs
The source table size is 60 GB on Bigquery. I extract the data with format - NewLine_Delimited_Json and GZIP compression
destination_cloud_storage_uris=[
'gs://bucket_name/main_folder/partition_date=xxxxxxx/part-*.gz'
]
Are you trying to export partitioned table? If yes, each partition is exported as different table and it might cause small files.
I run the export in cli with each of the following commands and received in both cases files of size 49 MB:
bq extract --compression=GZIP --destination_format=NEWLINE_DELIMITED_JSON project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
bq extract --compression=GZIP project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
Please add more details to the question so we can provide specific advice: How are you exactly asking for this export?
Nevertheless, if you have many files in GCS and you want to merge them all into one, you can do:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose

Add filename as column on import to BigQuery?

This is a question about importing data files from Google Cloud Storage to BigQuery.
I have a number of JSON files that follow a strict naming convention to include some key data not included in the JSON data itself.
For example:
xxx_US_20170101.json.gz
xxx_GB_20170101.json.gz
xxx_DE_20170101.json.gz
Which is client_country_date.json.gz At the moment, I have some convoluted processes in a Ruby app that reads the files, appends the additional data and then writes it back to a file that is then imported into a single daily table for the client in BigQuery.
I am wondering if it is possible to grab and parse the filename as part of the import to BigQuery? I could then drop the convoluted Ruby processes which occasionally fail on larger files.
You could define an external table pointing to your files:
Note that the table type is "external table", and that it points to multiple files with the * glob.
Now you can query for all data in these files, and query for the meta-column _FILE_NAME:
#standardSQL
SELECT *, _FILE_NAME filename
FROM `project.dataset.table`
You can now save these results to a new native table.

BigQuery export table to csv file

I'm trying to export a BigQuery form UI to Google Storage table but facing this error:
Errors:
Table gs://mybucket/delta.csv.gz too large to be exported to a single file. Specify a uri including a to shard export. (error code: invalid)
When trying to export after query I got:
Download Unavailable This result set contains too many rows to download. Please use "Save as Table" and then export the resulting table.
Finally found how to do. we must use "*" in the blob name.
And will create as many file as needed.
It's weird that i can import large file (~GB) but not possible to export large file :(
BigQuery can export up to 1 GB of data per file
For larger than 1GB - BigQuery supports exporting to multiple files
See Single wildcard URI and Multiple wildcard URIs in Exporting data into one or more files