Scenario:
I upload a CSV file into a S3 bucket and now I would like to read this table using Trino.
Is it possible to just read the table without the CREATE TABLE statement? Maybe simple SELECT only? Or I have to CREATE TABLE everytime I want to read the CSV file?
I am going to work on a data set that contains information about 311 calls in the United States. This data set is available publicly in BigQuery. I would like to copy this directly to my bucket. However, I am clueless about how to do this as I am a novice.
Here is a screenshot of the public location of the dataset on Google Cloud:
I have already created a bucket named 311_nyc in my Google Cloud Storage. How can I directly transfer the data without having to download the 12 gb file and uploading it again through my VM instance?
If you select the 311_service_requests table from the list on the left, an "Export" button will appear:
Then you can select Export to GCS, select your bucket, type a filename, choose format (between CSV and JSON) and check if you want the export file to be compressed (GZIP).
However, there are some limitations in BigQuery Exports. Copying some from the documentation link that apply to your case:
You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.
EDIT:
A simple way to merge the output files together is to use the gsutil compose command. However, if you do this the header with the column names will appear multiple times in the resulting file because it appears in all the files that are extracted from BigQuery.
To avoid this, you should perform the BigQuery Export by setting the print_header parameter to False:
bq extract --destination_format CSV --print_header=False bigquery-public-data:new_york_311.311_service_requests gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv
and then create the composite:
gsutil compose gs://<YOUR_BUCKET_NAME>/nyc_311_* gs://<YOUR_BUCKET_NAME>/all_data.csv
Now, in the all_data.csv file there are no headers at all. If you still need the column names to appear in the first row you have to create another CSV file with the column names and create a composite of these two. This can be done either manually by pasting the following (column names of the "311_service_requests" table) into a new file:
unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,borough,x_coordinate,y_coordinate,park_facility_name,park_borough,bbl,open_data_channel_type,vehicle_type,taxi_company_borough,taxi_pickup_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
or with the following simple Python script (in case you want to use it with a table with a big amount of columns that is hard to be done manually) that queries the column names of the table and writes them into a CSV file:
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT column_name
FROM `bigquery-public-data`.new_york_311.INFORMATION_SCHEMA.COLUMNS
WHERE table_name='311_service_requests'
"""
query_job = client.query(query)
columns = []
for row in query_job:
columns.append(row["column_name"])
with open("headers.csv", "w") as f:
print(','.join(columns), file=f)
Note that for the above script to run you need to have the BigQuery Python Client library installed:
pip install --upgrade google-cloud-bigquery
Upload the headers.csv file to your bucket:
gsutil cp headers.csv gs://<YOUR_BUCKET_NAME/headers.csv
And now you are ready to create the final composite:
gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/all_data.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
In case you want the headers you can skip creating the first composite and just create the final one using all sources:
gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
You can also use the gcoud commands:
Create a bucket:
gsutil mb gs://my-bigquery-temp
Extract the data set:
bq extract --destination_format CSV --compression GZIP 'bigquery-public-data:new_york_311.311_service_requests' gs://my-bigquery-temp/dataset*
Please note that you have to use gs://my-bigquery-temp/dataset* because the dataset is to large and can not be exported to a single file.
Check the bucket:
gsutil ls gs://my-bigquery-temp
gs://my-bigquery-temp/dataset000000000
......................................
gs://my-bigquery-temp/dataset000000000045
You can find more information Exporting table data
Edit:
To compose an object from the exported dataset files you can use gsutil tool:
gsutil compose gs://my-bigquery-temp/dataset* gs://my-bigquery-temp/composite-object
Please keep in mind that you can not use more that 32 blobs (files) to compose the object.
Related SO Question Google Cloud Storage Joining multiple csv files
I am exporting a table of size>1GB from Bigquery into GCS but it splits the files into very small files of 2-3 MB. Is there a way to get bigger files like 40-60MB per files rather than 2-3 MB.
I do the expport via the api
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
https://cloud.google.com/bigquery/docs/reference/v2/jobs
The source table size is 60 GB on Bigquery. I extract the data with format - NewLine_Delimited_Json and GZIP compression
destination_cloud_storage_uris=[
'gs://bucket_name/main_folder/partition_date=xxxxxxx/part-*.gz'
]
Are you trying to export partitioned table? If yes, each partition is exported as different table and it might cause small files.
I run the export in cli with each of the following commands and received in both cases files of size 49 MB:
bq extract --compression=GZIP --destination_format=NEWLINE_DELIMITED_JSON project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
bq extract --compression=GZIP project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
Please add more details to the question so we can provide specific advice: How are you exactly asking for this export?
Nevertheless, if you have many files in GCS and you want to merge them all into one, you can do:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
I have a program which will download some data from the web and save it as a csv, and then upload that data to a Google Cloud Storage Bucket. Next, that program will use gsutil to create a new Google BigQuery Table by concatenating all the files in the Google Cloud Storage Bucket. To do the concatenating I run this command in command prompt:
bq load --project_id=ib-17 da.hi gs://ib/hi/* da:TIMESTAMP,bol:STRING,bp:FLOAT,bg:FLOAT,bi:FLOAT,lo:FLOAT,en:FLOAT,kh:FLOAT,ow:FLOAT,ls:FLOAT
The issue is that for some reason this command appends to the existing table, so I get a lot of duplicate data. The question is how can I either use gsutil to delete the table first maybe how can I use gsutil to overwrite the table?
If I understood correctly your question, you should delete and recreate the table with:
bq rm -f -t da.hi
bq mk --schema da:TIMESTAMP,bol:STRING,bp:FLOAT,bg:FLOAT,bi:FLOAT,lo:FLOAT,en:FLOAT,kh:FLOAT,ow:FLOAT,ls:FLOAT -t da.hi
Another possibility is to use the --replace flag, such as:
bq load --replace --project_id=ib-17 da.hi gs://ib/hi/*
I think that this flag was once called WRITE_DISPOSITION but looks like the CLI updated the name to --replace.
I have been trying multiple ways to move the compressed TSV to Big query. I was able to get the command working but didn't see any table being loaded. Please help me figure out to write the command that works.
bq ‘--project_id’ --nosync load --source_format CSV --field_delimiter ‘\t’ --autodetect --skip_leading_rows ‘0’ --quote=‘’ --encoding UTF-8 :table.destinationtable ‘gs://bucketname/filename.tsv.gz’
Successfully started load 162822:bqjob_r2d00a5817904935f_0000015c79e61b7c_1