When we create a table under a particular dataset, we have 5 options like empty table , Google cloud storage and upload etc.My question is if it is a Cloud storage , where does this table gets created in BigQuery or Cloud storage ? as my intention is to dump the data in cloud storage and then load in to BigQuer. Same goes for empty table also as we explicitly define schema , I understand the table will reside in BQ.
I have load the data by below script:
bq load --source_format=CSV --skip_leading_rows=1 --autodetect --ignore_unknown_values \
commerce.balltoball gs://balltoballbucket/head_usa_names.csv
I suppose the balltoballbucket is referred to storage bucket where as commerce.balltoball is BigQuery refrence.
Apologies for newbie question. Thanks for your help.
If your bq load works, then UI should work for you. The documentation is here:
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table (then pick Console tab)
Select file from GCS bucket: gs://balltoballbucket/head_usa_names.csv
File Format: CSV
Dataset Name: commerce
Table Name: balltoball
Other options see on the page:
(Optional) Click Advanced options.
As to where the table is stored, if you pick Native table as Table type, it is stored inside BigQuery storage, and External table for letting the data stay on GCS and only read from GCS when there is a query hitting the table.
Related
I'm following this guide from google cloud documentation to create external partitioned table, which is:
Create table from: Google Cloud Storage
Select file from GCS bucket: gs://my_bucket/data/ymd=20200703/*
File format: Avro
Source Data Partitioning, URI Prefix: gs://my_bucket/data/Table Type: External
But when I click create table it says:
Specifying a schema is disallowed for STORAGE_FORMAT_AVRO
If I use native tables instead of external, it just works. I also tried gs://my_bucket/data/, gs://my_bucket/data/, gs://my_bucket/data/**/, gs://my_bucket/data/ymd=20200703/*, gs://my_bucket/data/ymd=20200703/file-blabla, ... for Select file from GCS bucket but no difference.
Any ideas how I can create external partitioned tables in BigQuery? not native?
This was a bug with google big query, reported this and they are working on it now.
We have a data stored on gcp bucket in below format -
gs:/gcptest/Year=2020/Month=06/day=18/test1.parquet and so many files under the day=18 folder.
I want to create a table in bigquery with the columns present in the files and partitioned by Year,Month,Day that is present on the file path.
So that when I will load the data into table I can just select the path from the gcp bucket and load the data which will partitioned by values of Year/Month/Day present on the path
BigQuery supports loading externally partitioned data in Avro, Parquet, ORC, CSV and JSON formats that is stored on Cloud Storage using a default hive partitioning layout.
Support is currently limited to the BigQuery web UI, command-line tool, and REST API.
You can see more in Loading externally partitioned data documentation
Also see how to Query externally partitioned data
We want to create a backup copy of a BigQuery dataset in case a table is accidentally dropped, as it is only recoverable within 7 days.
Is there a way to extend the duration of the recovery period? If not, how can we create a backup of a dataset with a retention period of 30 days in BigQuery?
It is currently not possible to extend the duration of the recovery period. A feature request for the ability to extend the duration of the recovery period has already been created as commented by Katayoon.
Here is a public link to monitor the progress on that issue: https://issuetracker.google.com/120038872
To backup datasets in BigQuery you could either make copies of your dataset, or as a more workable solution, export the data to Cloud Storage, so you can import it back at a later time. Cloud Storage allows you to set a retention period and a lifecycle policy which together will allow you to make sure that data stays undisturbed for the desired amount of time, and that it removes itself after a given time should you wish to save on storage costs.
For how you do export in BigQuery:
You can export the tables as AVRO, JSON or CSV files to the Cloud Storage via web UI, command line, an API and using various languages like C#, Go, Python and Java, as long as both are in the same location. There are other limitations to exporting a table, such as file size, Integer encoding, data compression, etc.
Link to table export and limitations:
https://cloud.google.com/bigquery/docs/exporting-data
You can find the instructions on the procedures here:
Retention Policies and Bucket Lock: https://cloud.google.com/storage/docs/using-bucket-lock#lock-bucket
Object Lifecycle Management:
https://cloud.google.com/storage/docs/managing-lifecycles
Loading data into BigQuery can be done using various file formats, such as CSV, JSON, Avro, Parquet, or ORC and so on. At this moment you can load directly only from local storage, or from Google Storage. More on loading data, file formats, data sources and limitations by following the link: https://cloud.google.com/bigquery/docs/loading-data
More information on
Exporting tables: https://cloud.google.com/bigquery/docs/exporting-data
Export limitations: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations
Loading data into BigQuery: https://cloud.google.com/bigquery/docs/loading-data
Wildcards: https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames
Merging the file: https://cloud.google.com/storage/docs/gsutil/commands/compose
You can take a snapshot of a table using either SQL or CLI:
SQL
CREATE SNAPSHOT TABLE
myproject.library_backup.books
CLONE myproject.library.books
OPTIONS(expiration_timestamp = TIMESTAMP "2022-04-27 12:30:00.00-08:00")
CLI
bq cp --snapshot --no_clobber --expiration=86400 library.books library_backup.books
You can backup and restore using the tools in https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/tree/master/blogs/bigquery_backup:
Backup a table to GCS
./bq_backup.py --input dataset.tablename --output gs://BUCKET/backup
This saves a schema.json, a tabledef.json, and extracted data in AVRO format to GCS.
You can also backup all the tables in a data set:
./bq_backup.py --input dataset --output gs://BUCKET/backup
Restore tables one-by-one by specifying a destination data set
./bq_restore.py --input gs://BUCKET/backup/fromdataset/fromtable --output destdataset
For views, the backup stores the view definition and the restore creates a view.
Guys a very basic question but not able to decipher ,Please help me out.
Q1: When we create bigquery table using below command , the data resides in same Cloud Storage?
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
Q2: let's say my data director is gs://sp2040/raw/cards/cust/ for customer file Table structure defined is:
bq mk --time_partitioning_type=DAY market.cust \
custid:string,grp:integer,odate:string
Everyday I create new dir in the bucket such as 20170101,20170102..to load new dataset. So after the data loaded in this bucket Do I need to fire below queries.
D1:
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
D2:
bq load --source_format=CSV 'market.cust$20170102' \
gs://sp2040/raw/cards/cust/20170102/20170102_cust.csv
When we create bigquery table using below command , the data resides in same Cloud Storage?
Nope! BigQuery is not using Cloud Storage for storing data (unless it is federated Table linked to Cloud Storage)
Check BigQuery Under the Hood with Tino Tereshko and Jordan Tigani - you will like it
Do I need to fire below queries
Yes. you need to load those files into BigQuery, so you can query the data
Yes you would need load the data into BigQuery using those commands.
However, there are a couple of alternatives
PubSub and Dataflow: You could configure PubSub to watch your cloud storage and create notification when files are added, described here. You could then have Dataflow job that imported the file into BigQuery. DataFlow Documentation
BigQuery external tables: BigQuery can query cvs files that are stored in Cloud Storage without importing the data, as described here. There is wildcard support for filenames so it could be configured once. Performance might not be as good as directly storing items in BigQuery
I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.
1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json