Exporting query results as JSON via Google BigQuery API - google-bigquery

I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.

1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json

Related

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

BigQuery - Transfers automation from Google Cloud Storage - Overwrite table

Here's the case:
Our client daily uploads CSVs (overwritten) to a bucket in Google Cloud Storage (each table in a different file).
We use BigQuery as DataSource in DataStudio
We want to automatically transfer the CSVs to BigQuery.
The thing is, even though we've:
Declared the tables in BigQuery with "Overwrite table" write preference option
Configured the daily Transfers vía UI (BigQuery > Transfers) to automatically upload the CSVs from Google Cloud one hour after the files are uploaded to Google Cloud, as stated by the limitations.
The automated transfer/load is by default in "WRITE_APPEND", thus the tables are appended instead of overwritten in BigQuery.
Hence the question: How/where can we change the
configuration.load.writeDisposition = WRITE_TRUNCATE
as stated here in order to overwrite the tables when the CSVs are automatically loaded?
I think that's what we're missing.
Cheers.
None of the above worked for us, so I'm posting this in case anyone has the same issue.
We scheduled a query to erase the table content just before the automatic importation process starts:
DELETE FROM project.tableName WHERE true
And then, new data will be imported to a void table, therefore default "WRITE_APPEND" doesn't affect us.
1) One way to do this is to use DDL to CREATE and REPLACE your table before running the query which imports the data.
This is an example of how to create a table
#standardSQL
CREATE TABLE mydataset.top_words
OPTIONS(
description="Top ten words per Shakespeare corpus"
) AS
SELECT
corpus,
ARRAY_AGG(STRUCT(word, word_count) ORDER BY word_count DESC LIMIT 10) AS top_words
FROM bigquery-public-data.samples.shakespeare
GROUP BY corpus;
Now that it's created you can import your data.
2) Another way is to use BigQuery schedule Queries
3) If you write Python you can find an even better solution here

How to take the backup of a dataset in BigQuery?

We want to create a backup copy of a BigQuery dataset in case a table is accidentally dropped, as it is only recoverable within 7 days.
Is there a way to extend the duration of the recovery period? If not, how can we create a backup of a dataset with a retention period of 30 days in BigQuery?
It is currently not possible to extend the duration of the recovery period. A feature request for the ability to extend the duration of the recovery period has already been created as commented by Katayoon.
Here is a public link to monitor the progress on that issue: https://issuetracker.google.com/120038872
To backup datasets in BigQuery you could either make copies of your dataset, or as a more workable solution, export the data to Cloud Storage, so you can import it back at a later time. Cloud Storage allows you to set a retention period and a lifecycle policy which together will allow you to make sure that data stays undisturbed for the desired amount of time, and that it removes itself after a given time should you wish to save on storage costs.
For how you do export in BigQuery:
You can export the tables as AVRO, JSON or CSV files to the Cloud Storage via web UI, command line, an API and using various languages like C#, Go, Python and Java, as long as both are in the same location. There are other limitations to exporting a table, such as file size, Integer encoding, data compression, etc.
Link to table export and limitations:
https://cloud.google.com/bigquery/docs/exporting-data
You can find the instructions on the procedures here:
Retention Policies and Bucket Lock: https://cloud.google.com/storage/docs/using-bucket-lock#lock-bucket
Object Lifecycle Management:
https://cloud.google.com/storage/docs/managing-lifecycles
Loading data into BigQuery can be done using various file formats, such as CSV, JSON, Avro, Parquet, or ORC and so on. At this moment you can load directly only from local storage, or from Google Storage. More on loading data, file formats, data sources and limitations by following the link: https://cloud.google.com/bigquery/docs/loading-data
More information on
Exporting tables: https://cloud.google.com/bigquery/docs/exporting-data
Export limitations: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations
Loading data into BigQuery: https://cloud.google.com/bigquery/docs/loading-data
Wildcards: https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames
Merging the file: https://cloud.google.com/storage/docs/gsutil/commands/compose
You can take a snapshot of a table using either SQL or CLI:
SQL
CREATE SNAPSHOT TABLE
myproject.library_backup.books
CLONE myproject.library.books
OPTIONS(expiration_timestamp = TIMESTAMP "2022-04-27 12:30:00.00-08:00")
CLI
bq cp --snapshot --no_clobber --expiration=86400 library.books library_backup.books
You can backup and restore using the tools in https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/tree/master/blogs/bigquery_backup:
Backup a table to GCS
./bq_backup.py --input dataset.tablename --output gs://BUCKET/backup
This saves a schema.json, a tabledef.json, and extracted data in AVRO format to GCS.
You can also backup all the tables in a data set:
./bq_backup.py --input dataset --output gs://BUCKET/backup
Restore tables one-by-one by specifying a destination data set
./bq_restore.py --input gs://BUCKET/backup/fromdataset/fromtable --output destdataset
For views, the backup stores the view definition and the restore creates a view.

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!
There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.