Loading data from Bigquery to google storage bucket in CSV file format - google-bigquery

I run a dataset in bigquery on a daily basis which i need to export to my google storage bucket. The dataset is greater than 10MB which means i'm unable to use app-scripts.
Essentially, I'd like to automate a data load using my bigquery script which exports the dataset as a CSV file to google storage.
Can anyone point me into the right direction in terms of which programme/method to use. Please also share your experiences.
Thanks

Here you can find some details on how to export data from BigQuery to Cloud Storage along with a sample written in Python.
https://cloud.google.com/bigquery/exporting-data-from-bigquery
You can implement a simple application running on App Engine that will contain cron job scheduled to run once a day and perform the steps described in the tutorial above.
https://cloud.google.com/appengine/docs/python/config/cron

Related

Load batch CSV Files from Cloud Storage to BigQuery and append on same table

I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)

Send Bigquery Data to rest endpoint

I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!

Google BigQuery: Batch load through API

First of all, Is it possible to do batch load into Google BigQuery through its API? If yes, then how much does it cost?
I don't want to go for streaming load because it costs a bit as compared to batch load.
I will be inserting a million rows each day. I will be using Python to use 'patch' or 'update' in the API.
You can load data:
From Google Cloud Storage
From other Google services, such as DoubleClick and Google AdWords
From a readable data source (such as your local machine)
By inserting individual records using streaming inserts
Using DML statements to perform bulk inserts
Using a Google Cloud Dataflow pipeline to write data to BigQuery
what you ar looking for is line 3, on that page you find a lot of examples in different programming languages, loading data from local or GCS file is free.
your data can be in any of the following formats:
Comma-separated values (CSV)
JSON (newline-delimited)
Avro
Parquet
ORC
I think this is what you are looking for: Batch Queries for Python.
Also, here is the GitHub Repository for Python and BigQuery, you can find the snippet that appears on the documentation under snippets.py.
You can find the BigQuery pricing here [3] and a calculator here [4] [5]

Best way to migrate large amount of data from US dataset to EU dataset in BigQuery?

I have many TBs in about 1 million tables in a single BigQuery project hosted in multiple datasets that are located in the US. I need to move all of this data to datasets hosted in the EU. What is my best option for doing so?
I'd export the tables to Google Cloud Storage and reimport using load jobs, but there's a 10K limit on load jobs per project per day
I'd do it as queries w/"allow large results" and save to a destination table, but that doesn't work cross-region
The only option I see right now is to reinsert all of the data using the BQ streaming API, which would be cost prohibitive.
What's the best way to move a large volume of data in many tables cross-region in BigQuery?
You have a couple of options:
Use load jobs, and contact Google Cloud Support to ask for a quota exception. They're likely to grant 100k or so on a temporary basis (if not, contact me, tigani#google, and I can do so).
Use federated query jobs. That is, move the data into a GCS bucket in the EU, then re-import the data via BigQuery queries with GCS data sources. More info here.
I'll also look into whether we can increase this quota limit across the board.
You can copy dataset using BigQuery Copy Dataset (in/cross-region). The copy dataset UI is similar to copy table. Just click "copy dataset" button from the source dataset, and specify the destination dataset in the pop-up form. See screenshot below. Check out the public documentation for more use cases.
A few other options that are now available since Jordan answered a few years ago. These options might be useful for some folks:
Use Cloud Composer to orchestrate the export and load via GCS buckets. See here.
Use Cloud Dataflow to orchestrate the export and load via GCS buckets. See here.
Disclaimer: I wrote the article for the 2nd option (using Cloud Dataflow).