I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!
Related
I am quite puzzled by BigQuery connector on Spotfire. It is taking !extremely! long time to import my dataset in-memory.
my configuration: spotfire on AWS windows instance (8vCPU - 32Go RAM). dataset 50Go >100M rows on BigQuery.
Yes - I should use in-database for such large dataset and push the queries to BigQuery and use Spotfire only for display, but that is not my question today 😋
Today i am trying to understand how the import works and why it is taking so long. this import job started 21hrs ago and it is still not finished. The resources of the server are barely used (CPU, Disk, Network).
Testing done:
I tried importing data from Redshift and it was much faster (14min for 22Go)
I checked resources used during import: network speed (Redshift ~ 370Mbs, BQ ~ 8Mbs for 30min), CPU (Redshift ~ 25%, BQ < 5%), RAM (Redshift & BQ ~ 27Go), Disk write (Redshift 30Mbs, BQ 5MBs)
I really don't understand what is Spotfire actually doing for all this time while importing dataset from BQ in memory. There seems to be no use of server resources and there is no indication of status apart from time running.
Any Spotfire experts have any insights on what's happening? Is the connector to BigQuery actually not to be used for In-memory analysis - what is the actual implementation limiting factor?
Thanks!
😇
We had an issue which is fixed in the Spotfire versions below:
TS 10.10.3 LTS HF-014
TS 11.2.0 HF-002
Please also vote and comment on the idea of using the Storage API when extracting data from BigQuery:
https://ideas.tibco.com/ideas/TS-I-7890
Thanks,
Thomas Blomberg
Senior Product Manager TIBCO Spotfire
#Tarik, you need to install the above hotfix at your end.
You can download the latest hotfix from the link: https://community.tibco.com/wiki/list-hotfixes-tibco-spotfire-clients-analyst-web-player-consumerbusiness-author-and-automation
An update after more testing. Thanks #Thomas and #Manoj's very helpful support. Here are the results:
I updated spotfire version to 11.2.0 HF002 and it fixed the issue with bringing data in-memory with BigQuery 👌 - Using (Data > Add Data...), the data throughput was very low though ~ 13min/Go. The network throughput was doing burst of 8Mbs.
As suggested in tibco ideas by Thomas, I installed Simba JDBC driver and the data throughput improved drammatically to ~ 50s/Go using (Data > Information Designer). The issue off course is that you need access to the server to install it. the network throughput was roughly 200Mbs. I am not sure what is the limiting factor (Spotfire config, Samba Driver or BigQuery).
Using Redshift connector to a Redshift cluster with the same data and connecting using (Data > Information Designer), I get to a data import throughput of ~ 30s/Go with a network throughput of 380Mbs.
So my recommandation is to use the latest simba driver along with the Information Designer to get the best "in-memory" data import throughput when connecting to medium size dataset in BigQuery (10-30Go). This leads to a data import throughput of 1min/Go.
It's not clear what makes Redshift connection faster though and if there is faster method to import data from GCP/BigQuery to Spotfire 🤷♂️
Any comments or suggestions are welcome!
Tarik
I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.
I am trying to migrate Redshift to BigQuery . The table size is 2TB+
I am using bigquery redshift data transfer service.
But the migration is running for more than 5 hours.
Also also see that the queries that gets executed at the Redshift end unloads data into 50 MB chunks. As there there is no way to configure chunk size parameter in Redshift transfer job.
It this much time to transfer 2TB of data from redshfit to BigQuery is expected or something can be done to improve this job.
There are some system like snowflake in just 2-3 hours from Redshift to their end.
Bigquery redshift data transfer service is built on top of the Google Cloud Storage Transfer Service. The end to end data movement involves:
1. Extract data from Redshift cluster to S3
2. Move data from S3 to GCS
3. Load data from GCS into BQ
While the 2nd and 3rd steps are quick, the first step is actually limited by the Redshift cluster itself, since it's the Redshift Cluster which executes the UNLOAD command.
Some options to make this faster can be:
1. Upgrade to a powerful cluster.
2. Do Redshift workload management (https://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) to give Migration Account (The one provided to Bigquery redshift data transfer service) better priority and resources to run the UNLOAD command.
I don't have experience with the redshift data transfer service, but I have used the Google Cloud Storage Transfer Service (available here) and in my experience it's very scalable. It should transfer 2TB of data in under an hour. If you've got millions of smallish files to transfer it might take a couple hours but it should still work.
Once you've got the data in google cloud storage you can either import it into BigQuery or create a federated table that scans over the data in google cloud storage.
I have many TBs in about 1 million tables in a single BigQuery project hosted in multiple datasets that are located in the US. I need to move all of this data to datasets hosted in the EU. What is my best option for doing so?
I'd export the tables to Google Cloud Storage and reimport using load jobs, but there's a 10K limit on load jobs per project per day
I'd do it as queries w/"allow large results" and save to a destination table, but that doesn't work cross-region
The only option I see right now is to reinsert all of the data using the BQ streaming API, which would be cost prohibitive.
What's the best way to move a large volume of data in many tables cross-region in BigQuery?
You have a couple of options:
Use load jobs, and contact Google Cloud Support to ask for a quota exception. They're likely to grant 100k or so on a temporary basis (if not, contact me, tigani#google, and I can do so).
Use federated query jobs. That is, move the data into a GCS bucket in the EU, then re-import the data via BigQuery queries with GCS data sources. More info here.
I'll also look into whether we can increase this quota limit across the board.
You can copy dataset using BigQuery Copy Dataset (in/cross-region). The copy dataset UI is similar to copy table. Just click "copy dataset" button from the source dataset, and specify the destination dataset in the pop-up form. See screenshot below. Check out the public documentation for more use cases.
A few other options that are now available since Jordan answered a few years ago. These options might be useful for some folks:
Use Cloud Composer to orchestrate the export and load via GCS buckets. See here.
Use Cloud Dataflow to orchestrate the export and load via GCS buckets. See here.
Disclaimer: I wrote the article for the 2nd option (using Cloud Dataflow).
I run a dataset in bigquery on a daily basis which i need to export to my google storage bucket. The dataset is greater than 10MB which means i'm unable to use app-scripts.
Essentially, I'd like to automate a data load using my bigquery script which exports the dataset as a CSV file to google storage.
Can anyone point me into the right direction in terms of which programme/method to use. Please also share your experiences.
Thanks
Here you can find some details on how to export data from BigQuery to Cloud Storage along with a sample written in Python.
https://cloud.google.com/bigquery/exporting-data-from-bigquery
You can implement a simple application running on App Engine that will contain cron job scheduled to run once a day and perform the steps described in the tutorial above.
https://cloud.google.com/appengine/docs/python/config/cron