Bigquery Redshift migration of 2 TB+ size table - google-bigquery

I am trying to migrate Redshift to BigQuery . The table size is 2TB+
I am using bigquery redshift data transfer service.
But the migration is running for more than 5 hours.
Also also see that the queries that gets executed at the Redshift end unloads data into 50 MB chunks. As there there is no way to configure chunk size parameter in Redshift transfer job.
It this much time to transfer 2TB of data from redshfit to BigQuery is expected or something can be done to improve this job.
There are some system like snowflake in just 2-3 hours from Redshift to their end.

Bigquery redshift data transfer service is built on top of the Google Cloud Storage Transfer Service. The end to end data movement involves:
1. Extract data from Redshift cluster to S3
2. Move data from S3 to GCS
3. Load data from GCS into BQ
While the 2nd and 3rd steps are quick, the first step is actually limited by the Redshift cluster itself, since it's the Redshift Cluster which executes the UNLOAD command.
Some options to make this faster can be:
1. Upgrade to a powerful cluster.
2. Do Redshift workload management (https://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) to give Migration Account (The one provided to Bigquery redshift data transfer service) better priority and resources to run the UNLOAD command.

I don't have experience with the redshift data transfer service, but I have used the Google Cloud Storage Transfer Service (available here) and in my experience it's very scalable. It should transfer 2TB of data in under an hour. If you've got millions of smallish files to transfer it might take a couple hours but it should still work.
Once you've got the data in google cloud storage you can either import it into BigQuery or create a federated table that scans over the data in google cloud storage.

Related

Load batch CSV Files from Cloud Storage to BigQuery and append on same table

I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)

Is there (still) an advantage to staging data on Google Cloud Storage before loading into BigQuery?

I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.

Cost effective BigQuery loading data from s3

I have (2 TB) in 20k files in s3 created over the course of the every day that I need to load to BigQuery to date partition table. Files are rolled over every 5 mins.
What is the most cost effective way to get data to BigQuery?
I am looking for cost optimization in both AWS s3 to GCP network egress and actual data loading.
Late 2020 update: you could consider using BigQuery Omni in order to not have to move your data from S3 and still have the BigQuery capabilities you're looking for.
(disclaimer: I'm not affiliated in any way to Google, I just find it remarkable that they've started providing multi-cloud support thanks to Anthos. I hope the other cloud providers will follow suit...)
Google cloud in beta supports a BigQuery Transfer service for S3. Details mentioned here. The other mechanism to use S3 -> GCS -> BigQuery mechanism, which i believe will incur the GCS cost too
As per Google Cloud's pricing docs, it says "no charge" from GC PoV with limits applicable.
For data transfer from S3 to Google CLoud over Internet(i am assuming its not over VPN) is mentioned here . Your data is around 2TB, so the cost as per the table will be $0.09 per GB
There is several way for optimizing the transfer and the load.
First of all, the network egress from AWS can't be avoided. If you can, gzip your file before storing them into S3. You will reduce the egress bandwidth and BigQuery can load compressed files.
If your workload that write to S3 can't gzip the file, you have to perform a comparison between the processing time for gzipping the file and the egress cost of not gzipped file.
For GCS, we often speak about cost in GB/month. It's a mistake. When you look at the billing in BigQuery the cost is calculated in Gb/seconds. By the ways, less you let your file on storage, less you pay. By the way if you load your file quickly after the transfert and the load into BigQuery, you will pay almost nothing.
BigQuery data ingestion
You have a few options to get your s3 data ingested to BigQuery, all depending on how quickly do you need your data available in BigQuery. Also, any requirements for any data transformation (enrichment, deduplication, aggregation) should be taken into consideration to the overall cost.
The fastest way to get data to BigQuery is streaming API (within the seconds delay), which comes with $0.010 per 200 MB charge. Streaming API Pricing
BigQuery Transfer service is another choice that is the easiest and free of charge. It allows you to schedule data transfer to run it no more than once a day (currently). In your case, where data is continuously produced, that would be the slowest method to get data to BigQuery.
Transfer Service Pricing
If you need complex transformation, you may also consider Cloud Dataflow, which is not free of charge. Cloud Dataflow Pricing
Lastly, you may also consider a serverless solution, which is fully event-driven, allowing you data ingestion in close to real-time. With this, you would pay for lambda and cloud function execution, which should be around a few dollars per day plus egress cost.
For data mirroring between AWS S3 and Google Cloud Storage, you could use serverless Cloud Storage Mirror, which comes with payload size optimization with either data compression or dynamic AVRO transcoding.
For getting data loaded to BigQuery, you can use serverless BqTail, which allows you to run loads in batches. To not exceed 1K loads BigQuery quota per table per day, you could comfortably use 90-sec batch window, which would get your data loaded to BigQuery within a few minute's delays in the worst-case scenario. Optionally you can also run data deduplication, data enrichment, and aggregation.
Egress cost consideration
In your scenario, when transfer size is relatively small, 2 TB per day, I would accept egress cost; however, if you expect to grow to 40TB+ per day, you may consider using direct connect to GCP. With a simple proxy, that should come with substantial cost reduction.

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!

Best way to migrate large amount of data from US dataset to EU dataset in BigQuery?

I have many TBs in about 1 million tables in a single BigQuery project hosted in multiple datasets that are located in the US. I need to move all of this data to datasets hosted in the EU. What is my best option for doing so?
I'd export the tables to Google Cloud Storage and reimport using load jobs, but there's a 10K limit on load jobs per project per day
I'd do it as queries w/"allow large results" and save to a destination table, but that doesn't work cross-region
The only option I see right now is to reinsert all of the data using the BQ streaming API, which would be cost prohibitive.
What's the best way to move a large volume of data in many tables cross-region in BigQuery?
You have a couple of options:
Use load jobs, and contact Google Cloud Support to ask for a quota exception. They're likely to grant 100k or so on a temporary basis (if not, contact me, tigani#google, and I can do so).
Use federated query jobs. That is, move the data into a GCS bucket in the EU, then re-import the data via BigQuery queries with GCS data sources. More info here.
I'll also look into whether we can increase this quota limit across the board.
You can copy dataset using BigQuery Copy Dataset (in/cross-region). The copy dataset UI is similar to copy table. Just click "copy dataset" button from the source dataset, and specify the destination dataset in the pop-up form. See screenshot below. Check out the public documentation for more use cases.
A few other options that are now available since Jordan answered a few years ago. These options might be useful for some folks:
Use Cloud Composer to orchestrate the export and load via GCS buckets. See here.
Use Cloud Dataflow to orchestrate the export and load via GCS buckets. See here.
Disclaimer: I wrote the article for the 2nd option (using Cloud Dataflow).