Load batch CSV Files from Cloud Storage to BigQuery and append on same table - google-bigquery

I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.

There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)

Related

Is there (still) an advantage to staging data on Google Cloud Storage before loading into BigQuery?

I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.

Cost effective BigQuery loading data from s3

I have (2 TB) in 20k files in s3 created over the course of the every day that I need to load to BigQuery to date partition table. Files are rolled over every 5 mins.
What is the most cost effective way to get data to BigQuery?
I am looking for cost optimization in both AWS s3 to GCP network egress and actual data loading.
Late 2020 update: you could consider using BigQuery Omni in order to not have to move your data from S3 and still have the BigQuery capabilities you're looking for.
(disclaimer: I'm not affiliated in any way to Google, I just find it remarkable that they've started providing multi-cloud support thanks to Anthos. I hope the other cloud providers will follow suit...)
Google cloud in beta supports a BigQuery Transfer service for S3. Details mentioned here. The other mechanism to use S3 -> GCS -> BigQuery mechanism, which i believe will incur the GCS cost too
As per Google Cloud's pricing docs, it says "no charge" from GC PoV with limits applicable.
For data transfer from S3 to Google CLoud over Internet(i am assuming its not over VPN) is mentioned here . Your data is around 2TB, so the cost as per the table will be $0.09 per GB
There is several way for optimizing the transfer and the load.
First of all, the network egress from AWS can't be avoided. If you can, gzip your file before storing them into S3. You will reduce the egress bandwidth and BigQuery can load compressed files.
If your workload that write to S3 can't gzip the file, you have to perform a comparison between the processing time for gzipping the file and the egress cost of not gzipped file.
For GCS, we often speak about cost in GB/month. It's a mistake. When you look at the billing in BigQuery the cost is calculated in Gb/seconds. By the ways, less you let your file on storage, less you pay. By the way if you load your file quickly after the transfert and the load into BigQuery, you will pay almost nothing.
BigQuery data ingestion
You have a few options to get your s3 data ingested to BigQuery, all depending on how quickly do you need your data available in BigQuery. Also, any requirements for any data transformation (enrichment, deduplication, aggregation) should be taken into consideration to the overall cost.
The fastest way to get data to BigQuery is streaming API (within the seconds delay), which comes with $0.010 per 200 MB charge. Streaming API Pricing
BigQuery Transfer service is another choice that is the easiest and free of charge. It allows you to schedule data transfer to run it no more than once a day (currently). In your case, where data is continuously produced, that would be the slowest method to get data to BigQuery.
Transfer Service Pricing
If you need complex transformation, you may also consider Cloud Dataflow, which is not free of charge. Cloud Dataflow Pricing
Lastly, you may also consider a serverless solution, which is fully event-driven, allowing you data ingestion in close to real-time. With this, you would pay for lambda and cloud function execution, which should be around a few dollars per day plus egress cost.
For data mirroring between AWS S3 and Google Cloud Storage, you could use serverless Cloud Storage Mirror, which comes with payload size optimization with either data compression or dynamic AVRO transcoding.
For getting data loaded to BigQuery, you can use serverless BqTail, which allows you to run loads in batches. To not exceed 1K loads BigQuery quota per table per day, you could comfortably use 90-sec batch window, which would get your data loaded to BigQuery within a few minute's delays in the worst-case scenario. Optionally you can also run data deduplication, data enrichment, and aggregation.
Egress cost consideration
In your scenario, when transfer size is relatively small, 2 TB per day, I would accept egress cost; however, if you expect to grow to 40TB+ per day, you may consider using direct connect to GCP. With a simple proxy, that should come with substantial cost reduction.

Send Bigquery Data to rest endpoint

I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.

Bigquery Redshift migration of 2 TB+ size table

I am trying to migrate Redshift to BigQuery . The table size is 2TB+
I am using bigquery redshift data transfer service.
But the migration is running for more than 5 hours.
Also also see that the queries that gets executed at the Redshift end unloads data into 50 MB chunks. As there there is no way to configure chunk size parameter in Redshift transfer job.
It this much time to transfer 2TB of data from redshfit to BigQuery is expected or something can be done to improve this job.
There are some system like snowflake in just 2-3 hours from Redshift to their end.
Bigquery redshift data transfer service is built on top of the Google Cloud Storage Transfer Service. The end to end data movement involves:
1. Extract data from Redshift cluster to S3
2. Move data from S3 to GCS
3. Load data from GCS into BQ
While the 2nd and 3rd steps are quick, the first step is actually limited by the Redshift cluster itself, since it's the Redshift Cluster which executes the UNLOAD command.
Some options to make this faster can be:
1. Upgrade to a powerful cluster.
2. Do Redshift workload management (https://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) to give Migration Account (The one provided to Bigquery redshift data transfer service) better priority and resources to run the UNLOAD command.
I don't have experience with the redshift data transfer service, but I have used the Google Cloud Storage Transfer Service (available here) and in my experience it's very scalable. It should transfer 2TB of data in under an hour. If you've got millions of smallish files to transfer it might take a couple hours but it should still work.
Once you've got the data in google cloud storage you can either import it into BigQuery or create a federated table that scans over the data in google cloud storage.

Best way to migrate large amount of data from US dataset to EU dataset in BigQuery?

I have many TBs in about 1 million tables in a single BigQuery project hosted in multiple datasets that are located in the US. I need to move all of this data to datasets hosted in the EU. What is my best option for doing so?
I'd export the tables to Google Cloud Storage and reimport using load jobs, but there's a 10K limit on load jobs per project per day
I'd do it as queries w/"allow large results" and save to a destination table, but that doesn't work cross-region
The only option I see right now is to reinsert all of the data using the BQ streaming API, which would be cost prohibitive.
What's the best way to move a large volume of data in many tables cross-region in BigQuery?
You have a couple of options:
Use load jobs, and contact Google Cloud Support to ask for a quota exception. They're likely to grant 100k or so on a temporary basis (if not, contact me, tigani#google, and I can do so).
Use federated query jobs. That is, move the data into a GCS bucket in the EU, then re-import the data via BigQuery queries with GCS data sources. More info here.
I'll also look into whether we can increase this quota limit across the board.
You can copy dataset using BigQuery Copy Dataset (in/cross-region). The copy dataset UI is similar to copy table. Just click "copy dataset" button from the source dataset, and specify the destination dataset in the pop-up form. See screenshot below. Check out the public documentation for more use cases.
A few other options that are now available since Jordan answered a few years ago. These options might be useful for some folks:
Use Cloud Composer to orchestrate the export and load via GCS buckets. See here.
Use Cloud Dataflow to orchestrate the export and load via GCS buckets. See here.
Disclaimer: I wrote the article for the 2nd option (using Cloud Dataflow).