I have (2 TB) in 20k files in s3 created over the course of the every day that I need to load to BigQuery to date partition table. Files are rolled over every 5 mins.
What is the most cost effective way to get data to BigQuery?
I am looking for cost optimization in both AWS s3 to GCP network egress and actual data loading.
Late 2020 update: you could consider using BigQuery Omni in order to not have to move your data from S3 and still have the BigQuery capabilities you're looking for.
(disclaimer: I'm not affiliated in any way to Google, I just find it remarkable that they've started providing multi-cloud support thanks to Anthos. I hope the other cloud providers will follow suit...)
Google cloud in beta supports a BigQuery Transfer service for S3. Details mentioned here. The other mechanism to use S3 -> GCS -> BigQuery mechanism, which i believe will incur the GCS cost too
As per Google Cloud's pricing docs, it says "no charge" from GC PoV with limits applicable.
For data transfer from S3 to Google CLoud over Internet(i am assuming its not over VPN) is mentioned here . Your data is around 2TB, so the cost as per the table will be $0.09 per GB
There is several way for optimizing the transfer and the load.
First of all, the network egress from AWS can't be avoided. If you can, gzip your file before storing them into S3. You will reduce the egress bandwidth and BigQuery can load compressed files.
If your workload that write to S3 can't gzip the file, you have to perform a comparison between the processing time for gzipping the file and the egress cost of not gzipped file.
For GCS, we often speak about cost in GB/month. It's a mistake. When you look at the billing in BigQuery the cost is calculated in Gb/seconds. By the ways, less you let your file on storage, less you pay. By the way if you load your file quickly after the transfert and the load into BigQuery, you will pay almost nothing.
BigQuery data ingestion
You have a few options to get your s3 data ingested to BigQuery, all depending on how quickly do you need your data available in BigQuery. Also, any requirements for any data transformation (enrichment, deduplication, aggregation) should be taken into consideration to the overall cost.
The fastest way to get data to BigQuery is streaming API (within the seconds delay), which comes with $0.010 per 200 MB charge. Streaming API Pricing
BigQuery Transfer service is another choice that is the easiest and free of charge. It allows you to schedule data transfer to run it no more than once a day (currently). In your case, where data is continuously produced, that would be the slowest method to get data to BigQuery.
Transfer Service Pricing
If you need complex transformation, you may also consider Cloud Dataflow, which is not free of charge. Cloud Dataflow Pricing
Lastly, you may also consider a serverless solution, which is fully event-driven, allowing you data ingestion in close to real-time. With this, you would pay for lambda and cloud function execution, which should be around a few dollars per day plus egress cost.
For data mirroring between AWS S3 and Google Cloud Storage, you could use serverless Cloud Storage Mirror, which comes with payload size optimization with either data compression or dynamic AVRO transcoding.
For getting data loaded to BigQuery, you can use serverless BqTail, which allows you to run loads in batches. To not exceed 1K loads BigQuery quota per table per day, you could comfortably use 90-sec batch window, which would get your data loaded to BigQuery within a few minute's delays in the worst-case scenario. Optionally you can also run data deduplication, data enrichment, and aggregation.
Egress cost consideration
In your scenario, when transfer size is relatively small, 2 TB per day, I would accept egress cost; however, if you expect to grow to 40TB+ per day, you may consider using direct connect to GCP. With a simple proxy, that should come with substantial cost reduction.
Related
I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)
I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.
I have 2 GCS buckets with identical, sharded CSV files. Bucket federated-query-standard has storage class of standard. Bucket federated-query-archive has storage class of archive.
Running identical queries using federated/external source over the buckets produce to exact same amount of bytes billed/processed, which is 57.13GB of data. Performance (query time) is roughly the same.
According to the official docs for BigQuery pricing:
"When querying an external data source from BigQuery, you are charged
for the number of bytes read by the query. For more information, see
Query pricing. You are also charged for storing the data on Cloud
Storage. For more information, see Cloud Storage Pricing."
So, users are charged on two things: the data processed and the storage of the data in GCS. This makes complete sense.
My question: is there is a hidden cost anywhere that I'm not seeing (or unaware of) for querying GCS (e.g. retrieval costs) or between different storage classes?
Currently, there aren't any charges for reading from Archival or Coldine storage, hidden or otherwise. That doesn't mean this won't change in the future.
Because of the way BigQuery accesses GCS, GCS charges BigQuery for the access, not you (i.e. an internal accounting thing).
Performance may be inconsistent if you use archival storage. For that storage class, there are fewer redundant copies so tail latency will be higher.
For coldline, however, you should see roughly equivalent performance to standard GCS storage. The reason is that under the covers, coldline is implemented exactly the same way as standard storage. The difference is that coldline charges less for storage but makes it up on reads.
Since BigQuery doesn't charge you for reads, if you're doing a lot of federated querying over data in GCS but don't read the data much otherwise, your best bet is going to be to use coldline.
Again, this is a point-in-time response and this may change in the future.
I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.
When importing from S3 to DynamoDB, does this count towards provisioned write throughput?
I have a service that is only read from, except for batch updates from a multi-gigabyte file in S3. We don't want to pay for provisioned writes all month, and scaling from 0 writes to several million could take a while given the AWS policy of only allowing provisioned rates to double at one time.
Yes. EMR integration relies on the same API as any client application. As such is is subject to the same throughput policy.
Minor precision:
minimum throughput = 1 (not 0)
maximum throughput = 10,000 (not > 1,000,000)
By the way, huge 'scaling' can easily be automated provided that you only double at once. It only takes a couple of minutes to run. Maybe you could also consider storing "incremental" diff instead of the full "multi-gigabyte file in S3". It would save a lot...
The official optimization guide for DynamoDB can provide you some useful hints on how to optimize your import.