Google Dataflow instance and BigQuery cost considerations - google-bigquery

I am planning to spin up a dataflow instance on google cloud platform to run some experiments. I want to get familiar with, and experiment with using apache beam to pull data from BigQuery, run some ETL jobs (in python) and streaming jobs, and finally store the result in BigQuery.
However, I am also concerned with sending my company's GCP bill through the roof. What are the main cost considerations, or any methods to estimate what the cost will be, so I don't get an earful from my boss.
Any help would be greatly appreciated, thanks!

You can use calculator to get an estimate of price of the job.
One of the most important resource on the dataflow side is CPU per hour. To limit the cpu hours, you can set the maximum machines using option maxNumWorkers in your pipeline.
Here are more pipeline options that you can set while running your dataflow job https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
For BQ, you can do a similar estimate using the calculator.

Related

Suggestion for Non Analytical Distributed Processing Frameworks

Can someone please suggest a tool, framework or a service to perform the below task faster.
Input : The input to the service is a CSV file which consists of an identifier and several image columns with over a million rows.
Objective: To check if any of the image column of the row meets the minimum resolution and create a new boolean column for every row according to the results.
True - If any of the image in the row meets the min resolution
False - If no images in the row meets the min resolution
Current Implementation: Python script with pandas and multiprocessing running on a large VM(60 Core CPU) which takes about 4 - 5 Hours. Since this is a periodic task we schedule and manage it with Cloud Workflow and Celery Backend.
Note: We are looking to cut down on costs as uptime of server is just about 4-6Hrs a day. Hence 60 Core CPU 24*7 would be a lot of resources wasted.
Options Explored:
We have ruled out Cloud Run due to the memory, cpu and timeout limitations.
Apache Beam with Cloud Dataflow, seems like there is less support for non analytical workloads and Dataframe implementation with Apache Beam looks buggy still.
Spark and Dataproc seems to be good for analytical workloads. Although a Serverless option would be much preferred.
Which direction should i be looking into?

Can Bigquery scale to 0 workers?

I am using GCP BigQuery to run SQL queries.
As far as I know, the cost of BigQuery is charged by Slots (Workers), I'd like to know that, can Bigquery scale to 0 workers once queries finished? Is there anything I have to pay, like storage or compute engine instance?
Thank you.
NO.
BigQuery is a serverless managed database service.
It has no dependencies, it doesn't use Compute Engine Instances, or workers.
Slots concepts are for upfront payment and has nothing to do with Workers as you described.
In Bigquery you pay
your database storage
streaming insert
query costs
No.
BigQuery converts query statements into a graph of execution by dividing into multiple stages, which are composed of more granular sets of execution steps.
BigQuery model the units of work in a such a way that many potential workers may execute in parallel and each stage communicate with one another by using a distributed shuffle architecture. In BigQuery execution plan, workers are used to convey information specifically about parallelism

Cost effective BigQuery loading data from s3

I have (2 TB) in 20k files in s3 created over the course of the every day that I need to load to BigQuery to date partition table. Files are rolled over every 5 mins.
What is the most cost effective way to get data to BigQuery?
I am looking for cost optimization in both AWS s3 to GCP network egress and actual data loading.
Late 2020 update: you could consider using BigQuery Omni in order to not have to move your data from S3 and still have the BigQuery capabilities you're looking for.
(disclaimer: I'm not affiliated in any way to Google, I just find it remarkable that they've started providing multi-cloud support thanks to Anthos. I hope the other cloud providers will follow suit...)
Google cloud in beta supports a BigQuery Transfer service for S3. Details mentioned here. The other mechanism to use S3 -> GCS -> BigQuery mechanism, which i believe will incur the GCS cost too
As per Google Cloud's pricing docs, it says "no charge" from GC PoV with limits applicable.
For data transfer from S3 to Google CLoud over Internet(i am assuming its not over VPN) is mentioned here . Your data is around 2TB, so the cost as per the table will be $0.09 per GB
There is several way for optimizing the transfer and the load.
First of all, the network egress from AWS can't be avoided. If you can, gzip your file before storing them into S3. You will reduce the egress bandwidth and BigQuery can load compressed files.
If your workload that write to S3 can't gzip the file, you have to perform a comparison between the processing time for gzipping the file and the egress cost of not gzipped file.
For GCS, we often speak about cost in GB/month. It's a mistake. When you look at the billing in BigQuery the cost is calculated in Gb/seconds. By the ways, less you let your file on storage, less you pay. By the way if you load your file quickly after the transfert and the load into BigQuery, you will pay almost nothing.
BigQuery data ingestion
You have a few options to get your s3 data ingested to BigQuery, all depending on how quickly do you need your data available in BigQuery. Also, any requirements for any data transformation (enrichment, deduplication, aggregation) should be taken into consideration to the overall cost.
The fastest way to get data to BigQuery is streaming API (within the seconds delay), which comes with $0.010 per 200 MB charge. Streaming API Pricing
BigQuery Transfer service is another choice that is the easiest and free of charge. It allows you to schedule data transfer to run it no more than once a day (currently). In your case, where data is continuously produced, that would be the slowest method to get data to BigQuery.
Transfer Service Pricing
If you need complex transformation, you may also consider Cloud Dataflow, which is not free of charge. Cloud Dataflow Pricing
Lastly, you may also consider a serverless solution, which is fully event-driven, allowing you data ingestion in close to real-time. With this, you would pay for lambda and cloud function execution, which should be around a few dollars per day plus egress cost.
For data mirroring between AWS S3 and Google Cloud Storage, you could use serverless Cloud Storage Mirror, which comes with payload size optimization with either data compression or dynamic AVRO transcoding.
For getting data loaded to BigQuery, you can use serverless BqTail, which allows you to run loads in batches. To not exceed 1K loads BigQuery quota per table per day, you could comfortably use 90-sec batch window, which would get your data loaded to BigQuery within a few minute's delays in the worst-case scenario. Optionally you can also run data deduplication, data enrichment, and aggregation.
Egress cost consideration
In your scenario, when transfer size is relatively small, 2 TB per day, I would accept egress cost; however, if you expect to grow to 40TB+ per day, you may consider using direct connect to GCP. With a simple proxy, that should come with substantial cost reduction.

Dataflow to BigQuery quota

I found a couple related questions, but no definitive answer from the Google team, for this particular question:
Is a Cloud DataFlow job, writing to BigQuery, limited to the BigQuery quota of 100K rows-per-second-per-table (i.e. BQ streaming limit)?
google dataflow write to bigquery table performance
Cloud DataFlow performance - are our times to be expected?
Edit:
The main motivation is to find a way to predict runtimes for various input sizes.
I've managed to run jobs which show > 180K rows/sec processed via the Dataflow monitoring UI. But I'm unsure if this is somehow throttled on the insert into the table, since the job runtime was slower by about 2x than a naive calculation (500mm rows / 180k rows/sec = 45 mins, which actually took almost 2 hrs)
From your message, it sounds like you are executing your pipeline in batch, not streaming, mode.
In Batch mode, jobs run on the Google Cloud Dataflow service do not use BigQuery's streaming writes. Instead, we write all the rows to be imported to files on GCS, and then invoke a BigQuery load" job. Note that this reduces your costs (load jobs are cheaper than streaming writes) and is more efficient overall (BigQuery can be faster doing a bulk load than doing per-row imports). The tradeoff is that no results are available in BigQuery until the entire job finishes successfully.
Load jobs are not limited by a certain number of rows/second, rather it is limited by the daily quotas.
In Streaming mode, Dataflow does indeed use BigQuery's streaming writes. In that case, the 100,000 rows per second limit does apply. If you exceed that limit, Dataflow will get a quota_exceeded error and will then retry the failing inserts. This behavior will help smooth out short-term spikes that temporarily exceed BigQuery's quota; if your pipeline exceeds quota for a long period of time, this fail-and-retry policy will eventually act as a form of backpressure that slows your pipeline down.
--
As for why your job took 2 hours instead of 45 minutes, your job will have multiple stages that proceed serially, and so using the throughput of the fastest stage is not an accurate way to estimate end-to-end runtime. For example, the BigQuery load job is not initiated until after Dataflow finishes writing all rows to GCS. Your rates seem reasonable, but please follow up if you suspect a performance degradation.

Google BigQuery basic questions

These may be few basic questions.
When i load data into BQ tables, where exactly data stored? (If billing is already enabled). if it is data center, what would be data center capacity? Does our data co-exist with other users data?
When we fire queries, How our queries processed? What is the default compute engine used for this?
How can we increase query processing capacity?
Thanks
CP
BigQuery datacenter capacity is practically unlimited. If you plan to upload petabytes in a very short time frame you might need to contact support first just to make sure, but for normal big loads everything should be fine.
BigQuery doesn't use compute engine, but a series of very large clusters where all queries run. That's the secret to a low cost per query, without ongoing costs per hour like other alternatives.
BigQuery increases the number of CPUs involved in your query elastically as the query needs. You don't need to manage storage nor processing capacity.