How to save costs when calling BigQuery from a Google Cloud Function

How to save costs when calling BigQuery from a Google Cloud Function - google-bigquery

I have 10,000 images added to my google storage bucket on a daily basis.
This bucket has a cloud function event which triggers BigQuery scans
My cloud function checks for existing records in BigQuery 10,000 times a day, which is running up my BigQuery bill to unsustainable amount.
I would like to know if there is a way to query the database once and store the results into a variable which can then be available for all triggered cloud functions?
In summary: query DB once, store query results, use query results for all cloud function invocations. This way I do not hit BigQuery 10,000+ times a day.
P.S. BigQuery processed 288 Terabytes of data which is an insane bill $$$$$$$$$$$$

That's an insane bill.
Step 1: Set up cost controls. You will never deal with an insane bill again.
How do I turn on cost controls on BigQuery?
Step 2: Instead of querying again, check out the results of the previous job. If it's the same query and data, the same results are still valid. The API method is jobs.getQueryResults().
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/getQueryResults
Step 3: Cluster your tables. You'll probably find 90% additional cost reductions.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b

Related

Does Google BigQuery charge by processing time?

So, this is somewhat of a "realtime" question. I run a query and it's currently going at almost 12K seconds, but it told me This query will process 239.3 GB when run. Does BQ charge by the time besides the data processed? Should I stop this now?

I assume you are using on-demand pricing model - you are billed based on amount of processed bytes and processing time is not involved
BigQuery uses a columnar data structure. You're charged according to the total data processed in the columns you select, and the total data per column is calculated based on the types of data in the column. For more information about how your data size is calculated, see data size calculation.
You aren't charged for queries that return an error, or for queries that retrieve results from the cache.
Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.
Cancelling a running query job may incur charges up to the full cost for the query were it allowed to run to completion.
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
See more at BigQuery pricing

For on-demand queries you are charged for the amount of data processed by BigQuery engine and for expensive queries only you are charged extra for the complexity (which could manifest itself by increased query time).
The amount of data processed is reflected by totalBytesProcessed. And also by totalBytesBilled which is the same for ordinary queries. For complex/expensive queries you are charged extra for the complexity and technically it's done by totalBytesBilled becoming bigger than totalBytesProcessed.
More details: see this link

Is there an option to limit the number to columns in a sink export into BigQuery?

I created a sink export to load audit logs into BigQuery. However, there are a large number of columns that I don't need from the audit log. Is there a way to pick and choose the columns in the sink export?

We need to define our reason for wanting to reducing the number of columns. My thinking is that you are concerned about costs. If we look at active storage, we find that the current price is $0.02 / GB with the first 10GB free each month. If the data is untouched for 90 days, that storage cost drops to $0.01/GB. Next we have to estimate how much storage is used for recording all columns for a month vs recording just the storage you want to record. If we can make some projections, then we can make a call on how much the cost might change if we reduced storage usage. What we will want to estimate will be the number of log records to be exported per month and the size of the average log record if written as-is today vs a log record with only minimally needed fields.
If we do find that there is a distinction that makes a significant cost saving, one further thought would be to export the log entries to Pub/Sub and have them trigger a cloud function. However, I'm dubious that we might end up finding that the savings on BQ storage is then lost due to the cost of Pub/Sub and Cloud Function (and possibly BQ streaming insert).
Another thought might be to realize that the BQ log records are written to tables named by "day". We could have a batch job that runs after a days worth of records are written that copies only the columns of interest to a new table. Again, we are going to have to watch that we don't end up with higher costs elsewhere in our attempt to reduce storage costs.

Getting table-specific costs via SQL query

Is there a query I can run to determine how much queries against each table are costing us? For instance, the result of this query would at least include something like:
dataset.table1 236TB processed dataset.table2 56GB processed dataset.table3 24kB processed etc
Also is there a way to know what specific queries are costing us the most?
Thanks!

Let's talk first about data and respective data-points to do such a query!
Take a look at Job Resources
Here you have few useful properties
configuration.query.query - BigQuery SQL query to execute.
statistics.query.referencedTables - Referenced tables for the job.
statistics.query.totalBytesBilled - Total bytes billed for the job.
statistics.query.totalBytesProcessed - Total bytes processed for the job.
statistics.query.billingTier - Billing tier for the job.
Having above data-points would allow you to write relativelly simple query to answer your cost per query and cost per table questions!
So, now - how to get this data available?
You can collect your jobs using Job.list API and than loop thru all available jobs and retrieve respective stats via Job.get API - of course dumping retrieved data into BigQuery table. Than you can enjoy analysis!
Or you can use BigQuery's audit logs to track access and cost details (as described in the docs) and export them back to BigQuery for analysis.
The former option (Jobs.list and than Job.get in loop)) gives you ability to get your jobs info even if you don't have audit logs enabled yet, because Job.get API returns information about a specific job that is available for a six month period after creation - so plenty of data for analysis!

In my understanding currently, it is not possible to get processed bytes per table.
In my understanding it would be a great feature through which you can identify and optimize costs and also have a better possibility to understand effectivity of partioning and clustering changes. Currently is just possible to get the totalprocessed bytes for a query and also see which tables were referenced. But there are no easy query and no query at all which makes possible to analyze this cost on the table level which is more granuar then query level.

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...

This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.

First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.

Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

Google BigQuery pricing

I'm a Phd student from Singapore Management University. Currently I'm working in Carnegie Mellon University on a research project which needs the historical events from Github Archive (http://www.githubarchive.org/). I noticed that Google Bigquery has Github Archive data. So I run a program to crawl data using Google Bigquery service.
I just found that the price of Google bigquery shows on the console is not updated in real-time... While I started running the program for a few hours, the fee was only 4 dollar plus, so I thought the price is reasonable and I kept running the program. After 1~2 days, I checked the price again on Sep 13, 2013, the price became 1388$...I therefore immediately stopped using Google bigquery service. And just now I checked the price again, it turns out I need to pay 4179$...
It is my fault that I didn't realize I need to pay this big amount of money for executing queries and obtaining data from Google bigquery.
This project is only for research, not for commercial purpose. I would like to know whether it is possible to waive the fee. I really need [Google Bigquery team]'s kindly help.
Thank you very much & Best Regards,
Lisa

A year later update:
Please note some big developments since this situation:
Querying prices are 85% down.
GithubArchive is publishing daily and yearly tables now - so while developing your queries always test them on smaller datasets.
BigQuery pricing is based on the amount of data queried. One of its highlights is how easily it scales, going from scanning few gigabytes to terabytes in very few seconds.
Pricing scaling linearly is a feature: Most (or all?) other databases I know of would require exponentially more expensive resources, or are just not able to handle these amounts of data - at least not in a reasonable time frame.
That said, linear scaling means that a query over a terabyte is a 1000 times more expensive than a query over a gigabyte. BigQuery users need to be aware of this and plan accordingly. For these purposes BigQuery offers the "dry run" flag, that allows one to see exactly how much data will be queried before running the query - and adjust accordingly.
In this case WeiGong was querying a 105 GB table. Ten SELECT * LIMIT 10 queries will quickly amount to a terabyte of data, and so on.
There are ways to make these same queries consume much less data:
Instead of querying SELECT * LIMIT 10, call only the columns you are looking for. BigQuery charges based on the columns you are querying, so having unnecessary columns, will add unnecessary costs.
For example, SELECT * ... queries 105 GB, while SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] only goes through 8.72 GB, making this query more than 10 times less expensive.
Instead of "SELECT *" use tabledata.list when looking to download the whole table. It's free.
Github archive table contains data for all time. Partition it if you only want to see one month data.
For example, extracting all of the January data with a query leaves a new table of only 91.7 MB. Querying this table is a thousand times less expensive than the big one!
SELECT *
FROM [githubarchive:github.timeline]
WHERE created_at BETWEEN '2014-01-01' and '2014-01-02'
-> save this into a new table 'timeline_201401'
Combining these methods you can go from a $4000 bill, to a $4 one, for the same amount of quick and insightful results.
(I'm working with Github archive's owner to get them to store monthly data instead of one monolithic table to make this even easier)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas