How did my BigQuery bill get calculated - google-bigquery

I've been playing with Bigquery and building some aggregates for a reporting product. To my horror I noticed that my bill so far this month is over $4000! There doesn't seem to be any way of breaking down my usage - is there a report I can get of queries by data processed/cost?

BigQuery data processing is billed at $0.035 per gigabyte. (see pricing here). You can see how much data you're processing by looking at your query jobs.
If you're using the UI, it will tell you how much data your query processed next to the 'Run Query' button. If you're using the bq command-line tool, you can see the jobs you've run by running bq ls -j and then show how much data has been processed by each job by running bq show -j job_0bb47924271b433b895b690726099f69 (substitute your own job id here). If you're running queries by using the API directly, the number of bytes scanned is returned on the job in the statistics.totalBytesProcessed field.
If you'd like to reduce the amount you're spending, you can either use fewer columns in your queries, break your tables up into smaller pieces (e.g. daily tables), or use the batch query mode for non-time-sensitive queries which is only $0.020/GB processed.
If you break tables into smaller pieces, you can always do queries over multiple tables when needed using the ',' syntax. E.g. SELECT foo from table1, table2, table3.

Related

Get bytes queried for entire BigQuery project history

In our Project History tab, we have hundreds of queries from our daily analytics pipelines. I am working on a review of our BigQuery billing and analyzing the cost of queries here seems like the place to start.
However, there is no column for bytes processed or for cost. We can click the ... to show job details with billing, but this is not efficient or useful for assessing costs of hundreds of queries.
The accepted answer Is it possible to retrieve full query history and correlate its cost in google bigquery?, which is:
SELECT query, total_bytes_processed
FROM 'region-us'.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
project_id = 'you_project_id' AND user_email = 'my#eamil.com'
This is not so helpful. I've run this query for our project and for some reason it only shows a handful of queries for the last month, despite our pipelines running hundreds of queries daily. I've even removed the project_id and email filters to ensure that I was not filtering away results, and still only a handful of queries...
How can I get the cost / bytes queried for all queries in the Project History tab? Any why might the majority of our queries be missing from the JOB_BY_PROJECT query?
TL;DR: It's not possible
According to the public document on INFORMATION_SCHEMA, it is limited to
“currently running jobs, as well as the history of jobs completed in the past 180 days.”. You can run bq ls via BQ command line to check if you still get the same results with querying INFORMATION_SCHEMA.
There has been a recent issue where heavy queries are not showing up by querying, you can try querying again to see if the queries appear now. But if issue persists, you can try reading and exporting the audit log (if only you have set this up previously). Audit log for the job also contains information about billed bytes, and more detail and examples can be found from BQ audit logs public document.
Also if you need something permanent, I would recommend to set up a way to store historical data either from audit log or information schema to query on. You can check these related posts on setting up the audit log and querying from it and more examples on querying on it.

Bigquery Tier 1 exceeded for partitioned table but not for by day tables

We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.

Getting table-specific costs via SQL query

Is there a query I can run to determine how much queries against each table are costing us? For instance, the result of this query would at least include something like:
dataset.table1 236TB processed dataset.table2 56GB processed dataset.table3 24kB processed etc
Also is there a way to know what specific queries are costing us the most?
Thanks!
Let's talk first about data and respective data-points to do such a query!
Take a look at Job Resources
Here you have few useful properties
configuration.query.query - BigQuery SQL query to execute.
statistics.query.referencedTables - Referenced tables for the job.
statistics.query.totalBytesBilled - Total bytes billed for the job.
statistics.query.totalBytesProcessed - Total bytes processed for the job.
statistics.query.billingTier - Billing tier for the job.
Having above data-points would allow you to write relativelly simple query to answer your cost per query and cost per table questions!
So, now - how to get this data available?
You can collect your jobs using Job.list API and than loop thru all available jobs and retrieve respective stats via Job.get API - of course dumping retrieved data into BigQuery table. Than you can enjoy analysis!
Or you can use BigQuery's audit logs to track access and cost details (as described in the docs) and export them back to BigQuery for analysis.
The former option (Jobs.list and than Job.get in loop)) gives you ability to get your jobs info even if you don't have audit logs enabled yet, because Job.get API returns information about a specific job that is available for a six month period after creation - so plenty of data for analysis!
In my understanding currently, it is not possible to get processed bytes per table.
In my understanding it would be a great feature through which you can identify and optimize costs and also have a better possibility to understand effectivity of partioning and clustering changes. Currently is just possible to get the totalprocessed bytes for a query and also see which tables were referenced. But there are no easy query and no query at all which makes possible to analyze this cost on the table level which is more granuar then query level.

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

BigQuery performance: Is this correct?

Folks, I'm using BigQuery as a superfast database for my analytics queries, but I'm very disappointed with its performance.
Let me show you the numbers:
Just one Table at "from" clause
Select about 15 fields with group by each, about 5 fields with SUM()
Total table rows: 3.7 millions
Total rows returned: 830K
When I execute this query on BigQuery's console, it takes about 1 minute to process. Is this ok for you? I was expecting that it will return in about 2 seconds... If I execute this query on a columnar database, like Sybase IQ, it takes less than 2 seconds.
Big Query is a highly scalable database, before being a "super fast" database. It's designed to process HUGE amount of data distributing the processing among several different machines using a technique named Dremel. Because it's designed to use several machines and parallel processing, you should expect to have super-scalability with a good performance.
For example: analyzing all the wikipedia revisions in 5-10 seconds isn't bad, is it? But even a much smaller table would take about the same time.
Sybase IQ is often installed in a single database and it doesn't use Dremel. That said, it's going to be faster than Big Query in many scenarios...as designed.
Cheers!
Since you are returning 830k rows and BQ is always creating a temporary result table, the creation is more than a small result.
Have you turned on large results?
We are working in a shared environment and sometime loads ( table creation ) takes a while.
Certainly the performance differ from a dedicated environment. You get your dedicated environment for 20K$ a month.