BigQuery execution time and scaling - google-bigquery

I created a test dataset of roughly 450GB in BigQuery and I am getting execution speed of ~9 seconds to query the largest table (10bn rows) when running from WebUI. I just wanted to check if this is a 'normal' expected result and whether it would get worse with larger size (i.e. 100bn rows+) and if the queries become more complex. I am aware of table partitioning/etc. but I just want to get a sense of what is 'normal' expected speed without first getting into optimization, since the above seems like 'smallish' size for what BQ is meant for.
The above result is achieved on a simple query like this:
select ColumnA from DataSet.Table order by ColumnB desc limit 100
So the result returned to the client is very small. ColumnA is structured as UUIDs represented in String format and ColumnB is integer.

It's almost impossible to say if this is "normal" or not. BigQuery is a multitenancy architecture/infrastructure. That means we all share the same resources (i.e. compute power) in the cluster when running queries. Therefore, query times are never deterministic in BigQuery i.e. they can vary depending on the number of concurrent queries executing from users at any given time. That said however, you can get reserved slots for a flat rate price. Although, you'd need to be spending quite a lot of money to justify that.
You can improve execution times by removing compute/shuffle/memory intensive steps like order by etc. Obviously, the complexity of the query will also have and impact on the query times.
On some of our projects we can smash through 3TB-5TB with a relatively complex query in about 15s-20s. Sometimes it quicker, sometimes is slower. We also run queries over much smaller datasets that can take the same amount of time. This is because what I wrote at the beginning - BigQuery query times are not deterministic.
Finally, BigQuery will cache results, so if you issue the same query multiple times over the same dataset it will be returned from the cache i.e. much quicker!

Related

What is the time complexity of executing a simple query in terms of the number of rows in a table?

I'd like to know if the time taken to query from a table increases linearly with the number of rows in the table. In short, will the following query
SELECT * FROM my_table
take 10 times longer to run on average if the table has 10 times as many rows?
I think there are many factors that affect the speed of the query (like sharding of tables), but I'd like to know if on average we can expect it to be linear or perhaps sub-linear.
I tried running queries on different tables of different sizes and ended up with results that suggest it is sub-linear in time. But I'd like to make sure.
The time complexity depends on many factors. It is not limited to any single factor.Mostly the time taken varies on partitioning of table or data skewness etc. More number of rows will take more time to run select *. Since there are too many factors involved, representing in Big O notation is difficult.

Bigquery Tier 1 exceeded for partitioned table but not for by day tables

We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.

bigquery overcharges when selecting just few rows

select DATE(request_time) from logs.nobids_05 limit 1
gave me "3.48 GB processed" which a bit much considering that request_time is a field that appears in each row.
There are many other cases where just touching column automatically adds its total size to the cost. For example,
select * from logs.nobids_05 limit 1
gives me "This query will process 274 GB when run".
I am sure bigquery does not need to read 274GB for outputting 1 row of data.
2019 update: IF you cluster your tables, the cost of a SELECT * LIMIT 1 will be minimal.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
Running a "SELECT * FROM big_table LIMIT 1" with BigQuery would be the equivalent of doing this: https://www.youtube.com/watch?v=KZ-slvv_ZT4.
BigQuery is an analytical database. It's architecture and pricing are optimized for analysis at scale, not for single row handling.
Every operation in BigQuery involves a full table scan, but only of the columns mentioned in the query. The goal is to have predictable costs: Before running the query you are able to know how much data will be involved, therefore its cost. It might seem a big price to query just one row, but the good news is the cost remains constant, even when the queries get way more complex and CPU intensive.
Once in a while you might need to run a single row query, and the costs might seem excessive, but the assumption here is that you are using this tool to analyze data at scale, and the overall costs of having data stored in it should be more than competitive with other tools available. Since you've been working with other tools, I'd love to see a total cost comparison of analytical sessions within real case scenarios.
By the way, BigQuery has a better way for doing the equivalent of "SELECT * LIMIT x". It's free, and it relies on the REST API instead of querying:
https://developers.google.com/bigquery/docs/reference/v2/tabledata/list
This being said, thanks for the feedback, as there is a balancing job between making pricing more complex and the tool better suited for other jobs - and this balance is built on the feedback we get.
I don't think this is a bug. "When you run a query, you're charged according to the total data processed in the columns you select, even if you set an explicit LIMIT on the results." (https://developers.google.com/bigquery/pricing#samplecosts)

Understanding "Resources exceeded during query execution" with GROUP EACH BY in BigQuery

I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:
Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?
The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):
SELECT
alternative,
snapshot_time,
SUM(column_1),
...
SUM(column_139)
FROM
my_table
CROSS JOIN
[table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time
(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )
I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.
As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.
What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.
To answer your specific questions:
Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).
There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.
The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.
As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).

libpq very slow for large (20 million record) database

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks
I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.
Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.
You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.