Having a surprising performance hit when querying multiple tables, vs. one big one. Scenario:
We have a simple web analytics tool based on BigQuery. We track basic events for individual sites. For about a month, we pumped all data into one big table. Now, we are breaking the data into partitions by SITE and MONTH.
So the big table was simply [events.all]
Now we have, say, [events.events_2014_06_SITEID]
Querying individual tables for a certain group is much faster, and processing much less data. But querying our entire dataset is much, much slower on even simple queries. And our new dataset is only 1 day old, whereas the big table is 30 days old, so it's slower despite querying far less data.
For example:
select count(et) from [events.all] where et='re'
--> completed in 3.2s, processing 79mb of data. This table has 21,048,979 rows.
select count(et) from ( TABLE_QUERY(events, 'table_id CONTAINS "events_2014_"') ) where et='re'
--> completed in 44.2s, processing 1.8mb of data. Put together, these tables have 492,264 rows.
How come this occurs, and is there any way to resolve this big disparity?
Related
I have 25 tables with the same structure, but different data. Each table has 7 millions rows. To find a record I have to go through each table one by one i.e. search table 1, if the record is found then show it and exit otherwise search table 2 and so on until table 25.
The structure is:
Name, Cell Number, ID Card Number, Address
In performance perspective:
Is it ok or should I merge all tables to on large table.
At what extent I can combine the tables. (How many rows are good to be in one table and then another table should be created).
Note: I have only search query on Cell Number and ID card number
In general, it is better to store all rows in a single table rather than in multiple tables. To speed queries, you should use facilities such as indexes and partitions.
Normally, when this question comes up, the issue is small tables (think dozens of rows) versus "large" tables (think thousands or millions of rows). In that extreme case, the decision is more cut-and-dry:
There is overhead to executing searches on multiple tables. Preparing and running queries takes some effort.
There is overhead in data storage. Tables store rows on data pages and the pages are not shared with other tables. If these pages are half-filled, then the I/O time is wasted.
Any improvements on performance, such as indexes, are either wasted on small tables or need to be repeated ad infinitum.
In your case, with a handful of large tables, these considerations are weaker. There is overhead for searching tables. But then again, it takes some time to run a query against 7 million rows -- and if the query requires scanning the table, the compile time is much less than the execution time. Such large tables have minuscule amount of wasted overhead in terms of half-filled "last" pages.
What I would say instead is that storing entities across multiple tables just makes managing the database trickier, so why bother? If i had to guess, you have 25 months of history (24 months of history plus the current month). I would recommend that you store such data in a single table, perhaps partitioned by month.
We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.
I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?
A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.
I've got some highly partitionable data that I'd like to store in BigQuery, where each partition would get its own table. My question is if BQ will support the number of tables I'll need.
With my data set, I'd be creating approximately 2,000 new tables daily. All tables would have a 390 day (13 month) expiration, so eventually there'd be a constant count of ~ 2,000 tables * 390 days = ~780,000 tables in this particular project.
I'd test this myself, but BQ only supports a max of 10,000 load jobs per project per day.
Does anyone have experience with this sort of table count? Is there any official table limit provided by Google?
There are projects with that number of distinct tables today. There is not currently a hard cap on the number of distinct tables.
Some related considerations that come to mind when you're contemplating representations that use that many tables:
A query (including referenced views) can currently only reference 1000 tables.
Datasets with large numbers of tables may exhibit problematic behavior when using table wildcard functions.
You may be oversharding. Rather than lots of individual tables, you may simply want to use a wider schema and fewer tables.
If you're heavily dependent on time intervals as a sharding consideration, you may also want to look at table decorators as a way of limiting the scope of data scans.
You may also want to collapse data over time into fewer, larger tables as they age and are less frequently accessed. For example, copy jobs can append multiple source tables into a single destination table.
Most limits can be raised in BigQuery, as long as you are using BigQuery right - the limits are there to prevent abuse and misuse.
A critical question here - how much data will each table handle? Having 780,000 tables with 10 rows isn't a good idea.
How many tables do you want to handle per query? There's a hard limit of 1,000 tables per query.
If you have an interesting use case that requires higher limits, getting a support contract and their advice is the best way of having default limits raised.
https://cloud.google.com/support/
We are experimenting with BigQuery to analyze user data generated by our software application.
Our working table consists hundreds of millions of rows, each representing a unique user "session". Each containing a timestamp, UUID, and other fields describing the user's interaction with our product during that session. We currently generate about 2GB of data (~10M rows) per day.
Every so often we may run queries against the entire dataset (about 2 months worth right now, and growing), However typical queries will span just a single day, week, or month. We're finding out that as our table grows, our single-day query becomes more and more expensive (as we would expect given BigQuery architecture)
What isthe best way to query subsets of of our data more efficiently? One approach I can think of is to "partition" the data into separate tables by day (or week, month, etc.) then query them together in a union:
SELECT foo from
mytable_2012-09-01,
mytable_2012-09-02,
mytable_2012-09-03;
Is there a better way than this???
BigQuery now supports table partitions by date:
https://cloud.google.com/blog/big-data/2016/03/google-bigquery-cuts-historical-data-storage-cost-in-half-and-accelerates-many-queries-by-10x
Hi David: The best way to handle this is to shard your data across many tables and run queries as you suggest in your example.
To be more clear, BigQuery does not have a concept of indexes (by design), so sharding data into separate tables is a useful strategy for keeping queries as economically efficient as possible.
On the flip side, another useful feature for people worried about having too many tables is to set an expirationTime for tables, after which tables will be deleted and their storage reclaimed - otherwise they will persist indefinitely.