I've got some highly partitionable data that I'd like to store in BigQuery, where each partition would get its own table. My question is if BQ will support the number of tables I'll need.
With my data set, I'd be creating approximately 2,000 new tables daily. All tables would have a 390 day (13 month) expiration, so eventually there'd be a constant count of ~ 2,000 tables * 390 days = ~780,000 tables in this particular project.
I'd test this myself, but BQ only supports a max of 10,000 load jobs per project per day.
Does anyone have experience with this sort of table count? Is there any official table limit provided by Google?
There are projects with that number of distinct tables today. There is not currently a hard cap on the number of distinct tables.
Some related considerations that come to mind when you're contemplating representations that use that many tables:
A query (including referenced views) can currently only reference 1000 tables.
Datasets with large numbers of tables may exhibit problematic behavior when using table wildcard functions.
You may be oversharding. Rather than lots of individual tables, you may simply want to use a wider schema and fewer tables.
If you're heavily dependent on time intervals as a sharding consideration, you may also want to look at table decorators as a way of limiting the scope of data scans.
You may also want to collapse data over time into fewer, larger tables as they age and are less frequently accessed. For example, copy jobs can append multiple source tables into a single destination table.
Most limits can be raised in BigQuery, as long as you are using BigQuery right - the limits are there to prevent abuse and misuse.
A critical question here - how much data will each table handle? Having 780,000 tables with 10 rows isn't a good idea.
How many tables do you want to handle per query? There's a hard limit of 1,000 tables per query.
If you have an interesting use case that requires higher limits, getting a support contract and their advice is the best way of having default limits raised.
https://cloud.google.com/support/
Related
I have a SQL database of data delivered in a normalized format with several tables that have several billions of rows of data. I have decided to partition the large tables into separate tables by itemId since when I query the data I only care about 1 item at a time. I would end up having 5000+ tables at the end after partitioning the data. The problem is, partitioning the data takes about 25 minutes to build a single table for 1 item.
5000 items x 25 minutes = 86.8 days
It would take over 86 days to fully partition my entire SQL database. My entire database is about 2.5TB.
Is this something I can leverage AWS for to parallelize on an item level? Can I use AWS database migration services to host the database in its current form and then use AWS process to churn through all of the 5000 queries to partition the big tables into 5000 smaller tables with 2M rows each?
If not, is this something I just have to throw more hardware at to make it run faster (CPU or RAM)?
Thanks in advance.
This doesn't seem like a good strategy. For one thing, simple arithmetic is that 10,000,000,000 rows with 5,000 rows per item results in 2,000,000 partitions in the table.
The limit in Redshift (by default) is 1,000,000 partition per table:
Amazon Redshift Spectrum has the following quotas when using the
Athena or AWS Glue data catalog:
A maximum of 10,000 databases per account.
A maximum of 100,000 tables per database.
A maximum of 1,000,000 partitions per table.
A maximum of 10,000,000 partitions per account.
You should re-think your partitioning strategy. Or perhaps your problem is not suitable for Redshift. There may be other database strategies more suitable for your use-case. (This is not the forum for recommending specific software solutions, however.)
Use the itemid as sortkey and distkey. if the table is vacummed properly and you select one itemid this should have good results, where access time is almost as good as a single table. distkey is used to distribute the data between shards, which means each itemid's blocks would be stored together on the same shard making retrieving all of them faster. Having the itemid also be sortkey means that for itemid's with small row numbers that all exist on the same shard, finding the rows within the table's blocks on a shard would be as fast as possible.
Creating a separate table for each item, where every other attribute of the table remains the same, doesn't seem logical. If the data format is the same, then keep the data in the same table unless there is a particular problem to overcome.
If you set the itemId as the SORTKEY on a Redshift table, then Redshift will be able to skip-over the blocks that do not contain a desired value (when using WHERE itemId = 'xxx'). This will be highly efficient.
Admittedly, trying to keep such a large table sorted would probably be too hard to VACUUM. It would still work reasonably well without the SORTKEY since blocks can still be skipped, but not as efficiently because the data for that itemId would be spread over more blocks.
We have data of different dimensions, for example:
Name by Company
Stock prices by Date, Company
Commodity prices by Date & Commodity
Production volumes by Date, Commodity & Company
We're thinking of the best way of storing these in BigQuery. One potential method is to put them all in the same table, and nest the extra dimensions.
That would mean:
Almost all the data would be nested - e.g. there would be a single 'row' for each Company, and then its prices would be nested by Date.
Data would have to share at least one dimension - I don't think there would be a way of representing Commodity prices in a table whose first column was the company's Name
Are there disadvantages? Are there performance implications? Is it sensible to nest 5000 dates + associated values within each company's row?
It's common to have nested/repeated columns in BigQuery schemas since it makes reasoning about the data easier. Firebase produces schemas with repetition at many levels, for instance. If you flatten everything, the downside is you need some kind of unique ID for each row in order to associate events with each other, and then you'll need aggregations (using the ID as a key) rather than simple filters if you want to do any kind of counting.
As for downsides of nested/repeated schemas, one is that you may find yourself performing complicated transformations of the structure with ARRAY subqueries or STRUCT operators, for instance. These are generally fast, but they do have some overhead relative to queries without any structure imposed on the result at all.
My best suggestion would be to load some data and run some experiments. Storage and querying both are relatively cheap, so you can try a few different schema shapes and see which works better for your purposes.
Updating in Bigquery is pretty new, but based on the public available info BigQuery DML it is currently limited to only 48 updates per table per day.
Quotas
DML statements are significantly more expensive to process than SELECT
statements.
Maximum UPDATE/DELETE statements per day per table: 48 Maximum
UPDATE/DELETE statements per day per project: 500 Maximum INSERT
statements per day per table: 1,000 Maximum INSERT statements per day
per project: 10,000
Processing nested data is also very expensive since all of the data from that column is loaded on every query. It is also slow if you are doing a lot of operations on nested data.
I am currently using big query to store the user information to compute aggregate results against huge log data . But since modifying the data is not possible. In order to overcome this I am planning to store each user record in separate table. I understand bigquery supports querying from multiple tables using which i can get all information. My doubt over here are
as the number of users grows will the performance deteriorate as compared to storing all the users in a singe table.if there any limitations on number of tables per dataset in biq query
Thanks in advance
From what I know - there is no hard limit on number of tables in dataset.
At the same time - Native BQ UI has limit of first 10,000 tables in dataset to show.
Another limits to consider (just few to mention):
* Daily update limit: 1,000 updates per table per day;
* Query (including referenced views) can reference up to 1,000 tables and not more;
* Each additional table involved in a query (with hundreds and hundreds tables) makes considerable impact on performance.
* even if each table is small enough - it still will be charged at min price of 10MB (even if it is just few KB)
Not knowing your exact scenario doesnt allow making some recommendation, but at least you've got answer on those items in your question.
Overall, idea of having table per user doesn't sound good to me
I'm using BigQuery for ~5 billion rows that can be partitioned on ~1 million keys.
Since our queries are usually by partition key, is it possible to create ~1 million tables (1 table / key) to limit the total number of bytes processed?
We also need to query all of the data together at times, which is easy to do by putting it all in one table, but I'm hoping to use the same platform for partitioned analysis as bulk analytics.
That might work, but partitioning your table this finely is highly discouraged. You might be better off partitioning your data into a smaller number of tables, say 10 or 100, and querying just the one(s) you need.
What do I mean by discouraged? First, each of those million tables will get charged a minimum of 10 MB for storage. So you'll get charged for 9 TB of storage, when you likely have a lot less data than that. Second, you'll likely hit rate limits when you try to create that many tables. Third, managing a million tables is very tricky; the BigQuery UI will likely not be much help. Fourth, you'll make engineers on the BigQuery exceedingly grumpy, and they'll start trying to figure out whether we need to raise the minimum size for tables.
Also, if you do want to sometimes query all of your data, partitioning this finely is likely going to make things difficult for you, unless you are willing to store your data multiple times. You can only reference 1000 tables in a query, and each one you reference causes you to take a performance hit.
We are experimenting with BigQuery to analyze user data generated by our software application.
Our working table consists hundreds of millions of rows, each representing a unique user "session". Each containing a timestamp, UUID, and other fields describing the user's interaction with our product during that session. We currently generate about 2GB of data (~10M rows) per day.
Every so often we may run queries against the entire dataset (about 2 months worth right now, and growing), However typical queries will span just a single day, week, or month. We're finding out that as our table grows, our single-day query becomes more and more expensive (as we would expect given BigQuery architecture)
What isthe best way to query subsets of of our data more efficiently? One approach I can think of is to "partition" the data into separate tables by day (or week, month, etc.) then query them together in a union:
SELECT foo from
mytable_2012-09-01,
mytable_2012-09-02,
mytable_2012-09-03;
Is there a better way than this???
BigQuery now supports table partitions by date:
https://cloud.google.com/blog/big-data/2016/03/google-bigquery-cuts-historical-data-storage-cost-in-half-and-accelerates-many-queries-by-10x
Hi David: The best way to handle this is to shard your data across many tables and run queries as you suggest in your example.
To be more clear, BigQuery does not have a concept of indexes (by design), so sharding data into separate tables is a useful strategy for keeping queries as economically efficient as possible.
On the flip side, another useful feature for people worried about having too many tables is to set an expirationTime for tables, after which tables will be deleted and their storage reclaimed - otherwise they will persist indefinitely.