How to partition 10 billion row SQL tables quickly using AWS? - sql

I have a SQL database of data delivered in a normalized format with several tables that have several billions of rows of data. I have decided to partition the large tables into separate tables by itemId since when I query the data I only care about 1 item at a time. I would end up having 5000+ tables at the end after partitioning the data. The problem is, partitioning the data takes about 25 minutes to build a single table for 1 item.
5000 items x 25 minutes = 86.8 days
It would take over 86 days to fully partition my entire SQL database. My entire database is about 2.5TB.
Is this something I can leverage AWS for to parallelize on an item level? Can I use AWS database migration services to host the database in its current form and then use AWS process to churn through all of the 5000 queries to partition the big tables into 5000 smaller tables with 2M rows each?
If not, is this something I just have to throw more hardware at to make it run faster (CPU or RAM)?
Thanks in advance.

This doesn't seem like a good strategy. For one thing, simple arithmetic is that 10,000,000,000 rows with 5,000 rows per item results in 2,000,000 partitions in the table.
The limit in Redshift (by default) is 1,000,000 partition per table:
Amazon Redshift Spectrum has the following quotas when using the
Athena or AWS Glue data catalog:
A maximum of 10,000 databases per account.
A maximum of 100,000 tables per database.
A maximum of 1,000,000 partitions per table.
A maximum of 10,000,000 partitions per account.
You should re-think your partitioning strategy. Or perhaps your problem is not suitable for Redshift. There may be other database strategies more suitable for your use-case. (This is not the forum for recommending specific software solutions, however.)

Use the itemid as sortkey and distkey. if the table is vacummed properly and you select one itemid this should have good results, where access time is almost as good as a single table. distkey is used to distribute the data between shards, which means each itemid's blocks would be stored together on the same shard making retrieving all of them faster. Having the itemid also be sortkey means that for itemid's with small row numbers that all exist on the same shard, finding the rows within the table's blocks on a shard would be as fast as possible.

Creating a separate table for each item, where every other attribute of the table remains the same, doesn't seem logical. If the data format is the same, then keep the data in the same table unless there is a particular problem to overcome.
If you set the itemId as the SORTKEY on a Redshift table, then Redshift will be able to skip-over the blocks that do not contain a desired value (when using WHERE itemId = 'xxx'). This will be highly efficient.
Admittedly, trying to keep such a large table sorted would probably be too hard to VACUUM. It would still work reasonably well without the SORTKEY since blocks can still be skipped, but not as efficiently because the data for that itemId would be spread over more blocks.

Related

Databricks datamart updated optimization

I had a performance issue to ask your input on.
This is singly based in Databricks on Azure with storage on Azure Data Lake Storage. And tech stack is not more than 2 years old and is all up to the most recent release.
Say I have a datamart Delta table, 100 columns, 30,000,000 rows, grows 225,000 rows every calendar-quarter.
There isn't a Datawarehouse in this architecture, so the newest 225,000 rows are simply appended to the datamart; 30,000,000+ and growing every quarter.
Two columns are a dimension key Dim1_cd and a matching Dim1_desc.
There are 36 other dimensions key-value pairs in the datamart much like Dim1 is a key-value pair.
The datamart is a list of transactions, has a Period column, eg. "2021Q3", and Period is the first level and only partition of the datamart.
The partition divides the delta table into 15 partition folders currently. Each with 100Meg-ish size files numbering in the 150-parquet file range per partition folder.
A calendar-quarter later a new set of files is delivered and to be appended to the datamart, one of which is a Dim1_lookup.txt file which is first read into a Dim1_deltaTable; and has only two columns Dim1_cd and Dim1_desc. Each row is 3rd normal form distinct. On disk, the Dim1_lookup.txt file is only 55K large.
Applying this dimension’s newest version will sometimes have to take only 3-4 minutes, where there are not any Dim1_desc values needing to be updated. Other times, there are 20,000 to 100,000 updates to be written across 100s to 1000s of parquet files and can take an unpleasant long time.
Of course, writing the code of a delta table update to apply the Dim1_deltaTable is no big challenge.
But what can you suggest how to optimize the updates?
Ideally you might have a Datawarehouse backing the datamart, but not the case in this architecture.
You might want to partition Dim1_desc to take advantage of delta’s data skipping but there are 36 other _desc fields you have the same update concerns for.
What do you consider possible to optimize the update/minimize update processing time?

Hard limit on number of tables in a BQ project

I've got some highly partitionable data that I'd like to store in BigQuery, where each partition would get its own table. My question is if BQ will support the number of tables I'll need.
With my data set, I'd be creating approximately 2,000 new tables daily. All tables would have a 390 day (13 month) expiration, so eventually there'd be a constant count of ~ 2,000 tables * 390 days = ~780,000 tables in this particular project.
I'd test this myself, but BQ only supports a max of 10,000 load jobs per project per day.
Does anyone have experience with this sort of table count? Is there any official table limit provided by Google?
There are projects with that number of distinct tables today. There is not currently a hard cap on the number of distinct tables.
Some related considerations that come to mind when you're contemplating representations that use that many tables:
A query (including referenced views) can currently only reference 1000 tables.
Datasets with large numbers of tables may exhibit problematic behavior when using table wildcard functions.
You may be oversharding. Rather than lots of individual tables, you may simply want to use a wider schema and fewer tables.
If you're heavily dependent on time intervals as a sharding consideration, you may also want to look at table decorators as a way of limiting the scope of data scans.
You may also want to collapse data over time into fewer, larger tables as they age and are less frequently accessed. For example, copy jobs can append multiple source tables into a single destination table.
Most limits can be raised in BigQuery, as long as you are using BigQuery right - the limits are there to prevent abuse and misuse.
A critical question here - how much data will each table handle? Having 780,000 tables with 10 rows isn't a good idea.
How many tables do you want to handle per query? There's a hard limit of 1,000 tables per query.
If you have an interesting use case that requires higher limits, getting a support contract and their advice is the best way of having default limits raised.
https://cloud.google.com/support/

Relation with DB size and performance

Is there any relation between DB size and performance in my case:
There is a table in my Oracle DB that is used for logging. Now it has almost close to over 120 million rows and increases at a rate of 1000 rows per min. Each row has 6-7 columns with basic string data.
It is for our client. We never take any data from there but we might need that in case of any issues. However its fine if we clean up every month or so.
However the actual issue is will it affect performance of other transactional tables in the same db? Assuming the disk space as unlimited.
If 1000 rows/minute are being inserted into this table then about 40 million rows would be added per month. If this table has indexes I'd say that the biggest issue will be that eventually index maintenance will become a burden on the system, so in that case I'd expect performance to be affected.
This table seems like a good candidate for partitioning. If it's partitioned on the date/time that each row is added, with each partition containing one month's worth of data, maintenance would be much simpler. The partitioning scheme can be set up so that partitions are created automatically as needed (assuming you're on Oracle 11 or higher), and then when you need to drop a month's worth of data you can just drop the partition containing that data, which is a quick operation which doesn't burden the system with a large number of DELETE operations.
Best of luck.

Partitioning by date?

We are experimenting with BigQuery to analyze user data generated by our software application.
Our working table consists hundreds of millions of rows, each representing a unique user "session". Each containing a timestamp, UUID, and other fields describing the user's interaction with our product during that session. We currently generate about 2GB of data (~10M rows) per day.
Every so often we may run queries against the entire dataset (about 2 months worth right now, and growing), However typical queries will span just a single day, week, or month. We're finding out that as our table grows, our single-day query becomes more and more expensive (as we would expect given BigQuery architecture)
What isthe best way to query subsets of of our data more efficiently? One approach I can think of is to "partition" the data into separate tables by day (or week, month, etc.) then query them together in a union:
SELECT foo from
mytable_2012-09-01,
mytable_2012-09-02,
mytable_2012-09-03;
Is there a better way than this???
BigQuery now supports table partitions by date:
https://cloud.google.com/blog/big-data/2016/03/google-bigquery-cuts-historical-data-storage-cost-in-half-and-accelerates-many-queries-by-10x
Hi David: The best way to handle this is to shard your data across many tables and run queries as you suggest in your example.
To be more clear, BigQuery does not have a concept of indexes (by design), so sharding data into separate tables is a useful strategy for keeping queries as economically efficient as possible.
On the flip side, another useful feature for people worried about having too many tables is to set an expirationTime for tables, after which tables will be deleted and their storage reclaimed - otherwise they will persist indefinitely.

What is a good size (# of rows) to partition a table to really benefit?

I.E. if we have got a table with 4 million rows.
Which has got a STATUS field that can assume the following value: TO_WORK, BLOCKED or WORKED_CORRECTLY.
Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?
The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.
For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.
Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of WORKED_CORRECTLY? Highly unlikely. Has partitioning made your table more manageable? Not really.
The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on STATUS='WORKED_CORRECTLY' you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.
So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the WORKED_CORRECTLY. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.
Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.
So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.
This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.
The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.
What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?
Assuming you are trying to improve query performance? Do all your queries have predicates on the STATUS column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?