How do explicit table partitions in Databricks affect write performance? - amazon-s3

We have the following scenario:
We have an existing table containing approx. 15 billion records. It was not explicitly partitioned on creation.
We are creating a copy of this table with partitions, hoping for faster read time on certain types of queries.
Our tables are on Databricks Cloud, and we use Databricks Delta.
We commonly filter by two columns, one of which is the ID of an entity (350k distinct values) and one of which is the date at which an event occurred (31 distinct values so far, but increasing every day!).
So, in creating our new table, we ran a query like this:
CREATE TABLE the_new_table
USING DELTA
PARTITIONED BY (entity_id, date)
AS SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
This query has run for 48 hours and counting. We know that it is making progress, because we have found around 250k prefixes corresponding to the first partition key in the relevant S3 prefix, and there are certainly some big files in the prefixes that exist.
However, we're having some difficulty monitoring exactly how much progress has been made, and how much longer we can expect this to take.
While we waited, we tried out a query like this:
CREATE TABLE a_test_table (
entity_id STRING,
another_id STRING,
timestamp TIMESTAMP,
date DATE
)
USING DELTA
PARTITIONED BY (date);
INSERT INTO a_test_table
SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
WHERE CAST(from_unixtime(timestamp) AS DATE) = '2018-12-01'
Notice the main difference in the new table's schema here is that we partitioned only on date, not on entity id. The date we chose contains almost exactly four percent of the old table's data, which I want to point out because it's much more than 1/31. Of course, since we are selecting by a single value that happens to be the same thing we partitioned on, we are in effect only writing one partition, vs. the probably hundred thousand or so.
The creation of this test table took 16 minutes using the same number of worker-nodes, so we would expect (based on this) that the creation of a table 25x larger would only take around 7 hours.
This answer appears to partially acknowledge that using too many partitions can cause the problem, but the underlying causes appear to have greatly changed in the last couple of years, so we seek to understand what the current issues might be; the Databricks docs have not been especially illuminating.
Based on the posted request rate guidelines for S3, it seems like increasing the number of partitions (key prefixes) should improve performance. The partitions being detrimental seems counter-intuitive.
In summary: we are expecting to write many thousands of records in to each of many thousands of partitions. It appears that reducing the number of partitions dramatically reduces the amount of time it takes to write the table data. Why would this be true? Are there any general guidelines on the number of partitions that should be created for data of a certain size?

You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).
You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))
Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.
How many partitions?
No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.
I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.

I am not a databricks expert at all but hopefully this bullets can help
Number of partitions
The number of partitions and files created will impact the performance of your job no matter what, especially using s3 as data storage however this number of files should be handled easily by a cluster of descent size
Dynamic partition
There is a huge difference between partition dynamically by your 2 keys instead of one, let me try to address this in more details.
When you partition data dynamically, depending on the number of tasks and the size of the data, a big number of small files could be created per partition, this could (and probably will) impact the performance of next jobs that will require use this data, especially if your data is stored in ORC, parquet or any other columnar format. Note that this will require only a map only job.
The issue explained before, is addressed in different ways, being the most common the file consolidation. For this, data is repartitioned with the purpose of create bigger files. As result, shuffling of data will be required.
Your queries
For your first query, the number of partitions will be 350k*31 (around 11MM!), which is really big considering the amount of shuffling and task required to handle the job.
For your second query (which takes only 16 minutes), the number of required tasks and shuffling required is much more smaller.
The number of partitions (shuffling/sorting/tasks scheduling/etc) and the time of your job execution does not have a linear relationship, that is why the math doesn't add up in this case.
Recomendation
I think you already got it, you should split your etl job in 31 one different queries which will allow to optimize the execution time

My recommendations in case of occupying partitioned columns is
Identify the cardinality of all the columns and select those that have a finite amount in time, therefore exclude identifiers and date columns
Identify the main search to the table, perhaps it is date or by some categorical field
Generate sub columns with a finite cardinality in order to speed up the search example in the case of dates it is possible to decompose it into year, month, day, etc. , or in the case of integer identifiers, decompose them into the integer division of these IDs% [1,2,3 ...]
As I mentioned earlier, using columns with a high cardinality to partition, will cause poor performance, by generating a lot of files which is the worst working case.
It is advisable to work with files that do not exceed 1 GB for this when creating the delta table it is recommended to occupy "coalesce (1)"
If you need to perform updates or insertions, specify the largest number of partitioned columns to rule out the inceserary cases of file reading, which is very effective to reduce times.

Related

Hive external table optimal partition size

What is the optimal size for external table partition?
I am planning to partition table by year/month/day and we are getting about 2GB of data daily.
Optimal table partitioning is such that matching to your table usage scenario.
Partitioning should be chosen based on:
how the data is being queried (if you need to work mostly with daily data then partition by date).
how the data is being loaded (parallel threads should load their own
partitions, not overlapped)
2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.
Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.
Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.
Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.
It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.
For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.
Usefull blog here.
Hive partitioning is most effective in cases where the data is sparse. By sparse I mean that the data internally has visible partitions such as by year, month or day.
In your case, partitioning by date doesn't make much sense as each day will have 2 Gb of data which is not too big to handle. Partitioning by week or month makes more sense as it will optimize the query time and will not create too many small partition files.

What is a good size (# of rows) to partition a table to really benefit?

I.E. if we have got a table with 4 million rows.
Which has got a STATUS field that can assume the following value: TO_WORK, BLOCKED or WORKED_CORRECTLY.
Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?
The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.
For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.
Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of WORKED_CORRECTLY? Highly unlikely. Has partitioning made your table more manageable? Not really.
The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on STATUS='WORKED_CORRECTLY' you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.
So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the WORKED_CORRECTLY. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.
Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.
So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.
This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.
The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.
What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?
Assuming you are trying to improve query performance? Do all your queries have predicates on the STATUS column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?

How can I improve performance of average method in SQL?

I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?
The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.
To be more specific, the data in question is essentially a log with this sort of definition:
table log {
int duration
date time
string event
}
I'm doing queries like
SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today
The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.
The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.
An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.
Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?
There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".
If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.
Speeding up aggregates is usually done by keeping additional tables.
Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table
dimAavg(dimA, avgValue)
The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...
For example let us assume that system predominantly does inserts and only occasionally updates and deletes.
Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without a big impact on the system.
This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.
This way you can still get accurate aggregates from this table and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID
GROUP BY a.dimA
The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).
The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).
Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).
Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.
EDIT
There are also materialized views, 2, 3
Just a guess, but indexes won't help much since average must read all the record (in any order), indexes are usefull the find subsets of rows, ubt if you have to iterate on all rows with no special ordering indexes are not helping...
This might not be what you're looking for, but if your table has some way to order the data (e.g. by date), then you can just do incremental computations and store the results.
For example, if your data has a date column, you could compute the average for records 1 - Date1 then store the average for that batch along with Date1 and the #records you averaged. The next time you compute, you restrict your query to results Date1..Date2, and add the # of records, and update the last date queried. You have all the information you need to compute the new average.
When doing this, it would obviously be helpful to have an index on the date, or whatever column(s) you are using for the ordering.

Handling 100's of 1,000,000's of rows in T-SQL2005

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.

Efficiently storing 7.300.000.000 rows

How would you tackle the following storage and retrieval problem?
Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:
id (unique row identifier)
entity_id (takes on values between 1 and 2.000.000 inclusive)
date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10))
value_1 (takes on values between 1 and 1.000.000 inclusive)
value_2 (takes on values between 1 and 1.000.000 inclusive)
entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The database must be able to hold 10 years worth of daily data (7.300.000.000 rows (3.650*2.000.000)).
What is described above is the write patterns. The read pattern is simple: all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345.
Transactional support is not needed, but the storage solution must be open-sourced. Ideally I'd like to use MySQL, but I'm open for suggestions.
Now - how would you tackle the described problem?
Update: I was asked to elaborate regarding the read and write patterns. Writes to the table will be done in one batch per day where the new 2M entries will be added in one go. Reads will be done continuously with one read every second.
"Now - how would you tackle the described problem?"
With simple flat files.
Here's why
"all queries will be made on a
specific entity_id. I.e. retrieve all
rows describing entity_id = 12345."
You have 2.000.000 entities. Partition based on entity number:
level1= entity/10000
level2= (entity/100)%100
level3= entity%100
The each file of data is level1/level2/level3/batch_of_data
You can then read all of the files in a given part of the directory to return samples for processing.
If someone wants a relational database, then load files for a given entity_id into a database for their use.
Edit On day numbers.
The date_id/entity_id uniqueness rule is not something that has to be handled. It's (a) trivially imposed on the file names and (b) irrelevant for querying.
The date_id "rollover" doesn't mean anything -- there's no query, so there's no need to rename anything. The date_id should simply grow without bound from the epoch date. If you want to purge old data, then delete the old files.
Since no query relies on date_id, nothing ever needs to be done with it. It can be the file name for all that it matters.
To include the date_id in the result set, write it in the file with the other four attributes that are in each row of the file.
Edit on open/close
For writing, you have to leave the file(s) open. You do periodic flushes (or close/reopen) to assure that stuff really is going to disk.
You have two choices for the architecture of your writer.
Have a single "writer" process that consolidates the data from the various source(s). This is helpful if queries are relatively frequent. You pay for merging the data at write time.
Have several files open concurrently for writing. When querying, merge these files into a single result. This is helpful is queries are relatively rare. You pay for merging the data at query time.
Use partitioning. With your read pattern you'd want to partition by entity_id hash.
You might want to look at these questions:
Large primary key: 1+ billion rows MySQL + InnoDB?
Large MySQL tables
Personally, I'd also think about calculating your row width to give you an idea of how big your table will be (as per the partitioning note in the first link).
HTH.,
S
Your application appears to have the same characteristics as mine. I wrote a MySQL custom storage engine to efficiently solve the problem. It is described here
Imagine your data is laid out on disk as an array of 2M fixed length entries (one per entity) each containing 3650 rows (one per day) of 20 bytes (the row for one entity per day).
Your read pattern reads one entity. It is contiguous on disk so it takes 1 seek (about 8mllisecs) and read 3650x20 = about 80K at maybe 100MB/sec ... so it is done in a fraction of a second, easily meeting your 1-query-per-second read pattern.
The update has to write 20 bytes in 2M different places on disk. IN simplest case this would take 2M seeks each of which takes about 8millisecs, so it would take 2M*8ms = 4.5 hours. If you spread the data across 4 “raid0” disks it could take 1.125 hours.
However the places are only 80K apart. In the which means there are 200 such places within a 16MB block (typical disk cache size) so it could operate at anything up to 200 times faster. (1 minute) Reality is somewhere between the two.
My storage engine operates on that kind of philosophy, although it is a little more general purpose than a fixed length array.
You could code exactly what I have described. Putting the code into a MySQL pluggable storage engine means that you can use MySQL to query the data with various report generators etc.
By the way, you could eliminate the date and entity id from the stored row (because they are the array indexes) and may be the unique id – if you don't really need it since (entity id, date) is unique, and store the 2 values as 3-byte int. Then your stored row is 6 bytes, and you have 700 updates per 16M and therefore a faster inserts and a smaller file.
Edit Compare to Flat Files
I notice that comments general favor flat files. Don't forget that directories are just indexes implemented by the file system and they are generally optimized for relatively small numbers of relatively large items. Access to files is generally optimized so that it expects a relatively small number of files to be open, and has a relatively high overhead for open and close, and for each file that is open. All of those "relatively" are relative to the typical use of a database.
Using file system names as an index for a entity-Id which I take to be a non-sparse integer 1 to 2Million is counter-intuitive. In a programming you would use an array, not a hash-table, for example, and you are inevitably going to incur a great deal of overhead for an expensive access path that could simply be an array indeing operation.
Therefore if you use flat files, why not use just one flat file and index it?
Edit on performance
The performance of this application is going to be dominated by disk seek times. The calculations I did above determine the best you can do (although you can make INSERT quicker by slowing down SELECT - you can't make them both better). It doesn't matter whether you use a database, flat-files, or one flat-file, except that you can add more seeks that you don't really need and slow it down further. For example, indexing (whether its the file system index or a database index) causes extra I/Os compared to "an array look up", and these will slow you down.
Edit on benchmark measurements
I have a table that looks very much like yours (or almost exactly like one of your partitions). It was 64K entities not 2M (1/32 of yours), and 2788 'days'. The table was created in the same INSERT order that yours will be, and has the same index (entity_id,day). A SELECT on one entity takes 20.3 seconds to inspect the 2788 days, which is about 130 seeks per second as expected (on 8 millisec average seek time disks). The SELECT time is going to be proportional to the number of days, and not much dependent on the number of entities. (It will be faster on disks with faster seek times. I'm using a pair of SATA2s in RAID0 but that isn't making much difference).
If you re-order the table into entity order
ALTER TABLE x ORDER BY (ENTITY,DAY)
Then the same SELECT takes 198 millisecs (because it is reading the order entity in a single disk access).
However the ALTER TABLE operation took 13.98 DAYS to complete (for 182M rows).
There's a few other things the measurements tell you
1. Your index file is going to be as big as your data file. It is 3GB for this sample table. That means (on my system) all the index at disk speeds not memory speeds.
2.Your INSERT rate will decline logarithmically. The INSERT into the data file is linear but the insert of the key into the index is log. At 180M records I was getting 153 INSERTs per second, which is also very close to the seek rate. It shows that MySQL is updating a leaf index block for almost every INSERT (as you would expect because it is indexed on entity but inserted in day order.). So you are looking at 2M/153 secs= 3.6hrs to do your daily insert of 2M rows. (Divided by whatever effect you can get by partition across systems or disks).
I had similar problem (although with much bigger scale - about your yearly usage every day)
Using one big table got me screeching to a halt - you can pull a few months but I guess you'll eventually partition it.
Don't forget to index the table, or else you'll be messing with tiny trickle of data every query; oh, and if you want to do mass queries, use flat files
Your description of the read patterns is not sufficient. You'll need to describe what amounts of data will be retrieved, how often and how much deviation there will be in the queries.
This will allow you to consider doing compression on some of the columns.
Also consider archiving and partitioning.
If you want to handle huge data with millions of rows it can be considered similar to time series database which logs the time and saves the data to the database. Some of the ways to store the data is using InfluxDB and MongoDB.