Optimize data extraction while reading from ADL table - azure-data-lake

We are inserting data in ADL table using round-robin distribution scheme. In another job, we extract data from the table for three different partitions and observed an uneven number of vertices for partitions. For example, in one partition it creates 56 vertices for 264 GB data and in another partition, it creates 2 vertices for 209 GB data. Partition with few vertices took huge time to complete. In attached picture, I am not sure why SV5 and SV3 have only 2 vertices. Is there any way to optimize this and increase the number of vertices for these partitions?
Here is a script for table:
CREATE TABLE IF NOT EXISTS dbo.<tablename>
(
abc string,
def string,
<Other columns>
xyz int,
INDEX clx_abc_def CLUSTERED(abc, def ASC)
)
PARTITIONED BY (xyz)
DISTRIBUTED BY ROUND ROBIN;
Update:
Here is a script for data insertion:
INSERT INTO dbo.<tablename>
(
abc,
def,
<Other columns>
xyz
)
ON INTEGRITY VIOLATION IGNORE
SELECT *
FROM #logs;
I am doing multiple (maximum 3) inserts in a partition. But in another job, I am also selecting data, doing some processing, truncating partition and then inserting data back to the partition. I want to know why default distribution scheme of Round Robin is creating only 2 distributions for SV5 and SV3? I am hoping to have more distributions for this amount of data.

Given that you insert in different ways, it looks like sometimes, like in the script that INSERTs the data that SV1 is reading, those scripts get a good estimate, while others cause U-SQL to do very badly. When you use round robin, but do not specify a distribution, U-SQL will pick one for you based on compile-time estimated data size. This is also true for HASH and DIRECT HASH. The most rock-solid mitigation for this is to specify the number of distributions with the INTO clause whenever you have a pretty good idea of what distribution you want. Anything from 50-200 looks like it will keep you in the sweet spot.

I see that you use both partitions and distributions inside the partitions.
Do you insert the data all at once into the partition or do you have multiple INSERT statements per partition?
If the later, please note that each INSERT statement adds a new file to the partition that then gets processed by its own vertex.
Also, the ROUND ROBIN distribution applies to each partition file individually.
So you may end up with a lot of distribution groups that are extracted.
If my interpretation of your scenario is correct, please use ALTER TABLE REBUILD to compact the partitions.

Related

How do explicit table partitions in Databricks affect write performance?

We have the following scenario:
We have an existing table containing approx. 15 billion records. It was not explicitly partitioned on creation.
We are creating a copy of this table with partitions, hoping for faster read time on certain types of queries.
Our tables are on Databricks Cloud, and we use Databricks Delta.
We commonly filter by two columns, one of which is the ID of an entity (350k distinct values) and one of which is the date at which an event occurred (31 distinct values so far, but increasing every day!).
So, in creating our new table, we ran a query like this:
CREATE TABLE the_new_table
USING DELTA
PARTITIONED BY (entity_id, date)
AS SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
This query has run for 48 hours and counting. We know that it is making progress, because we have found around 250k prefixes corresponding to the first partition key in the relevant S3 prefix, and there are certainly some big files in the prefixes that exist.
However, we're having some difficulty monitoring exactly how much progress has been made, and how much longer we can expect this to take.
While we waited, we tried out a query like this:
CREATE TABLE a_test_table (
entity_id STRING,
another_id STRING,
timestamp TIMESTAMP,
date DATE
)
USING DELTA
PARTITIONED BY (date);
INSERT INTO a_test_table
SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
WHERE CAST(from_unixtime(timestamp) AS DATE) = '2018-12-01'
Notice the main difference in the new table's schema here is that we partitioned only on date, not on entity id. The date we chose contains almost exactly four percent of the old table's data, which I want to point out because it's much more than 1/31. Of course, since we are selecting by a single value that happens to be the same thing we partitioned on, we are in effect only writing one partition, vs. the probably hundred thousand or so.
The creation of this test table took 16 minutes using the same number of worker-nodes, so we would expect (based on this) that the creation of a table 25x larger would only take around 7 hours.
This answer appears to partially acknowledge that using too many partitions can cause the problem, but the underlying causes appear to have greatly changed in the last couple of years, so we seek to understand what the current issues might be; the Databricks docs have not been especially illuminating.
Based on the posted request rate guidelines for S3, it seems like increasing the number of partitions (key prefixes) should improve performance. The partitions being detrimental seems counter-intuitive.
In summary: we are expecting to write many thousands of records in to each of many thousands of partitions. It appears that reducing the number of partitions dramatically reduces the amount of time it takes to write the table data. Why would this be true? Are there any general guidelines on the number of partitions that should be created for data of a certain size?
You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).
You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))
Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.
How many partitions?
No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.
I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.
I am not a databricks expert at all but hopefully this bullets can help
Number of partitions
The number of partitions and files created will impact the performance of your job no matter what, especially using s3 as data storage however this number of files should be handled easily by a cluster of descent size
Dynamic partition
There is a huge difference between partition dynamically by your 2 keys instead of one, let me try to address this in more details.
When you partition data dynamically, depending on the number of tasks and the size of the data, a big number of small files could be created per partition, this could (and probably will) impact the performance of next jobs that will require use this data, especially if your data is stored in ORC, parquet or any other columnar format. Note that this will require only a map only job.
The issue explained before, is addressed in different ways, being the most common the file consolidation. For this, data is repartitioned with the purpose of create bigger files. As result, shuffling of data will be required.
Your queries
For your first query, the number of partitions will be 350k*31 (around 11MM!), which is really big considering the amount of shuffling and task required to handle the job.
For your second query (which takes only 16 minutes), the number of required tasks and shuffling required is much more smaller.
The number of partitions (shuffling/sorting/tasks scheduling/etc) and the time of your job execution does not have a linear relationship, that is why the math doesn't add up in this case.
Recomendation
I think you already got it, you should split your etl job in 31 one different queries which will allow to optimize the execution time
My recommendations in case of occupying partitioned columns is
Identify the cardinality of all the columns and select those that have a finite amount in time, therefore exclude identifiers and date columns
Identify the main search to the table, perhaps it is date or by some categorical field
Generate sub columns with a finite cardinality in order to speed up the search example in the case of dates it is possible to decompose it into year, month, day, etc. , or in the case of integer identifiers, decompose them into the integer division of these IDs% [1,2,3 ...]
As I mentioned earlier, using columns with a high cardinality to partition, will cause poor performance, by generating a lot of files which is the worst working case.
It is advisable to work with files that do not exceed 1 GB for this when creating the delta table it is recommended to occupy "coalesce (1)"
If you need to perform updates or insertions, specify the largest number of partitioned columns to rule out the inceserary cases of file reading, which is very effective to reduce times.

How to increase number of reducers during insert into partitioned clustered transactional table?

We have a clustered transactional table (10k buckets) which seems to be inefficient for the following two use cases
merges with daily deltas
queries based on date range.
What we want to do is to partition table by date and thus create partitioned clustered transactional table. Daily volume suggests number of buckets to be around 1-3, but inserting into the newly created table produces number_of_buckets reduce tasks which is too slow and causes some issues with merge on reducers due to limited hard drive.
Both issues are solvable (for instance, we could split the data into several chunks and start separate jobs to insert into the target table in parallel using n_jobs*n_buckets reduce tasks though it would result in several reads of the source table) but i believe there should be the right way to do that, so the question is: what is this right way?
P.S. Hive version: 1.2.1000.2.6.4.0-91

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

Query Postgres table by Block Range Index (BRIN) identifier directly

I have N client machines. I want to load each of machine with distinct partition of BRIN index.
That requires to:
create BRIN with predefined number of partitions - equal to number of client machines
send queries from clients which uses WHERE on BRIN partitions identifier instead of filter on indexed column
The main goal is performance improvement when loading single table from postgres into distributed client machines, keeping equal number of rows between the clients - or close to equal if rows count not divides by machines count.
I can achieve it currently by maintaining new column which chunks my table into number of buckets equal to number of client machines (or use row_number() over (order by datetime) % N on the fly). This way it would not be efficient in timing and memory, and the BRIN index looks like a nice feature which could speed up such use cases.
Minimal reproducible example for 3 client machines:
CREATE TABLE bigtable (datetime TIMESTAMPTZ, value TEXT);
INSERT INTO bigtable VALUES ('2015-12-01 00:00:00+00'::TIMESTAMPTZ, 'txt1');
INSERT INTO bigtable VALUES ('2015-12-01 05:00:00+00'::TIMESTAMPTZ, 'txt2');
INSERT INTO bigtable VALUES ('2015-12-02 02:00:00+00'::TIMESTAMPTZ, 'txt3');
INSERT INTO bigtable VALUES ('2015-12-02 03:00:00+00'::TIMESTAMPTZ, 'txt4');
INSERT INTO bigtable VALUES ('2015-12-02 05:00:00+00'::TIMESTAMPTZ, 'txt5');
INSERT INTO bigtable VALUES ('2015-12-02 16:00:00+00'::TIMESTAMPTZ, 'txt6');
INSERT INTO bigtable VALUES ('2015-12-02 23:00:00+00'::TIMESTAMPTZ, 'txt7');
Expected output:
client 1
2015-12-01 00:00:00+00, 'txt1'
2015-12-01 05:00:00+00, 'txt2'
2015-12-02 02:00:00+00, 'txt3'
client 2
2015-12-02 03:00:00+00, 'txt4'
2015-12-02 05:00:00+00, 'txt5'
client 3
2015-12-02 16:00:00+00, 'txt6'
2015-12-02 23:00:00+00, 'txt7'
The question:
How can I create BRIN with predefined number of partitions and run queries which filters on partition identifiers instead of filtering on index column?
Optionally any other way that BRIN (or other pg goodies) can speed up the task of parallel loading multiple clients from single table?
It sounds like you want to shard a table over many machines, and have each local table (one shard of the global table) have a BRIN index with exactly one bucket. But that does not make any sense. If the single BRIN index range covers the entire (local) table, then it can never be very helpful.
It sounds like what you are looking for is partitioning with CHECK constraints that can be used for partition-exclusion. PostgreSQL has supported that for a long time with table inheritance (although not for each partition being on a separate machine). Using this method, the range covered in the CHECK constraint has to be set explicitly for each partition. This ability to explicitly specify the bounds sounds like it exactly what you are looking for, just using a different technology.
But, the partition exclusion constraint code doesn't work well with modulus. The code is smart enough to know that WHERE id=5 only needs to check the CHECK (id BETWEEN 1 and 10) partition, because it knows that id=5 implies that id is between 1 and 10. More accurately, it know that contrapositive of that.
But the code was never written to know that WHERE id=5 implies that id%10 = 5%10, even though humans know that. So if you build your partitions on modulus operators, like CHECK (id%10=5) rather than on ranges, you would have to sprinkle all your queries with WHERE id = $1 and id % 10= $1 %10 if you wanted it to take advantage of the constraints.
Going by your description and comments I'd say you're looking in the wrong direction. You want to split the table upfront so access will be fast and simple, but without having to split things upfront because that would require you know the number of nodes upfront which is kind of variable if I understand correctly. And regardless, it takes quite a bit of processing to split things too.
To be honest, I'd go about your problem differently. Instead of assigning every record to a bucket I'd rather suggest to assign every record a pseudo-random value in a given range. I don't know about Postgres but in MSSQL I'd use BINARY_CHECKSUM(NewID()) instead of Rand(). Main reason being that the random function is harder to use SET-based there. Instead you could also use some hashing code that returns a reasonable working space. Anyway, in my MSSQL situation the resulting value would then be a signed integer sitting somewhere in the range -2^31 to +2^31 (giver or take, check the documentation for the exact boundaries!). As such, when the master machine decides to assign n client-machines, each machine can be assigned an exact range that -- given the properties of the randomizer/hashing algo -- will envelope a reasonably close approximation to the workload divided by n.
Assuming you have an index on the selection field this should be reasonably fast, regardless whether you decide to split the table in a thousand or a million chunks.
PS: mind that this approach will only work 'properly' if number of rows to process (greatly) outnumbers the number of machines that will do the processing. With small numbers you might see several machines not getting anything while others get to do all the work.
Basically, all you need to know is the size of the relation after loading, and then the pages_per_range storage parameter should be set to the divisor that gives you the desired number of partitions.
No need to introduce an artificial partition ID, because there is support for sufficient types and operators. Physical table layout is important here, so if you insist on the partition ID being the key, and end up introducing an out-of-order mapping between the natural load order and the artificial partition ID, make sure you cluster the table on that column's sort order before creating BRIN.
However, at the same time, remember that more discrete values have a better chance of hitting the index than fewer, so high cardinality is better - artificial partition identifier will have 1/n the cardinality of a natural key, where n is the number of distinct values per partition.
More here and here.

How can we use same partition schema with different partition function?

I'm learning table partitioning.
When I read this page, it said that
The TransactionHistoryArchive table must have the same design schema as the TransactionHistory table. There must also be an empty partition to receive the new data. In this case, TransactionHistoryArchive is a partitioned table that consists of just two partitions.
And with the following picture, we can see that TransactionHistory has 12 partitions, but TransactionHistoryArchive just has 2 partitions.
Illustration http://i.msdn.microsoft.com/dynimg/IC38652.gif
How could it possible? Please help me to understand it.
As long as two individual partitions have identical schema and the same boundary values you can switch them. They don't need to have the same partition scheme or function.
This is because SQL Server ensures that the binary data of those partitions on disk is compatible. That's the magic of partitioning and why you can move arbitrary amounts of data as a quick metadata-only operation.