What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods?
I hope both are used to "partition data based on dataframe column"? Or is there any difference?
Watch out: I believe the accepted answer is not quite right! I'm glad you ask this question, because the behavior of these similarly-named functions differs in important and unexpected ways that are not well documented in the official spark documentation.
The first part of the accepted answer is correct: calling df.repartition(COL, numPartitions=k) will create a dataframe with k partitions using a hash-based partitioner. COL here defines the partitioning key--it can be a single column or a list of columns. The hash-based partitioner takes each input row's partition key, hashes it into a space of k partitions via something like partition = hash(partitionKey) % k. This guarantees that all rows with the same partition key end up in the same partition. However, rows from multiple partition keys can also end up in the same partition (when a hash collision between the partition keys occurs) and some partitions might be empty.
In summary, the unintuitive aspects of df.repartition(COL, numPartitions=k) are that
partitions will not strictly segregate partition keys
some of your k partitions may be empty, whereas others may contain rows from multiple partition keys
The behavior of df.write.partitionBy is quite different, in a way that many users won't expect. Let's say that you want your output files to be date-partitioned, and your data spans over 7 days. Let's also assume that df has 10 partitions to begin with. When you run df.write.partitionBy('day'), how many output files should you expect? The answer is 'it depends'. If each partition of your starting partitions in df contains data from each day, then the answer is 70. If each of your starting partitions in df contains data from exactly one day, then the answer is 10.
How can we explain this behavior? When you run df.write, each of the original partitions in df is written independently. That is, each of your original 10 partitions is sub-partitioned separately on the 'day' column, and a separate file is written for each sub-partition.
I find this behavior rather annoying and wish there were a way to do a global repartitioning when writing dataframes.
If you run repartition(COL) you change the partitioning during calculations - you will get spark.sql.shuffle.partitions (default: 200) partitions. If you then call .write you will get one directory with many files.
If you run .write.partitionBy(COL) then as the result you will get as many directories as unique values in COL. This speeds up futher data reading (if you filter by partitioning column) and saves some space on storage (partitioning column is removed from data files).
UPDATE: See #conradlee's answer. He explains in details not only how the directories structure will look like after applying different methods but also what will be resulting number of files in both scenarios.
repartition() is used to partition data in memory and partitionBy is used to partition data on disk. They're often used in conjunction.
Both repartition() and partitionBy can be used to "partition data based on dataframe column", but repartition() partitions the data in memory and partitionBy partitions the data on disk.
repartition()
Let's play around with some code to better understand partitioning. Suppose you have the following CSV data.
first_name,last_name,country
Ernesto,Guevara,Argentina
Vladimir,Putin,Russia
Maria,Sharapova,Russia
Bruce,Lee,China
Jack,Ma,China
df.repartition(col("country")) will repartition the data by country in memory.
Let's write out the data so we can inspect the contents of each memory partition.
val outputPath = new java.io.File("./tmp/partitioned_by_country/").getCanonicalPath
df.repartition(col("country"))
.write
.csv(outputPath)
Here's how the data is written out on disk:
partitioned_by_country/
part-00002-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
part-00044-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
part-00059-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
Each file contains data for a single country - the part-00059-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv file contains this China data for example:
Bruce,Lee,China
Jack,Ma,China
partitionBy()
Let's write out data to disk with partitionBy and see how the filesystem output differs.
Here's the code to write out the data to disk partitions.
val outputPath = new java.io.File("./tmp/partitionedBy_disk/").getCanonicalPath
df
.write
.partitionBy("country")
.csv(outputPath)
Here's what the data looks like on disk:
partitionedBy_disk/
country=Argentina/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000.csv
country=China/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000
country=Russia/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000
Why partition data on disk?
Partitioning data on disk can make certain queries run much faster.
Related
I have 500 MB csv file which I am reading as a dataframe
I am looking for an optimal value of partition of this dataframe
I need to do some wide transformations and join this dataframe with another csv and so I have below 3 approaches now for re-partitioning this dataframe
df.repartition(no. of cores)
Re-partitioning the dataframe as per calculation 500MB/128MB~ 4 partitions so as to have at least 128MB data in each partition
Re-partitioning dataframe using specific columns of csv so as to co-locate data in same partitions
I want to know which of these option will be best for having parallel computation and processing in Spark 2.4
If you know the data very well then using the columns to partition the data works best. However, repartitioning based on either based on the block size and the no of cores are subject to change whenever the cluster configuration changes and for every environment you need to make those changes if your cluster configuration is different in higher environments. So, all in all going to with Data driven repartitioning is the better approach.
We have the following scenario:
We have an existing table containing approx. 15 billion records. It was not explicitly partitioned on creation.
We are creating a copy of this table with partitions, hoping for faster read time on certain types of queries.
Our tables are on Databricks Cloud, and we use Databricks Delta.
We commonly filter by two columns, one of which is the ID of an entity (350k distinct values) and one of which is the date at which an event occurred (31 distinct values so far, but increasing every day!).
So, in creating our new table, we ran a query like this:
CREATE TABLE the_new_table
USING DELTA
PARTITIONED BY (entity_id, date)
AS SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
This query has run for 48 hours and counting. We know that it is making progress, because we have found around 250k prefixes corresponding to the first partition key in the relevant S3 prefix, and there are certainly some big files in the prefixes that exist.
However, we're having some difficulty monitoring exactly how much progress has been made, and how much longer we can expect this to take.
While we waited, we tried out a query like this:
CREATE TABLE a_test_table (
entity_id STRING,
another_id STRING,
timestamp TIMESTAMP,
date DATE
)
USING DELTA
PARTITIONED BY (date);
INSERT INTO a_test_table
SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
WHERE CAST(from_unixtime(timestamp) AS DATE) = '2018-12-01'
Notice the main difference in the new table's schema here is that we partitioned only on date, not on entity id. The date we chose contains almost exactly four percent of the old table's data, which I want to point out because it's much more than 1/31. Of course, since we are selecting by a single value that happens to be the same thing we partitioned on, we are in effect only writing one partition, vs. the probably hundred thousand or so.
The creation of this test table took 16 minutes using the same number of worker-nodes, so we would expect (based on this) that the creation of a table 25x larger would only take around 7 hours.
This answer appears to partially acknowledge that using too many partitions can cause the problem, but the underlying causes appear to have greatly changed in the last couple of years, so we seek to understand what the current issues might be; the Databricks docs have not been especially illuminating.
Based on the posted request rate guidelines for S3, it seems like increasing the number of partitions (key prefixes) should improve performance. The partitions being detrimental seems counter-intuitive.
In summary: we are expecting to write many thousands of records in to each of many thousands of partitions. It appears that reducing the number of partitions dramatically reduces the amount of time it takes to write the table data. Why would this be true? Are there any general guidelines on the number of partitions that should be created for data of a certain size?
You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).
You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))
Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.
How many partitions?
No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.
I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.
I am not a databricks expert at all but hopefully this bullets can help
Number of partitions
The number of partitions and files created will impact the performance of your job no matter what, especially using s3 as data storage however this number of files should be handled easily by a cluster of descent size
Dynamic partition
There is a huge difference between partition dynamically by your 2 keys instead of one, let me try to address this in more details.
When you partition data dynamically, depending on the number of tasks and the size of the data, a big number of small files could be created per partition, this could (and probably will) impact the performance of next jobs that will require use this data, especially if your data is stored in ORC, parquet or any other columnar format. Note that this will require only a map only job.
The issue explained before, is addressed in different ways, being the most common the file consolidation. For this, data is repartitioned with the purpose of create bigger files. As result, shuffling of data will be required.
Your queries
For your first query, the number of partitions will be 350k*31 (around 11MM!), which is really big considering the amount of shuffling and task required to handle the job.
For your second query (which takes only 16 minutes), the number of required tasks and shuffling required is much more smaller.
The number of partitions (shuffling/sorting/tasks scheduling/etc) and the time of your job execution does not have a linear relationship, that is why the math doesn't add up in this case.
Recomendation
I think you already got it, you should split your etl job in 31 one different queries which will allow to optimize the execution time
My recommendations in case of occupying partitioned columns is
Identify the cardinality of all the columns and select those that have a finite amount in time, therefore exclude identifiers and date columns
Identify the main search to the table, perhaps it is date or by some categorical field
Generate sub columns with a finite cardinality in order to speed up the search example in the case of dates it is possible to decompose it into year, month, day, etc. , or in the case of integer identifiers, decompose them into the integer division of these IDs% [1,2,3 ...]
As I mentioned earlier, using columns with a high cardinality to partition, will cause poor performance, by generating a lot of files which is the worst working case.
It is advisable to work with files that do not exceed 1 GB for this when creating the delta table it is recommended to occupy "coalesce (1)"
If you need to perform updates or insertions, specify the largest number of partitioned columns to rule out the inceserary cases of file reading, which is very effective to reduce times.
What is the optimal size for external table partition?
I am planning to partition table by year/month/day and we are getting about 2GB of data daily.
Optimal table partitioning is such that matching to your table usage scenario.
Partitioning should be chosen based on:
how the data is being queried (if you need to work mostly with daily data then partition by date).
how the data is being loaded (parallel threads should load their own
partitions, not overlapped)
2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.
Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.
Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.
Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.
It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.
For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.
Usefull blog here.
Hive partitioning is most effective in cases where the data is sparse. By sparse I mean that the data internally has visible partitions such as by year, month or day.
In your case, partitioning by date doesn't make much sense as each day will have 2 Gb of data which is not too big to handle. Partitioning by week or month makes more sense as it will optimize the query time and will not create too many small partition files.
I would like to be able to do a fast range query on a Parquet table. The amount of data to be returned is very small compared to the total size but because a full column scan has to be performed it is too slow for my use case.
Using an index would solve this problem and I read that this was to be added in Parquet 2.0. However, I cannot find any other information on this so I am guessing that it was not. I do not think that there would be any fundamental obstacles preventing the addition of (multi-column) indexes, if the data were sorted, which in my case it is.
My question is: when will indexes be added to Parquet, and what would be the high level design for doing so? I think I would already be happy with an index that points out the correct partition.
Kind regards,
Sjoerd.
Update Dec/2018:
Parquet Format version 2.5 added column indexes.
https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250
See https://issues.apache.org/jira/browse/PARQUET-1201 for list of sub-tasks for that new feature.
Notice that this feature just got merged into Parquet format itself, it will take some time for different backends (Spark, Hive, Impala etc) to start supporting it.
This new feature is called Column Indexes. Basically Parquet has added two new structures in parquet layout - Column Index and Offset Index.
Below is a more detailed technical explanation what it solves and how.
Problem Statement
In the current format, Statistics are stored for ColumnChunks in ColumnMetaData and for individual pages inside DataPageHeader structs. When reading pages, a reader has to process the page header in order to determine whether the page can be skipped based on the statistics. This means the reader has to access all pages in a column, thus likely reading most of the column data from disk.
Goals
Make both range scans and point lookups I/O efficient by allowing direct access to pages based on their min and max values. In particular:
A single-row lookup in a rowgroup based on the sort column of that
rowgroup will only read one data page per retrieved column. Range
scans on the sort column will only need to read the exact data pages
that contain relevant data.
Make other selective scans I/O
efficient: if we have a very selective predicate on a non-sorting
column, for the other retrieved columns we should only need to
access data pages that contain matching rows.
No additional decoding
effort for scans without selective predicates, e.g., full-row group
scans. If a reader determines that it does not need to read the
index data, it does not incur any overhead.
Index pages for sorted
columns use minimal storage by storing only the boundary elements
between pages.
Non-Goals
Support for the equivalent of secondary indices, ie, an index structure sorted on the key values over non-sorted data.
Technical Approach
We add two new per-column structures to the row group metadata:
ColumnIndex: this allows navigation to the pages of a column based on column values and is used to locate data pages that contain matching values for a scan predicate
OffsetIndex: this allows navigation by row index and is used to retrieve values for rows identified as matches via the ColumnIndex. Once rows of a column are skipped, the corresponding rows in the other columns have to be skipped. Hence the OffsetIndexes for each column in a RowGroup are stored together.
The new index structures are stored separately from RowGroup, near the footer, so that a reader does not have to pay the I/O and deserialization cost for reading the them if it is not doing selective scans. The index structures’ location and length are stored in ColumnChunk and RowGroup.
Cloudera's Impala team has made some tests on this new feature (not yet available as part of Apache Impala core product). Here's their performance improvements:
and
As you can see some of the queries had a huge improvement in both both cpu time and amount of data it had to read from disks.
Original answer back from 2016:
struct IndexPageHeader {
/** TODO: **/
}
https://github.com/apache/parquet-format/blob/6e5b78d6d23b9730e19b78dceb9aac6166d528b8/src/main/thrift/parquet.thrift#L505
Index Page Header is not implemented, as of yet.
See source code of Parquet format above.
I don't see it even in Parquet 2.0 currently.
But yes - excellent answer from Ryan Blue above on Parquet that it has pseudo-indexing capabilities (bloom filters).
If your're interested in more details, I recommend great document on how Parquet bloom filters and predicate push-down work
https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide
a more technical implementation-specific document -
https://homepages.cwi.nl/~boncz/msc/2018-BoudewijnBraams.pdf
Parquet currently keeps min/max statistics for each data page. A data page is a group of ~1MB of values (after encoding) for a single column; multiple pages are what make up Parquet's column chunks.
Those min/max values are used to filter both column chunks and the pages that make up a chunk. So you should be able to improve your query time by sorting records by the columns you want to filter on, then writing the data into Parquet. That way, you get the most out of the stats filtering.
You can also get more granular filtering with this technique by decreasing the page and row group sizes, though you're then trading encoding efficiency and I/O efficiency.
I have a Hive table where for a user ID I have a ts column, which is a timeseries, stored as array. I want to maintain the timeseries as a recentmost window.
(a) how do I append a new number to the end of each column from another table joined by ID?
(b) how do I drop the leading number?
Data in Hive is typically stored in HDFS. HDFS has limited append capabilities. If the constant modification of data is at the core of your analytics systems, then perhaps you should consider using alternatives like HBase or Cassandra.
However, if the data updates are a small part of your workflow, I would encourage you to continue using Hive (in order to make use of it's SQL like functionality) but reconsider your design for storing these updates.
A quick solution to your above problem would be to have more than one record per user ID in your table. Each record would have a timeseries corresponding to the User ID. When you want to do your last N analysis on the timeseries, you should do a select from the table by using by Distribute By on User ID column. Your custom reducer will simply pick out the last N (or less, if the size of the timeseries is less than N) timestamps and return them.
Harish Butani also did some work on Windowing functions in Hive. You can also take a look at his work and associated documentation to gain some more insight. Good luck, Alexy!