Choosing Partition Column - hive

I have huge data set related to transactions. I need to choose partitioning column from transaction_date(increases everyday) or state(limited number). which is the ideal choice and why?

Disadvantage of choosing transaction_date as partition column:
(1) Too may small directories which may cause overhead in HDFS.
Advantages of using state:
(1) Number of directories will be fixed.
It all depends how the query will be formed for execution.
If your query contains filter clause for transaction_date and there is no partition then overall execution will be slow.
Also, creating partition does not guarantee faster execution.
The search results will be returned faster for partitions where data volume is less compared to the partitions where data volume is high.

The ideal choice is to have state as partitioning column as partitioning creates distinct folders based on distinct values. Hence number of folders = number of states and so the metadata information storage to Namenode would be less.
but if transaction date would be considered then each day there would be a new folder and that would reduce the performance of Namenode at some point of time.

Related

GCP BigQuery - LIMIT but full table read - How to limit queried data to a minimum

It looks like LIMIT would have no effect on the amount of processed/queried data (if you trust the UI).
SELECT
* --count(*)
FROM
`bigquery-public-data.github_repos.commits`
-- LIMIT 20
How to limit the amount of queried data to a minimum (even though one whole partition would probably always be needed)
without to use "preview" or similar
without to know the partition / clustering of the data
How to check the real approximate amount before a query execution?
In the execution details is stated that only 163514 rows has been queried as input (not 244928379 rows)
If you want to limit the amount of data BQ uses for a query you have this two options:
Table Partitioning
Big query can partition data using either a Date/Datetime/Timemestamp column you provide or by insert date (which is good if you have regular updates on a table).
In order to do this, you must specify the partition strategy in the DDL:
CREATE TABLE mydataset.mytable (foo: int64, txdate:date)
PARTITION BY txdate
Wildcard tables (like Sharding - splitting the data into multiple tables
This works when your data holds information about different domains (geographical, customer type, etc.) or sources.
Instead of having one big table, you can create 'subtables' or 'shards' like this with a similar schema (usually people use the same). For instance,dateset.tablename.eur for european data and ```dataset.tablename.jap`` for data from Japan.
You can query one of those tables directll select col1,col2... from dataset.tablename.custromer_eur; or from all tables select col1,col2 from 'dataset.tablename.*'
Wildcard tables can be also partitioned by date.
You pay for the volume of data loaded in the workers. Of course, you do nothing in your request and you ask for the 20 first result, the query stop earlier, and all the data aren't processed, but at least loaded. And you will pay for this!
Have a look to this. I have a similar request
Now, let's go to the logs
The total byte billed is ~800Mb
So you, have to think differently when you work with BigQuery, it's analytics database and not designed to perform small requests (too slow to start, the latency is at least 500ms due to worker warm up).
My table contain 3M+ of rows, and only 10% have been processed
And you pay for the reservation and the load cost (moving data have a cost and reserving slots has also a cost).
That's why, there is a lot of tip to save money on Google BigQuery. Some examples by a former BigQuery Dev Advocate
as of december 2021, I notice select * from Limit, will not scan the whole table and you pay only for a small number of rows, obviously if you add order by, it will scan everything.

How do explicit table partitions in Databricks affect write performance?

We have the following scenario:
We have an existing table containing approx. 15 billion records. It was not explicitly partitioned on creation.
We are creating a copy of this table with partitions, hoping for faster read time on certain types of queries.
Our tables are on Databricks Cloud, and we use Databricks Delta.
We commonly filter by two columns, one of which is the ID of an entity (350k distinct values) and one of which is the date at which an event occurred (31 distinct values so far, but increasing every day!).
So, in creating our new table, we ran a query like this:
CREATE TABLE the_new_table
USING DELTA
PARTITIONED BY (entity_id, date)
AS SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
This query has run for 48 hours and counting. We know that it is making progress, because we have found around 250k prefixes corresponding to the first partition key in the relevant S3 prefix, and there are certainly some big files in the prefixes that exist.
However, we're having some difficulty monitoring exactly how much progress has been made, and how much longer we can expect this to take.
While we waited, we tried out a query like this:
CREATE TABLE a_test_table (
entity_id STRING,
another_id STRING,
timestamp TIMESTAMP,
date DATE
)
USING DELTA
PARTITIONED BY (date);
INSERT INTO a_test_table
SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
WHERE CAST(from_unixtime(timestamp) AS DATE) = '2018-12-01'
Notice the main difference in the new table's schema here is that we partitioned only on date, not on entity id. The date we chose contains almost exactly four percent of the old table's data, which I want to point out because it's much more than 1/31. Of course, since we are selecting by a single value that happens to be the same thing we partitioned on, we are in effect only writing one partition, vs. the probably hundred thousand or so.
The creation of this test table took 16 minutes using the same number of worker-nodes, so we would expect (based on this) that the creation of a table 25x larger would only take around 7 hours.
This answer appears to partially acknowledge that using too many partitions can cause the problem, but the underlying causes appear to have greatly changed in the last couple of years, so we seek to understand what the current issues might be; the Databricks docs have not been especially illuminating.
Based on the posted request rate guidelines for S3, it seems like increasing the number of partitions (key prefixes) should improve performance. The partitions being detrimental seems counter-intuitive.
In summary: we are expecting to write many thousands of records in to each of many thousands of partitions. It appears that reducing the number of partitions dramatically reduces the amount of time it takes to write the table data. Why would this be true? Are there any general guidelines on the number of partitions that should be created for data of a certain size?
You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).
You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))
Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.
How many partitions?
No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.
I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.
I am not a databricks expert at all but hopefully this bullets can help
Number of partitions
The number of partitions and files created will impact the performance of your job no matter what, especially using s3 as data storage however this number of files should be handled easily by a cluster of descent size
Dynamic partition
There is a huge difference between partition dynamically by your 2 keys instead of one, let me try to address this in more details.
When you partition data dynamically, depending on the number of tasks and the size of the data, a big number of small files could be created per partition, this could (and probably will) impact the performance of next jobs that will require use this data, especially if your data is stored in ORC, parquet or any other columnar format. Note that this will require only a map only job.
The issue explained before, is addressed in different ways, being the most common the file consolidation. For this, data is repartitioned with the purpose of create bigger files. As result, shuffling of data will be required.
Your queries
For your first query, the number of partitions will be 350k*31 (around 11MM!), which is really big considering the amount of shuffling and task required to handle the job.
For your second query (which takes only 16 minutes), the number of required tasks and shuffling required is much more smaller.
The number of partitions (shuffling/sorting/tasks scheduling/etc) and the time of your job execution does not have a linear relationship, that is why the math doesn't add up in this case.
Recomendation
I think you already got it, you should split your etl job in 31 one different queries which will allow to optimize the execution time
My recommendations in case of occupying partitioned columns is
Identify the cardinality of all the columns and select those that have a finite amount in time, therefore exclude identifiers and date columns
Identify the main search to the table, perhaps it is date or by some categorical field
Generate sub columns with a finite cardinality in order to speed up the search example in the case of dates it is possible to decompose it into year, month, day, etc. , or in the case of integer identifiers, decompose them into the integer division of these IDs% [1,2,3 ...]
As I mentioned earlier, using columns with a high cardinality to partition, will cause poor performance, by generating a lot of files which is the worst working case.
It is advisable to work with files that do not exceed 1 GB for this when creating the delta table it is recommended to occupy "coalesce (1)"
If you need to perform updates or insertions, specify the largest number of partitioned columns to rule out the inceserary cases of file reading, which is very effective to reduce times.

How to partition 10 billion row SQL tables quickly using AWS?

I have a SQL database of data delivered in a normalized format with several tables that have several billions of rows of data. I have decided to partition the large tables into separate tables by itemId since when I query the data I only care about 1 item at a time. I would end up having 5000+ tables at the end after partitioning the data. The problem is, partitioning the data takes about 25 minutes to build a single table for 1 item.
5000 items x 25 minutes = 86.8 days
It would take over 86 days to fully partition my entire SQL database. My entire database is about 2.5TB.
Is this something I can leverage AWS for to parallelize on an item level? Can I use AWS database migration services to host the database in its current form and then use AWS process to churn through all of the 5000 queries to partition the big tables into 5000 smaller tables with 2M rows each?
If not, is this something I just have to throw more hardware at to make it run faster (CPU or RAM)?
Thanks in advance.
This doesn't seem like a good strategy. For one thing, simple arithmetic is that 10,000,000,000 rows with 5,000 rows per item results in 2,000,000 partitions in the table.
The limit in Redshift (by default) is 1,000,000 partition per table:
Amazon Redshift Spectrum has the following quotas when using the
Athena or AWS Glue data catalog:
A maximum of 10,000 databases per account.
A maximum of 100,000 tables per database.
A maximum of 1,000,000 partitions per table.
A maximum of 10,000,000 partitions per account.
You should re-think your partitioning strategy. Or perhaps your problem is not suitable for Redshift. There may be other database strategies more suitable for your use-case. (This is not the forum for recommending specific software solutions, however.)
Use the itemid as sortkey and distkey. if the table is vacummed properly and you select one itemid this should have good results, where access time is almost as good as a single table. distkey is used to distribute the data between shards, which means each itemid's blocks would be stored together on the same shard making retrieving all of them faster. Having the itemid also be sortkey means that for itemid's with small row numbers that all exist on the same shard, finding the rows within the table's blocks on a shard would be as fast as possible.
Creating a separate table for each item, where every other attribute of the table remains the same, doesn't seem logical. If the data format is the same, then keep the data in the same table unless there is a particular problem to overcome.
If you set the itemId as the SORTKEY on a Redshift table, then Redshift will be able to skip-over the blocks that do not contain a desired value (when using WHERE itemId = 'xxx'). This will be highly efficient.
Admittedly, trying to keep such a large table sorted would probably be too hard to VACUUM. It would still work reasonably well without the SORTKEY since blocks can still be skipped, but not as efficiently because the data for that itemId would be spread over more blocks.

Hive external table optimal partition size

What is the optimal size for external table partition?
I am planning to partition table by year/month/day and we are getting about 2GB of data daily.
Optimal table partitioning is such that matching to your table usage scenario.
Partitioning should be chosen based on:
how the data is being queried (if you need to work mostly with daily data then partition by date).
how the data is being loaded (parallel threads should load their own
partitions, not overlapped)
2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.
Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.
Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.
Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.
It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.
For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.
Usefull blog here.
Hive partitioning is most effective in cases where the data is sparse. By sparse I mean that the data internally has visible partitions such as by year, month or day.
In your case, partitioning by date doesn't make much sense as each day will have 2 Gb of data which is not too big to handle. Partitioning by week or month makes more sense as it will optimize the query time and will not create too many small partition files.

What is a good size (# of rows) to partition a table to really benefit?

I.E. if we have got a table with 4 million rows.
Which has got a STATUS field that can assume the following value: TO_WORK, BLOCKED or WORKED_CORRECTLY.
Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?
The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.
For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.
Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of WORKED_CORRECTLY? Highly unlikely. Has partitioning made your table more manageable? Not really.
The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on STATUS='WORKED_CORRECTLY' you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.
So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the WORKED_CORRECTLY. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.
Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.
So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.
This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.
The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.
What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?
Assuming you are trying to improve query performance? Do all your queries have predicates on the STATUS column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?