Slow loading of partitioned Hive table

Slow loading of partitioned Hive table - hive

I'm loading a table in Hive thats partitioned by date. It currently contains about 3 years worth of records, so circa 900 partitions (i.e. 365*3).
I'm loading daily deltas into this table, adding an additional partition per day. I achieve this using dynamic partitioning as I cant guarantee my source data only contains one days worth of data (e.g. if I'm recovering from a failure I may have multiple days of data to process).
This is all fine and dandy, however I've noticed the final step of actually writing the partition has become very slow. By this I mean the logs show the MapReduce stage completes quickly, its just very slow on the final step as it seems to scan and open all existing partitions regardless of if they will be overwritten.
Should I be explicitly creating partitions to avoid this step?

Whether the partitions are dynamic or static typically should not alter the performance drastically. Can you check in each of the partitions how many actual files are getting created? Just want to make sure that the actual writing is not serialized which it could be if it's writing to only one file. Also check how many mappers & reducers were employed by the job.

Related

hive alter table concatenate command risks

I have been using tez engine to run map reduce jobs. I have a MR job which takes ages to run, because i noticed i have over 20k files with 1 stripe each, and tez does not evenly distributes mappers based on amount of files, rather amount of stripes. And i can have a bunch of mappers with 1 file but a lot of stripes, and some mappers processing 15k files but with same amount of stripes than the other one.
As a workaround test, i used ALTER TALE table PARTITION (...) CONCATENATE in order to bring down the amount of files to process into more evenly distributed stripes per files, and now the map job runs perfectly fine.
My concern is that i didnt find in the documentation if there are any risks in running this command and losing data, since it works on the same files.
Im trying to assess if its better to use concatenate to bring down the amount of files before the MR job versus using bucketing which reads files and drops bucketed output into a separate location. Which in case of failure i dont lose source data.
Concatenate takes 1 minute per partition, versus bucketing taking more time but not risking losing source data.
My question: is there any risk of data loss when running concatenate command?
thanks!

It should work as safe as rewriting the table from query. It uses the same mechanism: result is prepared in staging first, after that staging moved to the table or partition location.
Concatenation works as a separate MR job, prepares concatenated files in staging directory and only if everything went without errors, moves them to the table location. You shold see something like this in logs:
INFO : Loading data to table dbname.tblName partition (bla bla) from /apps/hive/warehouse/dbname.db/tblName/bla bla partition path/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000

How to archive a giant postgres table?

We have a postgres table called history which is almost 900GB and continpusly increasing 10GB per day.
This is table is being accessed by a microservice (carts). We have postgres replication setup (one Master and one Slave).
2 Instances of the microservice is running in production, where 1 instance uses the master postgres connection to write and read data for some endpoints
and 1 another instance is using postgres slave connection to just the read the data alone.
Table definition:
id - uuid
data - jsonb column
internal - jsonb column
context - jsonb column
created_date - date
modified_date - date
Now in the above table data and internal column is loaded with big json for every row. We have came to a conclusion on null'ing the data and internal
column will reduce the total space consumption by this table.
Question:
How to archive this giant table? (meaning only cleaning up data and internal column alone).
How to achieve this without zero down time / performance degradation?
Approaches tested as of now.
Using pg_repack (this is best idea so far, but the issue here is once pg_repack is done then entire new table needs to get synced with slave instance which is causing WAL overhead).
Just nullify the data and internal column alone - Problem with this approach is just increases the table size due to postgres follows MVCC pattern.
Using Temp table and clone the data
Create a UNLOGGED table - historyv2
Copy the data from the original table to the historyv2 table without data and internal
then switch the table to LOGGED. ( I guess this will also cause the WAL overhead)
then rename the tables.
Can you guys give me some pointers on how to achieve this?
Postgres Version: 9.5

I always feel that questions like these conflate a few different ideas, which makes them seem more complicated than they should be. What does minimal performance impact mean? Generating lots of WAL may increase file i/o and cpu and network usage, but in most cases does not affect the system enough to have a client-facing impact. If no downtime is the most important thing, you should focus on optimizing for that, and not worry about what the process is to get there (within reason).
That said, if I woke up in your shoes, I would work first set up partitions for data going forward using table-inheritance, so that the data could be more easily segmented and worked on in the future. (This isn't entirely necessary, but probably makes your life easier in the future). After that, I would write a script to slowly go through the old data and creating new partitions with the "nulled out" data, interleaving partition creation, deletion of data, and vacuums against the main table. Once that is automated, you can let it churn slowly or during off-hours until it is done. You might need to do a final repack or vacuum full on the parent once all the data is moved, but it's probably ok even without it. Again, this isn't the simplest idea, probably not the fastest way to do it (if you could have downtime), but in the end, you'll have the schema you want without causing any service disruptions.

How do explicit table partitions in Databricks affect write performance?

We have the following scenario:
We have an existing table containing approx. 15 billion records. It was not explicitly partitioned on creation.
We are creating a copy of this table with partitions, hoping for faster read time on certain types of queries.
Our tables are on Databricks Cloud, and we use Databricks Delta.
We commonly filter by two columns, one of which is the ID of an entity (350k distinct values) and one of which is the date at which an event occurred (31 distinct values so far, but increasing every day!).
So, in creating our new table, we ran a query like this:
CREATE TABLE the_new_table
USING DELTA
PARTITIONED BY (entity_id, date)
AS SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
This query has run for 48 hours and counting. We know that it is making progress, because we have found around 250k prefixes corresponding to the first partition key in the relevant S3 prefix, and there are certainly some big files in the prefixes that exist.
However, we're having some difficulty monitoring exactly how much progress has been made, and how much longer we can expect this to take.
While we waited, we tried out a query like this:
CREATE TABLE a_test_table (
entity_id STRING,
another_id STRING,
timestamp TIMESTAMP,
date DATE
)
USING DELTA
PARTITIONED BY (date);
INSERT INTO a_test_table
SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
WHERE CAST(from_unixtime(timestamp) AS DATE) = '2018-12-01'
Notice the main difference in the new table's schema here is that we partitioned only on date, not on entity id. The date we chose contains almost exactly four percent of the old table's data, which I want to point out because it's much more than 1/31. Of course, since we are selecting by a single value that happens to be the same thing we partitioned on, we are in effect only writing one partition, vs. the probably hundred thousand or so.
The creation of this test table took 16 minutes using the same number of worker-nodes, so we would expect (based on this) that the creation of a table 25x larger would only take around 7 hours.
This answer appears to partially acknowledge that using too many partitions can cause the problem, but the underlying causes appear to have greatly changed in the last couple of years, so we seek to understand what the current issues might be; the Databricks docs have not been especially illuminating.
Based on the posted request rate guidelines for S3, it seems like increasing the number of partitions (key prefixes) should improve performance. The partitions being detrimental seems counter-intuitive.
In summary: we are expecting to write many thousands of records in to each of many thousands of partitions. It appears that reducing the number of partitions dramatically reduces the amount of time it takes to write the table data. Why would this be true? Are there any general guidelines on the number of partitions that should be created for data of a certain size?

You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).
You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))
Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.
How many partitions?
No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.
I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.

I am not a databricks expert at all but hopefully this bullets can help
Number of partitions
The number of partitions and files created will impact the performance of your job no matter what, especially using s3 as data storage however this number of files should be handled easily by a cluster of descent size
Dynamic partition
There is a huge difference between partition dynamically by your 2 keys instead of one, let me try to address this in more details.
When you partition data dynamically, depending on the number of tasks and the size of the data, a big number of small files could be created per partition, this could (and probably will) impact the performance of next jobs that will require use this data, especially if your data is stored in ORC, parquet or any other columnar format. Note that this will require only a map only job.
The issue explained before, is addressed in different ways, being the most common the file consolidation. For this, data is repartitioned with the purpose of create bigger files. As result, shuffling of data will be required.
Your queries
For your first query, the number of partitions will be 350k*31 (around 11MM!), which is really big considering the amount of shuffling and task required to handle the job.
For your second query (which takes only 16 minutes), the number of required tasks and shuffling required is much more smaller.
The number of partitions (shuffling/sorting/tasks scheduling/etc) and the time of your job execution does not have a linear relationship, that is why the math doesn't add up in this case.
Recomendation
I think you already got it, you should split your etl job in 31 one different queries which will allow to optimize the execution time

My recommendations in case of occupying partitioned columns is
Identify the cardinality of all the columns and select those that have a finite amount in time, therefore exclude identifiers and date columns
Identify the main search to the table, perhaps it is date or by some categorical field
Generate sub columns with a finite cardinality in order to speed up the search example in the case of dates it is possible to decompose it into year, month, day, etc. , or in the case of integer identifiers, decompose them into the integer division of these IDs% [1,2,3 ...]
As I mentioned earlier, using columns with a high cardinality to partition, will cause poor performance, by generating a lot of files which is the worst working case.
It is advisable to work with files that do not exceed 1 GB for this when creating the delta table it is recommended to occupy "coalesce (1)"
If you need to perform updates or insertions, specify the largest number of partitioned columns to rule out the inceserary cases of file reading, which is very effective to reduce times.

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...

This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.

First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.

Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

ALTER PARTITION FUNCTION to include 1.5TB worth of data for a quick switch

I inhereted a unmaintained database in which the partition function was set on a date field and expired on the first of the year. The data is largely historic and I can control the jobs that import new data into this table.
My question is relating to setting up or altering partitioning to include so much data, roughly 1.5TB counting indexes. This is on a live system and I don't know what kind of impact it will have with so many users connecting to it at once. I will test this on a non prod system but then I can't get real usage load on there. My alternative solution was to kill all the users hitting the DB and quickly doing a rename of the table, and renaming a table that does have a proper partitioning scheme in.
I wanted to:
-Keep the same partition function but extend it to:
keep all 2011 data up to a certain date (let's say Nov 22nd 2011) on 1 partition, all data coming in after that need to be put in their own new partitions
-Do a quick switch of the specific partition which has the full years worth of data
Anyone know if altering a partition on a live system to include a new partition for a full years worth of data, roughly 5-6 billion records and 1.5tb, is plausible? Any pitfalls? I will share my test results once I complete them but want any input. Thanks!

Partitions switch are a metadata only operation and the size of the partition switched in or out does not matter, it can be 1Kb or 1TB, it takes the exactly same amount of time (ie. very fast).
However what you're describing is not a partition switch operation, but a partition split: you want to split the last partition of the table into two partitions, one containing all the existing data and a new one empty. Splitting a partition has to split the data, and unfortunately this is an offline size-of-data operation.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas