Hive Add partition to external table slow - hive

So I need to create a external table for some data stored on S3 and add partitions explicitly (unfortunately, the directory hierarchy does not fit the dynamic partition functionality due to the name mismatch)
for example:
add partition for region:euwest1, year:2018, month:01, day:18, hour:18 at:s3://mybucket/mydata/euwest1/YYYY=2018/MM=01/dd=18/HH=18/
I ran this on an EMR cluster with Hive 2.3.2 and instance type r4.2xarge, which has 8 vCores and 61GB ram.
It takes about 4 seconds to finish adding one partition, it's not too bad but if we need to process multiple days of data then adding partitions would take a long time.
Is there anyway to make this process faster?
Thanks

Related

PutHiveQL NiFi Processor extremely slow - misconfiguration?

I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
(INSERT INTO customer (id, name, address) VALUES (x, x, x))
Any ideas about why this may be? Improvements to try?
Thanks in advance
Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc table on top of HDFS directory where you have stored the data.
Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records :
See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably

AWS Athena MSCK REPAIR TABLE tablename command

Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
to fail on?
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.
MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
What to do instead depends on a number of things that are unique to your situation.
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.

Oracle 12c - other database sessions long running during SQL-LOADER session running

When a session is loading the table on a different partition using the SQLLDR to spool one flat file , we see other sessions on the same table starts long running and are failing after a long time with the error
"ORA-08103: object no longer exists"
We did some initial check for this.
Our table is partitioned and we are using partition key to load from the flat file.
Table's Object id is not changing during that period. Only the partition's Id is changing because we are doing a truncate partition before loading.
We have no global index on it,all indexes are local.
Some flat files are too huge and sometime the AWR report shows complete I/O system consumed.

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

Non HBase solution for storing Huge data and updating on real time

Hi i have developed an application where i have to store TB of data for the first time and then 20 GB monthly incremental like insert/update/delete in the form of xml that will be applied on top of this 5 TB of data .
And finally on request basis i have to generate full snapshot of all data and create 5K text files based on the logic so that respective data should be in the respective files .
I have done this project using HBase .
I have created 35 tables in the HBase having region from 10 to 500 .
I have my data in my HDFS and the using mapreduce i bulk load data into receptive Hbase tables .
After that i have SAX parser application written in java to parse all incoming xml incremental files and update HBase tables .The frequency of the xml files are approx 10 xml files per minutes and total of 2000 updates .
The incremental message are strictly in order .
Finally on request basis i run my last mapreduce application to scan all Hbase table and create 5K text files and deliver it to the client .
All 3 steps are working fine but when i went to deploy my application on production server that is shared cluster ,the infrastructure team are not allowing us to run my application because i do full table scan on HBase .
I have used 94 node cluster and the biggest HBase table data that i have is approx 2 billions .All other tables has less than a millions of data .
Total time for mapreduce to scan and create text files takes 2 hours.
Now i am looking for some other solution to implement this .
I can use HIVE because i have records level insert/update and delete that too in very precise manner.
I have also integrated HBase and HIVE table so that for incremental data HBase table will be used and for full table scan HIVE will be used .
But as HIVE uses Hbase storage handler i cant create partition in HIVE table and that is why HIVE full table scan becomes very very slow even 10 times slower that HBase Full table scan
I cant think of any solution right now kind of stuck .
Please help me with some other solution where HBase is not involved .
Can i use AVRO or perquet file in this use case .But i am not sure how AVRO will support record level update .
I will answer my question .
My issue is that i dont want to perform full table scan on Hbase because it will impact performance of the region server and specially on the shared cluster it will hit the read wright performance on of the HBase .
So my solution using Hbase because it is very good for the update specially delta update that is columns update .
So in order to avoid that Full table scan take snapshot of HBase table export it to the HDFS and them run full table scan on the Hbase table snapshot.
Here is the detailed steps for the process
Create snapshot
snapshot 'FundamentalAnalytic','FundamentalAnalyticSnapshot'
Export Snapshot to local hdfs
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot FundamentalAnalyticSnapshot -copy-to /tmp -mappers 16
Driver Job Configuration to rum mapreduce on Hbase snapshot
String snapshotName="FundamentalAnalyticSnapshot";
Path restoreDir = new Path("hdfs://quickstart.cloudera:8020/tmp");
String hbaseRootDir = "hdfs://quickstart.cloudera:8020/hbase";
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Snapshot name
scan, // Scan instance to control CF and attribute selection
DefaultMapper.class, // mapper class
NullWritable.class, // mapper output key
Text.class, // mapper output value
job,
true,
restoreDir);
Also running mapreduce on Hbase snapshot will skip scan on Hbase table and also there will be no impact on region server.
The key to using HBase efficiently is DESIGN. With a good design you will never have to do full scan. That is not what HBase was made for. Instead you could have been doing a scan with Filter - something HBase was built to handle efficiently.
I cannot check your design now but I think you may have to.
The idea is not to design a HBase table the way you would have an RDBMS table and the key is designing a good rowkey. If you rowKey is well built, you should never do a full scan.
You may also want to use a project like Apache Phoenix if you want to access you table using other columns other than row key. It also performs well. I have a good experience with Phoenix.