currently, have hive properties:
SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.enforce.bucketing=undefined
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=2;
This, by default, creates ACID table.
I would like to create non-acid table by default. If i want to create non-acid, should i change the property to change hive.txn.manager to DummyTxnManager?
When user want to create transactional table, they should explicitly mention transactional=true while creating table. In this case, how does the transactional table get the features of transactional from DbTxnManager.
I would like to know, on what basis DbTxnManager applicable if we dont have properties set in hive-site.xml.
Also, want to know difference in DbTxnManager and setting transactional=true in table?
Related
I have a database with a single table. This table will need to be updated every few weeks. We need to ingest third-party data into it and it will contain 100-120 million rows. So the flow is basically:
Get the raw data from the source
Detect inserts, updates & deletes
Make updates and ingest into the database
What's the best way of detecting and performing updates?
Some options are:
Compare incoming data with current database one by one and make single updates. This seems very slow and not feasible.
Ingest incoming data into a new table, then switch out old table with the new table
Bulk updates in-place in the current table. Not sure how to do this.
What do you suggest is the best option, or if there's a different option out there?
Postgres has a helpful guide for improving performance of bulk loads. From your description, you need to perform a bulk INSERT in addition to a bulk UPDATE and DELETE. Below is a roughly step by step guide for making this efficient:
Configure Global Database Configuration Variables Before the Operation
ALTER SYSTEM SET max_wal_size = <size>;
You can additionally disable WAL entirely.
ALTER SYSTEM SET wal_level = 'minimal';
ALTER SYSTEM SET archive_mode = 'off';
ALTER SYSTEM SET max_wal_senders = 0;
Note that these changes will require a database restart to take effect.
Start a Transaction
You want all work to be done in a single transaction in case anything goes wrong. Running COPY in parallel across multiple connections does not usually increase performance as disk is usually the limiting factor.
Optimize Other Configuration Variables at the Transaction level
SET LOCAL maintenance_work_mem = <size>
...
You may need to set other configuration parameters if you are doing any additional special processing of the data inside Postgres (work_mem is usually most important there especially if using Postgis extension.) See this guide for the most important configuration variables for performance.
CREATE a TEMPORARY table with no constraints.
CREATE TEMPORARY TABLE changes(
id bigint,
data text,
) ON COMMIT DROP; --ensures this table will be dropped at end of transaction
Bulk Insert Into changes using COPY FROM
Use the COPY FROM Command to bulk insert the raw data into the temporary table.
COPY changes(id,data) FROM ..
DROP Relations That Can Slow Processing
On the target table, DROP all foreign key constraints, indexes and triggers (where possible). Don't drop your PRIMARY KEY, as you'll want that for the INSERT.
Add a Tracking Column to target Table
Add a column to target table to determine if row was present in changes table:
ALTER TABLE target ADD COLUMN seen boolean;
UPSERT from the changes table into the target table:
UPSERTs are performed by adding an ON CONFLICT clause to a standard INSERT statement. This prevents the need from performing two separate operations.
INSERT INTO target(id,data,seen)
SELECT
id,
data,
true
FROM
changes
ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data, seen = true;
DELETE Rows Not In changes Table
DELETE FROM target WHERE not seen is true;
DROP Tracking Column and Temporary changes Table
DROP TABLE changes;
ALTER TABLE target DROP COLUMN seen;
Add Back Relations You Dropped For Performance
Add back all constraints, triggers and indexes that were dropped to improve bulk upsert performance.
Commit Transaction
The bulk upsert/delete is complete and the following commands should be performed outside of a transaction.
Run VACUUM ANALYZE on the target Table.
This will allow the query planner to make appropriate inferences about the table and reclaim space taken up by dead tuples.
SET maintenance_work_mem = <size>
VACUUM ANALYZE target;
SET maintenance_work_mem = <original size>
Restore Original Values of Database Configuration Variables
ALTER SYSTEM SET max_wal_size = <size>;
...
You may need to restart your database again for these settings to take effect.
I have one enviornment in which queries containing more than 100 tables. now i need to access same query in read only environment. so i need to use <schema_name>.<table_name> in read only env. This is read only env so i can not create synonyms for all.
instead of writing schema name in prefix of each table, Is there any short cut for it. i am just guessing if anything is possible? They all belongs to same schema.
Try this out. It will set your session environment to the specified schema and as consequence, no need to provide the <schema_name> prefix.
ALTER SESSION SET CURRENT_SCHEMA = <schema_name>;
I want to create HIVE table with the transactional table with TBLPROPERTIES ("transactional"="true") set using create table statement. instead of setting for all tables can I set TBLPROPERTIES using hive-site.xml.
Unfortunately, we can't set it in hive-site.xml since transactional is a per table property. And we should not do it that way beacause 'transactional table' comes with some prerequisites and limitations.
I have a scenario I need to increase hbase.client.scanner.caching to 10000 from 100. But I don't want to make this permanent change, I only need it for that particular session when I am querying from hive querying engine. Is there any way how to set this property for that particular session.
i.e
set hbase.client.scanner.caching = 10000;
SELECT count(*) FROM hive_external_table;
-- but setting the parameter is not taking any effect.
-- where hive_external_table is a external table mapped from hbase_table
Yes, you can definitely set the property value in the same way. Don't give whitespace between key=value.
Use following:
hive> set hbase.client.scanner.caching=10000;
hive> SELECT count(*) FROM hive_external_table;
It will override the default value for the current session.
I have set all the parameters that needs to be set in hive for using transactions.
set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=0;
Created table using below command
CREATE TABLE Employee(Emp_id int,name string,company string,Desg string) clustered by (Emp_id) into 5 buckets stored as orc TBLPROPERTIES(‘transactional’=’true’);
Inserted Data in hive table by using below command
INSERT INTO table Employee values(1,’Jigyasa’,’Infosys’,’Senior System Engineer’), (2,’Pooja’,’HCL’,’Consultant’), (3,’Ayush’,’Asia Tours an travels’,’Manager’), (4,’Tarun’,’Dell’,’Architect’), (5,’Namrata’,’Apolo’,’Doctor’);
But while Updating the data
Update Employee set Company=’Ganga Ram’ where Emp_id=5;
I am getting below error message
FAILED:SemanticException [Error 10294]:Attempt to do Update or delete unsingtransaction manager thatdoes not support these operations.
Older versions of Hive have a bug where
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; at CLI doesn't take effect.
You can check this by running "set hive.txn.manager" which will print the current value.
The safest way is set this in hive-site.xml.