What is the best (less expensive) equivalent of SQL Server UPDATE SET command in Hive?
For example, consider the case in which I want to convert the following query:
UPDATE TABLE employee
SET visaEligibility = 'YES'
WHERE experienceMonths > 36
to equivalent Hive query.
I'm assuming you have a table without partitions, in which case you should be able to do the following command:
INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;
There are other ways but they are much more convoluted, I think the way Bejoy described is the most efficient.
(source: Bejoy KS blog)
Note that if you have to do this on a partitioned table (which is likely if you have a lot of data), you would probably need to overwrite your partition when doing this.
You can create an external table and use the 'insert overwrite into local directory' and in case you want to change the column values, you can use 'CASE WHEN', 'IF' or other conditional operations. And copy the output file back to HDFS location.
You can upgrade your hive to 0.14.0
Starting from 0.14.0 hive supports UPDATE operation.
To do the same we need to create hive tables such that they support ACID output format and need to set additional properties in hive-site.xml.
How to do CURD operations in Hive
Related
Suppose I have a non-transactional table in Hive named 'ccm'. It has hundreds of columns and one partition field.
I know how to create a copy with "create table abc like ccm' but I would like abc to be bucketed, ORC, and have transaction support set on via TBLPROPERTIES.
I do not want to mention all the columns in ccm when I compose the HQL.
Can I do this?
This answer may have the correct way to proceed in your case, and it also explains some limitation of the method used.
Create hive table using "as select" or "like" and also specify delimiter
So, from the example, you should add the missing parts:
CLUSTER BY
TBLPROPERTIES ("transactional"="true")
I have some doubts that you can achieve exactly your expected results but i would consider it as a step forward
How to find when table rows were last updated/inserted? Presto is ANSI-SQL compliant so even if you don't know Presto, maybe there's a generic SQL way that would point me in the right direction.
I'm using Hadoop. Presto queries are quicker than Hive. "Describe" just gives column names.
https://prestosql.io/docs/current/
Presto 309 added a hidden $properties table in the Hive connector for each table that exposes the Hive table properties. You can use it to find the last update time (replace example with your table name):
SELECT transient_lastddltime FROM "example$properties"
I am creating and insert tables in HIVE,and the files are created on HDFS and some on external storage S3
Assuming if I created a 10 tables,is there any system table in Hive where I can find the table info created by the user??? (for example like in Teradata we have DBC.tablesv which hold information of all the user defined tables)
You can find where you metastore is configured to be in the hive-site.xml file.
Its usual location is under /etc/hive/{$hadoop_version}/ or /etc/hive/conf/.
grep for "hive.metastore.uris" or "javax.jdo.option.ConnectionURL" to see which db you are using for the metastore. The credentials should also be there.
If, for example, your metastore is on a MySQL server, you can run queries like
SELECT * FROM TBLS;
SELECT * FROM PARTITIONS;
etc
You can't query (as in SELECT ... FROM...) the metadata from within Hive.
You do however have comnands that display that information, e.g. show databases, show tables, desc MyTable etc.
I'm not sure I understood 100% your question, if you mean the informations about the creation of the table, like the query itself, with the location on HDFS, table properties, etc, you can try with:
SHOW CREATE TABLE <table>;
If you need to retrieve a list of the columns names and datatypes try with:
DESCRIBE <table>;
Currently we are dropping the table daily and running the script which loads the data to the tables. Script takes 3-4 hrs during which data will not be available. So now our aim is to make the old hive data available to analysts until new data load execution is complete.
I am achieving this thing in hql script by loading daily data to the hive tables partitioned on load_year, load_month and load_day and dropping the yesterdays data by dropping the partition.
But what is the option for pig script to achieve the same? Can we alter the table through pig script? I dont want to execute the other hql to drop partition after pig.
Thanks
Since HDP 2.3 you can use HCatalog commands inside Pig scripts. Therefore, you can use the HCatalog command to drop a Hive table partition. The following is an example of dropping a Hive partition:
-- Set the correct hcat path
set hcat.bin /usr/bin/hcat;
-- Drop a table partion or execute other any Hcatalog command
sql ALTER TABLE midb1.mitable1 DROP IF EXISTS PARTITION(activity_id = "VENTA_ALIMENTACION",transaction_month = 1);
Another way is to use sh command execution inside Pig Script. However I had some problems to escape special characters in ALTER commands. So, the first is the best option in my opinion.
Regards,
Roberto Tardío
I am moving data around within Impala, not my design, and I have lost some data. I need to copy the data from the parquet tables back to their original non-parquet tables. Originally, the developers had done this with a simple one liner in a script. Since I don't know anything about databases and especially about Impala I was hoping you could help me out. This is the one line that is used to translate to a parquet table that I need to be reversed.
impalaShell -i <ipaddr> use db INVALIDATE METADATA <text_table>;
CREATE TABLE <parquet_table> LIKE <text_table> STORED AS PARQUET TABLE;
INSERT OVERWRITE <parquet_table> SELECT * FROM <text_table>;
Thanks.
Have you tried simply doing
CREATE TABLE <text_table>
AS
SELECT *
FROM <parquet_table>
Per the Cloudera documentation, this should be possible.
NOTE: Ensure that your does not exist or use a table name that does not already exist so that you do not accidentally overwrite other data.