how to add columns to existing hive partitioned table? - hive

alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>)
i have altered my table with above query. But after while checking the data i got NULL's for the both extra columns. I'm not getting data.
screenshot

CASCADE is the solution.
Query:
ALTER TABLE dbname.table_name ADD columns (column1 string,column2 string) CASCADE;
This changes the columns of a table's metadata and cascades the same change to all the partition metadata.
RESTRICT is the default, limiting column change only to table metadata.

As others have noted CASCADE will change the metadata for all partitions. Without CASCADE, if you want to change old partitions to include the new columns, you'll need to DROP the old partitions first and then fill them, INSERT OVERWRITE without the DROP won't work, because the metadata won't update to the new default metadata.
Let's say you have already run alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>) without CASCADE by accident and then you INSERT OVERWRITE an old partition without DROPPING first. The data will be stored in the underlying files, but if you query that table from hive for that partition, it won't show because the metadata wasn't updated. This can be fixed without having to rerun the insert overwrite using the following:
Run SHOW CREATE TABLE dbname.tblname and copy all the column definitions that existed before adding new columns
Run ALTER TABLE dbname.tblname REPLACE COLUMNS ({paste in col defs besides columns to add here}) CASCADE
Run ALTER TABLE dbname.tblname ADD COLUMNS (newcol1 int COMMENT "new col") CASCADE
be happy that the metadata has been changed for all partitions =)
As an example of steps 2-3:
DROP TABLE IF EXISTS junk.testcascade ;
CREATE TABLE junk.testcascade (
startcol INT
)
partitioned by (d int)
stored as parquet
;
INSERT INTO TABLE junk.testcascade PARTITION(d=1)
VALUES
(1),
(2)
;
INSERT INTO TABLE junk.testcascade PARTITION(d=2)
VALUES
(1),
(2)
;
SELECT * FROM junk.testcascade ;
+-----------------------+----------------+--+
| testcascade.startcol | testcascade.d |
+-----------------------+----------------+--+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+-----------------------+----------------+--+
--no cascade! opps
ALTER TABLE junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) ;
INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=3)
VALUES
(1,1,1),
(2,1,1)
;
INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=2)
VALUES
(1,1,1),
(2,1,1)
;
--okay! because we created this table after altering the metadata
select * FROM junk.testcascade where d=3;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | 1 | 1 | 3 |
| 2 | 1 | 1 | 3 |
+-----------------------+-----------------------+-----------------------+----------------+--+
--not okay even tho we inserted =( because the metadata isnt changed
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | NULL | NULL | 2 |
| 2 | NULL | NULL | 2 |
+-----------------------+-----------------------+-----------------------+----------------+--+
--cut back to original columns
ALTER TABLE junk.testcascade REPLACE COLUMNS( startcol int) CASCADE;
--add
ALTER table junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) CASCADE;
--it works!
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | 1 | 1 | 2 |
| 2 | 1 | 1 | 2 |
+-----------------------+-----------------------+-----------------------+----------------+--+

To add columns into partitioned table you need to recreate partitions.
Suppose the table is external and the datafiles already contain new columns, do the following:
1. Alter table add columns...
2. Recreate partitions. For each partitions do Drop then create. Newly created partition schema will inherit the table schema.
Alternatively you can drop the table then create table and create all partitions or restore them simply running MSCK REPAIR TABLE abc command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS.
See manual here: RECOVER PARTITIONS
Also in Hive 1.1.0 and later you can use CASCADE option of ALTER TABLE ADD|REPLACE COLUMNS. See manual here: ADD COLUMN
These suggestions work for external tables.

This solution only works if your data is partitioned and you know the location of the latest partition. In this case instead of doing a recover partition or a repair which is a costly operation, you can do something like:
Read the partitioned table and get the schema details
Read the table you want to update
Now find which all columns are different and do a alter table for each
Posting a scala code for reference:
def updateMetastoreColumns(spark: SparkSession, partitionedTablePath: String, toUpdateTableName: String): Unit = {
//fetch all column names along with their corresponding datatypes from latest partition
val partitionedTable = spark.read.orc(partitionedTablePath)
val partitionedTableColumns = partitionedTable.columns zip partitionedTable.schema.map(_.dataType.catalogString)
//fetch all column names along with their corresponding datatypes from currentTable
val toUpdateTable = spark.read.table(toUpdateTableName)
val toUpdateTableColumns = toUpdateTable.columns zip toUpdateTable.schema.map(_.dataType.catalogString)
//check if new columns are present in newer partition
val diffColumns = partitionedTableColumns.diff(toUpdateTableColumns)
//update the metastore with new column info
diffColumns.foreach {column: (String, String) => {
spark.sql(s"ALTER TABLE ${toUpdateTableName} ADD COLUMNS (${column._1} ${column._2})")
}}
}
This will help you dynamically find latest columns which are added to newer partition and update it to your metastore on the fly.

Related

How to ignore duplicates without unique constraint in Postgres 9.4?

I am currently facing an issue in our old database(postgres 9.4) table which contains some duplicate rows. I want to ensure that no more duplicate rows should be generated.
But I also want to keep the duplicate rows that already has been generated. Due to which I could not apply unique constraint on those columns(multiple column).
I have created a trigger which would check the row if already exists and raise exception accordingly. But it is also failing when concurrent transactions are in processing.
Example :
TAB1
col1 | col2 | col3 |
------------------------------------
1 | A | B | --
2 | A | B | -- already present duplicates for column col2 and col3(allowed)
3 | C | D |
INSERT INTO TAB1 VALUES(4 , 'A' , 'B') ; -- This insert statement will not be allowed.
Note: I cannot use on conflict due to older version of database.
Presumably, you don't want new rows to duplicate historical rows. If so, you can do this but it requires modifying the table and adding a new column.
alter table t add duplicate_seq int default 1;
Then update this column to identify existing duplicates:
update t
set duplicate_seq = seqnum
from (select t.*, row_number() over (partition by col order by col) as seqnum
from t
) tt
where t.<primary key> = tt.<primary key>;
Now, create a unique index or constraint:
alter table t add constraint unq_t_col_seq on t(col, duplicate_seq);
When you insert rows, do not provide a value for duplicate_seq. The default is 1. That will conflict with any existing values -- or with duplicates entered more recently. Historical duplicates will be allowed.
You can try to create a partial index to have the unique constraint only for a subset of the table rows:
For example:
create unique index on t(x) where (d > '2020-01-01');

How can I update the table in SQL?

I've created a table called Youtuber, the code is below:
create table Channel (
codChannel int primary key,
name varchar(50) not null,
age float not null,
subscribers int not null,
views int not null
)
In this table, there are 2 channels:
|codChannel | name | age | subscribers | views |
| 1 | PewDiePie | 28 | 58506205 | 16654168214 |
| 2 | Grandtour Games | 15 | 429 | 29463 |
So, I want to edit the age of "Grandtour Games" to "18". How can I do that with update?
Is my code right?
update age from Grandtour Games where age='18'
No, in update, you'll have to follow this sequence:
update tableName set columnWanted = 'newValue' where columnName = 'elementName'
In your code, put this:
update Channel set age=18 where name='Grandtour Games'
Comments below:
/* Channel is the name of the table you'll update
set is to assign a new value to the age, in your case
where name='Grandtour Games' is referencing that the name of the Channel you want to update, is Grandtour Games */
alter table changes the the schema (adding, updating, or removing columns or keys, that kind of thing).
Update table changes the data in the table without changing the schema.
So the two are really quite different.
Here is your answer -
-> ALTER is a DDL (Data Definition Language) statement
UPDATE is a DML (Data Manipulation Language) statement.
->ALTER is used to update the structure of the table (add/remove field/index etc).
Whereas UPDATE is used to update data.
Hope this helps!

SQL - drop a column in netezza

I have a tabletable1 like below from which i'm trying to drop a column.
table1
id name time value
---------------------
1 john 11:00 324
2 NULL 12:00 645
3 NULL 13:00 324
4 jane 11:00 132
5 NULL 12:00 30
A temp table is created as the original table cannot be altered due to permissions. This case may be very simple to be done by selecting everything except id , but what I really need to do is get rid of one column when there are large number of cols.
create temp table table2 as(
select * from table1
) distribute on random;
alter table table2 drop column id;
this gives the error - Drop behaviour (RESTRICT | CASCADE) needs to be specified when dropping a column or constraint
How should the alter table statement be ?
As the error message and documentation say, you need to specify either RESTRICT or CASCADE. However, note that you can't drop a column from a true TEMPORARY table, so this only applies to normal tables.
ALTER TABLE <table> <action> [ORGANIZE ON {(<columns>) | NONE}]
Where <action> can be one of:
ADD COLUMN <col> <type> [<col_constraint>][,…] |
ADD <table_constraint> |
ALTER [COLUMN] <col> { SET DEFAULT <value> | DROP DEFAULT } |
DROP [COLUMN] column_name[,column_name…] {CASCADE | RESTRICT } |
DROP CONSTRAINT <constraint_name> {CASCADE | RESTRICT} |
MODIFY COLUMN (<col> VARCHAR(<maxsize>)) |
OWNER TO <user_name> |
RENAME [COLUMN] <col> TO <new_col_name> |
RENAME TO <new_table> |
SET PRIVILEGES TO <table>
Like this:
SYSTEM.ADMIN(ADMIN)=> create table t1 (col1 bigint, col2 varchar(5));
CREATE TABLE
SYSTEM.ADMIN(ADMIN)=> insert into t1 values (1,'One');
INSERT 0 1
SYSTEM.ADMIN(ADMIN)=> insert into t1 values (2,'Two');
INSERT 0 1
SYSTEM.ADMIN(ADMIN)=> insert into t1 values (3,'Three');
INSERT 0 1
SYSTEM.ADMIN(ADMIN)=> select * from t1;
COL1 | COL2
------+-------
3 | Three
1 | One
2 | Two
(3 rows)
SYSTEM.ADMIN(ADMIN)=> alter table t1 drop column col2 restrict;
ALTER TABLE
SYSTEM.ADMIN(ADMIN)=> select * from t1;
COL1
------
1
2
3
(3 rows)
As always, if you alter a table to drop or add a column, you should follow it up with a GROOM to clean up the versioned table:
SYSTEM.ADMIN(ADMIN)=> groom table t1 versions;
NOTICE: Groom will not purge records deleted by transactions that started after 2016-11-07 17:00:11.
NOTICE: If this process is interrupted please either repeat GROOM VERSIONS or issue 'GENERATE STATISTICS ON "T1"'
NOTICE: Groom processed 2 pages; purged 0 records; scan size unchanged; table size unchanged.
GROOM VERSIONS
SYSTEM.ADMIN(ADMIN)=>
This is the syntax for dropping the column in Netezza
Alter table tablename drop columnname RESTRICT
According to this: http://datawarehouse.ittoolbox.com/groups/technical-functional/netezza-l/netezza-issue-2467523
it seems that you can't DROP a column via ALTER TABLE, only a constraint.

SQLITE: Transaction to check constraints after commit

I'm currently working on an sqlite table where I have to do the following:
ID | Name | SortHint
---|-------|---------
0 | A | 1
1 | B | 2
2 | C | 3
ID is the primary key and SortHint is a column with the UNIQUE-constaint. What I have to do is to modify the table, for example:
ID | Name | SortHint
---|-------|---------
0 | A | 3
1 | B | 1
2 | C | 2
The Problem: Because of the UNIQUE I can't simply update one row after another. I tried:
BEGIN TRANSACTION;
UPDATE MyTable SET SortHint = 3 WHERE ID= 0;
...
COMMIT;
But the first update query immideatly fails with:
UNIQUE constraint failed: MyTable.SortHint Unable to fetch row
So, is there a way to "disable" the unique constaint for a transaction and only check all of them once the transaction is committed?
Notes:
I can't modify the table
It works if I only use SortHint values that are not already in the table
I know how to "workaround" this problem, but I would like to know if there is a way to do this as described above
One possibility is to drop the unique constraint and then add it again. That is a little bit expensive, though.
Another would be to set the values to negative values:
UPDATE MyTable
SET SortHInt = - SortHint;
UPDATE MyTable
SET SortHint = 3
WHERE ID = 0;
. . .
If you cannot modify the table, you are not able to remove the constraint.
A workaround could be to change the SortHint to a range that is not in use.
For example you could add 10,000 to all of them. Commit.
Then change to the right number at once which have become free now.
Maybe test afterwards that no numbers of 10,000 or higher exist anymore.

MySQL data version control

Is there any way to setup MySQL to every time a row is changed, then a row to another table/database is created with what the data was originally? (with time stamping)
If so how would I go about doing it?
E.g.
UPDATE `live_db`.`people`
SET `live_db`.`people`.`name` = 'bob'
WHERE `id` = 1;
Causes this to happen before the update:
INSERT INTO `changes_db`.`people`
SELECT *
FROM `live_db`.`people`
WHERE `live_db`.`people`.`id` = 1;
And if you did it again it would result in something like this:
`live_db`.`people`
+----+-------+---------------------+
| id | name | created |
+----+-------+---------------------+
| 1 | jones | 10:32:20 12/06/2010 |
+----+-------+---------------------+
`changes_db`.`people`
+----+-------+---------------------+
| id | name | updated |
+----+-------+---------------------+
| 1 | billy | 12:11:25 13/06/2010 |
| 1 | bob | 03:01:54 14/06/2010 |
+----+-------+---------------------+
The live DB needs to have a created time stamp on the rows, and the changes DB needs to have a time stamp of when the live DB row was updated.
The changes DB will also have no primary keys and foreign key constraints.
I'm using InnoDB and MySQL 5.1.49 but can upgrade if required.
Use a Trigger
MySQL support for triggers started with MySQL version 5.0.2.
You can create a trigger:
DELIMITER \\
CREATE TRIGGER logtrigger BEFORE UPDATE ON live_db.people
FOR EACH ROW BEGIN
INSERT INTO changes_db.people(id,name,updated) VALUES(OLD.id,OLD.name,now());
END;
\\
This is how I ended up doing it
DELIMITER |
# Create the log table
CREATE TABLE IF NOT EXISTS `DB_LOG`.`TABLE`
LIKE `DB`.`TABLE`|
# Remove any auto increment
ALTER TABLE `DB_LOG`.`TABLE` CHANGE `DB_LOG`.`TABLE`.`PK` `DB_LOG`.`TABLE`.`PK` INT UNSIGNED NOT NULL|
# Drop the primary keys
ALTER TABLE `DB_LOG`.`TABLE` DROP PRIMARY KEY|
#Create the trigger
DROP TRIGGER IF EXISTS `DB`.`update_TABLE`|
CREATE TRIGGER `DB`.`update_TABLE` BEFORE UPDATE ON `DB`.`TABLE` FOR EACH ROW
BEGIN
INSERT INTO `DB_LOG`.`TABLE`
SELECT `DB`.`TABLE`.*
FROM `DB`.`TABLE`
WHERE `DB`.`TABLE`.`PK` = NEW.`PK`;
END|
DELIMITER ;
Sorry to comment on an old post, but I was looking to solve this exact problem! Thought I would share this information.
This outlines a solution perfectly:
http://www.hirmet.com/mysql-versioning-records-of-tables-using-triggers