I have a HIVE table, partitioned on date field and gets loaded every day. We got a request to add a new column at the end and load the data into the same HIVE table. Are there any better ways to handle this column change requests in keeping the existing data.
Do I need to delete the data in the existing table and recreate the table using the new columns and load the data.
In which format do you save the data?
If you are using avro-format, just add the new fields in the .avsc-filed and set a default-value:
{
"name": "yourData",
"type": ["string", "null"],
"default": "null"
}
If you store the data as csv, then it seems to be a little bit more complicated.
Changing the table with alter table didn't worked in my case (I have no idea why).
So I deleted the table, recreated it with the new columns and added the partitions and it works.
Make shure that your table is an external Table, then you don't have to delete the data.
eg:
Old Data:
889,5CE1,2016-07-25
New Data:
900,5DCBA,2016-07-25,2012-03-22,152047
hive:
create table somData (
anid int
,astring String
,extractDate date
)
PARTITIONED BY(cusPart STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TextFile location "/your/location";
what you have to do:
ALTER TABLE somData SET TBLPROPERTIES('EXTERNAL'='TRUE');
drop table somData;
create table somData (
anid int
,astring String
,extractDate date
,anotherDate date
,someInt int
)
PARTITIONED BY(cusPart STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TextFile location "/your/location";
ALTER TABLE someData ADD IF NOT EXISTS PARTITION (cusPart='foo') LOCATION '/your/paritioned/data';
Related
I have a hive external table point to a location on s3. My requirement is I will be uploading a new file to this s3 location everyday and the data in my hive table should be overwritten.
Every day my script will create a folder under 's3://employee-data/' and place a csv file there.
eg. s3://employee-data/20190812/employee_data.csv
Now I want my hive table to pick up this new file under new folder everyday and overwrite the existing data. I can get the folder name - '20190812' through my ETL.
Can someone help.
I tried ALTER table set location 'new location'. However, this does not overwrite the data.
create external table employee
{
name String,
hours_worked Integer
}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';
Set new location and the data will be accessible:
ALTER table set location 's3://employee-data/20190812/';
This statement points table to the new location, nothing is being overwritten of course.
Or alternatively make the table partitioned:
create external table employee
(
name String,
hours_worked Integer
)
partitioned by (load_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';
then do ALTER TABLE employee recover partitions;
and all dates will be mounted in separate partitions and you can query them using
WHERE load_date='20190812'
I have a table in RDBMS like so:
create table test (sno number, entry_date date default sysdate).
Now I want to create a table in hive with a structure as adding a default value to a column.
Hive currently doesn't support the feature of adding default value to any column while creating a table.
As a workaround load data into a temporary table and use the insert overwrite table statement to add the current date and time into the main table.
Create a temporary table:
create table test (sno number);
Load data into the table:
Create final table:
create table final_table (sno number, createDate string);
Finally load the data from temp test table to the final table:
insert overwrite table final_table select sno, FROM_UNIXTIME( UNIX_TIMESTAMP(), 'dd/MM/YYYY' ) from test;
Hive doesn't support DEFAULT fields
Doesn't mean you can't do it, though. Just a two step process of creating one "staging" table, then inserting into a second table and selecting that "default" value.
Adding a default value to a column while creating table in hive
Since you mention,
I've table in RDBMS
You could also use your existing table, and use Sqoop to import the data into Hive.
Create Statement:
CREATE EXTERNAL TABLE tab1(usr string)
PARTITIONED BY (year string, month string, day string, hour string, min string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION '/tmp/hive1';
Data:
select * from tab1;
jhon,2017,2,20,10,11
jhon,2017,2,20,10,12
jhon,2017,2,20,10,13
Now I need to alter tab1 table to have only 3 partitions (year string, month string, day string) without manually copying/modifying files. I have thousands of files, so I should alter only table defination without touching files?
Please let me know how to do this?
if this is something that you will do one time, I would suggest create a new table with the expected partitions and insert the table from the older table to the new one using dynamic partitioning. This will also avoid keep small files in your partitions. The other option is create a new table pointing to the old location with the expected partitions and use the following properties
TBLPROPERTIES ("hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE");
after that, you can run the msck repair table to recognize the partitions.
I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!
I'm processing a big hive's table (more than 500 billion records).
The processing is too slow and I would like to make it faster.
I think that by adding partitions, the process could be more efficient.
Can anybody tell me how I can do that?
Note that my table already exists.
My table :
create table T(
nom string,
prenom string,
...
date string)
Partitioning on date field.
Thx
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT OVERWRITE TABLE table_name PARTITION(Date) select date from table_name;
Note :
In the insert statement for a partitioned table make sure that you are specifying the partition columns at the last in select clause.
You have to restructure the table. Here are the steps:
Make sure no other process is writing to the table.
Create new external table using partitioning
Insert into new table by selecting from the old table
Drop the new table (external), only table will be dropped but data will be there
Drop the old table
Create the table with original name by pointing to the location under step 2
You can run repair command to fix all the metadata.
Alternative 4, 5, 6 and 7
Create the table with original name by running show create table on new table and replace with original table name
Run LOAD DATA INPATH command to move files under partitions to new partitions of new table
Drop the external table created
Both the approaches will achieve restructuring with one insert/map reduce job.