Add the last modified date of file to Hive external table - hive

I have a requirement where I need to add the time the file was dropped into the HDFS folder as a column in the Hive external table.
Example: I have 2 files dropped on
2017-07-13 15:22
2017-12-13 18:31
So, my last_modified column in the Hive table should reflect 2017-07-13 15:22 for all rows from file 1 and 2017-12-13 18:31 from file 2.
Is there a way to achieve this in the external table create statement.
Thanks in Advance!

I haven't come across any such feature to solve your problem. However, you can try out below steps to maintain last modified time per file in separate column:
Create a partition table on last_modified column.
CREATE EXTERNAL TABLE test (record string) PARTITIONED BY
(last_modified string) location '<warehouse_location>/test.db/test'
For each file add new partition to your table or load using insert statement into partition.
ALTER TABLE test ADD PARTITION (last_modified='2017-07-13 15:22')
location '<data-location>/newfile1/';
create a separate temp table on new file then insert data to
partition table:
CREATE EXTERNAL TABLE tmp (record strin ) location '<new data location>'
INSERT INTO TABLE test PARTITION (
last_modified = '2017-07-13 15:22') SELECT record FROM tmp;

Related

How to create table over partitioned data

I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!
As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.

How do I partition a table by all values?

I have an external table, now I want to add partitions to it. I have 224 unique city id's and I want to just write alter table my_table add partition (cityid) location /path; but hive complains, saying that I don't provide anything for the city id value, it should be e.g. alter table my_table add partition (cityid=VALUE) location /path;, but I don't want to run alter table commands for every value of city id, how can I do it for all id's in one go?
This is what hive command line looks like:
hive> alter table pavel.browserdata add partition (cityid) location '/user/maria_dev/data/cityidPartition';
FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {cityid=null}
Partition on physical level is a location (separate location for each value, usually looks like key=value) with data files. If you already have partitions directory structure with files, all you need is to create partitions in Hive metastore, then you can point your table to the root directory using ALTER TABLE SET LOCATION, then use MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS. This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
If you have only not-partitioned table with data in it's location, then adding partitions will not work because the data needs to be reloaded, you need to:
Create another partitioned table and use insert overwrite to load partition data using dynamic partition load:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table2 partition(cityid)
select col1, ... colN,
cityid
from table1; --partitions columns should be last in the select
This is quite efficient way to reorganize your data.
After this you can delete source table and rename your target table.

Hive table creation with a default value

I have a table in RDBMS like so:
create table test (sno number, entry_date date default sysdate).
Now I want to create a table in hive with a structure as adding a default value to a column.
Hive currently doesn't support the feature of adding default value to any column while creating a table.
As a workaround load data into a temporary table and use the insert overwrite table statement to add the current date and time into the main table.
Create a temporary table:
create table test (sno number);
Load data into the table:
Create final table:
create table final_table (sno number, createDate string);
Finally load the data from temp test table to the final table:
insert overwrite table final_table select sno, FROM_UNIXTIME( UNIX_TIMESTAMP(), 'dd/MM/YYYY' ) from test;
Hive doesn't support DEFAULT fields
Doesn't mean you can't do it, though. Just a two step process of creating one "staging" table, then inserting into a second table and selecting that "default" value.
Adding a default value to a column while creating table in hive
Since you mention,
I've table in RDBMS
You could also use your existing table, and use Sqoop to import the data into Hive.

Add partitions on existing hive table

I'm processing a big hive's table (more than 500 billion records).
The processing is too slow and I would like to make it faster.
I think that by adding partitions, the process could be more efficient.
Can anybody tell me how I can do that?
Note that my table already exists.
My table :
create table T(
nom string,
prenom string,
...
date string)
Partitioning on date field.
Thx
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT OVERWRITE TABLE table_name PARTITION(Date) select date from table_name;
Note :
In the insert statement for a partitioned table make sure that you are specifying the partition columns at the last in select clause.
You have to restructure the table. Here are the steps:
Make sure no other process is writing to the table.
Create new external table using partitioning
Insert into new table by selecting from the old table
Drop the new table (external), only table will be dropped but data will be there
Drop the old table
Create the table with original name by pointing to the location under step 2
You can run repair command to fix all the metadata.
Alternative 4, 5, 6 and 7
Create the table with original name by running show create table on new table and replace with original table name
Run LOAD DATA INPATH command to move files under partitions to new partitions of new table
Drop the external table created
Both the approaches will achieve restructuring with one insert/map reduce job.

External table does not return the data in its folder

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".