Hive external table is unable to read already partitioned hdfs directory - hive

I have a map reduce job, that already writes out record to hdfs using hive partition naming convention.
eg
/user/test/generated/code=1/channel=A
/user/test/generated/code=1/channel=B
After I create an external table, it does not see the partition.
create external table test_1 ( id string, name string ) partitioned by
(code string, channel string) STORED AS PARQUET LOCATION
'/user/test/generated'
Even with the alter command
alter table test_1 ADD PARTITION (code = '1', channel = 'A')
, it does not see the partition or record,
because
select * from test_1 limit 1 produces 0 result.
If I use empty location when I create external table, and then use
load data inpath ...
then it works. But the issue is there is too many partitions for the load data inpath to work.
Is there a way to make hive recognize the partition automatically (without doing insert query)?

Using msck, it seems to be working. But I had to exit the hive session, and connect again.
MSCK REPAIR TABLE test_1

Related

How can I load same file into hive table using beeline

I needed to create huge test data in hive table. I tried following commands but it only inserts one partition data at a time.
connect to beeline:
beeline --force=true -u 'jdbc:hive2://<host>:<port>/<hive database name>;ssl=true;user=<username>;password=<pw>'
create partitioned table :
CREATE TABLE p101(
Name string,
Age string)
PARTITIONED BY(fi string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
I have created ins.csv file with data and copy it to hdfs location, its data is as follows.
Name,Age
aaa,33
bbb,22
ccc,55
then I tried to load same file for multiple partition ids with following command
LOAD DATA INPATH 'hdfs_path/ins.csv' INTO TABLE p101 PARTITION(fi=1,fi=2,fi=3,fi=4,fi=5);
but it loads record only for partitionID=5.
You can only specify one partition for each insert into.
What you can do in order to have different partitions is add it into your csv file like this:
Name,Age,fi
aaa,33,1
bbb,22,2
ccc,55,3
Hive will automatically know that this is the partition.
LOAD DATA INPATH 'hdfs_path/ins.csv' INTO TABLE tmp.p101;

How to create table over partitioned data

I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!
As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

why hive can‘t select data from hdfs when use partition?

I use flume to write data to hdfs,path like /hive/logs/dt=20151002.Then,i use hive to select data,but the count of response is always 0.
Here is my create table sql,CREATE EXTERNAL TABLE IF NOT EXISTS test (id STRING) partitioned by (dt string) ROW FORMAT DELIMITED fields terminated by '\t' lines terminated by '\n' STORED AS TEXTFILE LOCATION '/hive/logs'
Here is my select sql,select count(*) from test
It seems that you are not registering partition in hive meta-store.
Although partition is present in hdfs path,Hive won't know it if its not registered in meta store. To register it you can do the following:
ALTER TABLE test ADD PARTITION (dt='20151002') location '/hive/logs/dt=20151002';

External table does not return the data in its folder

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".