External table does not return the data in its folder - hive

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?

You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!

When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".

Related

How to create table over partitioned data

I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!
As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.

Hive table not recognising partition

My hive table is partitioned with column 'job_id'. When I dump the data in the hdfs location of the table, then it is creating a partition with name 'JOB_ID' and my hive table is not recognizing it.
I have tried msck repair table command but that didn't helped either.
For external Hive tables you need to add new partition manually as follows:
ALTER TABLE table_name ADD PARTITION (job_id='927') location 'hdfs://some_location/job_id=927'
I found out that the partition name should always be in lowercase letter.
Here is the link:
https://medium.com/a-muggles-pensieve/hive-partition-column-name-camelcase-bad-idea-bc203d6e65da

How do I partition a table by all values?

I have an external table, now I want to add partitions to it. I have 224 unique city id's and I want to just write alter table my_table add partition (cityid) location /path; but hive complains, saying that I don't provide anything for the city id value, it should be e.g. alter table my_table add partition (cityid=VALUE) location /path;, but I don't want to run alter table commands for every value of city id, how can I do it for all id's in one go?
This is what hive command line looks like:
hive> alter table pavel.browserdata add partition (cityid) location '/user/maria_dev/data/cityidPartition';
FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {cityid=null}
Partition on physical level is a location (separate location for each value, usually looks like key=value) with data files. If you already have partitions directory structure with files, all you need is to create partitions in Hive metastore, then you can point your table to the root directory using ALTER TABLE SET LOCATION, then use MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS. This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
If you have only not-partitioned table with data in it's location, then adding partitions will not work because the data needs to be reloaded, you need to:
Create another partitioned table and use insert overwrite to load partition data using dynamic partition load:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table2 partition(cityid)
select col1, ... colN,
cityid
from table1; --partitions columns should be last in the select
This is quite efficient way to reorganize your data.
After this you can delete source table and rename your target table.

Drop and overwrite external table in hive

I need to create an external table in hiveql with the output from a SELECT clause. Every time when the HiveQL is ran the table should be dropped and recreated . When we drop an external table only the table structure is getting dropped but not the data files from HDFS location. How to achieve this?
Create Table As Select (CTAS) has restrictions. One of them is that target table cannot be External.
You have these options:
Create external table once, then INSERT OVERWRITE
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;
Use managed table, then you can DROP TABLE, then CREATE TABLE ... as SELECT
See also answer about skipTrash and auto.purge property.

Delete data in external and partitioned table in hive

I'm trying to delete data from external and partitioned table in hive. I can delete partitions with:
ALTER TABLE myTable DROP PARTITION(field > 'xxxx')
or
TRUNCATE TABLE myTable PARTITION(field)
But related files in Blob storage are not deleted. How do I delete those files?
In other hand, I'd like to delete data using any field as a filter (not only partition field). Can it be done in my case (in an external and partitioned table)? I've tried to achive this using:
INSERT OVERWRITE TABLE myTable PARTITION(field)
SELECT * FROM myTable WHERE machine = 'xxxxx'
But data in SELECT doesn't replace data in myTable.
Data in the external table will remain if you drop table or partition. Only if the table is managed, the data will be deleted automatically when the table or partition is deleted.
INSERT OVERWRITE TABLE myTable PARTITION(field) SELECT...
statement can replace data with newly loaded data for partitions existing in the returned dataset. If returned dataset is empty, the data will remain untouched.
To delete data in external table you need to delete files on the filesystem.