Hive failed to show tables in warehouse - apache

the problem I’m facing is that when I create a table inside hive the table only shows up in
localhost:50070/explorer.html#/user/hive/warehouse
,but it does not exist the real directory inside the terminal or hard disk on the directory
/user/hive/warehouse

Reason you might not have used Location when creation your create table command. You should use it like:
CREATE TABLE mytable
....
....
LOCATION 'hdfs://user/hive/warehouse'

Related

Unable to load managed table with maptype column (complex datatype) from external table in hive

I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]

Hive - Is it mandatory to have '=' for external table to consider as partition

I am new to Hive and have a below basic question:
I am trying to create external table on HDFS directory at location
/projects/score/output/scores_2020-06-30.gzip
but it is not considering it as partition.
Should developer need to change directory name "scores=yyyy-mm-dd" in place of "scores_yyyy-mm-dd.gzip"
like "/projects/score/output/scores=2020-06-30"
then only it would consider as partitioned?
i.e. Is it mandatory to have '=' for external table to consider as partition
Or can I change something in below table while creation. Trying as below:
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ (
...
)
PARTITIONED BY (scores STRING)
LOCATION '/projects/score/output/';
Thanks in advance!
You can define partition on top of any location, even outside table directory using ALTER TABLE ADD PARTITION. Partition in HDFS is a directory usually inside table location but not necessarily. If it is inside table directory, then you can use msck repair table to attach existing subdirestories inside table directory as partitions, it will scan table location and add partitions metadata.
In your example partition directory is missing, you have only table directory with a file inside. Filename does not matter in this case.
It is not absolutely mandatory to have partition directory in the format key=value, though msck repair table may not work in your Hive distribution, you still can add partitions using ALTER TABLE ADD PARTITION ... LOCATION command on top of any directory.
It may depend on vendor. For example on Qubole, ALTER TABLE RECOVER PARTITIONS(EMR alternative of MSCK REPAIR TABLE) works fine with directories like '2020-06-30'.
By default when inserting data using dynamic partitioning, it creates partition folders in the format key=value, but if you creating partition directories using some other tools, 'value' as partition folder name is okay. Just check does MSCK REPAIR work or not in your case. If it does not, create directories key=value if you need MSCK REPAIR.
The name of file(s) and the number of files inside partition folder does not matter in this context.

Database directory will not get created until explicitly specified in LOCATION field in Hive

I created a database with my preferred location (/user/hive/) using below query.
create database test
location "/user/hive/";
After creating the database, I checked in the location /user/hive/ for test.db directory using the command hadoop dfs -ls /user/hive. It was not available.
later I created one more database with default location using below query.
create database test2;
For database test2, I can see test2.db directory under default warehouse directory /user/hive/warehouse/
/user/hive/test.db directory got created when I explicitly specified it in LOCATION filed as below.
create database test
location "/user/hive/test.db";
As I'm new to Hive, Can any please explain.
Why test.db directory dint got created for my first query where I specified the location field as /user/hive/?
How Hive will work when location field is specified?
NOTE:
I'm using cloudera quick-start VM
Hive Version: Hive 1.1.0-cdh5.13.0
This is an expected behavior from Hive
create database test location "/user/hive/";
by executing above statement means you are creating test database on pointing to /user/hive directory so that's the reason why hive haven't created test directory.
We need to explicitly mention directory name also where we need to point the database in hive i.e create database test location "/user/hive/test.db"; then only hive creates test db pointing to test.db directory.
In case of create database test2; statement means we are creating database without specifying the location so this directory created under default hive warehouse location with same name as db name.

Impala External Table Location/URI

I am troubleshooting an application issue on an External (unmanaged) Table that was created using the CREATE TABLE X LIKE PARQUET syntax via Cloudera Impala. I am trying to determine the Location of the files comprising the partitions of the External table but having difficulty determining how to do this, or finding documentation describing this.
If I do a:
show create table T1;
I see the hive-managed location such as:
LOCATION 'hdfs://nameservice1/user/hive/warehouse/databaseName'
If I do a:
describe formatted T1;
I see that the table is in fact external but it doesnt give any insight on the unmanaged Location.
| Table Type: | EXTERNAL_TABLE
| Location: | hdfs://nameservice1/user/hive/warehouse/databaseName/T1
Question:
How do I determine the Location/URI/Parent Directory of the actual external files that comprise this External Table?
When you create a external table with impala or hive and you want know the location you must put the HDFS location, for example :
CREATE EXTERNAL TABLE my_db.table_name
(column string ) LOCATION 'hdfs_path'
The probably location o theses files if dont provite this, is under user directory that execute the comand create table.
For more detail you can see this link:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html
I hope to help!

External Table in Hive - Location

The below table returns no data while running a select statement
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.
Is there any way the location be made dynamic like
/user/data/CSV/*/*/*/*/part-*
or just /user/data/CSV/* would do fine ?
(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)
First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.
If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.
Hive does not iterate through multiple folders -
Hence for the above scenario
I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)
hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
This line
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?
Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.
You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
;
alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;
You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.
You may also be able to leverage a
create table like
statement. To create a schema like you have in another table. Then alter the table to point at the files you want.
I know this isn't exactly what you want and is more of a work around. Good luck!