Introduce HDFS folder information into Hive external table - hive

I have an HDFS directory structure like this:
/home/date_1/A/file.txt
/home/date_1/B/file.txt
/home/date_2/A/file.txt
/home/date_2/B/file.txt
...
I can create an external table
CREATE EXTERNAL TABLE table_name(col1 int, col2 string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORE AS TEXTFILE
LOCATION '/home'
But I don't know how to introduce the folder information 'A' or 'B' into the table. What can I do? Thanks!

In Hive you have virtual columns which you can use to read the underlying filename. INPUT__FILE__NAME will give your the list of files that the data has used to get the filename.
So you need to first create external table (as you have done). Then when you query the external table you can make use of the virtual column and split the data, as below:
select
col1,
col2,
INPUT__FILE__NAME as full_filepath,
concat_ws("/",reverse(split(reverse(INPUT__FILE__NAME),"/")[1]), reverse(split(reverse(INPUT__FILE__NAME),"/")[0])) as splitted_filepath
FROM
table_name;
More on virtual column in hive.

Are you using MapReduce as the Hive execution engine? You should be able to simply direct the framework to traverse all the sub-directories.
SET mapreduce.input.fileinputformat.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
SELECT COUNT(1) FROM table_name;

Related

Unable to load managed table with maptype column (complex datatype) from external table in hive

I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]

insert into hive external table as select and ensure it generates single file in table directory

My question is somewhat similar to the below post. I want to download some data from a hive table using select query. But because the data is large, I want to write it as an external table in a given path. so that I can create a csv file. Uses the below code
create external table output(col1 STRING, col2STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '{outdir}/output'
INSERT OVERWRITE TABLE output
Select col1, col2 from atable limit 1000
This works fine, and create a file in 0000_ format, which can be copied as a csv file.
But my question is how to ensure that the output will always have a single file? If there is no partition defined, will it always be single file? What is the rule it uses to split files?
Saw few similar questions like below. But it discuss hdfs file access.
How to point to a single file with external table
I know the below alternative, but I use a hive connection object to execute queries from a remote node.
hive -e ' selectsql; ' | sed 's/[\t]/,/g' > outpathwithfilename
You can set the below property before doing the overwrite
set mapreduce.job.reduces=1;
Note: If the hive engine doesn't allow to be modified at runtime, then whitelist the parameter by setting below property in hive-site.xml
hive.security.authorization.sqlstd.confwhitelist.append=|mapreduce.job.|mapreduce.map.|mapreduce.reduce.*

Is it possible to load only selected columns from Avro file to Hive?

I have a requirement to load Avro file to hive. Using the following to create the table
create external table tblName stored as avro location 'hdfs://host/pathToData' tblproperties ('avro.schema.url'='/hdfsPathTo/schema.avsc');
I am getting an error FOUND NULL, EXPECTED STRING while doing a select on the table. Is it possible to load few columns and find which column data is causing this error?
Actually you need first to create an Hive External table pointing to the location of your AVRO files, and using the AvroSerDe format.
At this stage, nothing is loaded. The external table is just a mask on files.
Then you can create an internal HIVE table and load data (the expected columns) from the external one.
If you are already having AVRO file, then load the file to HDFS in a directory of your choice. Next create an external table on top of the directory.
CREATE EXTERNAL TABLE external_table_name(col1 string, col2 string, col3 string ) STORED AS AVRO LOCATION '<HDFS location>';
Next create an internal hive table on top of the external table to load the data
CREATE TABLE internal_table_name(col2 string, col3 string) AS SELECT col2, col3 FROM external_table_name
You can schedule the internal table load using a batch script in any scripting language or tools.
Hope this helps :)

External Table in Hive - Location

The below table returns no data while running a select statement
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.
Is there any way the location be made dynamic like
/user/data/CSV/*/*/*/*/part-*
or just /user/data/CSV/* would do fine ?
(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)
First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.
If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.
Hive does not iterate through multiple folders -
Hence for the above scenario
I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)
hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
This line
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?
Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.
You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
;
alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;
You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.
You may also be able to leverage a
create table like
statement. To create a schema like you have in another table. Then alter the table to point at the files you want.
I know this isn't exactly what you want and is more of a work around. Good luck!

HiveQL Where In Clause That Points to a Set of Files

I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?
Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.