hive not adding file extensions to file names - hive

I've run the hive query
create table my_schema.my_table stored as parquet as select ...
It created the table and the files, but i do not see .parq file extension next to the files, which is a bit of a problem for me since i wanted to be able to run something like hdfs -ls -R /path/to/directory | grep .parq to list all parquet files in a directory.
Is there either a way to filter parquet files regardless of file extension or a way to make hive include the extension?
I have a similar query using impala and there i can see the .parq files without any issue.

Hive will not add extension to the file. You need to do it manually:
$ hadoop fs -put /path/to/directory/000000_0 /path/to/directory/data.parquet

Related

LOAD DATA INPATH table files start with some string in Impala

Just a simple question, I'm new in Impala.
I want to load data from the HDFS to my datalake using impala.
So I have a csv this_is_my_data.csv and what I want to do is load the file without specify all the extension, I mean something like the following:
LOAD DATA INPATH 'user/myuser/this_is.* INTO TABLE my_table
This is, a string starting with this_is and whatever follows.
If you need some additional information, please let me know. Thanks in advance.
The documentation says:
You can specify the HDFS path of a single file to be moved, or the
HDFS path of a directory to move all the files inside that directory.
You cannot specify any sort of wildcard to take only some of the files
from a directory.
The workaround is to put your files into table directory using mv or cp command. Check your table directory using DESCRIBE FORMATTED command and run mv or cp command (in a shell, not Impala of course):
hdfs dfs -mv "user/myuser/this_is.*" "/user/cloudera/mytabledir"
Or put files you need to load into some directory first then load all the directory.

How to combine multiple text files into a Hive table

I am currently trying to write a Hive script to take in a directory path and generate a Hive table combining all the different files in the path together. I have found how to load files given I know the direct path to them, but how can I do this without knowing all the file paths?
This is how I would do it if I know the paths given directory, /combine:
LOAD DATA INPATH '/combine/file1.txt' INTO TABLE tablename;
LOAD DATA INPATH '/combine/file2.txt' INTO TABLE tablename;
But how would you get the same result if you dont know the file paths, only the directory?
just just * symbol could load all the file under the e directory into the table .
LOAD DATA INPATH '/combine/*' INTO TABLE tablename;
You can use an external Hive table.
Create a folder on HDFS and load the two files there:
hadoop fs -mkdir /hive-data
hadoop fs -put file1.txt /hive-data/file1.txt
hadoop fs -put file2.txt /hive-data/file2.txt
Alternatively, specify a directory from which to load all files:
hadoop fs -put directory-with-files/* /hive-data/
Verify the files are loaded properly:
hadoop fs -ls /hive-data
Create an external table in Hive and refer to the HDFS location:
(Change the schema, field, and line delimiters to match your data files.)
CREATE EXTERNAL TABLE tablename
(
id INT,
desc STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///hive-data';
Verify data in Hive:
select * from tablename;

Hive Managed table - Filename

In hive managed tables is there anyway to input/specify the filename for the data files getting created?
For example, the below data file ends with "000000_0", is it possible to get that file generated with specific name?
hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders_partitioned/order_month=Apr/000000_0
There is no way to specify the file name when you input the data using hive cli or sqoop. But there is a way to input the specified file using copy command
hadoop fs -cp <src_file> <dest_folder>
In this case you have to be careful the data in this source file is to be matched exactly with the partition condition of destination directory.

Parquet compression type via Impala

We have quite a few impala tables defined, and assume we are using Snappy compression. (parquet files)
However nobody really knows what compression type we are actually using on existing tables.
The impala docs don't seem to specify how to get the compression type from an existing table.
Is there a way to find the used compression type via impala?
As of right now there is no command in Impala that would tell you the type of compression being used in a table stored as parquet, but there is a work around. What you can do is look at one of the parquet files within the table and then use the parquet-tools meta command in order to see the compression being used.
-- step1) run hdfs dfs -ls to determine the location and name for a parquet file
hdfs dfs -ls /yourTableLocationPath
-- step2) parquet-tools really only works locally right now so you will need to copy the file to a local directory
hdfs dfs -get /yourTableLocationPath/yourFileName /yourLocalPath
-- step3) run parquet-tools meta command
parquet-tools meta /yourLocalPath/yourFileName
The output of the parquet-tools meta command will show you the type of compression being used under the row group output.

how can i copy data from a Hive table into local system?

i have created a table in Hive "sample" and loaded a csv file "sample.txt" into it.
now i need that data from "sample" into my local /opt/zxy/sample.txt.
How can i do that?
Hortonworks' Sandbox lets you do it through its HCatalog menu. Otherwise, the syntax is
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/c' SELECT a.* FROM b
as per Hive language manual
Since your intention is just to copy the entire file from HDFS to your local FS, I would not suggest you to do it through a Hive query, because of the following reasons :
It'll start a Mapreduce job which will take more time than a normal copy.
It'll create file(s) with different names(000000_0, 000001_0 and so on), which will require you to rename the file manually afterwards.
You might face problem in opening these files as they are without any extension. Your OS would be unable to choose an application to open these files on its own. In such a case you either have to rename the file or manually select an application to open it.
To avoid these problems you could use HDFS get command :
bin/hadoop fs -get /user/hive/warehouse/sample/sample.txt /opt/zxy/sample.txt
Simple n easy. But if you need to copy some selected data, then you have to use a Hive query.
HTH
I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:
hive -e 'select * from sample' > /opt/zxy/sample.txt
Hope that helps.
Readers who are accessing Hive from Windows OS can check out this script on Github.
It's a Python+paramiko script that extracts Hive data to local Windows OS file-system.