Hive Managed table - Filename - hive

In hive managed tables is there anyway to input/specify the filename for the data files getting created?
For example, the below data file ends with "000000_0", is it possible to get that file generated with specific name?
hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders_partitioned/order_month=Apr/000000_0

There is no way to specify the file name when you input the data using hive cli or sqoop. But there is a way to input the specified file using copy command
hadoop fs -cp <src_file> <dest_folder>
In this case you have to be careful the data in this source file is to be matched exactly with the partition condition of destination directory.

Related

hive not adding file extensions to file names

I've run the hive query
create table my_schema.my_table stored as parquet as select ...
It created the table and the files, but i do not see .parq file extension next to the files, which is a bit of a problem for me since i wanted to be able to run something like hdfs -ls -R /path/to/directory | grep .parq to list all parquet files in a directory.
Is there either a way to filter parquet files regardless of file extension or a way to make hive include the extension?
I have a similar query using impala and there i can see the .parq files without any issue.
Hive will not add extension to the file. You need to do it manually:
$ hadoop fs -put /path/to/directory/000000_0 /path/to/directory/data.parquet

LOAD DATA INPATH table files start with some string in Impala

Just a simple question, I'm new in Impala.
I want to load data from the HDFS to my datalake using impala.
So I have a csv this_is_my_data.csv and what I want to do is load the file without specify all the extension, I mean something like the following:
LOAD DATA INPATH 'user/myuser/this_is.* INTO TABLE my_table
This is, a string starting with this_is and whatever follows.
If you need some additional information, please let me know. Thanks in advance.
The documentation says:
You can specify the HDFS path of a single file to be moved, or the
HDFS path of a directory to move all the files inside that directory.
You cannot specify any sort of wildcard to take only some of the files
from a directory.
The workaround is to put your files into table directory using mv or cp command. Check your table directory using DESCRIBE FORMATTED command and run mv or cp command (in a shell, not Impala of course):
hdfs dfs -mv "user/myuser/this_is.*" "/user/cloudera/mytabledir"
Or put files you need to load into some directory first then load all the directory.

Writing data using PIG to HIVE external table

I wanted to create an external table and load data into it through pig script. I followed the below approach:
Ok. Create a external hive table with a schema layout somewhere in HDFS directory. Lets say
create external table emp_records(id int,
name String,
city String)
row formatted delimited
fields terminated by '|'
location '/user/cloudera/outputfiles/usecase1';
Just create a table like above and no need to load any file into that directory.
Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script use as below
A = LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');
Ensure that destination location and delimiter and schema layout of final FOREACH statement in you Pigscript matches with Hive DDL schema.
My problem is, when I first created the table, it is creating a directory in hdfs, and when I tried to store a file using script, it throws an error saying "folder already exists". It looks like pig store always writes to a new directory with only specific name?
Is there any way to avoid this issue?
And are there any other attributes we can use with STORE command in PIG to write to a specific diretory/file everytime?
Thanks
Ram
YES you can use the HCatalog for achieving your result.
remember you have to run your Pig script like:
pig -useHCatalog your_pig_script.pig
or if you are using grunt shell then simply use:
pig -useHCatalog
next is your store command to store your relation directly into hive tables use:
STORE C INTO 'HIVE_DATABASE.EXTERNAL_TABLE_NAME' USING org.apache.hive.hcatalog.pig.HCatStorer();

How to load data into Hive table

I'm using the hortonworks's Hue (more like a GUI interface that connects hdfs, hive, pig together)and I want to load the data within the hdfs into my current created table.
Suppose the table's name is "test", and the file which contains the data, the path is:
/user/hdfs/test/test.txt"
But I'm unable to load the data into the table, I tried:
load data local inpath '/user/hdfs/test/test.txt' into table test
But there's error said can't find the file, there's no matching path.
I'm still so confused.
Any suggestions?
Thanks
As you said "load the data within the hdfs into my current created table".
But in you command you are using :
load data local inpath '/user/hdfs/test/test.txt' into table test
Using local keyword it looks for the file in your local filesystem. But you file is in HDFS.
I think you need to remove local keyword from you command.
Hope it helps...!!!
Since you are using the hue and the output is showing not matching path. I think you have to give the complete path.
for example:
load data local inpath '/home/cloudera/hive/Documents/info.csv' into table tablename; same as you can give the complete path where the hdfs in which the document resides.
You can use any other format file
remove local keyword as ur referring to local file system

HIVE script - Specify file name as S3 Location

I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.