How can I have impala or hive return the file format of the underlying files on HDFS for a table?
I tried:
SHOW FILES database.table_name
This ilst the files, but the problem is that some people stored parquet files as .parq and others .parquet. Is there anyway to return the file format, such that one could use it in a new create statement?
Use good old show create table mytable.
You can check the output and it clearly mentions file format. It also shows folder inside which file are stored - you should not try to use file name - let impala decide the name. below is a sample result from impala.
result
CREATE TABLE edh.mytable (
column1 STRING
)
STORED AS PARQUET --file format
LOCATION 's3a://cc-mys3/edh/user/hive/warehouse/edh.db/mytable' --folder location
Related
I am new to Hive. Could you please let me know answer for below question?
Why do we need base table while loading the data in ORC?
Can't we directly create table as ORC and load data in it?
1. Why do we need base table while loading the data in ORC?
We need of the base table, because most of the time we get the data file in text file format, i.e. CSV, TXT, DAT or any other delimiter that we can open the file and see the content. But the file Format ORC maintain in a different way by using their algorithm to optimized the Row and Column.
Hence we need of a base table, so, Actually what happened in that case. We create a table with the textFile format and select the data over their and write it into ORC table.
2. Can't we directly create table as ORC and load data in it?
Yes, you can load the data into ORC file directly.
To understand more about ORC, you can refer to https://orc.apache.org/docs/
Usually if you don't define file format , for hive it is textfile by default.
Need of base table arises because when you create a hive table with orc format and then trying to load data using command:
load data in path '' ..
it simply moves data from one location to another.
hive orc table won't understand textfile. that's when serde comes into picture. you define serde while creating table.
so when a operation like :
1. select * (read)
2. insert into (write)
serde will serialize and desiarlize various format to orc and map data to hive columns.
I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344
When we create using
Create external table employee (name string,salary float) row format delimited fields terminated by ',' location /emp
In /emp directory there are 2 emp files.
so when we run select * from employee, it get the data from both the file ad display.
What will be happen when there will be others file also having different kind of record which column is not matching with the employee table , so it will try to load all the files when we run "select * from employee"?
1.Can we specify the specific file name which we want to load?
2.Can we create other table also with the same location?
Thanks
Prashant
It will load all the files in emp directory even it doesn’t match with table.
for your first question. you can use Regex serde.if your data matches to regex.then it loads to the table.
regex for access log in hive serde
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
other options:I am pointing some links.these links has some ways.
when creating an external table in hive can I point the location to specific files in a direcotry?
https://issues.apache.org/jira/browse/HIVE-951
for your second question: yes we can create other tables also with the same location.
Here are your answers
1. If the data in the file dosent match with table format, hive doesnt throw an error. It tries to read the data as best as it could. If data for some columns are missing it will put NULL for them.
No we cannot specify the file name for any table to read data. Hive will consider all the files under the table directory.
Yes, we can create other tables with the same location.
I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?
Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.
I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.