How can I tell if a table is saved as parquet files? - hive

I am using HiveMetaStoreClient to get some meta data of hive tables and I got some tables saved as parquet while other tables saved as text. For tables saved as parquet, I want to get some more information like parquet schema.
So how can I get the file format of a hive table via HiveMetaStoreClient? Or if there is any other interfaces to do that?
I am thinking maybe I can try to read each table with ParquetReader and catch exceptions. Like:
try {
metaData = ParquetFileReader.readFooter(conf, file, NO_FILTER);
MessageType schema = metaData.getFileMetaData().getSchema();
} catch (Exception e) {
System.out.println("Not parquet!!!")
}
But it is like the worst choice.

You have multiple options.
Use SHOW CREATE TABLE <tablename>
Use DESCRIBE FORMATTED <tablename>
You can use hue which offers web GUI to Hadoop users.
If you have set up namenode for UI too, you can access the details or even browser the files. The url is generally http://:50070. It does not show a lot of details about the table. It is used for overall hadoop.

I do it by running "SHOW CREATE TABLE " in a Hive session and in the result you will see the CREATE statement of this table with file format details in it. It would look something like the below,
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Let me know if thats what you are looking for!

Related

impala/hive show file format

How can I have impala or hive return the file format of the underlying files on HDFS for a table?
I tried:
SHOW FILES database.table_name
This ilst the files, but the problem is that some people stored parquet files as .parq and others .parquet. Is there anyway to return the file format, such that one could use it in a new create statement?
Use good old show create table mytable.
You can check the output and it clearly mentions file format. It also shows folder inside which file are stored - you should not try to use file name - let impala decide the name. below is a sample result from impala.
result
CREATE TABLE edh.mytable (
column1 STRING
)
STORED AS PARQUET --file format
LOCATION 's3a://cc-mys3/edh/user/hive/warehouse/edh.db/mytable' --folder location

Retrieving JSON raw file data from Hive table

I have a JSON File. I want to move only selected fields to Hive table. So below is the statement I used to create a new table to import the data from JSON file to HIVE Table. While creating it doesn't give any error but when i use select * from JsonFile1 or count(*) from JsonFile1 I get error as Failed with exception java.io.IOException:java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
I have browsed over the internet stuck with this since few days. I can't find a solution. I checked in the HDFS. I see there is a table created and complete file imported as-is(not just the fields I selected but all of it). I just provided the sample data, the actual data contains like 50+ field names. creating all the column names is cumbersome. Is that what we need to do? Thank you in advance.
CREATE EXTERNAL TABLE JsonFile1(user STRUCT<id:BIGINT,description:STRING, followers_count:INT>)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION 'link/data';
I have data as below
{filter_level":"low",geo":null,"user":{"id":859264394,"description":"I don’t want it. Building #techteam, #LetsTalk!!! def#abc.com",
"contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name"krogmi",
"screen_name":"jkrogmi","id_str":"859264394",}}06:20:16 +0000 2012","default_profile_image":false,"followers_count":88,
"profile_sidebar_fill_color":"DDFFCC","screen_name":"abc_abc"}}
Answering my own question.
I have deleted the data in hdfs which I was pointing in the LOCATION '...', copied data again from local to hdfs and recreated the table again and it worked.
I am assuming that data was the problem.

ORC file format

I am new to Hive. Could you please let me know answer for below question?
Why do we need base table while loading the data in ORC?
Can't we directly create table as ORC and load data in it?
1. Why do we need base table while loading the data in ORC?
We need of the base table, because most of the time we get the data file in text file format, i.e. CSV, TXT, DAT or any other delimiter that we can open the file and see the content. But the file Format ORC maintain in a different way by using their algorithm to optimized the Row and Column.
Hence we need of a base table, so, Actually what happened in that case. We create a table with the textFile format and select the data over their and write it into ORC table.
2. Can't we directly create table as ORC and load data in it?
Yes, you can load the data into ORC file directly.
To understand more about ORC, you can refer to https://orc.apache.org/docs/
Usually if you don't define file format , for hive it is textfile by default.
Need of base table arises because when you create a hive table with orc format and then trying to load data using command:
load data in path '' ..
it simply moves data from one location to another.
hive orc table won't understand textfile. that's when serde comes into picture. you define serde while creating table.
so when a operation like :
1. select * (read)
2. insert into (write)
serde will serialize and desiarlize various format to orc and map data to hive columns.

How to define an HDInsight hive external table based on XMLs in a container

I tried creating a hive external table:
CREATE EXTERNAL TABLE TestXML (storexml string)
STORED AS TEXTFILE
LOCATION 'wasb:///test/';
However when i try executing query like below, its not able to extract the fields:
SELECT
xpath_string (storexml, '/trades/trade/USI')
FROM TestXML;
I saw a post, that talked about specifying the input format.
add JARS <>
set xmlinput.element=Store;
CREATE EXTERNAL TABLE EventStoreXML (storexml string)
STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'wasb:///eventstore#tradedata.blob.core.windows.net/';
I could not determine, which jars to include in the add JARs statement. I am using HDInsight on Linux.
Any pointers will be appreciated.
-Madhu
Realised the issue was with the XML having carriage return, as a result it was not able to read the XML.

Create hive table for schema less avro files

I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory .
Each file has a big number in it and hence I do not have any json kind of schema that I can relate too. I might be wrong when I say schema less . But I cannot find a way for hive to understand this data. This might be very simple but I am lost since I tried numerous different ways without success. I created tables pointing to json schema as avro uri, but this is not the case here.
For more context files were written using crunch api
final Path outcomesVersionPath = ...
pipeline.write(fruit.keys(), To.avroFile(outcomesVersionPath));
I tried following query which creates table but does not read data properly
CREATE EXTERNAL TABLE test_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
If your data set only has one STRING field then you should be able to read it from Hive with a single column called data (or whatever you would like) by changing your DDL to:
CREATE EXTERNAL TABLE test_table
(data STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
And then read the data with:
SELECT data FROM test_table;
Use avro utilities jar to see avro schema for any given binary file here!
Then just link the schema file while creating a table.