I created an external hive table like this:
CREATE EXTERNAL TABLE some_hive_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/hdfs/path/some_hive_table/'
TBLPROPERTIES ('avro.schema.literal'='{json schema here}');
I'd like to run some hive queries on it and export that data into an avro file. I know I can export data like this:
INSERT
OVERWRITE DIRECTORY '/hdfs/path/avrofileoutput/'
SELECT * FROM some_hive_table;
But I want my output file to be an avro file - not csv. Can this be done and if so how?
You can export any table irrespective of InputStorage as an AVRO file using below command to a Local location or HDFS location.
Starting with Hive 0.11.0
INSERT OVERWRITE LOCAL DIRECTORY '<Local directory>'
STORED AS AVRO SELECT * FROM some_hive_table;
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
You can try out below option.
insert overwrite table some_hive_table_avro select * from some_hive_table_text;
Related
Can someone point in the doc to create external table on qubole base on avro files?
CREATE TABLE my_table_name
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://my_avro_files/'
The following directory have a bunch of avro files
s3://my_avro_files/
s3://my_avro_files/file1.avro
s3://my_avro_files/file2.avro
s3://my_avro_files/file....avro
I believe you need to provide the schema as well. Please see here for details on how to extract it and specify in the create table statement.
I want to load some of the files from a HDFS directory into a table.
The files in the HDFS directory as below.
/data/log/user1log.csv
/data/log/user2log.csv
/data/log/user3log.csv
/data/log/user4log.csv
/data/log/user5log.csv
Now I want to load /data/log/user1log.csv and /data/log/user2log.csv files.
I have tried the below.
CREATE EXTERNAL TABLE log_data (username string,log_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
tblproperties ("skip.header.line.count"="1");
load data inpath '/data/log/user1log.csv' into table log_data;
load data inpath '/data/log/user2log.csv' into table log_data;
But after loading data into table files are vanishing from HDFS location.
But the file we should keep in the HDFS location.
Please help me.
Thanks in advance.
I don't think it's possible, when you do Load inpath it moves data rather than copying.
However, you have a External Table so you can load data even without using Load inpath
Here's how you can do it.
Specify the location for your Hive Table
CREATE EXTERNAL TABLE log_data (username string,log_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
tblproperties ("skip.header.line.count"="1");
location '/data/log_data/table'
Copy Files to Location
hdfs dfs -cp /data/log/user1log.csv /data/log_data/table/
hdfs dfs -cp /data/log/user2log.csv /data/log_data/table/
I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file.
My try is to use below DDL:
CREATE EXTERNAL TABLE parquet_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS PARQUET
LOCATION 'hdfs://myParquetFilesPath'
TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc');
My Hive table is successfully created with the right schema, but when I try to read the data :
SELECT * FROM parquet_test;
I get the following error :
java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable
Is there a way to successfully create and read Parquet files, without mentioning columns name and types list in DDL?
Below query works:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema.
in other way, how to generate a hive table from a parquet/avro schema ?
thanks :)
Try below using avro schema:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
Same query is asked in Dynamically create Hive external table with Avro schema on Parquet Data
I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory .
Each file has a big number in it and hence I do not have any json kind of schema that I can relate too. I might be wrong when I say schema less . But I cannot find a way for hive to understand this data. This might be very simple but I am lost since I tried numerous different ways without success. I created tables pointing to json schema as avro uri, but this is not the case here.
For more context files were written using crunch api
final Path outcomesVersionPath = ...
pipeline.write(fruit.keys(), To.avroFile(outcomesVersionPath));
I tried following query which creates table but does not read data properly
CREATE EXTERNAL TABLE test_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
If your data set only has one STRING field then you should be able to read it from Hive with a single column called data (or whatever you would like) by changing your DDL to:
CREATE EXTERNAL TABLE test_table
(data STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
And then read the data with:
SELECT data FROM test_table;
Use avro utilities jar to see avro schema for any given binary file here!
Then just link the schema file while creating a table.