Apache hive create table for given structure - hive

My csv file contains data structure like:
99999,{k1:v1,k2:v2,k3:v3},9,1.5,http://www.asd.com
what is the create table query for this structure?
I don't have to do any processing on csv file before it is loaded into table.

You need to use Opencsv serde to read/write csv data to/from hive table. Download it here:https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar
Add the serde to the library path of Hive. - Could be skipped, but do upload it to the hdfs cluster your hive server is running. We will use it later to query.
Create Table
CREATE TABLE my_table(a int, b string, c int, d double, url string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Notice that if you use openCSV serde, no matter what type you give, it will be taken as String by hive. But no worries as Hive is loosely type language. It will typecast string into int, json etc. at runtime.
Query
To query at the hive prompt first add the library if not added to the library path of hive
add jar hdfs:///user/hive/aux_jars/opencsv.jar;
Now you could query as:
select a, get_json_object(b, '$.k1') from my_table where get_json_object(b, '$.k2') > val;
Above is the example to access the JSON field from an Hive table.
References:
http://documentation.altiscale.com/using-csv-serde-with-hive
http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html
PS: Json Tuple is the faster way to access the json elements, but I found the syntax of get_json_object more appealing.

Related

Creating external hive table in databricks

I am using databricks community edition.
I am using a hive query to create an external table , the query is running without any error but the table is not getting populated with the specified file that has been specified in the hive query.
Any help would be appreciated .
from official docs ... make sure your s3/storage location path and schema (with respects to the file format [TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM]) are correct
DROP TABLE IF EXISTS <example-table> // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true) // deletes the data
CREATE TABLE <example-table>
USING org.apache.spark.sql.parquet
OPTIONS (PATH "<your-s3-path>")
AS SELECT <your-sql-query-here>
// alternative
CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"

Is it possible to load only selected columns from Avro file to Hive?

I have a requirement to load Avro file to hive. Using the following to create the table
create external table tblName stored as avro location 'hdfs://host/pathToData' tblproperties ('avro.schema.url'='/hdfsPathTo/schema.avsc');
I am getting an error FOUND NULL, EXPECTED STRING while doing a select on the table. Is it possible to load few columns and find which column data is causing this error?
Actually you need first to create an Hive External table pointing to the location of your AVRO files, and using the AvroSerDe format.
At this stage, nothing is loaded. The external table is just a mask on files.
Then you can create an internal HIVE table and load data (the expected columns) from the external one.
If you are already having AVRO file, then load the file to HDFS in a directory of your choice. Next create an external table on top of the directory.
CREATE EXTERNAL TABLE external_table_name(col1 string, col2 string, col3 string ) STORED AS AVRO LOCATION '<HDFS location>';
Next create an internal hive table on top of the external table to load the data
CREATE TABLE internal_table_name(col2 string, col3 string) AS SELECT col2, col3 FROM external_table_name
You can schedule the internal table load using a batch script in any scripting language or tools.
Hope this helps :)

ORC file format

I am new to Hive. Could you please let me know answer for below question?
Why do we need base table while loading the data in ORC?
Can't we directly create table as ORC and load data in it?
1. Why do we need base table while loading the data in ORC?
We need of the base table, because most of the time we get the data file in text file format, i.e. CSV, TXT, DAT or any other delimiter that we can open the file and see the content. But the file Format ORC maintain in a different way by using their algorithm to optimized the Row and Column.
Hence we need of a base table, so, Actually what happened in that case. We create a table with the textFile format and select the data over their and write it into ORC table.
2. Can't we directly create table as ORC and load data in it?
Yes, you can load the data into ORC file directly.
To understand more about ORC, you can refer to https://orc.apache.org/docs/
Usually if you don't define file format , for hive it is textfile by default.
Need of base table arises because when you create a hive table with orc format and then trying to load data using command:
load data in path '' ..
it simply moves data from one location to another.
hive orc table won't understand textfile. that's when serde comes into picture. you define serde while creating table.
so when a operation like :
1. select * (read)
2. insert into (write)
serde will serialize and desiarlize various format to orc and map data to hive columns.

How to query data from gz file of Amazon S3 using Qubole Hive query?

I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344

Dynamically create Hive external table with Avro schema on Parquet Data

I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file.
My try is to use below DDL:
CREATE EXTERNAL TABLE parquet_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS PARQUET
LOCATION 'hdfs://myParquetFilesPath'
TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc');
My Hive table is successfully created with the right schema, but when I try to read the data :
SELECT * FROM parquet_test;
I get the following error :
java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable
Is there a way to successfully create and read Parquet files, without mentioning columns name and types list in DDL?
Below query works:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';