How to create imapala table with complex data type and how I can specify delimiter for array type column - impala

I am trying to create impala table with array column type, I have to use custom delimiter for array type column.
I tried below query. But, its throwing error.
Create table array_demo( arra_col ARRAY<string>) row format delimited fields terminated by ','
collection items terminated by '|' stored as parquet

You should omit the ROW FORMAT clause and the subclauses specifying the terminators, and include a STORED AS clause (Parquet is the only format Impala supports with complex data).
The data files to load the table have to be in parquet format too.
If you don't have the data file in Parquet format, you can create the table in Hive,
then create a copy using CREATE TABLE … AS SELECT (CTAS statement), with STORED AS PARQUET.
You then can query the table in Impala.
As an example
-- Create table in Hive
CREATE TABLE array_demo( arra_col ARRAY<STRING>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
STORED AS TEXTFILE;
-- Copy the table as parquet format
CREATE TABLE array_demo_impala AS
SELECT *
FROM array_demo
STORED AS PARQUET;

Related

Skipping header in hive is removing first line of my data

I have the following query in hive:
CREATE EXTERNAL TABLE shop.id_store (
person_id INT,
shop_category STRING
)
row format delimited fields terminated by ',' stored as textfile
LOCATION "user/schema/table"
tblproperties('skip.header.line.count'='1', 'external.table.purge'='true');
LOAD DATA INPATH 'tmp/ids.csv' OVERWRITE INTO TABLE shop.id_store;
INSERT OVERWRITE TABLE shop.id_store
SELECT
*
FROM
shop.id_store
my csv ids.csv, does contain headers, however i have noticed that the above code actually removes the first row of my actual data. What is going on?

How to create an Hive External on & delimited Key-Value pair

I have a simple requirement of creating an "Hive external table" on a text file which has data in the format of
colAAA=2&colDDD=1065985&colBBB=valueBB&colCCC=875
COL_NAME=VALUE&COL_NAME=VALUE&COL_NAME=VALUE
I cannot use RegEx Serde as the column names don't come in a defined order. Is there a way to create external table with out writing a new CustomSerde ??
create external table if not exists custom_table_name(
colAAA int,
colBBB int,
colCCC string,
colDDD int)
row format delimited
fields terminated by '&'
????????????? How to make it read the Key-Value ??
I would like to avoid writing CustomSerde unless there is no open-source SERDE available ... Thanks.
First, create external table with one map column to parse your data
create external table some_table
(map_col map<string, string>)
row format
COLLECTION ITEMS TERMINATED BY '&'
MAP KEYS TERMINATED BY '='
stored as textfile
location <your_location>
then select map keys of your interest
create table another_table as
select map_col['colAAA'] as colAAA, ...etc
from some_table

Creating a hive table without ROW FORMATTED comma delimated columns

I have a .CSV comma delimited file
c1,c2,c3,c4
d1,d2,d3,d4
My requirement is to create an external hive table which has a single field named item and containing each row of my CSV file regardless of the comma delimited columns.
What is the hive query for create table which I have to use?
Create hive table without specifying row formatted and hive defaults to cntrl+A(^A) delimiter.
as your data is comma delimited so all data will be read by single field name.
Example:
create external table i(item string) location '<your_directory_path>';
Here item field will have all the data!

Is defining a delimiter in a hive ORC Table useless?

When you create a ORC table in hive, you are changing the file type to be orc. This means you can't look at a specific file outside of the orc table.
Here's an example orc create table statement
CREATE TABLE IF NOT EXISTS table_orc_v1
(
col1 int,
col2 int
)
PARTITIONED BY (odate date)
CLUSTERED BY (col1) INTO 10 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');
If I try to make this a csv table (like you do on a non-orc table) will it
1) not affect table performance
2) slow down performance as it converts things to a csv file that you can never read
3) give me some benefit that I'm not aware of
4) do something else
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
if you are using any binary format (ORC, AVRO, Parquet) to store you data then ROW FORMAT DELIMITED FIELDS TERMINATED BY is just ignored, you can use it in your table syntax, it might not give you any error. However they are not being used

Load into Hive table imported entire data into first column only

I am trying to copy the Hive data from one server to another server. By this, I am exporting into hive data into CSV from server1 and trying to import that CSV file into Hive in server2.
My table contains following datatypes:
bigint
string
array
Here is my commands:
export:
hive -e 'select * from sample' > /home/hadoop/sample.csv
import:
load data local inpath '/home/hadoop/sample.csv' into table sample;
After importing into Hive table, entire row data into inserted into first column only.
How can I overcome this, or else is there a better way to copy data from one server to another server?
While creating table add below line at the end of create statment
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Like Below:
hive>CREATE TABLE sample(id int,
name String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Then Load Data:
hive>load data local inpath '/home/hadoop/sample.csv' into table sample;
For Your Example
sample.csv
123,Raju,Hello|How Are You
154,Nishant,Hi|How Are You
So In above sample data first column is bigint, second is String and third is Array separated by |
hive> CREATE TABLE sample(id BIGINT,
name STRING,
messages ARRAY<String>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|';
hive> LOAD DATA LOCAL INPATH '/home/hadoop/sample.csv' INTO TABLE sample;
Most important point :
Define delimiter for collection items and don't impose the array
structure you do in normal programming.
Also, try to make the field
delimiters different from collection items delimiters to avoid
confusion and unexpected results.
You really should not be using CSV as your data transfer format
DistCp copies data between Hadoop clusters as-is
Hive supports Export, Import
Circus Train allows Hive table replication
why not use hadoop command to transfer data from one cluster to another such as
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
then load the data to your new table
load data inpath '/bar/foo/*' into table wyp;
your problem may caused by the delimiter
,The default delimiter '\001' if you havn't set when create a hivetable ..
if you use hive -e 'select * from sample' > /home/hadoop/sample.csv will make all cloumn to one cloumn