I was doing R&D on hadoop, hive, impala, and kudu. Installed HADOOP, HIVE, IMPALA, and KUDU servers.
I have configured --kudu_master_hosts=: in /etc/default -> impala file. i.e like below:
IMPALA_SERVER_ARGS=" \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}\
--kudu_master_hosts=<HOST_NAME>:<PORT>"
==============
Aftert that re-started the servers.
Then using Kudu JAVA client, i was able to create table in kudu and able to insert some records.
Then mapped the same table in impala by doing this:
CREATE EXTERNAL TABLE my_mapping_table
STORED AS KUDU
TBLPROPERTIES (
'kudu.table_name' = 'testT1'
);
successfully able to access the kudu table in impala and able to see all the records.
Now i was trying to create a table in KUDU using impala-shell.
[<HOST_NAME>:21000] > CREATE TABLE my_first_table
> (
> id BIGINT,
> name STRING,
> PRIMARY KEY(id)
> )
> STORED AS KUDU;
But this is giving a error i.e :
ERROR: ImpalaRuntimeException: Error creating Kudu table 'impala::default.my_first_table'
CAUSED BY: NonRecoverableException: Not enough live tablet servers to create a table with the requested replication factor 3. 1 tablet servers are alive.
Can any one explain me what is happening or what is the solution for this error.
Read through KUDU documentation but not getting any ideas.
Regards,
Akshay
This query will help to create table i.e
CREATE TABLE emp
(
uname STRING,
age INTEGER,
PRIMARY KEY(uname)
)
STORED AS KUDU
TBLPROPERTIES (
'kudu.num_tablet_replicas' = '1'
);
This query will work only if, --kudu_master_hosts=: is set in /etc/default impala file.
Else u have to give kudu_master_hosts in table properties. i.e
TBLPROPERTIES (
'kudu.num_tablet_replicas' = '1',
--kudu_master_hosts=<HOST_NAME>:<PORT>
);
What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.
The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if
1) Imapala approach
impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'
2) Hive approach
Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$
would be efficient
Assuming table tbl as your hive parquet table and condition as your filter condition.
CTAS command:
CREATE TABLE tbl_text ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/data' AS select * from tbl where condition;
You will find your CSV text file (delimited by ',') at /tmp/data in HDFS.
You can get this file to your local file system if needed using:
hadoop fs -get /tmp/data
Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.
Partition your table with the columns of your interest and based on your queries for best results
Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it
Step 2: Set hive parameters to support Dynamic partition
SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;
Step 3: Create your Main Hive Table with partition columns, example :
CREATE TABLE employee (
emp_id int,
emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;
Step 4: Load data from Temporary table to your employee table (Main Table)
insert overwrite table employee partition(location)
select emp_id,emp_name, location from TmpTable;
Step 5: export the data from hive with a condition
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee WHERE location='CALIFORNIA';
Please refer this link:
Dynamic Partition Concept
Hope this is useful.
I have an existing S3 folder structure like this,
s3://mydata/{country}/{date}/
{country} could be any of 30 different countries
{date} could be any date since 20150101
How can I read this in Hive by treating {country} as partition and {date} as sub partition ?
You can use the Hive DDL statement ALTER TABLE ADD PARTITION
ALTER TABLE mydata
ADD PARTITION (country='south-africa', date='20191024')
LOCATION 's3://mydata/south-africa/20191024/';
You can script this using a shell script, and passing each statement to Hive like hive -e 'ALTER TABLE $TABLE ADD PARTITION $PARTITION_SPEC LOCATION $PARTITION_LOCATION'
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions
I have a table named "HISTORY" in HBase having column family "VDS" and the column names ROWKEY, ID, START_TIME, END_TIME, VALUE. I am using Cloudera Hadoop Distribution. I want to provide SQL interface to HBase table using Impala. In order to do this we have to create respective External Table in Hive? So how to create external hive table pointing to this HBase table?
Run the following code in Hive Query Editor:
CREATE EXTERNAL TABLE IF NOT EXISTS HISTORY
(
ROWKEY STRING,
ID STRING,
START_TIME STRING,
END_TIME STRING,
VALUE DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
(
"hbase.columns.mapping" = ":key,VDS:ID,VDS:START_TIME,VDS:END_TIME,VDS:VALUE"
)
TBLPROPERTIES("hbase.table.name" = "HISTORY");
Don't forget to Refresh Impala Metadata after External Table Creation with the following bash command:
echo "INVALIDATE METADATA" | impala-shell;
I am new to hive. I have successfully setup a single node hadoop cluster for development purpose and on top of it, I have installed hive and pig.
I created a dummy table in hive:
create table foo (id int, name string);
Now, I want to insert data into this table. Can I add data just like sql one record at a time? kindly help me with an analogous command to:
insert into foo (id, name) VALUES (12,"xyz);
Also, I have a csv file which contains data in the format:
1,name1
2,name2
..
..
..
1000,name1000
How can I load this data into the dummy table?
I think the best way is:
a) Copy data into HDFS (if it is not already there)
b) Create external table over your CSV like this
CREATE EXTERNAL TABLE TableName (id int, name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'place in HDFS';
c) You can start using TableName already by issuing queries to it.
d) if you want to insert data into other Hive table:
insert overwrite table finalTable select * from table name;
There's no direct way to insert 1 record at a time from the terminal, however, here's an easy straight forward workaround which I usually use when I want to test something:
Assuming that t is a table with at least 1 record. It doesn't matter what is the type or number of columns.
INSERT INTO TABLE foo
SELECT '12', 'xyz'
FROM t
LIMIT 1;
Hive apparently supports INSERT...VALUES starting in Hive 0.14.
Please see the section 'Inserting into tables from SQL' at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
What ever data you have inserted into one text file or log file that can put on one path in hdfs and then write a query as follows in hive
hive>load data inpath<<specify inputpath>> into table <<tablename>>;
EXAMPLE:
hive>create table foo (id int, name string)
row format delimited
fields terminated by '\t' or '|'or ','
stored as text file;
table created..
DATA INSERTION::
hive>load data inpath '/home/hive/foodata.log' into table foo;
to insert ad-hoc value like (12,"xyz), do this:
insert into table foo select * from (select 12,"xyz")a;
this is supported from version hive 0.14
INSERT INTO TABLE pd_temp(dept,make,cost,id,asmb_city,asmb_ct,retail) VALUES('production','thailand',10,99202,'northcarolina','usa',20)
It's a limitation of hive.
1.You cannot update data after it is inserted
2.There is no "insert into table values ... " statement
3.You can only load data using bulk load
4.There is not "delete from " command
5.You can only do bulk delete
But you still want to insert record from hive console than you can do select from statck. refer this
You may try this, I have developed a tool to generate hive scripts from a csv file. Following are few examples on how files are generated.
Tool -- https://sourceforge.net/projects/csvtohive/?source=directory
Select a CSV file using Browse and set hadoop root directory ex: /user/bigdataproject/
Tool Generates Hadoop script with all csv files and following is a sample of
generated Hadoop script to insert csv into Hadoop
#!/bin/bash -v
hadoop fs -put ./AllstarFull.csv /user/bigdataproject/AllstarFull.csv
hive -f ./AllstarFull.hive
hadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv
hive -f ./Appearances.hive
hadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv
hive -f ./AwardsManagers.hive
Sample of generated Hive scripts
CREATE DATABASE IF NOT EXISTS lahman;
USE lahman;
CREATE TABLE AllstarFull (playerID string,yearID string,gameNum string,gameID string,teamID string,lgID string,GP string,startingPos string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/bigdataproject/AllstarFull.csv' OVERWRITE INTO TABLE AllstarFull;
SELECT * FROM AllstarFull;
Thanks
Vijay
You can use following lines of code to insert values into an already existing table. Here the table is db_name.table_name having two columns, and I am inserting 'All','done' as a row in the table.
insert into table db_name.table_name
select 'ALL','Done';
Hope this was helpful.
Hadoop file system does not support appending data to the existing files. Although, you can load your CSV file into HDFS and tell Hive to treat it as an external table.
Use this -
create table dummy_table_name as select * from source_table_name;
This will create the new table with existing data available on source_table_name.
LOAD DATA [LOCAL] INPATH '' [OVERWRITE] INTO TABLE <table_name>;
use this command it will load the data at once just specify the file path
if file is in local fs then use LOCAL if file is in hdfs then no need to use local