Can anyone tell the difference between create-hive-table & hive-import method? Both will create a hive table, but still what is the significance of each?
hive-import command:
hive-import commands automatically populates the metadata for the populating tables in hive metastore. If the table in Hive does not exist yet, Sqoop
will simply create it based on the metadata fetched for your table or query. If the table already exists, Sqoop will import data into the existing table. If you’re creating a new Hive table, Sqoop will convert the data types of each column from your source table to a type compatible with Hive.
create-hive-table command:
Sqoop can generate a hive table (using create-hive-tablecommand) based on the table from an existing relational data source. If set, then the job will fail if the target hive table exists. By default this property is false.
Using create-hive-table command involves three steps: importing data into HDFS, creating hive table and then loading the HDFS data into Hive. This can be shortened to one step by using hive-import.
During a hive-import, Sqoop will first do a normal HDFS import to a temporary location. After a successful import, Sqoop generates two queries: one for creating a table and another one for loading the data from a temporary location. You can specify any temporary location using either the --target-dir or --warehouse-dir parameter.
Added a example below for above description
Using create-hive-table command:
Involves three steps:
Importing data from RDBMS to HDFS
sqoop import --connect jdbc:mysql://localhost:3306/hadoopexample --table employees --split-by empid -m 1;
Creating hive table using create-hive-table command
sqoop create-hive-table --connect jdbc:mysql://localhost:3306/hadoopexample --table employees --fields-terminated-by ',';
Loading data into Hive
hive> load data inpath "employees" into table employees;
Loading data to table default.employees
Table default.employees stats: [numFiles=1, totalSize=70]
OK
Time taken: 2.269 seconds
hive> select * from employees;
OK
1001 emp1 101
1002 emp2 102
1003 emp3 101
1004 emp4 101
1005 emp5 103
Time taken: 0.334 seconds, Fetched: 5 row(s)
Using hive-import command:
sqoop import --connect jdbc:mysql://localhost:3306/hadoopexample --table departments --split-by deptid -m 1 --hive-import;
The difference is that create-hive-table will create table in Hive based on the source table in database but will NOT transfer any data. Command "import --hive-import" will both create table in Hive and import data from the source table.
Related
I needed to create huge test data in hive table. I tried following commands but it only inserts one partition data at a time.
connect to beeline:
beeline --force=true -u 'jdbc:hive2://<host>:<port>/<hive database name>;ssl=true;user=<username>;password=<pw>'
create partitioned table :
CREATE TABLE p101(
Name string,
Age string)
PARTITIONED BY(fi string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
I have created ins.csv file with data and copy it to hdfs location, its data is as follows.
Name,Age
aaa,33
bbb,22
ccc,55
then I tried to load same file for multiple partition ids with following command
LOAD DATA INPATH 'hdfs_path/ins.csv' INTO TABLE p101 PARTITION(fi=1,fi=2,fi=3,fi=4,fi=5);
but it loads record only for partitionID=5.
You can only specify one partition for each insert into.
What you can do in order to have different partitions is add it into your csv file like this:
Name,Age,fi
aaa,33,1
bbb,22,2
ccc,55,3
Hive will automatically know that this is the partition.
LOAD DATA INPATH 'hdfs_path/ins.csv' INTO TABLE tmp.p101;
I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!
As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.
I need to pull records from a MySQL table with n columns and store them in hive with extra columns. Is there any way in sqoop to perform it?
Example:
MySQL table has the following fields id, name, place. And,
Hive table structure is id, name, place and contact number(null).
So when performing sqoop, I want to add an extra column contact number in hive as (null).
You can specify it in the by using --query option in sqoop and select the extra column with NULL AS.
sqoop import \
--query 'SELECT id, name, place, NULL AS contact_number FROM mysql_table'
--connect jdbc:mysql://mysql.example.com/sqoop \
--Any other options
I have imported all tables using sqoop into a Hive database "sqoop_import" able to see all tables imported successfully as below :-
hive> use sqoop_import;
OK
Time taken: 0.026 seconds
hive> show tables;
OK
categories
customers
departments
order_items
orders
products
Time taken: 0.025 seconds, Fetched: 6 row(s)
hive>
But when I am trying the same from impala-shell or Hue using the same user, It's showing different results as below : -
[quickstart.cloudera:21000] > use sqoop_import;
Query: use sqoop_import
[quickstart.cloudera:21000] > show tables;
Query: show tables
+--------------+
| name |
+--------------+
| customers |
| customers_nk |
+--------------+
Fetched 2 row(s) in 0.01s
[quickstart.cloudera:21000] >
What am I doing wrong?
When you import a new table with sqoop to hive, in order to see it through Impala-Shell you should INVALIDATE METADATA of the specific table. So from the Impala-Shell run the following command : impala-shell -d DB_NAME -q "INVALIDATE METADATA table_name"; .
But if you append new data files to an existing table through sqoop you need to do REFRESH. So from the Impala-Shell run the following command:
impala-shell -d DB_NAME -q "REFRESH table_name";.
What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.
The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if
1) Imapala approach
impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'
2) Hive approach
Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$
would be efficient
Assuming table tbl as your hive parquet table and condition as your filter condition.
CTAS command:
CREATE TABLE tbl_text ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/data' AS select * from tbl where condition;
You will find your CSV text file (delimited by ',') at /tmp/data in HDFS.
You can get this file to your local file system if needed using:
hadoop fs -get /tmp/data
Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.
Partition your table with the columns of your interest and based on your queries for best results
Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it
Step 2: Set hive parameters to support Dynamic partition
SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;
Step 3: Create your Main Hive Table with partition columns, example :
CREATE TABLE employee (
emp_id int,
emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;
Step 4: Load data from Temporary table to your employee table (Main Table)
insert overwrite table employee partition(location)
select emp_id,emp_name, location from TmpTable;
Step 5: export the data from hive with a condition
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee WHERE location='CALIFORNIA';
Please refer this link:
Dynamic Partition Concept
Hope this is useful.