issues with sqoop and hive - hive

We are facing the following issues details as follows, please share your inputs.
1) Issue with --validate option in sqoop
if we run the sqoop command without creating a job for it, validate works. But if we create a job first, with validate option the validate doesn't seem to work.
works with
sqoop import --connect "DB connection" --username $USER --password-file $File_Path --warehouse-dir $TGT_DIR --as-textfile --fields-terminated by '|' --lines-teriminated-by '\n' --table emp_table -m 1 --outdir $HOME/javafiles --validate
Does not work with
sqoop job --create Job_import_emp import --connect "DB connection" --username $USER --password-file $File_Path --warehouse-dir $TGT_DIR --as-textfile --fields-terminated by '|' --lines-teriminated-by '\n' --table emp_table -m 1 --outdir $HOME/javafiles --validate
2) Issue with Hive import
If we are importing data for the first time in hive, it becomes imperative to create hive table ( hive internal), so we keep "--create-hive-table" in sqoop command.
Even thouhg if i keep "--create-hive-table" option, Is there any way to skip create table step in hive while importing, if the table is already exists.
Thanks
Sheik

Sqoop allows --validate option only for sqoop import and sqoop export commands.
From the official Sqoop User guide, the validation has these limitations,
all-tables option
free-form query option
Data imported into Hive or HBase table
import with --where argument
No, the table check cannot be skip if --create-hive-table option is set, the job will fail if the target table exists.

Related

How to create single file while using sqoop import with multiple mappers

I want to import data from Mysql using sqoop import but my requirement is i want to use 4 mappers but it should create only one file in hdfs target directory is there is any way to do this ?
No. there is no option in sqoop to re-partition files into 1 file.
I don't think this should be a headache of sqoop.
You can do it easily using getmerge feature of hadoop. Example:
hadoop fs -getmerge /sqoop/target-dir/ /desired/local/output/file.txt
Here
/sqoop/target-dir is the target-dir of your sqoop command (directory containing all the part files).
desired/local/output/file.txt is the combined single file.
you can use below sqoop command..!!
Suppose database name is prateekDB and table name is Emp...!!
sqoop import --connect "jdbc:mysql://localhost:3306/prateekDB" --username=root \
--password=data --table Emp --target-dir /SqoopImport --split-by empno
Add this option to sqoop
--num-mappers 1
the sqoop log shows:
Job Counters
Launched map tasks=1
Other local map tasks=1
and finally on hdfs ONE file is created.

Creating text table from Impala partitioned parquet table

I have a parquet table formatted as follows:
.impala_insert_staging
yearmonth=2013-04
yearmonth=2013-05
yearmonth=2013-06
...
yearmonth=2016-04
Underneath each of these directories are my parquet files. I need to get them into my another table which just has a
.impala_insert_staging
file.
Please help.
The best I found is to pull the files in locally and sqoop them back up into a text table.
To pull the parquet table down I performed the following:
impala-shell -i <ip-addr> -B -q "use default; select * from <table>" -o filename '--output_delimiter=\x1A'
Unfortunately this adds the yearmonth value as another column on my table. So I either go into my 750GB file and sed/awk out that last column or use mysqlimport (since I'm using MySQL as well) to import only the columns I'm interested in.
Finally I'll sqoop up the data to a new text table.
sqoop import --connect jdbc:mysql://<mysqlip> --table <mysql_table> -uroot -p<pass> --hive-import --hive-table <new_db_text>

Sqoop import into Hive Sequence table

I am trying to load a Hive table using the Sqoop import commands. But when I run it says that Sqoop doesn't support SEQUENCE FILE FORMAT while loading into hive.
Is this correct , I through SQOOP has matured for all the formats present in Hive. Can anyone guide me on this. And if at all standard procedure to load Hive tables which have SEQUENCE FILE FORMAT using sqoop.
Currently importing of sequence files directly into Hive is not supported yet. But you can import data --as-seuquencefile into HDFS and then you can create an external table on top of that. As you are saying you are getting exceptions even with this approach, please paste your sample code & logs, so that I can help you.
PFB code
sqoop import --connect jdbc:mysql://xxxxx/Emp_Details --username xxxx--password xxxx --table EMP --as-sequencefile --hive-import --target-dir /user/cloudera/emp_2 --hive-overwrite

Difference between 2 commands in Sqoop

Please tell me what is the difference between the 2 commands below
sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password
--table tableName --hive-table tableName --create-hive-table --hive-import;
sqoop create-hive-table --connect jdbc:mysql://localhost:3306/db1
--username root --password password;
What is the difference of using --create-hive-table & just create-hive-table in both the commands?
Consider the two queries:
1) When --create-hive-table is used, the contents of the RDBMS table will be copied to the location mentioned by --target-dir (HDFS Location). This will check whether the table sqoop.emp exists in Hive or not.
If the table in Hive doesn't exist, data from the HDFS location is moved to the hive table and everything goes well.
In case, if the table (sqoop.emp) already exists in Hive, an error is thrown: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table emp already exists)
Example:
sqoop import \
--connect jdbc:mysql://jclient.ambari.org/sqoop \
--username hdfs -P \
--table employee \
--target-dir /user/hive/sqoop/employee \
--delete-target-dir \
--hive-import \
--hive-table sqoophive.emp \
--create-hive-table \
--fields-terminated-by ',' \
--num-mappers 3
2) When create-hive-table is used without hive-import
The schema of the swoop.employee (in RDBMS) is fetched and using that a table is created under the default database in hive (default.employee). But no data is transferred.
Example (Modified form of one given in the book (Hadoop Definitive Guide by Tom White):
sqoop create-hive-table \
--connect jdbc:mysql://jclient.ambari.org/sqoop \
--username hdfs -P \
--table employee \
--fields-terminated-by ','
Now the question is, when to use what. Former is used when no data is only present in the RDBMS and we need to not only create but populate the table in Hive in one go.
The latter is used when the table has to be created in the Hive but not to be populated. Or in case when the data already exists in HDFS and it is to be used to populate the hive table.
sqoop-import --connect jdbc:mysql://localhost:3306/db1
>-username root -password password
>--table tableName --hive-table tableName --create-hive-table --hive-import;
The above command will import data from db into hive with hive default settings and if table is not already present it will create a table in Hive with same schema as it was in DB.
sqoop create-hive-table --connect jdbc:mysql://localhost:3306/db1
>-username root -password password;
The create-hive-table tool will create a table in Hive Metastore, with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported(it will pick from sqoop job). This effectively performs the "--hive-import" step of sqoop-import without running the preceeding import.
For example consider you have imported table1 from db1 into hdfs using sqoop. If you execute create-hive-table next it will create a table in hive metastore with table schema from db1 of table1. So it will be usefull for you to load data into this table in future whenever needed.

How to import easily RDBMS data into HIVE partition tables

I have tables in my RDBMS. Now I have chosen 3rd column of that table as the partition column for my HIVE table.
Now how can I easily import my RDBMS table's data into HIVE table (considering the partition column)?
it works only for static partitions.
refer the below sqoop script for more details :
sqoop import
--connect "jdbc:mysql://quickstart.cloudera:3306/prac"
--username root
--password cloudera
--hive-import
--query "select id,name,ts from student where city='Mumbai' and \$CONDITIONS"
--hive-table prac.student
--hive-partition-key city
--hive-partition-value 'Mumbai'
--target-dir /user/mangesh/sqoop_import/student_temp5
--split-by id
Importing rdbms into hive may be achieved using sqoop.
Here is relevant info for importing into paritioned tables:
http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_importing_data_into_hive
You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and
--hive-partition-value arguments. The partition value must be a string. Please see the Hive documentation for more details on
partitioning.
For the dynamic partition you can use like
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/prac" \
--username root \
--password cloudera \
--table <mysql -tablename> \
--hcatalog-database <hive-databasename> \
--hcatalog-table <hive-table name> \