Mysql to Hive with incremental column - hive

I am trying to learn some basic things in sqoop and I want to insert some data from a mysql table into hive. This Mysql table takes data every 5 mins. I found how to create sqoop job in order to connect and run the query but I can not understand how the sqoop will know the last-value from the primary key column in order to extract the newer data every time.
1)For example in the below sqoop command do I have to put the last value or the sqoop can understand it from its own?
2)In check column and last value must be the primary key column?
sqoop job --create <JOBS NAME>\
--import \
--connect "jdbc:<PATH>" \
--username <USERNAME> \
--password <PASSWORD> \
--target-dir <DIR> \
--table <MYSQL TABLE>\
--hive-import \
--hive-table <HIVE TABLE>\
--fields-terminated-by , \
--escaped-by \\ \
--split-by <COLUMN TO BE SPLITED IN MAPPERS> \
--num-mappers -5 \
--incremental append \
--check-column \
--last-value
Found this to documentation but I still dont understand.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

Related

I have an error trying to run my sqoop job (trying to copy a table from oracle to hive)

I am trying to copy a table from oracle to hadoop (hive) with a sqoop script (the table does not already exist in hive). Within putty, I launch a script called "my_script.sh", code sample below. However, after I run it, it gives me back my code followed by no such file or directory error. Can someone please tell me if I am missing something from my code?
Yes my source and target directory is correct (I made sure to triple check).
Thank you
#!/bin/bash
sqoop import \
-Dmapred.map.child.java.opts='-Doracle.net.tns_admin=. -Doracle.net.wallet_location=.' \
-files $WALLET_LOCATION/cwallet.sso,$WALLET_LOCATION/ewallet.p12,$TNS_ADMIN/sqlnet.ora,$TNS_ADMIN/tnsnames.ora \
--connect jdbc:oracle:thin:/#MY_ORACLE_DATABASE \
--table orignal_schema.orignal_table \
--hive-drop-import-delims \
--hive-import \
--hive-table new_schema.new_table \
--num-mappers 1 \
--hive-overwrite \
--mapreduce-job-name my_sqoop_job \
--delete-target-dir \
--target-dir /hdfs://myserver/apps/hive/warehouse/new_schema.db \
--create-hive-table

how to import-all-tables from Mysql to hive using sqoop for particular database in hive?

sqoop import-all-tables into hive with default database works fine but Sqoop import-all-tables into hive specified database is not working.
As --hive-database is depreciated how to specify database name
sqoop import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \
--password XXX \
--hive-import \
--create-hive-table
The above code creates tables in /user/hive/warehouse/ i.e default directory
How to import all tables into /user/hive/warehouse/retail.db/
you can set the HDFS path of your database using the option --warehouse-dir.
The next example worked for me:
sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username user \
--password password \
--warehouse-dir /apps/hive/warehouse/lina_test.db
--autoreset-to-one-mapper

Import Data from Postgresql to Hive

I am facing issues while importing Table from postgresql to hive. Query I am using is :
sqoop import \
--connect jdbc:postgresql://IP:5432/PROD_DB \
--username ABC_Read \
--password ABC#123 \
--table vw_abc_cust_aua \
-- --schema ABC_VIEW \
--target-dir /tmp/hive/raw/test_trade \
--fields-terminated-by "\001" \
--hive-import \
--hive-table vw_abc_cust_aua \
--m 1
Error I am getting
ERROR tool.ImportTool: Error during import: No primary key could be found for table vw_abc_cust_aua. Please specify one with --split-by or perform a sequential import with '-m 1'.
PLease let me know what is wrong with my query
I am considering -- --schema ABC_VIEW is a typo error, it should be --schema ABC_VIEW
The other issue is the option to provide number of mapper is either -m or --num-mappers and not --m
Solution
in you script change --m to -m or --num-mappers

Hive table outdated after Sqoop incremental import

I'm trying to do a Sqoop incremental import to a Hive table using "--incremental append".
I did an initial sqoop import and then create a job for the incremental imports.
Both are executed successfully and new files have been added to the same original Hive table directory in HDFS, but when I check my Hive table, the imported observations are not there. The Hive table is equal before the sqoop incremental import.
How can I solve that?
I have about 45 Hive tables and would like to update them daily automatically after the Sqoop incremental import.
First Sqoop Import:
sqoop import \
--connect jdbc:db2://... \
--username root \
-password 9999999 \
--class-name db2fcs_cust_atu \
--query "SELECT * FROM db2fcs.cust_atu WHERE \$CONDITIONS" \
--split-by PTC_NR \
--fetch-size 10000 \
--delete-target-dir \
--target-dir /apps/hive/warehouse/fcs.db/db2fcs_cust_atu \
--hive-import \
--hive-table fcs.cust_atu \
-m 64;
Then I run Sqoop incremental import:
sqoop job \
-create cli_atu \
--import \
--connect jdbc:db2://... \
--username root \
--password 9999999 \
--table db2fcs.cust_atu \
--target-dir /apps/hive/warehouse/fcs.db/db2fcs_cust_atu \
--hive-table fcs.cust_atu \
--split-by PTC_NR \
--incremental append \
--check-column TS_CUST \
--last-value '2018-09-09'
It might be difficult to understand/answer your question without looking at your full query because your outcome also depends on your choice of arguments and directories. Mind to share your query?

How to stop nulls from sqoop import (oracle to hive)

I am getting null rows in hive after sqoop import from oracle to hive
in sqoop --query, I mentioned where pk is not null .
sqoop query :
sqoop import \
--connect "${SQOOP_CONN_STR}" \
--connection-manager "${SQOOP_CONNECTION_MANAGER}" \
--username ${SQOOP_USER} \
--password ${SQOOP_PASSWORD} \
--fields-terminated-by ${SQOOP_DELIM} \
--null-string '' \
--null-non-string '' \
--query \""${SQOOP_QUERY}"\" \
--target-dir "${SQOOP_OP_DIR}" \
--split-by ${SQOOP_SPLIT_BY} \
-m ${SQOOP_NUM_OF_MAPPERS} 1> ${SQOOP_TEMP_LOG}
It is due to change in field delimiter.
You are importing in HDFS without specifying any field delimiter. So, it will use default comma.
Hive table you created might have CTRL^A(default) as field delimiter.
Make these in sync, it should work.