I am trying to learn some basic things in sqoop and I want to insert some data from a mysql table into hive. This Mysql table takes data every 5 mins. I found how to create sqoop job in order to connect and run the query but I can not understand how the sqoop will know the last-value from the primary key column in order to extract the newer data every time.
1)For example in the below sqoop command do I have to put the last value or the sqoop can understand it from its own?
2)In check column and last value must be the primary key column?
sqoop job --create <JOBS NAME>\
--import \
--connect "jdbc:<PATH>" \
--username <USERNAME> \
--password <PASSWORD> \
--target-dir <DIR> \
--table <MYSQL TABLE>\
--hive-import \
--hive-table <HIVE TABLE>\
--fields-terminated-by , \
--escaped-by \\ \
--split-by <COLUMN TO BE SPLITED IN MAPPERS> \
--num-mappers -5 \
--incremental append \
--check-column \
--last-value
Found this to documentation but I still dont understand.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
Related
I am trying to copy a table from oracle to hadoop (hive) with a sqoop script (the table does not already exist in hive). Within putty, I launch a script called "my_script.sh", code sample below. However, after I run it, it gives me back my code followed by no such file or directory error. Can someone please tell me if I am missing something from my code?
Yes my source and target directory is correct (I made sure to triple check).
Thank you
#!/bin/bash
sqoop import \
-Dmapred.map.child.java.opts='-Doracle.net.tns_admin=. -Doracle.net.wallet_location=.' \
-files $WALLET_LOCATION/cwallet.sso,$WALLET_LOCATION/ewallet.p12,$TNS_ADMIN/sqlnet.ora,$TNS_ADMIN/tnsnames.ora \
--connect jdbc:oracle:thin:/#MY_ORACLE_DATABASE \
--table orignal_schema.orignal_table \
--hive-drop-import-delims \
--hive-import \
--hive-table new_schema.new_table \
--num-mappers 1 \
--hive-overwrite \
--mapreduce-job-name my_sqoop_job \
--delete-target-dir \
--target-dir /hdfs://myserver/apps/hive/warehouse/new_schema.db \
--create-hive-table
sqoop import-all-tables into hive with default database works fine but Sqoop import-all-tables into hive specified database is not working.
As --hive-database is depreciated how to specify database name
sqoop import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \
--password XXX \
--hive-import \
--create-hive-table
The above code creates tables in /user/hive/warehouse/ i.e default directory
How to import all tables into /user/hive/warehouse/retail.db/
you can set the HDFS path of your database using the option --warehouse-dir.
The next example worked for me:
sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username user \
--password password \
--warehouse-dir /apps/hive/warehouse/lina_test.db
--autoreset-to-one-mapper
I am facing issues while importing Table from postgresql to hive. Query I am using is :
sqoop import \
--connect jdbc:postgresql://IP:5432/PROD_DB \
--username ABC_Read \
--password ABC#123 \
--table vw_abc_cust_aua \
-- --schema ABC_VIEW \
--target-dir /tmp/hive/raw/test_trade \
--fields-terminated-by "\001" \
--hive-import \
--hive-table vw_abc_cust_aua \
--m 1
Error I am getting
ERROR tool.ImportTool: Error during import: No primary key could be found for table vw_abc_cust_aua. Please specify one with --split-by or perform a sequential import with '-m 1'.
PLease let me know what is wrong with my query
I am considering -- --schema ABC_VIEW is a typo error, it should be --schema ABC_VIEW
The other issue is the option to provide number of mapper is either -m or --num-mappers and not --m
Solution
in you script change --m to -m or --num-mappers
I'm trying to do a Sqoop incremental import to a Hive table using "--incremental append".
I did an initial sqoop import and then create a job for the incremental imports.
Both are executed successfully and new files have been added to the same original Hive table directory in HDFS, but when I check my Hive table, the imported observations are not there. The Hive table is equal before the sqoop incremental import.
How can I solve that?
I have about 45 Hive tables and would like to update them daily automatically after the Sqoop incremental import.
First Sqoop Import:
sqoop import \
--connect jdbc:db2://... \
--username root \
-password 9999999 \
--class-name db2fcs_cust_atu \
--query "SELECT * FROM db2fcs.cust_atu WHERE \$CONDITIONS" \
--split-by PTC_NR \
--fetch-size 10000 \
--delete-target-dir \
--target-dir /apps/hive/warehouse/fcs.db/db2fcs_cust_atu \
--hive-import \
--hive-table fcs.cust_atu \
-m 64;
Then I run Sqoop incremental import:
sqoop job \
-create cli_atu \
--import \
--connect jdbc:db2://... \
--username root \
--password 9999999 \
--table db2fcs.cust_atu \
--target-dir /apps/hive/warehouse/fcs.db/db2fcs_cust_atu \
--hive-table fcs.cust_atu \
--split-by PTC_NR \
--incremental append \
--check-column TS_CUST \
--last-value '2018-09-09'
It might be difficult to understand/answer your question without looking at your full query because your outcome also depends on your choice of arguments and directories. Mind to share your query?
I am getting null rows in hive after sqoop import from oracle to hive
in sqoop --query, I mentioned where pk is not null .
sqoop query :
sqoop import \
--connect "${SQOOP_CONN_STR}" \
--connection-manager "${SQOOP_CONNECTION_MANAGER}" \
--username ${SQOOP_USER} \
--password ${SQOOP_PASSWORD} \
--fields-terminated-by ${SQOOP_DELIM} \
--null-string '' \
--null-non-string '' \
--query \""${SQOOP_QUERY}"\" \
--target-dir "${SQOOP_OP_DIR}" \
--split-by ${SQOOP_SPLIT_BY} \
-m ${SQOOP_NUM_OF_MAPPERS} 1> ${SQOOP_TEMP_LOG}
It is due to change in field delimiter.
You are importing in HDFS without specifying any field delimiter. So, it will use default comma.
Hive table you created might have CTRL^A(default) as field delimiter.
Make these in sync, it should work.