Import data from sqoop to hive - hive

sqoop import –connect “jdbc:mysql://quickstart.cloudera:3306/retail_db” \
–username=retail_dba –password=cloudera –table export1 –hive-import \
–hive-table export_3 –create-hive-table –fields-terminated-by “|” \
–lines-terminated-by “\n” –null-string nvl –null-non-string -2 –outdir java_files
If I use the above command it gives an error that
either use split by or -m 1 for sequential import
when I used split-by it ignored null values and imported other into hive
Can you explain the reason?
Thanks
Varun

The NULL value issues you are getting are not related to split-by.
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. Because sqoop is using those parameters in generated code, you need to properly escape value \N to \N:
$ sqoop import ... --null-string '\\N' --null-non-string '\\N'

Related

sqoop import staging table issue

I am trying to import the data from teradata into HDFS location.
I have access to view for that database. So I created a staging table in another database. But when I try to run the code it says error
Error: Running Sqoop version: 1.4.6.2.6.5.0-292 18/12/23 21:49:41 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 18/12/23 21:49:41 ERROR tool.BaseSqoopTool: Error parsing arguments for import:staging-table, t_hit_data_01_staging, –clear-staging-table, --query, select * from table1 where cast(date1 as Date) <= date '2017-09-02' and $CONDITIONS, --target-dir, <>, --split-by, date1, -m, 25
I have given the staging table details in the code and ran it. but throws error.
(Error parsing arguments from import and as un-recognized arguments from staging table)
sqoop import \
--connect jdbc:teradata://<server_link>/Database=db01 \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username <UN> \
--password <PWD> \
–-staging-table db02.table1_staging –clear-staging-table \
--query "select * from table1 where cast(date1 as Date) <= date '2017-09-02' and \$CONDITIONS " \
--target-dir '<hdfs location>' \
--split-by date1 -m 25`
The data should be loaded into the HDFS location, using the staging table in another database in Teradata.Then later on changing the where clause it sqoop should create another file under the same folder in HDFS location. Example: part-0000, next file as part -0001 etc.,
I dont think there is a staging option available for import command.
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

SQOOP --query with SCHEMA in SQL Server

I'm trying to use the --query option in sqoop to import data from SQL Server. My concern is, how can we declare which schema to use with --query in SQL Server.
My script:
sqoop \
--options-file sqoop/aw_mssql.cfg \
--query "select BusinessEntityId, LoginID, cast(OrganizationNode as string) from Employee where \$CONDITIONS" \
--hive-table employees \
--hive-database mssql \
-- --schema=HumanResources
Still produces an error
Invalid object name 'Employee'
Also tried
--connect "jdbc:sqlserver://192.168.1.17;database=AdventureWorks;schema=HumanResources"
but that also failed.
You can try this below code:
sqoop import \
--connect jdbc:sqlserver://192.168.1.17;database=AdventureWorks \
--username "Your User" \
--password "Your Password" \
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--verbose \
--query "select BusinessEntityId, LoginID, cast(OrganizationNode as string) from HumanResources.Employee where \$CONDITIONS" \
--split-by "EmpID" \
--where " EmpID='Employee ID' " \
-m 1 \
--target-dir /user/cloudera/ingest/raw/Employee\
--fields-terminated-by "," \
--hive-import \
--create-hive-table \
--hive-table mssql.employees \
hive-import – Import table into Hive (Uses Hive’s default delimiters
if none are set.)
create-hive-table – It will create new HIBE table. Note: Job
will be failed if a Hive table already exists. It works in this
case.
hive-table – Specifies <db_name>.<table_name>.
The sqoop command you are using is missing a few things. 1st of all you need to specify that this is an sqoop import job. apart from that your query needs to have a connection string. Moreover i dont know what argments you are passing inside the options file so if you had posted the details it would have been easier and i am not sure about the -- --schema=HumanResources thing as i haven't seen it. A correct working sqoop example query is :
sqoop import --connect <connection string> --username <username> --password <password> --query <query> --hive-import --target-table <table_name> -m <no_if_mappers
Moreover keep this in mind while using --query tool you need not specify the --table tool, otherwise it will throw an error.
-schema can work in conjunction with -table, but not with -query. Think what would that mean, it would require to parse the text of the query and replace every unqualified table reference with a two-part name, but not table references that are already two-part, three-part or four-part names. And match exactly the syntax rules of the back end (SQL Server in this case). It's just not feasible.
Specify the schema explicitly in the query:
select BusinessEntityId, LoginID, cast(OrganizationNode as string)
from HumanResources.Employee
where ...

issue with Date while scooping data from oracle

I am trying to scoop import Date column from oracle database to Avro format.
Below is the sample on how I am executing. My options file content
--target-dir
/.. /
--delete-target-dir
--as-avrodatafile
--query
Select chg_ts from abc
--map-column-java chg_ts=String
My scoop import contains
sqoop import -D oraoop.timestamp.string=false --options-file $1 --options-file $2 --fetch-size=0 -m 1 --mapreduce-job-name $job_name"_"$instance
After doing the above steps I am still getting the date as "type" : [ "null", "long" ] in avro file and takes Bigint in Hive.
Please guide me if I am missing something here.

sqoop-export is failing when I have \N as data

Iam getting below error when I run my sqoop export command.
This is my content to be exported by sqoop command
00001|Content|1|Content-article|\N|2015-02-1815:16:04|2015-02-1815:16:04|1 |\N|\N|\N|\N|\N|\N|\N|\N|\N
00002|Content|1|Content-article|\N|2015-02-1815:16:04|2015-02-1815:16:04|1 |\N|\N|\N|\N|\N|\N|\N|\N|\N
sqoop command
sqoop export --connect jdbc:postgresql://10.11.12.13:1234/db --table table1 --username user1 --password pass1--export-dir /hivetables/table/ --fields-terminated-by '|' --lines-terminated-by '\n' -- --schema schema
15/06/09 08:05:16 INFO mapreduce.Job: Task Id :
attempt_1431442954745_1210_m_000001_0, Status : FAILED Error:
java.io.IOException: Can't export data, please check failed map task
logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.lang.RuntimeException: Can't parse input data: '\N'
at duser.__loadFromFields(duser.java:690)
at duser.parse(duser.java:558)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:202)
at duser.__loadFromFields(duser.java:627)
Can you help me resolve it ?
Try adding these arguments to the export statement
--input-null-string "\\\\N" --input-null-non-string "\\\\N"
From the documentation:
If --input-null-string is not specified, then the string "null" will
be interpreted as null for string-type columns. If
--input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string
columns.
If you don't add those arguments, it won't be able to understand that the \N in your data is actually null.
The problem seems to be the order in which columns are being imported. Sqoop doesn't automatically understand the column mapping. Try using --columns argument to specify the order the columns appear in. Here's how to use it:
sqoop export --connect jdbc:postgresql://10.11.12.13:5432/reports ... --columns col1,col2,col3,...
See http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_4 for documentation on how to use --columns.

Sqoop Export with Missing Data

I am trying to use Sqoop to export data from HDFS into Postgresql. However, I receive an error partially through the export that it can't parse the input. I manually went into the file I was exporting and saw that this row had two columns missing. I have tried a bunch of different arguments with the Sqoop command, but cannot get it to work. Here is what I was running thus far:
sqoop export --connect jdbc:postgresql://localhost:5432/XX -username
XX -password XX --table XX --input-fields-terminated-by
"\t" --input-lines-terminated-by "\n" --input-null-string '\n' --input-null
non-string '\n' -m 1 --export-dir /user/dan/output
I have also tried it without the "--input-null-string" and "--input-null-non-string" args and got the same result. My table has 6 columns and the file I am reading has tab separated values that are inserted into the table if all 6 are there. Any help would be appreciated.
I solved the problem by changing my reduce function so that if there were not the correct amount of fields to output a certain value and then I was able to use the --input-null-non-string with that value and it worked.