Multiple character delimiter using apache sqoop import

Multiple character delimiter using apache sqoop import - hive

I am importing data from teradata(RDBMS) to hive using apache sqoop. The usual delimiters used for import like ",", "|", "~" are present in the tables. Is there a way to use multiple characters as delimiters in apache sqoop.
To avoid it, I have used --escaped-by "\t" and --fields-terminated-by "," parameters in sqoop import command. So is there a way to 'unescape' the "\t" I used in sqoop import.

I use the '\b' delimiter whenever I get challenging tables that contain large data fields containing text that might have TABS and CR/LF characters. '\b' is as BACKSPACE which is very difficult to insert into a character firld in most databases.
Here is an example of the sqoop command I use:
sqoop import
--connect "jdbc:sqlserver://myserver;DatabaseName=MyDB;user=MyUser;password=MyPassword;port=1433"
--warehouse-dir=/user/MyUser/Import/MyDB
--fields-terminated-by '\b' --num-mappers 8
--table training_deficiency
--hive-table stage.training_deficiency
--hive-import --hive-overwrite
--hive-delims-replacement '<newline>'
--split-by Training_Deficiency_ID
--outdir /home/MyUser/sqoop/java
--where "batch_update_dt > '2016-12-09 23:06:44.69'"

Related

sqoop is not working as expected NULL Sting from source is populated as \\N

I am trying to use null-string while pulling data from Oracle SQL Server
But for null value in SQL server while importing to HDFS I am getting '\\N' value and for which Hive table is also populated with '\\N' value .
I use as
sqoop --connect ABCD --username A \
--password B#123 \
--escaped-by \\ \
--null-string '\\N' \
--target-dir /user/anu/dear
But in hdfs I am getting as :
Anu,'\\N',Boledion,12638
Expected Results :
Anu,\N,Boledion,12638
NB: I tried using '\N' - Getting Sqoop Java error saying '\N' cannot process or wrong escape .
And even tried '\\N' as mentioned in cloudera QnA
4 Slashed
It is highly appreciated in I can get this solved.

BCP detects CRLF as data and not as Row ending

I am using bcp to export and import data, for example:
bcp "exec db.dbo.procedure" queryout "C:\Users\I\Desktop\ps.dat" -c -r "*" -t; -S -U -P
bcp db.dbo.table in C:\Users\I\Desktop\ps.dat -e "C:\Users\I\Desktop\ps_error.dat" -c -r "*" -t; -m1000000 -S -U -P
If I execute these statements without -r, bcp uses the default CRLF as end of row. Later, the import fails with right data truncation.
After so many attempts I have seen that CRLF is detected as two bytes of data, and it does not fit the table format. When I use the above statements it works perfectly.
Why is this happening? Is this a bcp bug, or is the expected behaviour?

According to MS that is the expected behaviour:
https://learn.microsoft.com/en-us/sql/tools/bcp-utility
this article explains all the parameters and for this case these are the ones we are interested in:
-c
Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
-r
row_term
Specifies the row terminator. The default is \n (newline character). Use this parameter to override the default row terminator. For more information, see Specify Field and Row Terminators (SQL Server).
So it seems that by removing -r which sets the row terminator to \n (LF) , -c is taking over and setting the row terminator to \r\n (CRLF)

sqoop import into hive

1st command:
sqoop import \
–connect “jdbc:mysql://quickstart.cloudera:3306/retail_db” \
–username retail_dba \
–password cloudera \
–table departments \
–hive-home /user/hive/warehouse \
–hive-import \
–hive-overwrite \
–hive-table sqoop_import.departments \
–outdir java_files
2nd command:
sqoop import \
–connect “jdbc:mysql://quickstart.cloudera:3306/retail_db” \
–username retail_dba \
–password cloudera \
–table departments \
–target-dir=/user/hive/warehouse/department_test \
–append
In both the commands we are creating table in Hive without specifying field and line delimiters and importing using sqoop, then why in second case we are getting Null and not in first case

Hive's default delimiter
Field: CTRL+A
LINE : \n
Case 1 : HIVE IMPORT
Import tables into Hive (Uses Hive’s default delimiters if none are set.)
Also, it creates table mentioned in --hive-table (if not exists) with hive's default delimiter.
Case 2 : HDFS IMPORT
In this case, data from RDBMS is stored as , field delimiter and \n line delimiter (default) which is not the default delimiters for hive. That's why you are getting NULL entries in your data.
You can solve it using two ways:
Change your hive table's field delimiter
use --fields-terminated-byin your IMPORT command.

sqoop importing string column of a dataset containing "," in it

the Dataset I am importing contains string columns with "," in them.
When I try to import , the string value is getting split into fields.
Here is my sqoop script:
sqoop import --connect 'jdbc:sqlserver://XXX.XX.XX.XX:51260;database=Common' -username=BIG_DATA -P --table Carriers --hive-import --hive-table common.Carriers --hive-drop-import-delims --optionally-enclosed-by '\"' --map-column-hive UpdatedDate=string,ResourceID=string --lines-terminated-by '\n' -- --schema Truck -m 10
the sqoop command works fine for integer type columns but it splits the string columns as they contain ","(camma) within the string . so is there any way to escape it while parsing the string containing ","

adding this --fields-terminated-by '^' to sqoop import solved similar problem of mine

This should work
$ sqoop import --fields-terminated-by , --escaped-by \ --enclosed-by '\"' ...

Passing parameter in sqoop

Below is my sqoop cmd in shell script,
sqoop import --connect 'jdbc:sqlserver://190.148.155.91:1433;username=****;password=****;database=Testdb' --query 'Select DimFreqCellRelationID,OSSC_RC, MeContext, ENodeBFunction,EUtranCellFDD,EUtranFreqRelation, EUtranCellRelation FROM dbo.DimCellRelation WHERE DimFreqCellRelationID > **$maxval** and $CONDITIONS' --split-by OSS --target-dir /testval;
Before executing this command, i have assigned a value to $maxval , when I execute sqoop cmd value should get passed in place of $maxval. But thats not happning. Is it possible to pass parameter through sqoop. Can you please let me know if you have any suggestion to achieve this logic?

I believe that the problem you are seeing is with incorrect enclosing. Using single quotes (') will prohibit bash to perform any substitutions. You need to use double quotes (") if you want to use variables inside the parameter. However you also have to be careful as you do not want to substitute the $CONDITIONS placeholder. Try it without Sqoop:
jarcec#Odie ~ % echo '$maxval and $CONDITIONS'
$maxval and $CONDITIONS
jarcec#Odie ~ % echo "$maxval and $CONDITIONS"
and
jarcec#Odie ~ % echo "$maxval and $CONDITIONS"
jarcec#Odie ~ % export maxval=30
jarcec#Odie ~ % echo "$maxval and $CONDITIONS"
30 and
jarcec#Odie ~ % echo "$maxval and \$CONDITIONS"
30 and $CONDITIONS

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Multiple character delimiter using apache sqoop import - hive

Related

sqoop is not working as expected NULL Sting from source is populated as \\N

BCP detects CRLF as data and not as Row ending

sqoop import into hive

sqoop importing string column of a dataset containing "," in it

Passing parameter in sqoop

Categories

Resources