Sqoop Export with Missing Data - sql

I am trying to use Sqoop to export data from HDFS into Postgresql. However, I receive an error partially through the export that it can't parse the input. I manually went into the file I was exporting and saw that this row had two columns missing. I have tried a bunch of different arguments with the Sqoop command, but cannot get it to work. Here is what I was running thus far:
sqoop export --connect jdbc:postgresql://localhost:5432/XX -username
XX -password XX --table XX --input-fields-terminated-by
"\t" --input-lines-terminated-by "\n" --input-null-string '\n' --input-null
non-string '\n' -m 1 --export-dir /user/dan/output
I have also tried it without the "--input-null-string" and "--input-null-non-string" args and got the same result. My table has 6 columns and the file I am reading has tab separated values that are inserted into the table if all 6 are there. Any help would be appreciated.

I solved the problem by changing my reduce function so that if there were not the correct amount of fields to output a certain value and then I was able to use the --input-null-non-string with that value and it worked.

Related

Creating a Format File for Bulk Import

I am trying to create a Format File to bulk import a .csv file but i, am getting an error.
Query I used
"BCP -SMSSQLSERVER01.[Internal_Checks].[Jan_Flat] format out -fC:\Desktop\exported data\Jan_FlatFormat.fmt -c -T -Uasda -SMSSQLSERVER01 -PPASSWORD"
I am getting an error
"A valid table name is required for in, out, or format options."
This is the error. can anyone suggest what need to do.
According to the bcp Utility documentation the first parameter should be a [Database.]Schema.{Table | View | "query"}, so don't put -SMSSQLSERVER01 where you've got it. Also use format nul instead of format out.
Try using:
bcp.exe [Internal_Checks].[Jan_Flat] format nul "-fC:\Desktop\exported data\Jan_FlatFormat.fmt" -c -SMSSQLSERVER01 -T -Uasda -PPASSWORD
Note the quotes " around the -f switch because your path name contains space characters.
Also note that the -c switch causes single-byte characters (ASCII/OEM/codepage with SQLCHAR) to be written out. If your table contains nchar, nvarchar or ntext columns you should consider using the -w switch instead so as to write out UTF-16 encoded data (using SQLNCHAR).

How to clean bad data from huge csv file

So I have huge csv file (assume 5 GB) and I want to insert the data to the table but it return error that the length of the data is not the same
I found that some data has more columns than I want
For example the correct data I have has 8 columns but some data has 9 (it can be human/system error)
I want to take only 8 columns data, but because the data is so huge, I can not do it manually or using parsing in python
Any recommendation of a way to do it?
I am using linux, so any linux command also welcome
In sql I am using COPY ... FROM ... CSV HEADER; command to import the csv into table
You can use awk for this purpose. Assuming you field delimiter is comma (,) this code can do the work:
awk -F\, 'NF==8 {print}' input_file >output_file
A fast and dirty php solution as single command line:
php -r '$f=fopen("a.csv","rb"); $g=fopen("b.csv","wb"); while ( $r=fgetcsv($f) ) { $r = array_slice($r,0,8); fputcsv($g,$r); }'
It reads file a.csv and writes b.csv.

BCP update database table base on output from powershell

I have 4 files with the same csv header as following
Column1,Column2,Column3,Column4
But I only required data from Column2,Column3,Column4 for import the data into SQL database using BCP . I am using the PowerShell to select the columns that I want and import the required data using BCP but my powershell executed with no error and there are not data updated in my database table. May I know how to set the BCP to import the output from Powershell to database table. Here are my powershell script
$filePath = Get-ChildItem -Path 'D:\test\*' -Include $filename
$desiredColumn = 'Column2','Column3','Column4'
foreach($file in $filePath)
{
write-host $file
$test = import-csv $file | select $desiredColumn
write-host $test
$action = bcp <myDatabaseTableName> in $test -T -c -t";" -r"\n" -F2 -S <MyDatabase>
}
These are the output from the powershell script
D:\test\sample1.csv
#{column2=111;column3=222;column4=333} #{column2=444;column3=555;column4=666}
D:\test\sample2.csv
#{column2=777;column3=888;column4=999} #{column2=aaa;column3=bbb;column4=ccc}
First off, you can't update a table with bcp. It is used to bulk load data. That is, it will either insert new rows or export existing data into a flat file. Changing existing rows, usually called as updating, is out of scope for bcp. If that's what you need, you need to use another a tool. Sqlcmd works fine, and Powershell's got Invoke-Sqlcmd for running arbitary TSQL statements.
Anyway, the BCP utility has notoriously tricky syntax. As far as I know, one cannot bulk load data by passing the data as parameter to bcp, a source file must be used. Thus you need to save the filtered file and pass its name to bcp.
Exporting a filtered CSV is easy enough, just remember to use -NoTypeInformation switch, lest you'll get #TYPE Selected.System.Management.Automation.PSCustomObject as your first row of data. Assuming the bcp arguments are well and good (why -F2 though? And Unix newlines?).
Stripping double quotes requires another an edit to the file. Scrpting Guy has a solution.
foreach($file in $filePath){
write-host $file
$test = import-csv $file | select $desiredColumn
# Overwrite filtereddata.csv, should one exist, with filtered data
$test | export-csv -path .\filtereddata.csv -NoTypeInformation
# Remove doulbe quotes
(gc filtereddata.csv) | % {$_ -replace '"', ''} | out-file filtereddata.csv -Fo -En ascii
$action = bcp <myDatabaseTableName> in filtereddata.csv -T -c -t";" -r"\n" -F2 -S <MyDatabase>
}
Depending on your locale, column separator might be semicolon, colon or something else. Use -Delimiter '<character>' switch to pass whatever you need or change bcp's argument.
Erland's got a helpful page about bulk operations. Also, see Redgate's advice.
Without need to modify the file first, there is an answer here about how bcp can handle quoted data.
BCP in with quoted fields in source file
Essentially, you need to use the -f option and create/use a format file to tell SQL your custom field delimiter (in short, it is no longer a lone comma (,) but it is now (",")... comma with two double quotes. Need to escape the dblquotes and a small trick to handle the first doulbe quote on a line. But it works like a charm.
Also, need the format file to ignore column(s)... just set the destination column number to zero. All with no need to modify the file before load. Good luck!

Removing unnecessary header in Hive table export to csv

How do I remove the text below from an export of a Hive table to local as a csv.
This is the unnecessary line of text that I am getting right at the top, the headers for the columns are displaced to the second row of the csv:
4/4/2018 19:19 284 WARN [main] conf.HiveConf (HiveConf.java:initialize(3081)) - HiveConf of name hive.custom-extensions.root does not exist
This is the code I used to produce the csv.
hive -e 'set hive.cli.print.header=true;
select * from database1.my_table' | sed 's/[\t]/,/g' > /s/myusername/my_table.csv
You can deter the warning messages from being printed to console, and send them to DRFA (Daily Rolling File Appender):
export HADOOP_ROOT_LOGGER="WARN,DRFA"

How can I delete a specific line (e.g. line 102,206,973) from a 30gb csv file?

What method can I use to delete a specific line from a csv/txt file that is too big too load into memory and edit manually?
Background
My question is actually an indirect solution to a problem related with importing csv into sql databases.
I have a series of 10-30gb csv files I want to import and populate an sqlite table from within R (Since they are too large to import as data frames as a whole into R). I am using the 'RSQlite' package for this.
A couple fail because of an error related to one of the lines being badly formatted. The populating process is then cancelled. R returns the line number which caused the process to fail.
The error given is:
./csvfilename line 102206973 expected 9 columns of data but found 3)
So I know exactly the line which causes the error.
I see 2 potential 'indirect' solutions which I was hoping someone could help me with.
(i) Deleting the line causing the error in 20+gb files. e.g. line 102,206,973 in the example above.
I am not concerned with 'losing' the data in line 102,206,973 by just skipping or deleting it. However I have tried and failed to somehow access the csv file and to remove the line.
(ii) Using sqlite directly (or anything else?) to import an csv which does allow you to skip lines or an error.
Although not likely to be related directly to the solution, here is the R code used.
db <- dbConnect(SQLite(), dbname=name_of_table)
dbWriteTable(conn = db, name ="currentdata", value = csvfilename, row.names = FALSE, header = TRUE)
Thanks!
To delete a specific line you can use sed:
sed -e '102206973d' your_file
If you want the replacement to be done in-place, do
sed -i.bak -e '102206973d' your_file
This will create a backup names your_file.bak and your_file will have the specified line removed.
Example
$ cat a
1
2
3
4
5
$ sed -i.bak -e '3d' a
$ cat a
1
2
4
5
$ cat a.bak
1
2
3
4
5