Import CSV File Error : Column Value containing column delimiter - sql

I am trying to Import a Csv File into SQL SERVER using SSIS
Here's an example how data looks like
Student_Name,Student_DOB,Student_ID,Student_Notes,Student_Gender,Student_Mother_Name
Joseph Jade,2005-01-01,1,Good listener,Male,Amy
Amy Jade,2006-01-01,1,Good in science,Female,Amy
....
Csv Columns are not containing text qualifiers (quotations)
I Created a simple package using SSIS to import it into SQL but sometime the data in SQL looked like below
Student_Name Student_DOB Student_ID Student_Notes Student_Gender Student_Mother_Name
Ali Jade 2004-01-01 1 Good listener Bad in science Male,Lisa
The Reason was that somtimes [Student_Notes] column contains Comma (,) that is used as column delimiter so the Row are not imported Correctly
Any suggestions

A word of warning: I'm not a regular C# coder.
But anyway this code does the following:
It opens a file called C:\Input.TXT
It searches each line. If the line has more than 5 commas, it takes all the extra commas out of the third last field (notes)
It writes the result to C:\Output.TXT - that's the one you need to actually import
There are many improvements that could be made:
Get file paths from connection managers
Error handling
An experienced C# programmer could probably do this in hlaf the code
Keep in mind your package will need write access to the appropriate folder
public void Main()
{
// Search the file and remove extra commas from the third last field
// Extended from code at
// http://stackoverflow.com/questions/1915632/open-a-file-and-replace-strings-in-c-sharp
// Nick McDermaid
string sInputLine;
string sOutputLine;
string sDelimiter = ",";
String[] sData;
int iIndex;
// open the file for read
using (System.IO.FileStream inputStream = File.OpenRead("C:\\Input.txt"))
{
using (StreamReader inputReader = new StreamReader(inputStream))
{
// open the output file
using (StreamWriter outputWriter = File.AppendText("C:\\Output.txt"))
{
// Read each line
while (null != (sInputLine = inputReader.ReadLine()))
{
// Grab each field out
sData = sInputLine.Split(sDelimiter[0]);
if (sData.Length <= 6)
{
// 6 or less fields - just echo it out
sOutputLine = sInputLine;
}
else
{
// line has more than 6 pieces
// We assume all of the extra commas are in the notes field
// Put the first three fields together
sOutputLine =
sData[0] + sDelimiter +
sData[1] + sDelimiter +
sData[2] + sDelimiter;
// Put the middle notes fields together, excluding the delimiter
for (iIndex=3; iIndex <= sData.Length - 3; iIndex++)
{
sOutputLine = sOutputLine + sData[iIndex] + " ";
}
// Tack on the last two fields
sOutputLine = sOutputLine +
sDelimiter + sData[sData.Length - 2] +
sDelimiter + sData[sData.Length - 1];
}
// We've evaulted the correct line now write it out
outputWriter.WriteLine(sOutputLine);
}
}
}
}
Dts.TaskResult = (int)Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success;
}

In The Flat File Connection Manager. Make the File as only one column (DT_STR 8000)
Just add a script Component in the dataflowtask and Add Output Columns (Same as Example Shown)
in The script component split each row using the following Code:
\\Student_Name,Student_DOB,Student_ID,Student_Notes,Student_Gender,Student_Mother_Name
Dim strCells() as string = Row.Column0.Split(CChar(","))
Row.StudentName = strCells(0)
Row.StudentDOB = strCells(1)
Row.StudentID = strCells(2)
Row.StudentMother = strCells(strCells.Length - 1)
Row.StudentGender = strCells(strCells.Length - 2)
Dim strNotes as String = String.Empty
For int I = 3 To strCells.Length - 3
strNotes &= strCells(I)
Next
Row.StudentNotes = strNotes
it worked fine for me

If import CSV file is not a routine
Import CSV file in Excel
Search error rows with Excel rows filter and rewrite them
Save Excel file in TXT Tab delimited
Import TXT file with SSIS
Else make a script that search comma in the Student Notes column range

Related

How to deal with pipe as data in pipe delimited file

I am trying to solve a problem where my csv data looks like below:
A|B|C
"Jon"|"PR | RP"|"MN"
"Pam | Map"|"Ecom"|"unity"
"What"|"is"this" happening"|"?"
That is, it is pipe delimited and has quotes as text qualifier but it also has pipe and quotes with in the data values. I have already tried
Update based on the comments
I tried to select | as delimiter and " as Text Qualifier but when trying to import data to OLEDB Destination i receive the following error:
couldn't find column delimiter for column B
You have to change the Column Delimiter property to | (vertical bar) and the Text Qualifier property to " within the Flat File Connection Manager
If these is still not working then you have some bad rows in the Flat File Source which you must handle using the Error Output:
Flat File source Error Output connection in SSIS
SQL SERVER – SSIS Component Error Outputs
I actually ended up writing a c sharp script to remove the initial and last quote and set the column delimiter to quote pipe quote ("|") in SSIS. Code was as below:
public void Main()
{
String folderSource = "path";
String folderTarget = "path";
foreach (string file in System.IO.Directory.GetFiles(folderSource))
{
String targetfilepath = folderTarget + System.IO.Path.GetFileName(file);
System.IO.File.Delete(targetfilepath);
int icount = 1;
foreach (String row in System.IO.File.ReadAllLines(file))
{
if (icount == 1)
{
System.IO.File.AppendAllText(targetfilepath, row.Replace("|", "\"|\""));
}
else
{
System.IO.File.AppendAllText(targetfilepath, row.Substring(1, row.Length - 2));
}
icount = icount + 1;
System.IO.File.AppendAllText(targetfilepath, Environment.NewLine);
}
}
Dts.TaskResult = (int)ScriptResults.Success;
}

How to get NPOI Excel RichStringCellValue?

I am using DotNetCore.NPOI (1.2.1) in order to read an MS Excel file.
Some of the cells are of type text and contain formatted strings (e.g. some words in bold).
How do I get the formatted cell value? My final goal: Retrieve the cell text as HTML.
I tried
var cell = row.GetCell(1);
var richStringCellValue = cell.RichStringCellValue;
But this won't let me access the formatted string (just the plain string without formattings).
Does anybody have an idea or solution?
I think you'll have to take longer route in this case. First you'll have to maintain the formatting of cell value like date, currency etc and then extract the style from cell value and embed the cell value under that style.
best option is to write extenstion method to get format and style value.
To get the fomat Please see this link How to get the value of cell containing a date and keep the original formatting using NPOI
For styling first you'll have to check and find the exact style of running text and then return the value inside the html tag , below method will give you idea to extract styling from cell value. Code is untested , you may have to include missing library.
public void GetStyleOfCellValue()
{
XSSFWorkbook wb = new XSSFWorkbook("YourFile.xlsx");
ISheet sheet = wb.GetSheetAt(0);
ICell cell = sheet.GetRow(0).GetCell(0);
XSSFRichTextString richText = (XSSFRichTextString)cell.RichStringCellValue;
int formattingRuns = cell.RichStringCellValue.NumFormattingRuns;
for (int i = 0; i < formattingRuns; i++)
{
int startIdx = richText.GetIndexOfFormattingRun(i);
int length = richText.GetLengthOfFormattingRun(i);
Console.WriteLine("Text: " + richText.String.Substring(startIdx, startIdx + length));
if (i == 0)
{
short fontIndex = cell.CellStyle.FontIndex;
IFont font = wb.GetFontAt(fontIndex);
Console.WriteLine("Bold: " + (font.IsBold)); // return string <b>my string</b>.
Console.WriteLine("Italics: " + font.IsItalic + "\n"); // return string <i>my string</i>.
Console.WriteLine("UnderLine: " + font.Underline + "\n"); // return string <u>my string</u>.
}
else
{
IFont fontFormat = richText.GetFontOfFormattingRun(i);
Console.WriteLine("Bold: " + (fontFormat.IsBold)); // return string <b>my string</b>.
Console.WriteLine("Italics: " + fontFormat.IsItalic + "\n");// return string <i>my string</i>.
}
}
}
Font formatting in XLSX files are stored according to schema http://schemas.openxmlformats.org/spreadsheetml/2006/main which has no direct relationship to HTML tags. Therefore your task is not that much straight forward.
style = cell.getCellStyle();
font = style.getFont(); // or style.getFont(workBook);
// use Font object to query font attributes. E.g. font.IsItalic
Then you will have to build the HTML by appending relevant HTML tags.

SSIS import a Flat File to SQL with the first row as header and last row as a total

I receive Text File that I have to Import to a SQL Table, I have to come with a SSIS because I will received the Flat File every Day , with the First Row as the Customer_ID, then come the invoice details and then the Total of the invoice.
Example :
30303
0000109291700080190432737000005Name of the product
0000210291700080190432737000010Name of the product
0000309291700080190432737000000Name of the product
003 000145
So let me Explain:
First 30303 is the Customer #
Other Rows Invoice Details
00001-> ROWID 092917-> DATE 000801904327->PROD 370->Trans 00010 -> AMOUNT
Name of the product
Last Row
003==>Total rows 000145==>Total of Invoice
Any Clue ?
I would use a Script Component as a source in a Data Flow Task. You can then use C# or VB.net to read the file, e.g., by using System.IO.StreamReader, in any way you wish. You can read a line at a time, store values in variables to write to every row (e.g., the customer number), etc. It's extremely flexible for complex files.
Here is an example script (C#) based on your data:
public override void CreateNewOutputRows()
{
System.IO.StreamReader reader = null;
try
{
bool line1Read = false;
int customerNumber = 0;
reader = new System.IO.StreamReader(Variables.FilePath); // this refers to a package variable that contains the file path
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (!line1Read)
{
customerNumber = Convert.ToInt32(line);
line1Read = true;
}
else if (!reader.EndOfStream)
{
Output0Buffer.AddRow();
Output0Buffer.CustomerNumber = customerNumber;
Output0Buffer.RowID = Convert.ToInt32(line.Substring(0, 5));
Output0Buffer.Date = DateTime.ParseExact(line.Substring(5, 6), "MMddyy", System.Globalization.CultureInfo.CurrentCulture);
Output0Buffer.Prod = line.Substring(11, 12);
Output0Buffer.Trans = Convert.ToInt32(line.Substring(23, 3));
Output0Buffer.Amount = Convert.ToInt32(line.Substring(26, 5));
Output0Buffer.ProductName = line.Substring(31);
}
}
}
catch
{
if (reader != null)
{
reader.Close();
reader.Dispose();
}
throw;
}
}
The columns in 'Output 0' of the Script Component are configured as follows:
Name DataType Length
==== ======== ======
CustomerNumber four-byte signed integer [DT_I4]
RowID four-byte signed integer [DT_I4]
Date database date [DT_DBDATE]
Prod string [DT_STR] 12
Trans four-byte signed integer [DT_I4]
Amount four-byte signed integer [DT_I4]
ProductName string [DT_STR] 255
To implement this:
Create a string variable called 'FilePath' with your file path in it for the script to reference.
Create a Data Flow Task.
Add a Script Component to the Data Flow Task - you'll be asked what type it should be, select 'Source'.
Right-click the Script Component, click 'Edit'.
On the 'Script' pane, add the 'FilePath' variable to the 'ReadOnlyVariables' section.
On the 'Inputs and Outputs' pane, expand 'Output 0' and add columns to the 'Output Columns' section as per the above table.
On the 'Script' pane, click 'Edit Script', and then paste my code over the public override void CreateNewOutputRows() method (replacing it).
Your Script Component source is now configured, and you'll be able to use it like any other data source component. To write this data to a SQL Server table, add an OLEDB Destination to the Data Flow Task, and link the Script Component to that, configuring the columns appropriately.

How can I load in a pipe (|) delimited text file that has columns that sometimes contain line breaks?

I have built an SSIS package that loads in several delimited text files into a SQL database. One of the files often contains line spaces in it, which breaks the standard data flow task of setting a flat file source and mapping to an ado.net destination since it thinks it is on a new line when it reaches a line break. The vendor sending over the files does not want to sent the file without any edits and can't do XML at this time. Is there any way to fix this? I was thinking of writing a small vb.net program that would correct the files so they would work in the SSIS package, but not sure how to write that logic. The file has 5 columns, the first 2 are big integer and always contain some long integer ID, then there is a small text column that just contains one short word, then a date, and then a long comments field that is causing the problem. The comments field is sometimes blank (which is ok), the problem are the rows that have line breaks. I never know how many line breaks are in the comments, some have none, some can have several, even multiple line breaks in a row, so was wondering if this is even possible.
5787626|6547599|Approved|1/10/2017|Applicant request for fee waiver approved
5443221|7742812|Active|11/5/2013|
3430962|7643957|Re-Scheduled|5/25/2016|REVISED TERMS AND CONDITIONS REJECTED
Applicant has 30 DAYS To submit paperwork for extension.
34433624|7673715|Denied|1/24/2017|
34113575|7653748|Active|1/8/2014|New terms have been granted.
Sample File Format.
As long as there is logic that you can program/predict, it will be possible.
I would do it using a Script Component as a source, which means you don't need to rewrite the file before processing it. It also provides a lot of flexibility, e.g., you can store values in variables while iterating over multiple lines in the file, etc.
I posted another answer recently that gives a lot of detail on how to go about this: SSIS import a Flat File to SQL with the first row as header and last row as a total.
An example of holding the values in variables until the row is ready to be written:-
For this example I am writing three columns, ID1, ID2 and Comments. The file looks like this:
1|2|Comment1
Comment2
4|5|Comment3
Comment4
Comment5
6|7|Comment6
The Script Component contains the following method.
public override void CreateNewOutputRows()
{
System.IO.StreamReader reader = null;
try
{
bool readFirstLine = false;
int id1 = 0;
int id2 = 0;
string comments = null;
reader = new System.IO.StreamReader(Variables.FilePath); // this refers to a package variable that contains the file path
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (line.Contains("|"))
{
if (readFirstLine)
{
Output0Buffer.AddRow();
Output0Buffer.ID1 = id1;
Output0Buffer.ID2 = id2;
Output0Buffer.Comments = comments;
}
else
{
readFirstLine = true;
}
string[] fields = line.Split('|');
id1 = Convert.ToInt32(fields[0]);
id2 = Convert.ToInt32(fields[1]);
comments = fields[2];
}
else
{
comments += " " + line;
}
if (reader.EndOfStream)
{
Output0Buffer.AddRow();
Output0Buffer.ID1 = id1;
Output0Buffer.ID2 = id2;
Output0Buffer.Comments = comments;
}
}
}
catch
{
if (reader != null)
{
reader.Close();
reader.Dispose();
}
throw;
}
}
The result set is:
ID1 ID2 Comments
=== === ========
1 2 Comment1 Comment2
4 5 Comment3 Comment4 Comment5
6 7 Comment6

How to bulk insert from CSV when some fields have new line character?

I have a CSV dump from another DB that looks like this (id, name, notes):
1001,John Smith,15 Main Street
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"
1003,Bill Karr,2820 West Ave.
The last field may contain carriage returns and commas, in which case it is surrounded by double quotes. And I need to preserve those returns and commas.
I use this code to import CSV into my table:
BULK INSERT CSVTest
FROM 'c:\csvfile.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
SQL Server 2005 bulk insert cannot figure out that carriage returns inside quotes are not row terminators.
How to overcome?
UPDATE:
Looks like the only way to keep line breaks inside a field is to use different row separator. So, I want to mark all row separating line breaks by putting a pipe in front of them. How can I change my CSV to look like this?
1001,John Smith,15 Main Street|
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"|
1003,Bill Karr,2820 West Ave.|
Bulk operations on SQL Server do not specifically support CSV even though they can import them if the files are carefully formatted. My suggestion would be to enclose all field values in quotes. BULK INSERT might then allow the carriage returns within a field value. If it does not, then your next solution might be an Integration Services package.
See Preparing Data for Bulk Export or Import for more.
you can massage these line breaks into one line with a script, eg you can use GNU sed to remove line breaks. eg
$ more file
1001,John Smith,15 Main Street
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"
1003,Bill Karr,"2820
West Ave"
$ sed '/"/!s/$/|/;/.*\".*[^"]$/{ :a;N };/"$/ { s/$/|/ }' file
1001,John Smith,15 Main Street|
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"|
1003,Bill Karr,"2820
West Ave"|
then you can bulk insert.
Edit:
Save this :/"/!s/$/|/;/.*\".*[^"]$/{ :a;N };/"$/ { s/$/|/ } in a file , say myformat.sed. then do this on the command line
c:\test> sed.exe -f myformat.sed myfile
According to the source of all knowledge (Wikipedia), csv uses new lines to separate records. So what you have is not valid csv.
My suggestion is that you write a perl program to process your file and add each record to the db.
If you're not a perl person, then you could use a programming site or see if some kind SO person will write the parsing section of the program for you.
Added:
Possible Solution
Since the OP states that he can change the input file, I'd change all the new lines that do not follow a " to be a reserved char sequence, eg XXX
This can be an automated replacement in many editors. In Windows, UltraEdit includes regexp find/replace functionality
Then import into the dbms since you'll no longer have the embedded new lines.
Then use SQL Replace to change the XXX occurrences back into new lines.
If you have control over the contents of the CSV file, you could replace the in-field line breaks (CRLF) with a non-linebreak character (perhaps just CR or LF), then run a script after the import to replace them with CRLF again.
This is how MS Office products (Excel, Access) deal with this problem.
OK, here's a small Java program that I end up writing to solve the problem.
Comments, corrections and optimizations are welcome.
import java.io.*;
public class PreBulkInsert
{
public static void main(String[] args)
{
if (args.length < 3)
{
System.out.println ("Usage:");
System.out.println (" java PreBulkInsert input_file output_file separator_character");
System.exit(0);
}
try
{
boolean firstQuoteFound = false;
int fromIndex;
int lineCounter = 0;
String str;
BufferedReader in = new BufferedReader(new FileReader(args[0]));
BufferedWriter out = new BufferedWriter(new FileWriter(args[1]));
String newRowSeparator = args[2];
while ((str = in.readLine()) != null)
{
fromIndex = -1;
do
{
fromIndex = str.indexOf('"', fromIndex + 1);
if (fromIndex > -1)
firstQuoteFound = !firstQuoteFound;
} while (fromIndex > -1);
if (!firstQuoteFound)
out.write(str + newRowSeparator + "\r\n");
else
out.write(str + "\r\n");
lineCounter++;
}
out.close();
in.close();
System.out.println("Done! Total of " + lineCounter + " lines were processed.");
}
catch (IOException e)
{
System.out.println(e.getMessage());
System.exit(1);
}
}
}
You cannot import this unless the CSV is in valid format. So, you have to either fix the dump or manually using search & replace fix the unwanted new line characters.