Detecting partial rows in SSIS - error-handling

Situation: A tab delimited row from a flat file source is missing columns at the end of the row. The rows are delimited by {CR}{LF} in the Flat File Connection Manager, and the last column is delimited by {CR}{LF} as well. All other columns are delimited by Tab {t}. SSIS is importing the row.
Example:
Column_1{t} Column_2{t} Column_3{t} Column_4{CR}{LF}
123{t} 123{t} 123{t} 123{CR}{LF}
123{t} 123{CR}{LF}
123{t} 123{t} 123{{t} 123{CR}{LF}
123{t} 123{t} {t} {CR}{LF}
123{t} 123{t} 123{t} 123{CR}{LF}
Problem: A partial row that does not have the remaining columns tab delimited (see row 2 above) treats the following row as a part of the current row, while a row containing the tab delimited columns that are blank (see row 4 above) does not.
Desired Output: An error is desired to signal a partial row.
What is the best method to check for partial rows in the middle of a file?

It appears that pre-denali (2012) SSIS fails if the missing column when parsing. This is fixed in 2012 by always checking for the row delimiter.
See: http://blogs.msdn.com/b/mattm/archive/2011/07/17/flat-file-source-changes-in-denali.aspx
Work arounds for this issue in pre-2012 SSIS include writing your own parser (this is what we chose to do), converting the data before parsing it, or using the Flat File Source just to parse rows.

Related

Split text file with multiple record types in SSIS

I have a text file that I need to import into SQL database and split to columns. The file layout is as follows:
AL12345... (Header row)
12...
30...
70...
EL.XXXX (Trailer row which contains also the number of records in the block)
AL23456... (Header row)
12...
30...
70...
EL.XXXX (Trailer row which contains also the number of records in the block)
AL34567... (Header row)
12...
30...
EL.XXXX (Trailer row which contains also the number of records in the block)
The number of blocks (from header to trailer) are one or more. When there is only one block I don't have problems importing and manipulating data. The problem arises when there is more than one block.
What should I do? Split the file if it contains more than one block and than import each file separately? If yes, how would I split the file?
Thank you!
I would read the file using the script task.
Ignore rows that begin with AL. Or EL. If the first column is numeric you should be good.

python is taking multiple column as index while importing csv file

I am trying to import a csv file, the first column is dates, second is none, and the rest are some product data. When I import the file, it take my first three column as index. I just need to have dates as index.
When you are loading the csv file by specifying the column names, all the columns that you didn't pass to the names argument will become an index. So you can either
1) Pass the name of the second and third index as the names argument which will give only the dates as index.
2) Prd_data.reset_index(level=[1,2],inplace=True)
This will drop the second and the third index and make them as columns and keep only the date as index.

How to combine two rows into one row with respective column values

I've a csv file which contains data with new lines for a single row i.e. one row data comes in two lines and I want to insert the new lines data into respective columns. I've loaded the data into sql but now I want to replace the second row data into 1st row with respective column values.
output details:
I wouldn't recommend fixing this in SQL because this is an issue with the CSV file. The issue is that file contains new lines, which causes rows split.
I strongly encourage fixing CSV files, if possible. It's going to be difficult to fix that in SQL given there are going to be more cases like that.
If you're doing the import with SSIS (or if you have the option of doing it with SSIS if you are not currently), the package can be configured to manage embedded carriage returns.
Define your file import connection manager with the columns you're expecting.
In the connection manager's Properties window, set the AlwaysCheckForRowDelimiters property to False. The default value is True.
By setting the property to False, SSIS will ignore mid-row carriage return/line feeds and will parse your data into the required number of columns.
Credit to Martin Smith for helping me out when I had a very similar problem some time ago.

How to import data to a table with 14 columns via BCP if my data file has less than 13 delimiters?

Sorry for the extremely awkward wording in that question. I'll explain.
I have a table with 14 columns in which i'm trying to import data to via BCP. My data comes from a text file. This text file is TAB delimited. Logically there should be 13 delimiters for 14 cells of data in a row. My data is inconsistent and doesn't have delimiters if the values at the end are null. This means that some rows of data only have 10 delimiters. This causes my data to "wrap around" when it is imported. The first cell of data in my text file is being put in the 10th column of the row prior to it. It should be the first cell in its own new row.
The thing is every single row in the text file ends in with "CRLF" which is used by default in BCP.
Is there a way to tell BCP to fill in all 14 columns before moving on to the next row? Or will i have to re format my data file every time i import (not ideal).
Here is my BCP command:
bcp testdb.dbo.MACARP in C:\Users\sysbrady\Desktop\MyData.txt /c /T /t "\t" /E -S WSTVDISTD023\SQLEXPRESS
"Is there a way to tell BCP to fill in all 14 columns before moving on to the next row?"
When you say "fill in", do you mean you want BCP to keep the null values present in your text file? The -k qualifier tells BCP to keep the nulls (make sure the column in your table allows nulls). See link below:
http://msdn.microsoft.com/en-us/library/ms187887.aspx
"The thing is every single row in the text file ends in with "CRLF" which is used by default in BCP."
This is unclear - could you post an image? Unsure of whether you have phrased this as a problem, or a feature you want to retain.

Use google-refine on csv without headers and with various number of columns per record

I'm attempting to import in open-refine a csv extracted from a NoSQL database (Cassandra) without headers and with different number of columns per record.
For instance, fields are comma separated and could look like below:
1 - userid:100456, type:specific, status:read, feedback:valid
2 - userid:100456, status:notread, message:"some random stuff here but with quotation marks", language:french
There's a maximum number of columns and there aren't cleansing required on their names.
How do I make up a big excel file I could mine using pivot table?
If you can get JSON instead, Refine will ingest it directly.
If that's not a possibility, I'd probably do something along the lines of:
import as lines of text
split into two columns containing row ID and fields
split multi-valued cells on fields column using comma as a separatd
split fields column into two columns using colon as a separate
use key/value on these two columns to unfold into columns