I am trying to work with an external vendor to my company that is sending a csv file that I need to import into various tables in our database. We have other vendors who send similar files and they always send csv files with quotation marks around every field (in case there are commas in the field). As such, I make the field terminator "," for the file, and it bulk inserts just fine since all fields will include this terminator.
The problem that I'm running into is that we have a new vendor that is unable to enclose every field in quotation marks. They are using RFC 4180, which includes quotation marks around fields that have commas, but it doesn't include quotation marks when there aren't commas. So, this leads to inconsistent field terminators when attempting to bulk insert, and I don't know of a way around it. If I make the field terminator a comma, it will split fields that have commas in them, but I similarly can't make the field terminator "," since that will not be around every field.
Any advice is welcome. I am trying to work with the vendor to send a consistent format, but, in case they can't, I'm trying to also find a workaround.
Related
In my project I saw two Hive tables and in the create table statement I saw one table has ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0004' and another table has ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u001C'. I want to know what does these '\u0004' and '\u001C' mean and when to use them? Kindly answer.
In many text formats, \u introduces a Unicode escape sequence. This is a way of storing or sending a character that can't be easily displayed or represented in the format you're using. The four characters after the \u are the Unicode "code point" in hexadecimal. A Unicode code point is a number denoting a specific Unicode character.
All characters have a code point, even the printable ones. For example, a is U+0061.
U+0004 and U+001C are both unprintable characters, meaning there's no standard character you can use to display them on the screen. That's why an escape sequence is used here.
If you use a simple, printable character like , as your field delimiter, it will make the stored data easier for a human to read. The field values will be stored with a , between each one. For example, you might see the values one, two and three stored as:
one,two,three
But if you expect your field values to actually contain a ,, it would be a poor choice of field delimiter (because then you'd need a special way to tell the difference between a single field with a value of one,two or two different fields with the values one and two). The choice of delimiter depends both on whether you want to be able to read it easily, and what characters you expect the field to contain.
What is the purpose of adding a text qualifier to a SSIS flat text file output?
I'm pulling data out of a SQL database that has quotes/commas/pipes/and many other common delimiters in the data.
Extreme example of a data point in a column:
"Johnson"|Smith,Jones
I set up the export as a comma delimited, with a double quote " text qualifier. I assumed it would export the data like so, and it did:
,""Johnson"|Smith,Jones",
Now i'm testing re-importing the data back in, as a comma delimited, with a double quote text qualifier. I got errors saying SSIS couldn't find the delimiter. I thought it would recognize the combination comma, and double quote, essentially as a more complex delimiter.
If adding a text delimiter to the output doesn't help with the problem of having the characters in the actual data, what does it do?
Assuming the person receiving the data might use a tool like Excel to process the data, which doesn't seem to be able to handle a complex multi character delimiter like |", is the best way to handle this by removing the most common delimiter from my data, and using that as the delimiter? Probably pipe in my case, instead of comma.
Text qualifier is used in the event that delimiters are contained within the row cell. Typically, the text qualifier is a double quote. In the event that the cell contains a delimiter and a text qualifier is not used, then the data that occurs after the delimiter will spill into the next column. From there, the data row can potentially blow up and none of the columns will line up afterwards. It can be a real mess.
Additionally, you will not see the text qualifier in applications, like Excel. However, if you open the file in Notepad++, then you will see the text qualifiers. There can be a lot of data (e.g., text qualifiers, new line characters, column delimiters, etc.) that is contained within a file but is not displayed in certain applications. This data typically is used to define the structure of the data as opposed to being the actual data.
For your problem, you will need to remove the double quotes from the source data or use a different text qualifier. You could use a single quote, but what if you have data like Jones's? The idea here is that the text qualifier should be unique in defining the data structure, which, as I understand it, means that you cannot have a text qualifier that is actually a part of the data (see note from Microsoft below - emphasis mine).
Per Microsoft:
Specify a text qualifier character. Each column can be configured to
recognize a text qualifier.
The use of a qualifier character to embed a qualifier character into a
qualified string is supported by the Flat File Connection Manager. The
double instance of a text qualifier is interpreted as a literal,
single instance of that string. For example, if the text qualifier is
a single quote and the input data is 'abc', 'def', 'g'hi', the output
data is abc, def, g'hi. However, an instance of a qualifier embedded
in a qualified string causes the Flat File Source to fail with the
error DTS_E_PRIMEOUTPUTFAILED.
References
Flat File Connection Manager official documentation
I am populating the data from server to google big query. One of the attributes in the table is a string that has close to 150+ characters in it.
For example, "Had reseller test devices in a vehicle with known working device
Set to power cycle, never got green light Checked with cell provider and all SIMs were active all cases the modem appears to be dead,light in all but not green light".
Table in GBQ gets populated until it hits this specific attribute. When this attribute is about to load, this does not get loaded in the single cell. It gets splitted into different cells and it corroupts the table.
Is there any restriction on each field of the GBQ? Any information regarding this would be appreciated.
My guess is that quote and comma characters in the CSV data are confusing the CSV parser. For example, if one of your fields is hello, world, this will look like two separate fields. The way around this is to quote the field, so you'd need "hello, world". This, of course, has problems if you have embedded quotes in the field. For instance if you wanted to have a field that said She said, "Hello, world", you would either need to escape the quotes by doubling the internal quotes, as in "She said, ""Hello, world""", or by using a different field separator (for instance, |) and dropping the quote separator (using \0).
One final complication is if you have embedded newlines in your field. If you have Hello\nworld, this means you need to set the allow_quoted_newlines on the load job configuration. The downside is that large files will be slower to import with this option, since they can't be done in parallel.
These configuration options are all described here, and can be used via either the web UI or the bq command line shell.
I'm not sure there is a limit imposed, and certainly I have seen string fields with over 8,000 characters.
Can you please clarify, 'When this attribute is about to load, this does not get loaded in the single cell. It gets splitted into different cells and it corroupts the table.'? Does this happen every time? Could it be associated with certain punctuation?
I've been given some csv files that I want to turn into tables in a SQL database. However, the genius who created the files used comma delimiters, even though several data fields contain commas. So when I try to BCP the data into the database, I get a whole bunch of errors.
Is there a way that I can escape the commas that aren't field separators? At the moment I'm tempted to write a script to manually replace every comma in each file with a pipe, and then go through and manually change the affected rows back.
The only way to fix this is to write a script or program that fixes the data.
If the bad data is limited to a single field the process should be trivial:
You consume the row from either side by the count of good delimiters and replace with a new unique delimiter and what remains is the column with the extra old delimiters that you would just leave as is.
If you have two bad fields straddling good fields, you would need some kind of advanced logic, for instance I had XML data with delimiters, I had to parse the XML until I found a terminating tag and then process the other delimiters as needed.
I have a text file that is split using commas
Simple enough to do in SSIS but i have the following row in my source flat file:
Desc,Curr,Desc,ID,Quantity
05969A105 ,CU,BANCORP INC, THE DEL COMMON ,1,2126
there is a comma in my Desc column and im not sure how i can ignore that comma
AFAIK, you can't do anything in SSIS (or any other app that I have ever used) to handle this, because it is simply bad data. If you need to persist with comma delimiters then you will need to get the data provider to use text-delimiters, e.g. double-quotes, to wrap the data. SSIS can be told what is the text delimiter and will strip these chars off the data automatically.
Of course this may raise the issue of 'but the text may need to contain a double-quote!', in which case you would be better off getting the delimiter changed to something else, such as a tab or pipe.