Importing File WIth Field Terminators In Data - sql

I've been given some csv files that I want to turn into tables in a SQL database. However, the genius who created the files used comma delimiters, even though several data fields contain commas. So when I try to BCP the data into the database, I get a whole bunch of errors.
Is there a way that I can escape the commas that aren't field separators? At the moment I'm tempted to write a script to manually replace every comma in each file with a pipe, and then go through and manually change the affected rows back.

The only way to fix this is to write a script or program that fixes the data.
If the bad data is limited to a single field the process should be trivial:
You consume the row from either side by the count of good delimiters and replace with a new unique delimiter and what remains is the column with the extra old delimiters that you would just leave as is.
If you have two bad fields straddling good fields, you would need some kind of advanced logic, for instance I had XML data with delimiters, I had to parse the XML until I found a terminating tag and then process the other delimiters as needed.

Related

Azure Storage Blob BULK INSERT with inconsistent field terminators

I am trying to work with an external vendor to my company that is sending a csv file that I need to import into various tables in our database. We have other vendors who send similar files and they always send csv files with quotation marks around every field (in case there are commas in the field). As such, I make the field terminator "," for the file, and it bulk inserts just fine since all fields will include this terminator.
The problem that I'm running into is that we have a new vendor that is unable to enclose every field in quotation marks. They are using RFC 4180, which includes quotation marks around fields that have commas, but it doesn't include quotation marks when there aren't commas. So, this leads to inconsistent field terminators when attempting to bulk insert, and I don't know of a way around it. If I make the field terminator a comma, it will split fields that have commas in them, but I similarly can't make the field terminator "," since that will not be around every field.
Any advice is welcome. I am trying to work with the vendor to send a consistent format, but, in case they can't, I'm trying to also find a workaround.

formating/indenting stored text data in sqlite3

I am trying to format text data stored in my sqlite file, so when queried it will return with formatted data, more specifically indents.
I've read that this should be done in the front end application but it seems that would would be taxing to format hundreds of different, nonuniform entries. (these entries do share the same column value in the table though)
I have bullets in almost all of my entries but cannot indent them to nest properly.
is there anyway execute a sql statement to the column to add indent behavior like so
alksdjf ;lkasjdfl kajsl;dk fja;lsk djfal;ksjdfl;aksjd fl;kasjdfslkdfjals;kdjf;laksjdflkajs;dlkfja;lskjdfl;aksjdflkasjd;lfkjasl;kdfjl;
the text lines up with each other past the bullet

Writing on HDFS messed the data

I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.

Comma delimited flat file source

I have a text file that is split using commas
Simple enough to do in SSIS but i have the following row in my source flat file:
Desc,Curr,Desc,ID,Quantity
05969A105 ,CU,BANCORP INC, THE DEL COMMON ,1,2126
there is a comma in my Desc column and im not sure how i can ignore that comma
AFAIK, you can't do anything in SSIS (or any other app that I have ever used) to handle this, because it is simply bad data. If you need to persist with comma delimiters then you will need to get the data provider to use text-delimiters, e.g. double-quotes, to wrap the data. SSIS can be told what is the text delimiter and will strip these chars off the data automatically.
Of course this may raise the issue of 'but the text may need to contain a double-quote!', in which case you would be better off getting the delimiter changed to something else, such as a tab or pipe.

Testing a CSV - how far should I go?

I'm generating a CSV which contains several rows and columns.
However, when I'm testing said CSV I feel like I am simply repeating the code that builds the file in the test as I'm checking each and every field is correct.
Question is, is this more sensible than it seems to me, or is there a better way?
A far simpler test is to just import the CSV into a spreadsheet or database and verify the data output is aligned to the proper fields. No extra columns or extra rows, data selected from the imported recordset is a perfect INTERSECT with the recordset from which the CSV was generated, etc.
More importantly, I recommend making sure your test data includes common CSV fail scenarios such as:
Field contains a comma (or whatever your separator character)
Field contains multiple commas (You might think it's the same thing, but I've seen one fail where the other succeeded)
Field contains the new-row character(s)
Field contains characters not in the code page of the CSV file
...to make sure your code is handling them properly.