Testing a CSV - how far should I go? - testing

I'm generating a CSV which contains several rows and columns.
However, when I'm testing said CSV I feel like I am simply repeating the code that builds the file in the test as I'm checking each and every field is correct.
Question is, is this more sensible than it seems to me, or is there a better way?

A far simpler test is to just import the CSV into a spreadsheet or database and verify the data output is aligned to the proper fields. No extra columns or extra rows, data selected from the imported recordset is a perfect INTERSECT with the recordset from which the CSV was generated, etc.
More importantly, I recommend making sure your test data includes common CSV fail scenarios such as:
Field contains a comma (or whatever your separator character)
Field contains multiple commas (You might think it's the same thing, but I've seen one fail where the other succeeded)
Field contains the new-row character(s)
Field contains characters not in the code page of the CSV file
...to make sure your code is handling them properly.

Related

Merge multiple csv files with difference header tile in Objective C

I have multiple csv file with difference header tile, and I want to merge all of them & keep combination Header title.
I think I can doing cover csv to array then compare header tile in all file then merge csv file. However, seem it get huge processing time because o a lot of loop there. Could you help if have any fast solution.
Example:
file1.csv
No,Series,Product,A,B,C,D,E
1, AAA, XX, a1,b1,c1,d1,e1
file2.csv
No,Series,Product,A,C,D,B,E,F,G
1, AAB, XX, a1,c1,d1,b1,e1,f1,g1
file3.csv
No,Series,Product,A,A1,A2,C,D,B1,B,E
1, AAC, XX, a1,a11,a21,c1,d1,b11,b1,e1
My expected merge file is:
merge.csv
No,Series,Product,A,A1,A2,B,B1,C,D,E,F,G
1, AAA, XX, a1,0,0,b1,0,c1,d1
1, AAB, XX, a1,0,0,b1,0,c1,d1,e1,f1,g1
1, AAC, XX, a1,a11,a21,b1,b11,c1,d1,e1
"Data which not available in column will show as "0" or "NA",etc.
From your comment it seems you have no code but you think your sketch will be slow, it sounds like you are optimising prematurely – code your algorithm, test it, if its slow use Instruments to see where the time is being spent and then look at optimisation.
That said some suggestions:
You need to decide if you are supporting general CSV files, where field values may contain commas, newlines or double quotes; or simple CSV files where none of those characters is present in a field. See section 2 of Common Format and MIME Type for Comma-Separated Values (CSV) Files what what you need to parse to support general CSV files, and remember you need to output values using the same convention. If you stick with simple CSV files then NSString's componentsSeparatedByString: and NSArray's componentsJoinedByString: may be all you need to parse and output respectively.
Consider first iterating over your files reading just the header rows, parse those, and produce the merged list of headers. You will need to preserve the order of the headers, so you can pair them up with the data rows, so arrays are your container of choice here. You may choose to use sets in the merging process, but the final merged list of headers should also be an array in the order you wish them to appear in the merged file. You can use these arrays of headers directly in the dictionary methods below.
Using a dictionary as in your outline is one approach. In this case look at NSDictionary's dictionaryWithObjects:forKeys: for building the dictionary from the parsed header and record. For outputting the dictionary look at objectsForKeys:notFoundMarker: and using the merged list of headers. This supports missing keys and you supply the value to insert. For standard CSV's the missing value is empty (i.e. two adjacent commas in the text) but you can use NA or 0 as you suggest.
You can process each file in turn, a row at a time: read, parse, make dictionary, get an array of values back from dictionary with the missing value in the appropriate places, combine, write. You never need to hold a complete file in memory at any time.
If after implementing your code using dictionaries to easily handle the missing columns you find it is too slow you can then look at optimising. You might want to consider instead of breaking each input data row into fields and the recombining adding in the missing columns that you just do direct string replacement operations on the text of the data row and just add in extra delimiters as needed – e.g. if column four is missing you can change the third comma for two commas to insert the missing column.
If after designing your algorithm and coding it you hit problems you can ask a new question, include your algorithm and code, a link back to this question so people can follow the history, and explain what your issue is. Someone will undoubtedly help you take the next step.
HTH

Column size of Google Big Query

I am populating the data from server to google big query. One of the attributes in the table is a string that has close to 150+ characters in it.
For example, "Had reseller test devices in a vehicle with known working device
Set to power cycle, never got green light Checked with cell provider and all SIMs were active all cases the modem appears to be dead,light in all but not green light".
Table in GBQ gets populated until it hits this specific attribute. When this attribute is about to load, this does not get loaded in the single cell. It gets splitted into different cells and it corroupts the table.
Is there any restriction on each field of the GBQ? Any information regarding this would be appreciated.
My guess is that quote and comma characters in the CSV data are confusing the CSV parser. For example, if one of your fields is hello, world, this will look like two separate fields. The way around this is to quote the field, so you'd need "hello, world". This, of course, has problems if you have embedded quotes in the field. For instance if you wanted to have a field that said She said, "Hello, world", you would either need to escape the quotes by doubling the internal quotes, as in "She said, ""Hello, world""", or by using a different field separator (for instance, |) and dropping the quote separator (using \0).
One final complication is if you have embedded newlines in your field. If you have Hello\nworld, this means you need to set the allow_quoted_newlines on the load job configuration. The downside is that large files will be slower to import with this option, since they can't be done in parallel.
These configuration options are all described here, and can be used via either the web UI or the bq command line shell.
I'm not sure there is a limit imposed, and certainly I have seen string fields with over 8,000 characters.
Can you please clarify, 'When this attribute is about to load, this does not get loaded in the single cell. It gets splitted into different cells and it corroupts the table.'? Does this happen every time? Could it be associated with certain punctuation?

Importing File WIth Field Terminators In Data

I've been given some csv files that I want to turn into tables in a SQL database. However, the genius who created the files used comma delimiters, even though several data fields contain commas. So when I try to BCP the data into the database, I get a whole bunch of errors.
Is there a way that I can escape the commas that aren't field separators? At the moment I'm tempted to write a script to manually replace every comma in each file with a pipe, and then go through and manually change the affected rows back.
The only way to fix this is to write a script or program that fixes the data.
If the bad data is limited to a single field the process should be trivial:
You consume the row from either side by the count of good delimiters and replace with a new unique delimiter and what remains is the column with the extra old delimiters that you would just leave as is.
If you have two bad fields straddling good fields, you would need some kind of advanced logic, for instance I had XML data with delimiters, I had to parse the XML until I found a terminating tag and then process the other delimiters as needed.

Writing on HDFS messed the data

I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.

Converting Gridview into CSV

Is this possible to take delimeter other than comma for converting into CSV file....because in my scenario my gridview cell contains data with commas.
Well the 'C' in C‍SV does stand for "Comma".
That said, depending on what the purpose/destination of your "CSV" output is, I can see two options:
If your program is the only recipient, use whatever you like. Heck, something like the built in serialises might be easier.
Otherwise, follow the CSV format and double quote your values.
There is a lot a more useful information in this question, and this one.