Merge multiple csv files with difference header tile in Objective C - objective-c

I have multiple csv file with difference header tile, and I want to merge all of them & keep combination Header title.
I think I can doing cover csv to array then compare header tile in all file then merge csv file. However, seem it get huge processing time because o a lot of loop there. Could you help if have any fast solution.
Example:
file1.csv
No,Series,Product,A,B,C,D,E
1, AAA, XX, a1,b1,c1,d1,e1
file2.csv
No,Series,Product,A,C,D,B,E,F,G
1, AAB, XX, a1,c1,d1,b1,e1,f1,g1
file3.csv
No,Series,Product,A,A1,A2,C,D,B1,B,E
1, AAC, XX, a1,a11,a21,c1,d1,b11,b1,e1
My expected merge file is:
merge.csv
No,Series,Product,A,A1,A2,B,B1,C,D,E,F,G
1, AAA, XX, a1,0,0,b1,0,c1,d1
1, AAB, XX, a1,0,0,b1,0,c1,d1,e1,f1,g1
1, AAC, XX, a1,a11,a21,b1,b11,c1,d1,e1
"Data which not available in column will show as "0" or "NA",etc.

From your comment it seems you have no code but you think your sketch will be slow, it sounds like you are optimising prematurely – code your algorithm, test it, if its slow use Instruments to see where the time is being spent and then look at optimisation.
That said some suggestions:
You need to decide if you are supporting general CSV files, where field values may contain commas, newlines or double quotes; or simple CSV files where none of those characters is present in a field. See section 2 of Common Format and MIME Type for Comma-Separated Values (CSV) Files what what you need to parse to support general CSV files, and remember you need to output values using the same convention. If you stick with simple CSV files then NSString's componentsSeparatedByString: and NSArray's componentsJoinedByString: may be all you need to parse and output respectively.
Consider first iterating over your files reading just the header rows, parse those, and produce the merged list of headers. You will need to preserve the order of the headers, so you can pair them up with the data rows, so arrays are your container of choice here. You may choose to use sets in the merging process, but the final merged list of headers should also be an array in the order you wish them to appear in the merged file. You can use these arrays of headers directly in the dictionary methods below.
Using a dictionary as in your outline is one approach. In this case look at NSDictionary's dictionaryWithObjects:forKeys: for building the dictionary from the parsed header and record. For outputting the dictionary look at objectsForKeys:notFoundMarker: and using the merged list of headers. This supports missing keys and you supply the value to insert. For standard CSV's the missing value is empty (i.e. two adjacent commas in the text) but you can use NA or 0 as you suggest.
You can process each file in turn, a row at a time: read, parse, make dictionary, get an array of values back from dictionary with the missing value in the appropriate places, combine, write. You never need to hold a complete file in memory at any time.
If after implementing your code using dictionaries to easily handle the missing columns you find it is too slow you can then look at optimising. You might want to consider instead of breaking each input data row into fields and the recombining adding in the missing columns that you just do direct string replacement operations on the text of the data row and just add in extra delimiters as needed – e.g. if column four is missing you can change the third comma for two commas to insert the missing column.
If after designing your algorithm and coding it you hit problems you can ask a new question, include your algorithm and code, a link back to this question so people can follow the history, and explain what your issue is. Someone will undoubtedly help you take the next step.
HTH

Related

Alternative to case statments when changing a lot of numeric controls

I'm pretty new to LabVIEW, but I do have experience in other programing languages like Python and C++. The code I'm going to ask about works, but there was a lot of manual work involved when putting it together. Basically I read from a text file and change control values based on values in the text file, in this case its 40 values.
I have set it up to pull from a text file and split the string by commas. Then I loop through all the values and set the indicator to read the corresponding value. I had to create 40 separate case statements to achieve this. I'm sure there is a better way of doing this. Does anyone have any suggestions?
There could be done following improvements (additionally to suggested by sweber:
If file contains just data, without "label - value" format, then you could read it as csv (comma separated values) format, and read actually just 1st row.
Currently, you set values based on order. In this case, you could: create reference to all indicators, build them to array in proper order, in For Loop assign values to indicators via property node Value.
Overall, I support sweber that if it is some key - value data, then better to use either JSON format, or .ini file format, which support such structure.
Let's start with some optimization:
It seems your data file contains nothing more than just 40 numbers. You can wire an 1D DBL array to the default input of the string-to-array VI, and you will get just a 1D array out. No need for a 2D array.
Second, there is no need to convert the FOR index value to a string, the CASE accepts integers, too.
Now, about your question: The simplest solution is to display the values as array, just as they come from the string-to-array VI.
But I guess each value has a special meaning, and you would like to display it's name/description somehow. In this case, create a cluster with 40 values, edit their labels as you like, and make sure their order in the cluster is the same as the order of the values in the files.
Then, wire the 1D array of values to this cluster via an array-to-cluster VI.
If you plan to use the text file to store and load the values, converting the cluster data to JSON and vv. might be something for you, as it transports the labels of the cluster into the file, too. (However, changing labels is an issue, then)

Importing File WIth Field Terminators In Data

I've been given some csv files that I want to turn into tables in a SQL database. However, the genius who created the files used comma delimiters, even though several data fields contain commas. So when I try to BCP the data into the database, I get a whole bunch of errors.
Is there a way that I can escape the commas that aren't field separators? At the moment I'm tempted to write a script to manually replace every comma in each file with a pipe, and then go through and manually change the affected rows back.
The only way to fix this is to write a script or program that fixes the data.
If the bad data is limited to a single field the process should be trivial:
You consume the row from either side by the count of good delimiters and replace with a new unique delimiter and what remains is the column with the extra old delimiters that you would just leave as is.
If you have two bad fields straddling good fields, you would need some kind of advanced logic, for instance I had XML data with delimiters, I had to parse the XML until I found a terminating tag and then process the other delimiters as needed.

Quickly Convert Text To Numbers or Dates Excel VBA

Is there any way to QUICKLY convert numbers/dates stored as text (without knowing exactly which cells are affected) to their correct type using VBA.
I get data in an ugly text-deliminated format, and I wrote a macro that basically does text-to-columns on it, but is more robust (regular text-to-columns will not work on my data, and I also don't want to waste time going through the wizard every time...). But, since I have to use arrays to process the data efficiently, everything gets stored as a String (and is thus transferred to the worksheet as text).
I don't want to have to cycle through every cell, as this takes a LONG time (these are huge data files - I need to use arrays to process them). Is there a simple command I can apply to the entire range to do this?
Thanks!
This has to do with the data type of the columns modify the column from general to the correct data type and the placement of text data should get automatically converted... here's an example where I pasted the text 012345 into different columns having different data types. Note how the displayed value is different for the different types but the value is retained (except on number and general which truncate a leading 0.
However if you don't know what field is of what type... you're really out of luck.
There is a way is there. Just multiply 1 with the data in the column have text to converted as number, whether it is text or not it will convert to numbers only.
Read the following the link for more.
http://chandoo.org/wp/2014/09/02/convert-numbers-stored-as-text-tip/

Writing on HDFS messed the data

I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.

Testing a CSV - how far should I go?

I'm generating a CSV which contains several rows and columns.
However, when I'm testing said CSV I feel like I am simply repeating the code that builds the file in the test as I'm checking each and every field is correct.
Question is, is this more sensible than it seems to me, or is there a better way?
A far simpler test is to just import the CSV into a spreadsheet or database and verify the data output is aligned to the proper fields. No extra columns or extra rows, data selected from the imported recordset is a perfect INTERSECT with the recordset from which the CSV was generated, etc.
More importantly, I recommend making sure your test data includes common CSV fail scenarios such as:
Field contains a comma (or whatever your separator character)
Field contains multiple commas (You might think it's the same thing, but I've seen one fail where the other succeeded)
Field contains the new-row character(s)
Field contains characters not in the code page of the CSV file
...to make sure your code is handling them properly.