Filehelpers: parse fixed length file with different length of lines in - filehelpers

I've to parse a fixed length file containing multiple line but with different length.
Actually each line represent a different kind of object that would be insert in the database.
The file can be like this:
A10200JohnSmithUSA
B10000ContractSignedWithJohnSmith10200
Line 1..represent information regarding John Smith and Line 2 represents informations regarding contract signed by JohnSmith... All this come in the same file.
Can filehelpers library do that?
Thanks for your help.
Regards

Yes FileHelpers can handle multiple record types within the same file. It depends whether there is any hierarchical relation between the records. If not, you can use the MultiRecordEngine.
Check out the MultiRecordEngine example
If your contract row needs to use information from the John Smith row, then you can use a MasterDetailEngine.
Check out the MasterDetailEngine example
Note however, the MasterDetailEngine only supports one level of nesting.

Related

Merge multiple csv files with difference header tile in Objective C

I have multiple csv file with difference header tile, and I want to merge all of them & keep combination Header title.
I think I can doing cover csv to array then compare header tile in all file then merge csv file. However, seem it get huge processing time because o a lot of loop there. Could you help if have any fast solution.
Example:
file1.csv
No,Series,Product,A,B,C,D,E
1, AAA, XX, a1,b1,c1,d1,e1
file2.csv
No,Series,Product,A,C,D,B,E,F,G
1, AAB, XX, a1,c1,d1,b1,e1,f1,g1
file3.csv
No,Series,Product,A,A1,A2,C,D,B1,B,E
1, AAC, XX, a1,a11,a21,c1,d1,b11,b1,e1
My expected merge file is:
merge.csv
No,Series,Product,A,A1,A2,B,B1,C,D,E,F,G
1, AAA, XX, a1,0,0,b1,0,c1,d1
1, AAB, XX, a1,0,0,b1,0,c1,d1,e1,f1,g1
1, AAC, XX, a1,a11,a21,b1,b11,c1,d1,e1
"Data which not available in column will show as "0" or "NA",etc.
From your comment it seems you have no code but you think your sketch will be slow, it sounds like you are optimising prematurely – code your algorithm, test it, if its slow use Instruments to see where the time is being spent and then look at optimisation.
That said some suggestions:
You need to decide if you are supporting general CSV files, where field values may contain commas, newlines or double quotes; or simple CSV files where none of those characters is present in a field. See section 2 of Common Format and MIME Type for Comma-Separated Values (CSV) Files what what you need to parse to support general CSV files, and remember you need to output values using the same convention. If you stick with simple CSV files then NSString's componentsSeparatedByString: and NSArray's componentsJoinedByString: may be all you need to parse and output respectively.
Consider first iterating over your files reading just the header rows, parse those, and produce the merged list of headers. You will need to preserve the order of the headers, so you can pair them up with the data rows, so arrays are your container of choice here. You may choose to use sets in the merging process, but the final merged list of headers should also be an array in the order you wish them to appear in the merged file. You can use these arrays of headers directly in the dictionary methods below.
Using a dictionary as in your outline is one approach. In this case look at NSDictionary's dictionaryWithObjects:forKeys: for building the dictionary from the parsed header and record. For outputting the dictionary look at objectsForKeys:notFoundMarker: and using the merged list of headers. This supports missing keys and you supply the value to insert. For standard CSV's the missing value is empty (i.e. two adjacent commas in the text) but you can use NA or 0 as you suggest.
You can process each file in turn, a row at a time: read, parse, make dictionary, get an array of values back from dictionary with the missing value in the appropriate places, combine, write. You never need to hold a complete file in memory at any time.
If after implementing your code using dictionaries to easily handle the missing columns you find it is too slow you can then look at optimising. You might want to consider instead of breaking each input data row into fields and the recombining adding in the missing columns that you just do direct string replacement operations on the text of the data row and just add in extra delimiters as needed – e.g. if column four is missing you can change the third comma for two commas to insert the missing column.
If after designing your algorithm and coding it you hit problems you can ask a new question, include your algorithm and code, a link back to this question so people can follow the history, and explain what your issue is. Someone will undoubtedly help you take the next step.
HTH

Flexible schema with Google Bigquery

I have around 1000 files that have seven columns. Some of these files have a few rows that have an eighth column (if there is data).
What is the best way to load this into BigQuery? Do I have to find and edit all these files to either
- add an empty eighth column in all files
- remove the eighth column from all files? I don't care about the value in this column.
Is there a way to specify eight columns in the schema and add a null value for the eighth column when there is no data available.
I am using BigQuery APIs to load data if that might help.
You can use the 'allowJaggedRows' argument, which will treat non-existent values at the end of a row as nulls. So your schema could have 8 columns, and all of the rows that don't have that value will be null.
This is documented here: https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowJaggedRows
I've filed a doc bug to make this easier to find.
If your logs are in JSON, you can define a nullable field, and if it does not appear in the record, it would remain null.
I am not sure how it works with CSV, but I think that you have to have all fields (even empty).
There is a possible solution here if you don't want to worry about having to change the CSV values (which would be my recommendation otherwise)
If the number of rows with an eight parameter is fairly small and you can afford to "sacrifice" those rows, then you can pass a maxBadRecords param with a reasonable number. In that case, all the "bad" rows (i.e. the ones not conforming to the schema) would be ignored and wouldn't be loaded.
If you are using bigquery for statistical information and you can afford to ignore those rows, it could solve your problem.
Found a workable "hack".
Ran a job for each file with the seven column schema and then ran another job on all files with eight columns schema. One of the job would complete successfully. Saving me time to edit each file individually and reupload 1000+ files.

Yahoo finance API field misalignment

I'm trying to use the Yahoo Finance API to create a custom csv but depending upon the stock there is field misalignment.
For instance, if I just want to download the "k3" field for yahoo which corresponds to last trade size, I would craft the url like so:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=k3
However, if you download that csv there are two columns of data.
Similarly, if I decide to get Float Shares , I want the url:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=f6
However that gives me 3 columns. Is there a way to get it in exactly one column? I want to automate this process but the different column orientations make it very difficult as different rows then have different column lengths and I am unable to easily match up the column name with the row.
Bonus: If someone can explain where the 3 float share numbers come from that would be great, I seem to only be able to match up the first to the site...
Thank you for your help!
In short, you are describing known bugs that Yahoo isn't going to fix as the feed is officially unsupported.
Specifically re. Float (f6): the number returned is the entire float. It is not 3 csv numbers. The commas are not delimiters; rather, they are 1,000s separators. (I suspect the same is the case with K3. As it is with a couple of other known numbers. (See link below.))
Two solutions:
(1) Write your own workaround using conditional statements (if or case) in your code.
(2) Download the buggy parameters separately from the clean ones.
See: Yahoo's official reply to your question.
The multiple columns is because that excel (or whatever csv viewer you are using) treats "thousand-seperator" as the the "comma-seperator". We used to have this problem in our school project, and found a hack which is good only if you are using this api for some hobby project and not concerning data usage.
The idea is instead of treating the results as a csv, pick a static column (column A) where you will know the value beforehand (e.g. column 's' stock symbol) or put this value as the first column. When constructing the query, use this column to surround those columns (float columns) with formatting problem. once you get the quotes.csv, manual seperate the results on the column A value.
for example using
http://download.finance.yahoo.com/d/quotes.csv?s=yhoo&f=sf6sa5sb6
will get you
"YHOO", 887,675,000,"YHOO",400,"YHOO",N/A
Then use ,"YHOO", to seperate the results (excluding first column).
Not an elegent way to solve the problem, but at least it gives you the correct result.

Writing on HDFS messed the data

I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.

Fixed Length Text File to SQL Data Table

I have a text file (~100,000+ rows), where each column is a fixed length and I need to get it into a SQL Server database table. Each one of our clients are required to get this data, but each text file is slightly different so we have to manually go in and adjust the character spacing in a SQL stored procedure.
I was wondering if there is a way that we can use XML/XSD/XSLT instead. This way, I would not have to go in and manually edit the stored procedures.
What we do currently is this:
1.) SQL server stored procedure reads a text file from the disk
2.) Each record is split into an XML element and dumped into a temporary table
3.) Using SQL Server's string manipulation, each element is parsed
4.) Each column is dumped into
For clarification, here are a couple of examples...
One client's text file would have the following:
Name [12 Characters]
Employer [20 Characters]
Income [7 Characters]
Year-Qtr [5 Characters]
JIM JONES HOMERS HOUSE OF HOSE100000 20113
Another client's text file would have the following:
Year-Qtr [5 Characters]
Income [7 Characters]
Name [12 Characters]
Employer [20 Characters]
20113100000 JIM JONES HOMERS HOUSE OF HOSE
They basically all have the same fields, some may have a couple more are a couple less, just in different orders.
Using SQL Server xml processing functions to import a fixed length text file seems like a backwards way of doing things (no offense).
You don't need to build your own application, Microsoft has already built one for you. It's ingeniously called BCP Utility. If needed, you can create a format file that tells BCP Utility how to import your data. The best part is it's ridiculously fast and you can import the data to SQL Server from a remote machine (as in the file doesn't have to be located on the SQL Server box to import it)
To address the fact that you need to be able to change the column widths, I don't think editing the format file would be to bad.
Ideally you would be able to use a delimited format instead of an ever-changing fixed length format, that would make things much easier. It might be quick and easy for you to import the data into excel and save it in a delimited format and then go from there.
Excel, Access, all the flavors of VB and C# have easy-to-use drivers for treating text files as virtual database tables, usually with visual aids for mapping the columns. And reading and writing to SQL Server is of course cake. I'd start there.
100K rows should not be a problem unless maybe you're doing it hourly for several clients.
I'd come across File Helpers a while back when I was looking for a CSV parser. The example I've linked to shows you how you can use basic POCOs decorated with attributes to represent the file you are trying to parse. Therefore you'd need a Customer specific POCO in order to parse their files.
I haven't tried this myself, but it could be worth a look.