Broken rows when reading inlineStr cells with EPPlus C# - epplus

I have xlsx-file which opens succesfully with Excel and which can be parsed with other excel-libraries than EPPlus. We are likely to continue using EPPlus, so it will be nice to get some advice with this issue.
Excel data in is plain text data without formatting.
When parsing with EPPlus with C#, let's say 3x3 sheet, parsed data is fragmented following way to memory (empty cells added to every row so total size is 3x9 or something):
r1c1 r1c2 r1c3
r2c1 r2c2 r2c3
r3c1 r3c2 r3c3
instead of 3x3 array
r1c1 r1c2 r1c3
r2c1 r2c2 r2c3
r3c1 r3c2 r3c3
When opening xlsx-archive with zip viewer it seems that xl\worksheets\sheet.xml contains following data.
<x:row>
<x:c t="inlineStr">
<x:is>
<x:t>Data in cell</x:t>
</x:is>
</x:c>
..
</x:row>
So no any row/column identifiers are present in previous snippet. Maybe the root cause of problem?
Another thing to notice is that when open and save same file in Excel without modifications, file size increases and sheet data seem to be moved from sheet.xml to sharedstrings.xml. After succesful saving in Excel, only row/column indices are present in sheet1.xml and file can be properly parsed with EPPlus.

Problem in this case was that there were no row/column identifiers present in sheet data. Incorrect Excel file was created with custom program using Open XML SDK.
According to Open XML SDK Guidelines (https://msdn.microsoft.com/en-us/library/office/gg278309.aspx), missing row/column identifiers is error against Excel data format so problem was not in EPPlus library.

Related

grab and filter from more than 255 columns from a huge closed workbook

i have a huge workbook (0.6 million rows) and 315 columns whose column names i need to grab into an array. due to the huge size, i don't want to open and close the workbook to copy the 1st row of the range. Also, I want to only grab certain columns from the 1st row that begin with the word "Global ".
can anyone help with short code example on how to go about doing this? please note i have tried ADOX, ADO etc but both show the 255 column limitations. I also dont want to open the workbook, but pull the required "Global " columns from the 315 columns into an array.
any help is most appreciated.
You can copy the first row of your target by opening a new workbook, and in A1 use this formula:
='C:\PATH_TO_TARGET\[TARGET_FILE_NAME.xlsx]WORKSHEET_NAME'!A1
Note that PATH+FILENAME+WORKSHEET is enclosed in single quotes, the FILENAME is enclosed in square brackets, and an exclamation separates the cell reference.
Then copy/Paste or fill right to get the next 314 columns. Note: this formula will return zero for empty target cells.
Once you have the column heading you can copy/paste_special_values if you want to destroy the links to the closed workbook.
Hope that helps
You could use the Python programing language.
While it does not actively works with XLSX fiels, you just have to install the openpyxl external module from here: https://pypi.python.org/pypi/openpyxl -
(You will also have to install Python. of course - just download it from www.python.org)
It will make working with your data in an interactive Python session a piece of cake, and the time to open the workbook without having to load the Excel interface should be a fraction of what you are expecting. (I think it will have to fit in your memory, though).
But this is all I had to type, in an interactive Python2 session to open a workbook, and retreive the column names that start with "bl":
import openpyxl
a = openpyxl.load_workbook("bla.xlsx")
[cell.value for cell in a.worksheets[0].rows[0] if cell.value.startswith("bl")]
output:
Out[8]: [u'bla', u'ble', u'bli', u'blo', u'blu']
The last input line requires on to know Python to be understood, so, here is a summary of what happens: Python is a language very fond of working with sequences - and the openpyxl libray gives your workbook as just that:
an object which is a sequence of worksheets - each worksheet having a rows attribute which has a sequence of all rows in the sheet, and each row bein a sequence of cells. Each cell has a value attribute which is the text within it.
The inline for statement is the compact form, but it could be written as a multiple line statement as:
In [10]: for cell in a.worksheets[0].rows[0]:
....: if cell.value.startswith("bl"):
....: print cell.value
....:
bla
ble
bli
blo
blu
Keep in mind that by exploring Python a bit deeper, you can programatically manipulate your data in a way that will be easier than ininteractivelygiven a data-set this size - and you can even use Python itself to drop select contents to an SQL database, (including its bult-in, single-file database, sqlite), where sophisticated indexes and queries can make working with your data a breeze)

Excel sheet Deletes the formulas present in the sheet when I open it. How to avoid this?

I'm uploading an excel file that contains sheets, to my server which encodes to base 64 so I decode it as required and process it by adding data in sheet 5 as column1 and column2 with certain number of rows. At the time of uploading, this sheet has some specific formulas on sheet 5 that makes changes in other sheets. So on opening the file which I send as response after editing from server, There comes this prompt that reads
"Excel Found unreadable content in 'MyDownloadedExcelData.xlsx'. Do you want to recover the contents of this workbook?If you trust the source of this workbook, click Yes', with Yes and no buttons
and when I click on yes and open the sheet, all the formulas are deleted.
I see something like
Excel was able to open the file by repairing ot removing the unreadable content.
Removed Records :Formula from /xl/calcChain.xml Part
Repaired Records : Cell Information from /xl/worksheets/sheet1.xml part etc
So, How do I make sure my formulas in the sheet are retained?
Using VBA you could have an on close event that pastes values and an on open event that recreates the formulas. Your file would essentially save with static data, but then be used with functions intact.
If this solution is of interest I can help provide some coding framework.

vb.NET SaveAs not saving all Excel data

I have a very strange issue that I cannot seem to find an answer to online.
I have a VB.NET application that creates an Excel of data (roughly 42,542 rows in total) and the saves the file to a folder location & opens it on screen for the user.
The onscreen version & folder version is only showing 16,372 rows of data like it is being cut off.
When I go through debug I can see all the rows are being added & if I save manually in debug all the rows save. Some data seems to get lost on the system save.
I am taking data from 4 record sets & writing each set one after the other with specific headers for each block on the Excel sheet.
My save line is:
xlWBook.SaveAs(Filename:=sFileName, FileFormat:=Excel.XlFileFormat.xlExcel7)
Would anyone please have any ideas as to what this might be?
Older version of Excel only support 16,384 rows per worksheet. You are saving as Excel7 (which is Excel 95) and has this limitation:
See here for a summary of sizes per version:
https://superuser.com/questions/366468/what-is-the-maximum-allowed-rows-in-a-microsoft-excel-xls-or-xlsx
Change your code to another format, See here for all the allowed formats: XlFileFormat Enumeration
However the file format is actually an optional argument in the SaveAs method, so you could leave it off altogether: "For an existing file, the default format is the last file format specified; for a new file, the default is the format of the version of Excel being used."
Source: WorkBook.SaveAs Method

excel macro to read text file and find matches in cells

I really could use some help
I have two .txt/csv files that I need to read from into my excel file.
In my excel file I have a whole column, each cell containing string of characters and I need to write a script to be able find matches and and copy an adjacent column from that txt file.
An example of a single row on my txt file is shown below:
"AB101AA","AB10 1AA","AB101A","AB10 1A","AB101","AB10 1","AB10","AB10","AB","10",394251,806376,,
"AB101AF","AB10 1AF","ABERDEEN","ABERDEENSHIRE",,"ABERDEEN, CITY OF"
My excel file would have a cell which probably say "AB101AF" and i want the corresponding cell to run through a million rows and find the match and then find the corresponding nth cell on the txt file and return it on the excel spreadsheet example "ABERDEEN, CITY OF".
I know I havent been helpful in explaining the issue. But any help would be appreciated.
Thank you
Depending upon the size of your text file you could import the file using the GetExternalData option in Excel. This would allow you to load your data into a different Sheet and then use a lookup to your data from the main Sheet. Using Match and/or vlookup should help here.
You could also add a workbook connection to the text file and search using the connection.

Convert xls File to csv, but extra rows added?

So, I am trying to convert some xls files to a csv, and everything works great, except for one part. The SaveAs function in the Excel interop seems to export all of the rows (including blank ones). I can see these rows when I look at the file using Notepad. (All of the rows I expect, 15 rows with two single quotes, then the rest are just blank). I then have a stored procedure that takes this csv and imports to the desired table (this works on spreadsheets that have been manually converted to csv (e.g. open, File--> Saves As, etc.)
Here is the line of code I am using for my SavesAs in my code. I have tried xlCSV, xlCSVWindows, and xlCSVDOS as my file format, but they all do the same thing.
wb.SaveAs(aFiles(i).Replace(".xls", "B.csv"), Excel.XlFileFormat.xlCSVMSDOS, , , , False) 'saves a copy of the spreadsheet as a csv
So, is there some additional step/setting I need to do to not get the extraneuos rows to show up in the csv?
Note that if I open this newly created csv, and then click Save As, and choose csv, my procedure likes it again.
When you create a CSV from a Workbook, the CSV is generated based upon your UsedRange. Since the UsedRange can be expanded simply by having formatting applied to a cell (without any contents) this is why you are getting blank rows. (You can also get blank columns due to this issue.)
When you open the generated CSV all of those no-content cells no longer contribute to the UsedRange due to having no content or formatting (since only values are saved in CSVs).
You can correct this issue by updating your used range before the save. Here's a brief sub I wrote in VBA that would do the trick. This code would make you lose all formatting, but I figured that wasn't important since you're saving to a CSV anyway. I'll leave the conversion to VB.Net up to you.
Sub CorrectUsedRange()
Dim values
Dim usedRangeAddress As String
Dim r As Range
'Get UsedRange Address prior to deleting Range
usedRangeAddress = ActiveSheet.UsedRange.Address
'Store values of cells to array.
values = ActiveSheet.UsedRange
'Delete all cells in the sheet
ActiveSheet.Cells.Delete
'Restore values to their initial locations
Range(usedRangeAddress) = values
End Sub
Tested your code with VBA and Excel2007 - works nice.
However, I could replicate it somewhat, by formatting an empty cell below my data-cells to bold. Then I would get empty single quotes in the csv. BUT this was also the case, when I used SaveAs.
So, my suggestion would be to clear all non-data cells, then to save your file. This way you can at least exclude this point of error.
I'm afraid that may not be enough. It seems there's an Excel bug that makes even deleting the non-data cells insufficient to prevent them from being written out as empty cells when saving as csv.
http://answers.microsoft.com/en-us/office/forum/office_2010-excel/excel-bug-save-as-csv-saves-previously-deleted/2da9a8b4-50c2-49fd-a998-6b342694681e
Another way, without a script. Hit Ctrl+End . If that ends up in a row AFTER your real data, then select the rows from the first one until at least the row this ends up on, right click, and "Clear Contents".