Read .xlsx computed cell values using openpyxl - openpyxl

I am trying to read the values (not formulas) in a range of cells in a .xlsx file. If I include the data_only=True option in the call to load_workbook I get the error:
opt/anaconda3/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py:296: UserWarning: Unknown extension is not supported and will be removed
If I leave out the data_only option then the load_workbook works fine but I get the formulas rather than the computed values. Is there a way for me to get the last computed values?

Related

openpyxl destroys functions on save

I'm trying to save pandas DF into an existing spreadsheet. I found an excellent answer at Writing Pandas DataFrame to Excel: How to auto-adjust column widths, which is really continuation of another question *)
The problem though is that when I use it, on trying to load the spreadsheet I get an error on "damaged content", complaining about a drawing - even though I have none in the spreadsheet, and all functions are gone. Static data are still there.
log is
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
-<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>error171360_05.xml</logFileName>
<summary>Errors were detected in file 'test.xlsx'</summary>
-<repairedRecords summary="Following is a list of repairs:">
<repairedRecord>Repaired Records: Drawing from /xl/drawings/drawing1.xml part (Drawing shape)</repairedRecord>
</repairedRecords>
</recoveryLog>
Any ideas?
Edit: I'm pretty sure now it's not caused by pandas, as opening workbook, adding an empty sheet, and saving it removes all the formulas.
workbook = load_workbook(file)
try:
sheet = workbook["Result"]
except KeyError:
sheet = workbook.create_sheet("Result")
# for r in dataframe_to_rows(result, index=False, header=True):
# sheet.append(r)
workbook.save(file)
It doesn't produce the error above though.
Edit2: There's a question from 2013 (Openpyxl: Formulas getting removed when saving file) which says OpenPyxl doesn't support it, with a feature requested to do so. But the link to the feature doesn't work, so I have no idea whether it works or not.
*) there is a small bug in the function in that answer, sheet_name is a param, but it also tries to look it up in **kwargs, which of course fails, so gets replaced by a default value even if passed into the function. I can't comment on the question, so maybe #maxU will read this and edit..

1004 copy method of range class failed, error not present on smaller data sets

I have a macro which "chops up" spread sheets, allowing the user to select columns they want to keep. As well as selection criteria to filter columns for values and date ranges.
The macro works fine, however when I tried to process a 190MB file I got an
error: 1004 copy method of range class failed
The line it failed on was:
Selection.SpecialCells(xlCellTypeVisible).Copy Destination:=Export.Sheets("Sheet1").Range("A1")
I've tried splitting the code up and it still didn't work. (see below)
Selection.SpecialCells(xlCellTypeVisible).Copy
Export.Sheets("Sheet1").Range("A1").Select
Export.Sheets("Sheet1").Range("A1").Paste
The full file can be found here:
https://quickfileshare.org/9th/Big_Choppa_-_V4_JB1_Test.xlsm

Broken rows when reading inlineStr cells with EPPlus C#

I have xlsx-file which opens succesfully with Excel and which can be parsed with other excel-libraries than EPPlus. We are likely to continue using EPPlus, so it will be nice to get some advice with this issue.
Excel data in is plain text data without formatting.
When parsing with EPPlus with C#, let's say 3x3 sheet, parsed data is fragmented following way to memory (empty cells added to every row so total size is 3x9 or something):
r1c1 r1c2 r1c3
r2c1 r2c2 r2c3
r3c1 r3c2 r3c3
instead of 3x3 array
r1c1 r1c2 r1c3
r2c1 r2c2 r2c3
r3c1 r3c2 r3c3
When opening xlsx-archive with zip viewer it seems that xl\worksheets\sheet.xml contains following data.
<x:row>
<x:c t="inlineStr">
<x:is>
<x:t>Data in cell</x:t>
</x:is>
</x:c>
..
</x:row>
So no any row/column identifiers are present in previous snippet. Maybe the root cause of problem?
Another thing to notice is that when open and save same file in Excel without modifications, file size increases and sheet data seem to be moved from sheet.xml to sharedstrings.xml. After succesful saving in Excel, only row/column indices are present in sheet1.xml and file can be properly parsed with EPPlus.
Problem in this case was that there were no row/column identifiers present in sheet data. Incorrect Excel file was created with custom program using Open XML SDK.
According to Open XML SDK Guidelines (https://msdn.microsoft.com/en-us/library/office/gg278309.aspx), missing row/column identifiers is error against Excel data format so problem was not in EPPlus library.

grab and filter from more than 255 columns from a huge closed workbook

i have a huge workbook (0.6 million rows) and 315 columns whose column names i need to grab into an array. due to the huge size, i don't want to open and close the workbook to copy the 1st row of the range. Also, I want to only grab certain columns from the 1st row that begin with the word "Global ".
can anyone help with short code example on how to go about doing this? please note i have tried ADOX, ADO etc but both show the 255 column limitations. I also dont want to open the workbook, but pull the required "Global " columns from the 315 columns into an array.
any help is most appreciated.
You can copy the first row of your target by opening a new workbook, and in A1 use this formula:
='C:\PATH_TO_TARGET\[TARGET_FILE_NAME.xlsx]WORKSHEET_NAME'!A1
Note that PATH+FILENAME+WORKSHEET is enclosed in single quotes, the FILENAME is enclosed in square brackets, and an exclamation separates the cell reference.
Then copy/Paste or fill right to get the next 314 columns. Note: this formula will return zero for empty target cells.
Once you have the column heading you can copy/paste_special_values if you want to destroy the links to the closed workbook.
Hope that helps
You could use the Python programing language.
While it does not actively works with XLSX fiels, you just have to install the openpyxl external module from here: https://pypi.python.org/pypi/openpyxl -
(You will also have to install Python. of course - just download it from www.python.org)
It will make working with your data in an interactive Python session a piece of cake, and the time to open the workbook without having to load the Excel interface should be a fraction of what you are expecting. (I think it will have to fit in your memory, though).
But this is all I had to type, in an interactive Python2 session to open a workbook, and retreive the column names that start with "bl":
import openpyxl
a = openpyxl.load_workbook("bla.xlsx")
[cell.value for cell in a.worksheets[0].rows[0] if cell.value.startswith("bl")]
output:
Out[8]: [u'bla', u'ble', u'bli', u'blo', u'blu']
The last input line requires on to know Python to be understood, so, here is a summary of what happens: Python is a language very fond of working with sequences - and the openpyxl libray gives your workbook as just that:
an object which is a sequence of worksheets - each worksheet having a rows attribute which has a sequence of all rows in the sheet, and each row bein a sequence of cells. Each cell has a value attribute which is the text within it.
The inline for statement is the compact form, but it could be written as a multiple line statement as:
In [10]: for cell in a.worksheets[0].rows[0]:
....: if cell.value.startswith("bl"):
....: print cell.value
....:
bla
ble
bli
blo
blu
Keep in mind that by exploring Python a bit deeper, you can programatically manipulate your data in a way that will be easier than ininteractivelygiven a data-set this size - and you can even use Python itself to drop select contents to an SQL database, (including its bult-in, single-file database, sqlite), where sophisticated indexes and queries can make working with your data a breeze)

Convert xls File to csv, but extra rows added?

So, I am trying to convert some xls files to a csv, and everything works great, except for one part. The SaveAs function in the Excel interop seems to export all of the rows (including blank ones). I can see these rows when I look at the file using Notepad. (All of the rows I expect, 15 rows with two single quotes, then the rest are just blank). I then have a stored procedure that takes this csv and imports to the desired table (this works on spreadsheets that have been manually converted to csv (e.g. open, File--> Saves As, etc.)
Here is the line of code I am using for my SavesAs in my code. I have tried xlCSV, xlCSVWindows, and xlCSVDOS as my file format, but they all do the same thing.
wb.SaveAs(aFiles(i).Replace(".xls", "B.csv"), Excel.XlFileFormat.xlCSVMSDOS, , , , False) 'saves a copy of the spreadsheet as a csv
So, is there some additional step/setting I need to do to not get the extraneuos rows to show up in the csv?
Note that if I open this newly created csv, and then click Save As, and choose csv, my procedure likes it again.
When you create a CSV from a Workbook, the CSV is generated based upon your UsedRange. Since the UsedRange can be expanded simply by having formatting applied to a cell (without any contents) this is why you are getting blank rows. (You can also get blank columns due to this issue.)
When you open the generated CSV all of those no-content cells no longer contribute to the UsedRange due to having no content or formatting (since only values are saved in CSVs).
You can correct this issue by updating your used range before the save. Here's a brief sub I wrote in VBA that would do the trick. This code would make you lose all formatting, but I figured that wasn't important since you're saving to a CSV anyway. I'll leave the conversion to VB.Net up to you.
Sub CorrectUsedRange()
Dim values
Dim usedRangeAddress As String
Dim r As Range
'Get UsedRange Address prior to deleting Range
usedRangeAddress = ActiveSheet.UsedRange.Address
'Store values of cells to array.
values = ActiveSheet.UsedRange
'Delete all cells in the sheet
ActiveSheet.Cells.Delete
'Restore values to their initial locations
Range(usedRangeAddress) = values
End Sub
Tested your code with VBA and Excel2007 - works nice.
However, I could replicate it somewhat, by formatting an empty cell below my data-cells to bold. Then I would get empty single quotes in the csv. BUT this was also the case, when I used SaveAs.
So, my suggestion would be to clear all non-data cells, then to save your file. This way you can at least exclude this point of error.
I'm afraid that may not be enough. It seems there's an Excel bug that makes even deleting the non-data cells insufficient to prevent them from being written out as empty cells when saving as csv.
http://answers.microsoft.com/en-us/office/forum/office_2010-excel/excel-bug-save-as-csv-saves-previously-deleted/2da9a8b4-50c2-49fd-a998-6b342694681e
Another way, without a script. Hit Ctrl+End . If that ends up in a row AFTER your real data, then select the rows from the first one until at least the row this ends up on, right click, and "Clear Contents".