OpenPyXL - find last row in existing excel file [duplicate] - openpyxl

My query is to do with a function that is part of a parsing script Im developing. I am trying to write a python function to find the column number corresponding to a matched value in excel. The excel has been created on the fly with openpyxl, and it has the first row (from 3rd column) headers that each span 4 columns merged into one. In my subsequent function, I am parsing some content to be added to the columns corresponding to the matching headers. (Additional info: The content I'm parsing is blast+ output. I'm trying to create a summary spreadsheet with the hit names in each column with subcolumns for hits, gaps, span and identity. The first two columns are query contigs and its length. )
I had initially written a similar function for xlrd and it worked. But when I try to rewrite it for openpyxl, I find that the max_row and max_col function wrongly returns a larger number of rows and columns than actually present. For instance, I have 20 rows for this pilot input, but it reports it as 82.
Note that I manually selected the empty rows & columns and right clicked and deleted them, as advised elsewhere in this forum. This didn't change the error.
def find_column_number(x):
col = 0
print "maxrow = ", hrsh.max_row
print "maxcol = ", hrsh.max_column
for rowz in range(hrsh.max_row):
print "now the row is ", rowz
if(rowz > 0):
pass
for colz in range(hrsh.max_column):
print "now the column is ", colz
name = (hrsh.cell(row=rowz,column=colz).value)
if(name == x):
col = colz
return col
The issue with max_row and max_col, has been discussed here https://bitbucket.org/openpyxl/openpyxl/issues/514/cell-max_row-reports-higher-than-actual I applied the suggestion here. But the max_row is still wrong.
for row in reversed(hrsh.rows):
values = [cell.value for cell in row]
if any(values):
print("last row with data is {0}".format(row[0].row))
maxrow = row[0].row
I then tried the suggestion at https://www.reddit.com/r/learnpython/comments/3prmun/openpyxl_loop_through_and_find_value_of_the/, and tried to get the column values. Once, again the script takes into account the empty columns and reports a higher number columns than actually present.
for currentRow in hrsh.rows:
for currentCell in currentRow:
print(currentCell.value)
Can you please help me resolve this error, or suggest another method to achieve my aim?

As noted in the bug report you linked to there's a difference between a sheet's reported dimensions and whether these include empty rows or columns. If max_row and max_column are not reporting what you want to see then you will need to write your own code to find the first completely empty. The most efficient way, of course, would be to start from max_row and work backwards but the following is probably sufficient:
for max_row, row in enumerate(ws, 1):
if all(c.value is None for c in row):
break

I confirm the bug found by the OP. I found newer posts reporting max_row being too large.
This bug cannot be fixed.
In my case, it appears when I set the value of all cells in a worksheet to None.
After this operation, the worksheet still reports the old dimensions.
A call to ws.calculate_dimensions() does not change anything.
Closing and restarting excel still has openpyxl report the same wrong dimensions.
This is a problem because ws.append() starts at ws.max_row, and there is no way to override this behaviour. You end up with a worksheet that is blank and then, somewhere down, the data you appended appears.
The only way I found out that remedies this bug is to delete entire rows by hand in excel. openpyxl then shows the correct max_row.
I found out that this is linked to the member ws._cells not being empty as it should after setting all cells to None. However, the user cannot delete this dictionary as it is a private member.

I have the same behaviour with the latest version 3.0.3 of openpyxl. I use an XLSX file as a template (created from a XLS file), open it, add some data then save it with a different name. I find out that max_row is set to 49 and I don't know why.
However after reading in the online documentation https://openpyxl.readthedocs.io/en/stable/api/openpyxl.worksheet.worksheet.html this line:
Do not create worksheets yourself, use
openpyxl.workbook.Workbook.create_sheet() instead
I created my XLSX template directly from openpyxl simply as follows:
wb = openpyxl.Workbook()
wb.save(filename="template.xslx")
It works fine now (max_row=1). Hope it helps.

When using openpyxl max_row function to get the maximum rows containing the data in the sheet, sometimes it even counts the empty rows, this is because the max_row function returns the maximum row index of the sheet, not the count of rows containing the data.
Example: Let's say an excel/google-sheet file is created with 10 rows of data and 5 rows of data are removed, the max_row function of openpyxl returns maximum rows as 10, as the maximum row index of file will be 10, as the file had contained 10 rows initially.
So to get the maximum rows containing the data in openpyxl
def get_maximum_rows(*, sheet_object):
rows = 0
for max_row, row in enumerate(sheet_object, 1):
if not all(col.value is None for col in row):
rows += 1
return rows
import openpyxl
workbook = openpyxl.load_workbook(<filepath>)
sheet_object = workbook.active
max_rows = get_maximum_rows(sheet_object=sheet_object)

Today I encountered the same. I edited the .xlsx file which I'm using in openpyxl. I deleted all values from the extreme right side column and found that max_column not giving exact max_column. Then I deleted the columns where the cell values were previously deleted (right-click on column 'ID' and delete). Now I find it is reporting correct value.

I used Dharman's approach and solved the problem.
I had an Excel file with more than 100k rows. I had deleted the duplications in this file.
At first, the max_row reported the total row number before the deletion.
I used workbook.save(filename='another_filename.xlsx") method to save the original Excel file to a new one.
Then I used the openpyxl to open the new file (another_filanem.xlsx). The max_row reports the correct number now.

in general max_row and max_col will make your script so slow to run, maybe it is better to detect a None and store the row or col in case.

Here is how I find the max column and max row by simply looping through the Excel sheet. By using this code, you can compare both the result from the Python and the loop.
from openpyxl import load_workbook
wb = lw("Test.xlsx")
sheet = wb["Sheet 1"]
print("Python defined max_column " + str(sheet.max_column))
print("Python defined max_row " + str(sheet.max_row))
def get_maximum_cols():
for i in range(1, 20000):
if sheet.cell(row=2, column= i).value == None:
max_col = i
break
return max_col
def get_maximum_rows():
for i in range(1, 20000):
if sheet.cell(row=i, column = 2).value == None:
max_row = i
break
return max_row
max_cols = get_maximum_cols()
max_rows = get_maximum_rows()
print('max column ' + str(max_cols))
print('max row ' + str(max_rows))
wb.save("Test.xlsx")

Related

grab and filter from more than 255 columns from a huge closed workbook

i have a huge workbook (0.6 million rows) and 315 columns whose column names i need to grab into an array. due to the huge size, i don't want to open and close the workbook to copy the 1st row of the range. Also, I want to only grab certain columns from the 1st row that begin with the word "Global ".
can anyone help with short code example on how to go about doing this? please note i have tried ADOX, ADO etc but both show the 255 column limitations. I also dont want to open the workbook, but pull the required "Global " columns from the 315 columns into an array.
any help is most appreciated.
You can copy the first row of your target by opening a new workbook, and in A1 use this formula:
='C:\PATH_TO_TARGET\[TARGET_FILE_NAME.xlsx]WORKSHEET_NAME'!A1
Note that PATH+FILENAME+WORKSHEET is enclosed in single quotes, the FILENAME is enclosed in square brackets, and an exclamation separates the cell reference.
Then copy/Paste or fill right to get the next 314 columns. Note: this formula will return zero for empty target cells.
Once you have the column heading you can copy/paste_special_values if you want to destroy the links to the closed workbook.
Hope that helps
You could use the Python programing language.
While it does not actively works with XLSX fiels, you just have to install the openpyxl external module from here: https://pypi.python.org/pypi/openpyxl -
(You will also have to install Python. of course - just download it from www.python.org)
It will make working with your data in an interactive Python session a piece of cake, and the time to open the workbook without having to load the Excel interface should be a fraction of what you are expecting. (I think it will have to fit in your memory, though).
But this is all I had to type, in an interactive Python2 session to open a workbook, and retreive the column names that start with "bl":
import openpyxl
a = openpyxl.load_workbook("bla.xlsx")
[cell.value for cell in a.worksheets[0].rows[0] if cell.value.startswith("bl")]
output:
Out[8]: [u'bla', u'ble', u'bli', u'blo', u'blu']
The last input line requires on to know Python to be understood, so, here is a summary of what happens: Python is a language very fond of working with sequences - and the openpyxl libray gives your workbook as just that:
an object which is a sequence of worksheets - each worksheet having a rows attribute which has a sequence of all rows in the sheet, and each row bein a sequence of cells. Each cell has a value attribute which is the text within it.
The inline for statement is the compact form, but it could be written as a multiple line statement as:
In [10]: for cell in a.worksheets[0].rows[0]:
....: if cell.value.startswith("bl"):
....: print cell.value
....:
bla
ble
bli
blo
blu
Keep in mind that by exploring Python a bit deeper, you can programatically manipulate your data in a way that will be easier than ininteractivelygiven a data-set this size - and you can even use Python itself to drop select contents to an SQL database, (including its bult-in, single-file database, sqlite), where sophisticated indexes and queries can make working with your data a breeze)

Concatenate entries from all incomplete rows with entries in complete rows above them

I converted a few files containing huge price lists from *.pdf to *.xls format using an online file format converter. However the conversion didn't give the desired result and more cleanup work is needed on the file. Have tried various different approaches using the macro recorder and tack overflow and have failed.
I need a macro that does the following cleanup work on my data.
Loops through rows in the selected data and Search for incomplete rows that are missing an entry in one or more cells.
Concatenate the text in these incomplete rows with the cell in the same column in the first complete row above it.
Example;
If row A contains all entries, but row B is missing entry in the Product code column then the entry in the needle description column in row B should concatenate with the needle description in row A
This file contain 10 line of data. Tab 1 shows the data that contains incomplete rows for the first 2 products. Tab 2 shows the form i want it to be in.
http://www.filetolink.com/5e39eaaf00
I'd be very grateful for any help on this for it will save me a lot of head wringing.
I didn't pull up the file because that site has extreme safety risk warnings, but the below code will do what you want as far as I can tell. If not, please clarify and I'll modify as necessary.
My Excel columns are in the following order:
ID, ProductCode, NeedleDescription
If your NeedleDescription isn't the first column to the right of your ProductCode this won't work. Just let me know what order your columns are in and I'll modify as necessary.
Copy the below into a module
Highlight the range you desire to clean up
Run the macro "Cleanse"
It will loop through all rows and fill all empty cells with the concatenation of:
CurrentRowNeedleDescription & PreviousRowNeedleDescription
Sub Cleanse()
'Fill an empty or blank cell in selection
'with formula specified
'First Highlight Affected Range, then run this macro
Dim cell As Range
On Error Resume Next
'Set formula to include in empty cells
'to be the description of existing row plus description of previous row
'Test for empty cell. If empty, fill cell with value given
For Each cell In Selection
If IsEmpty(cell) Then
cell.FormulaR1C1 = "=RC[1]&R[-1]C[1]"
End If
Next
End Sub
Hope that helps!

Lookup function in multiple sheets data

I have multiple sheets of data and I want to make it in one sheet (All of them are in the same workbook). Link to the excel file.
I tried to use Hlookup function in excel file, something like below:
=HLOOKUP("University",Sheet1!$A$1:$G$2, 2, FALSE).
But, since I have more than 100 sheets of data, I want to find a way to drag the function and auto generate the function below the 2nd row. I have tried to use indirect function by setting a reference column in front as below but cannot deal with it.
=HLOOKUP("University", 'INDIRECT(A3)'!$A$1:$G$2, 2, FALSE)
My next option is VB code. But, I am new to VB. Anybody can help on it?
Place your individual sheet names in column H of the Summary sheet and the row number in column I (as helper columns) and write this formula in cell A2 of the summary sheet.
=IFERROR(HLOOKUP(A$1,INDIRECT($H2&"!A1:G"&$I2),$I2,0),)
and drag to column F and down for as many sheet rows combos you have. I used 10 rows but you can obviously make it longer or shorter as neeed.
When you are done you can filter on 0 in column A and remove any lines with no data.
If your sheet names have spaces in them, you'll need to adjust the INDIRECT formula to this:
INDIRECT("'"&$H2&"'!A1:G"&$I2)
best way would be "defined names" + INDIRECT + HLOOKUP (or LOOKUP) like:
defined names
name: SList
formula: =MID(TRANSPOSE(GET.WORKBOOK(1))&T(NOW()),FIND("]",TRANSPOSE(GET.WORKBOOK(1))&T(NOW()))+1,255)
formula in cells: (this in A2 then simply autofill to G2 and thenn everything down) (you'll get a row with 0's between the sheets, which can be filtered out or deleted later (copy/paste values))
=IFERROR(HLOOKUP(A$1,INDIRECT("'"&INDEX(SList,COUNTIF($A$1:$A1,0)+2)&"'!$A:$G"),$H2,0),"")
Set H2 to 2 and for H3: (autofill down from H3)
=MAX(($H2+1)*($A2>0),2)
works perfectly for me LINK
No manual typing of sheetnames or something like that (only Column H:H as helper). Youll get rows's with 0's every time a new sheet is selected which can be filtered out. (or if you copy/paste values also can be deleted)
the +2 at ...st,COUNTIF($A$1:$A1,0)+2)&... simply tells to start with sheet 2 (if summary is the first). You may change it to +1 if you want to lookup starting with the first sheet.
Assuming you already have all 100+ sheet names typed out in column A, this will work whether or not you have spaces in the sheet names:
=HLOOKUP("University", OFFSET(INDIRECT(ADDRESS(1,1,1,1,A2)),0,0,2,7),2,FALSE)

VBA Hide Row in table if Specified Column is Empty

I am trying to add to a macro I have that will hide every row that has no text in a column named Authorization. Please see the code I have below, I thought this may be on the right track but it does not hide any rows.
Cells.EntireRow.Hidden = False
For Each cell In Range("Authorization").End(xlUp)
If cell = "" And cell.Offset(1, 0) = "" Then cell.EntireRow.Hidden = True
Next cell
Edited to add how to define a dynamic named range
It is the fact that you have set the whole column to the name "Authorisation" that I think makes your code freeze, because the whole column is 1 million rows (if you have 2007 or above), and the code will still check even blank rows, so its doing it 1 million times. 1 Option is to rather than set it to the whole column, you could use a "Dynamic Named Range" which will expand and grow as data is added. There are several different formulas to do this, but based on the fact your data may contain blanks, this version of the formula will expand down to the last populated row in the column. My example uses colum A as you havent specified what column you are using, so change A to suit your needs.
You need to open the Names manager, from the Formulas tab
From the dialog box, find your "Authorisation" name.
Select it and you should see its current formula at the bottom of the dialog box, replace that with the following formula:
=OFFSET(Sheet1!$A$3,0,0,MATCH("*",Sheet1!$A:$A,-1)-2,1)
In the above formula:
Sheet1 is my sheet, replace it with yours a needed
$A$3 is the starting row of the name, so based on your comments, have set this as column A row 3
0,0, Are defaults you should not need to change
$A$A$ is the column it is counting values, so change as required
-1 is a default, leave as is
-2 is subtracting 2 from the count because we are starting on row 3, so if you change the starting row, change this
the last 1, defined how many columns your named range covers, in your example it is just 1, so this should not need changing.
Once you have defined the name in this way, the code below should work a lot quicker as it will only loop through down to the last row of entered data. There is one possible issue I can see with this and that is if the very last cell in column A is blank, but the rest of the row isn't, this will miss out the last row. I could fix this by using a different column to count what constitutes the last row, but need to know whicj column would always have a value in it.
< Original answer and code>
not sure you code matches the description of what you want it to do, namely you seem to be trying to check the row beneath the current cell as well, is this what you really wanted? Anyhow your syntax is slightly wrong. I have written and tested this and it works, I have swapped your offset around so my code is checking the cell in the named range "Authorisation" and then also checking the cell to the right. Amend to suit your needs
Sub test()
Dim c As Range
For Each c In Range("Authorisation").Cells
If c.Value = "" And c.Offset(0, 1).Value = "" Then c.EntireRow.Hidden = True
Next c
End Sub

Column references in formulas

I am a little stuck at the moment. I am working on an array of data and need to find a way to input column numbers into formulas.
-I have used the match function to find the corresponding column number for a value.
ex. "XYZ" matched with Column 3, which is equivalent to C1:Cxxxxxx
-now for inputing the C1:Cxxxxxx into a formula to get data for that particular column, I would like to be able to directly reference the Column 3 part, because I plan on using this workbook in the future and the column needed to run the calculation may or may not be column 3 the next time I use it.
- is there any way to tell excel to use a formula to tell excel which column to use for an equation?
so a little more detail, I have the equation
=AND(Sheet3!$C$1:$C$250000=$A$4,Sheet3!$B$1:$B$250000=$B$4)
instead of specifying to use column C, is there a way to use a formula to tell it to use C?
EDIT: more additional info;
"i am basically running the equivalent of a SQL where statement where foo and bar are true, I want excel to spit out a concatenated list of all baz values where foo and bar are true. ideally i would like it to ONLY return baz values that are true, then I will concat them together separately. the way I got it now, the expression will test every row separately to see if true; if there is 18K rows, there will be 18K separate tests.. it works, but it's not too clean. the goal is to have as much automated as possible. *i do not want to have to go in and change the column references every time I add a new data arra*y"
Thanks
You can use INDEX, e.g. if you have 26 possible columns from A to Z then this formula will give you your column C range (which you can use in another formula)
=INDEX(Sheet3!$A$1:$Z$250000,0,3)
The 0 indicates that you want the whole column, the 3 indicates which column. If you want the 3 can be generated by another formula like a MATCH function
Note: be careful with AND in
=AND(Sheet3!$C$1:$C$250000=$A$4,Sheet3!$B$1:$B$250000=$B$4)
AND only returns a single result not an array, if you want an array you might need to use * like this
=(Sheet3!$C$1:$C$250000=$A$4)*(Sheet3!$B$1:$B$250000=$B$4)
You could use ADDRESS to generate the text, you then need to use INDIRECT as you are passing a string rather than a range to the fomula
=AND(INDIRECT(ADDRESS(1,3,,,"Sheet3") & ":" & ADDRESS(250000,3))=$A$4
,INDIRECT(ADDRESS(1,2,,,"Sheet3") & ":" & ADDRESS(250000,2))=$B$4)
Obviously replace the 3s and 2s in the ADDRESS formulae with your MATCH function you used to get the column number. The above assumes the column for $B$1:$B$25000 is also found using `MATCH', otherwise it is just:
=AND(INDIRECT(ADDRESS(1,3,,,"Sheet3") & ":" & ADDRESS(250000,3))=$A$4
,Sheet3!$B$1:$B$25000=$B$4)
Note a couple of things:
You only need to use "Sheet3" on the first part of the INDRECT
Conditions 3 and 4 in the ADDRESS formula are left as default, this
means they return absolute ($C$1) reference and are A1 style as
opposed to R1C1
EDIT
Given the additional info maybe using an advanced filter would get you near to what you want. Good tutorial here. Set it up according to the tutorial to familiarise yourself with it and then you can use some basic code to set it up automatically when you drop in a new dataset:
Paste in the dataset and then use VBA to get the range the dataset uses then apply the filter with something like:
Range("A6:F480").AdvancedFilter Action:=xlFilterInPlace, CriteriaRange:= _
Sheets("Sheet1").Range("A1:B3"), Unique:=False
You can also copy the results into a new table, though this has to be in the same sheet as the original data. My suggestion would be paste you data into hidden columns to the left and put space for your criteria in rows 1:5 of the visible columns and then have a button that gets the used range for your data, applies the filter and copies the data below the criteria:
Range("A6:F480").AdvancedFilter Action:=xlFilterCopy, CriteriaRange:=Sheets _
Range("H1:M3"), CopyToRange:=Range("H6"), Unique:=False
Button would need to clear the destination cells first etc, make sure you have enough hidden columns etc but it's all possible. Hope this helps.