read_excel only read cells formated as table - pandas

This is the way I am currently importing the information from an excel file where all rows that contains information are formatted as a table. Row number 13 is the header.
df = pd.read_excel('path_to_file', skiprows=12, usecols="C:T", na_values='N/A')
My question is, considering that I have skipped the rows(skiprows) and columns(usecols) without information, is pandas only reading down to the end of the cells that contain values? Currently 65.500 but increasing everyday, or always read a fix amount(e.g. 1M)
Is there any way that I can improve the performance/only read the necessary rows(cells with values)?
Thank you!

Related

Is there a quick way to subset columns in PANDAS?

I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because in my last step, I want the program to create me a file containing only what has changed in the "new" version file so that I do not have to update the entire database, only the data points that have changed.
old = pd.read_excel(landdata20201122.xlsx')
new = pd.read_excel(landdata20210105.xlsx')
old['version'] = "old"
new['version'] = "new"
I merge the sheets into one, I then drop duplicate rows based on all the columns in the original files. I have to subset the data because if the program looks at my added version column, it will not be seen as a duplicate row. Statement is listed below
df2 = df1.drop_duplicates(subset=["UWI", "Current DOI Partners", "Encumbrances", "Lease Expiry Date", "Mineral Leases", "Operator", "Attached Land Rights", "Surface Leases"])
df2.shape
I am wondering if there is a quicker way to subset the data, basically the way I currently have it setup, I have to list each column title. Some of my sheets have 100+ columns, so it is a lot of work when I only want it to negate 1 column. Is there a way that I can populate all the column titles and remove the ones I do not want looked at? Or is there a way to enter the columns I DO NOT want compared in the drop duplicates command instead of entering all the columns except one?
If I can just list the columns I do not want to compare, I will be able to use the same script for much more of the data that I am working with as I will not have to edit the drop_duplicates statement each time I compare sheets.
Any help is appreciated, thank you in advance!
If I've understood well:
Store the headers in a list.
Remove the names you don't want by hand.
Inside the subset of drop_duplicates(), place the list.
In case that the columns you want to remove are more than those you want to keep, add by hand all the wanted columns in the list.
With a list, you won't need to write them every time.
How to iterate a list:
list=['first', 'second', 'third']
for i in list:
print(i)
# Output: 'first', 'second', 'third'

Reading xlsx into R and creating new header

I'm a newbie to R.
I read in an Excel file for a survey, but I started reading observations from the 3rd row of the excel file, as the survey download creates a first two rows of the question string (first row for all questions) followed by a second row of multiple choice questions (each option gets its own column except the first option, which is listed in the same column in the second row as the question in the first row).
So now, my dataframe starts with Row 3.
But now I need to create custom variable names - ie. new variable names for each column before I manipulate further. I'm looking for tips on how to best accomplish this.
What I am thinking:
Create an Excel file with the variable names, and then use this is as the header. I'm not quite sure which code I would use to do this.
Code the names as an empty dataframe, and then somehow merge this so the empty dataframe column names are the column names for the file I imported.
I would appreciate some suggestions on how best to do this!

Printing a large dataframe across pages

I have a need to print a large table across multiple pages which contains both header rows and a “header”column. Representative of what I would like to achieve is:
https://github.com/EricG-Personal/table_print/blob/master/table.png
I do not want the contents of any cell to be clipped, split between pages, or auto-scaled to be smaller. Each page should have the appropriate header rows and each page should have the appropriate header column (the ID column).
The only aspect not depicted is that some of the cells would contain image data.
Can I achieve this with pandas?
What possible solutions do I have when attempting to print a large dataframe?
Pandas has no such capabilities, it wasn't designed for that in the first place.
I'd suggest converting your DataFrame to excel sheet and print that using MS Excel. It has -to the best of my knowledge- all what you need.

How can i merge header cells in Excel writer in pentaho?

I am trying to merge header cells columns into one cell but when i do that my data also comes in one column. I want my resulting output as per this screenshot attached. Kindly help me for this.
Are your columns variable? Or you always have the same output schema?
If it's fixed then, I would use a template where the headers are fixed and I start populating from row 5.
Google Spreadsheet input
If you are using the Spreadsheet input that is not possible on the step.
What I usually do in that kind of situation is to create a row with my headers and hide it so the user don't get confused with two headers. Them the Step will get the result perfectly using the column names provided on first row. (you can use a formula like =b3 there so it changes with the real header. No problem.)
Excel input
If you are using the Excel input step you can set the sheet to be read from row 2, column 0 and should work fine. =)

Advanced data import from Excel sheet

currently I'm trying to import data from one excel sheet to another.
I'm going through the cells in 13 columns and if value in one of them exceeds X I need to copy this and 719 following rows into another excel sheet.
Does anyone have tips how to do so?
I'm not really into that but I'm trying to simplify work of my service engineers...
Thanks so much for your answers.
I found a solution:
Checking maximum value in a row
With function "delete row based on cell value" I'm deleting all rows where maximum value is smaller than desired
I'm transferring next 720 rows to another sheet with simple addressing
Not as perfect as I'd imagine but is working