Is there a quick way to subset columns in PANDAS? - pandas

I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because in my last step, I want the program to create me a file containing only what has changed in the "new" version file so that I do not have to update the entire database, only the data points that have changed.
old = pd.read_excel(landdata20201122.xlsx')
new = pd.read_excel(landdata20210105.xlsx')
old['version'] = "old"
new['version'] = "new"
I merge the sheets into one, I then drop duplicate rows based on all the columns in the original files. I have to subset the data because if the program looks at my added version column, it will not be seen as a duplicate row. Statement is listed below
df2 = df1.drop_duplicates(subset=["UWI", "Current DOI Partners", "Encumbrances", "Lease Expiry Date", "Mineral Leases", "Operator", "Attached Land Rights", "Surface Leases"])
df2.shape
I am wondering if there is a quicker way to subset the data, basically the way I currently have it setup, I have to list each column title. Some of my sheets have 100+ columns, so it is a lot of work when I only want it to negate 1 column. Is there a way that I can populate all the column titles and remove the ones I do not want looked at? Or is there a way to enter the columns I DO NOT want compared in the drop duplicates command instead of entering all the columns except one?
If I can just list the columns I do not want to compare, I will be able to use the same script for much more of the data that I am working with as I will not have to edit the drop_duplicates statement each time I compare sheets.
Any help is appreciated, thank you in advance!

If I've understood well:
Store the headers in a list.
Remove the names you don't want by hand.
Inside the subset of drop_duplicates(), place the list.
In case that the columns you want to remove are more than those you want to keep, add by hand all the wanted columns in the list.
With a list, you won't need to write them every time.
How to iterate a list:
list=['first', 'second', 'third']
for i in list:
print(i)
# Output: 'first', 'second', 'third'

Related

Order of the columns in Apache Zeppelin when selecting the data from the temprorary table is wrong, how to put specific column first?

Currently we have the scala DataFrame output with id value shown first (but it is chronologically added to the DataFrame last). Other columns appears dynamically based on .pivot() function and the data.
When I call for the data in %sql interpreter, the order is changing, thus making CSV file that I download also have id column as the last one, that doesn't work for me. I can't just write the selection script at once with putting the id column at the first point manually, as I can't control other columns because of pivot. Is there any other way to make specific column go first?
The Scala paragraph is:
resultMean.registerTempTable("mean")
The sql paragraph is:
%sql
select *
from mean
For someone who will read this in future, the reason of such a behavior is in misusing the DataFrame. In Scala .show() was applied to one DataFrame, while the export to the temp table to another one. If you face the same, please double check you apply your methods to the same objects.

Reading xlsx into R and creating new header

I'm a newbie to R.
I read in an Excel file for a survey, but I started reading observations from the 3rd row of the excel file, as the survey download creates a first two rows of the question string (first row for all questions) followed by a second row of multiple choice questions (each option gets its own column except the first option, which is listed in the same column in the second row as the question in the first row).
So now, my dataframe starts with Row 3.
But now I need to create custom variable names - ie. new variable names for each column before I manipulate further. I'm looking for tips on how to best accomplish this.
What I am thinking:
Create an Excel file with the variable names, and then use this is as the header. I'm not quite sure which code I would use to do this.
Code the names as an empty dataframe, and then somehow merge this so the empty dataframe column names are the column names for the file I imported.
I would appreciate some suggestions on how best to do this!

Is there any way to exclude columns from a source file/table in Pentaho using "like" or any other function?

I have a CSV file having more than 700 columns. I just want 175 columns from them to be inserted into a RDBMS table or a flat file usingPentaho (PDI). Now, the source CSV file has variable columns i.e. the columns can keep adding or deleting but have some specific keywords that remain constant throughout. I have the list of keywords which are present in column names that have to excluded, e.g. starts_with("avgbal_"), starts_with("emi_"), starts_with("delinq_prin_"), starts_with("total_utilization_"), starts_with("min_overdue_"), starts_with("payment_received_")
Any column which have the above keywords have to be excluded and should not pass onto my RDBMS table or a flat file. Is there any way to remove the above columns by writing some SQL query in PDI? Selecting specific 175 columns is not possible as they are variable in nature.
I think your example is fit to use meta data injection you can refer to example shared below
https://help.pentaho.com/Documentation/7.1/0L0/0Y0/0K0/ETL_Metadata_Injection
two things you need to be careful
maintain list of columns you need to push in.
since you have changing column names so you may face issue with valid columns as well which you want to import or work with. in order to do so make sure you generate the meta data file every time so you are sure about the column names you want to push out from the flat file.

Create table schema and load data in bigquery table using source google drive

I am creating table using google drive as a source and google sheet as a format.
I have selected "Drive" as a value for create table from. For file Format, I selected Google Sheet.
Also I selected the Auto Detect Schema and input parameters.
Its creating the table but the first row of the sheet is also loaded as a data instead of table fields.
Kindly tell me what I need to do to get the first row of the sheet as a table column name not as a data.
It would have been helpful if you could include a screenshot of the top few rows of the file you're trying to upload at least to see the data types you have in there. BigQuery, at least as of when this response was composed, cannot differentiate between column names and data rows if both have similar datatypes while schema auto detection is used. For instance, if your data looks like this:
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
BigQuery would not be able to detect the column names (at least automatically using the UI options alone) since all the headers and row data are Strings. The "Header rows to skip" option would not help with this.
Schema auto detection should be able to detect and differentiate column names from data rows when you have different data types for different columns though.
You have an option to skip header row in Advanced options. Simply put 1 as the number of rows to skip (your first row is where your header is). It will skip the first row and use it as the values for your header.

Build a Column by applying a formula to an existing column (like in excel)

I am new to the community. Hopefully this was not answered already.
I am trying to add a column to a DataFrame that contains a formula based on the previous columns. Example, build a series of stocks returns based on stocks close.
I know how to build a column by doing exactly the same thing to all elements of another, but not to use a columns element and formula to create another.
Thanks for your help.