I want to average several .csv files which are same dimension, and save as a new file. such as:
df_new=avergae(df1,df2,df3.....)
Are there any ready function carrying out it?
Suppose all your DF columns are numeric and you want to take the mean value across multiple DFs, you can do something like this.
pd.DataFrame(columns=df1.columns, data = np.mean(np.array([df1.values,df2.values,df3.values]),axis=0))
Related
I'm trying to extract 10 columns from a csv file with pandas in Google colab so I'm using this line firstten = data.iloc[:, 0:10] # first ten columns of data frame with all rows
After that I used firstten.count() to count the number of values in each column so after that what I want to know is that if in one of those columns have 80 different values and the others are lower I want to know from this ten columns the highest number of counts for that, I used the max but it didn't work some help with this please I don't need to know which column just the number in some pieces of data will be repeated 80 like 4 times so also to have this in mind with the solution. Also I thought about using the sort function but maybe there is another option.
Thanks for your help this is the output that I would like to get:
I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because in my last step, I want the program to create me a file containing only what has changed in the "new" version file so that I do not have to update the entire database, only the data points that have changed.
old = pd.read_excel(landdata20201122.xlsx')
new = pd.read_excel(landdata20210105.xlsx')
old['version'] = "old"
new['version'] = "new"
I merge the sheets into one, I then drop duplicate rows based on all the columns in the original files. I have to subset the data because if the program looks at my added version column, it will not be seen as a duplicate row. Statement is listed below
df2 = df1.drop_duplicates(subset=["UWI", "Current DOI Partners", "Encumbrances", "Lease Expiry Date", "Mineral Leases", "Operator", "Attached Land Rights", "Surface Leases"])
df2.shape
I am wondering if there is a quicker way to subset the data, basically the way I currently have it setup, I have to list each column title. Some of my sheets have 100+ columns, so it is a lot of work when I only want it to negate 1 column. Is there a way that I can populate all the column titles and remove the ones I do not want looked at? Or is there a way to enter the columns I DO NOT want compared in the drop duplicates command instead of entering all the columns except one?
If I can just list the columns I do not want to compare, I will be able to use the same script for much more of the data that I am working with as I will not have to edit the drop_duplicates statement each time I compare sheets.
Any help is appreciated, thank you in advance!
If I've understood well:
Store the headers in a list.
Remove the names you don't want by hand.
Inside the subset of drop_duplicates(), place the list.
In case that the columns you want to remove are more than those you want to keep, add by hand all the wanted columns in the list.
With a list, you won't need to write them every time.
How to iterate a list:
list=['first', 'second', 'third']
for i in list:
print(i)
# Output: 'first', 'second', 'third'
Really appreciate some help in transforming a flat Excel table into a single column series.
The Excel data reflects daily IoT sensor data (365 days/year as rows) collected hourly (24 columns of observations by hour). Current presentation from Excel file is hourly readings (columns) and dates (rows). I'm new to Stack Overflow, so I cannot directly embed images yet.
Before 1:
After: 2:
I have successfully imported the .xls file with pd.read_excel, datetime type is set for the index column, and file is imported with skiprows/skipfooter
Problem #1: How to flatten file/transpose the multi-column dataframe into a single series by hour/date.
Problem #2: How to create a multiindex that combines the date of the observation with the hour of the observation.
The following images show where the data is and where it needs to go.
I apologize in advance for any lacks in posting protocols. As I mentioned, I'm new and therefore limited in what I can post to make it easier for you to assist.
You can use df.stack() for this.
See df.stack() for more information on usage
just using df.stack() will also automatically create a multiindex with the date and hour.
I would like to create dataframes using excel spreadsheets as source data. I need to transform the data series from the format used to store the data in the excel spreadsheets to the dataframe variable end product.
I would like to know if users have experience in using various python methods to accomplish the following:
-data series transform: I have a series that includes one data value per month, but would like to expand the table of values to include one value per day using an index (or perhaps column with date values). So if table1 has a month based index and table2 has a daily index how can I convert table1 values to the table2 based index.
-dataframe sculpting: the data I am working with is not similar in length, some data sets are longer than others. By what methods is it possible to find the shortest series length in a column in the context of a multicolumn dataframe?
Essentially, I would like to take individual tables from workbooks and combine them into a single dataframe that uses a single index value as the basis for their presentation. My workbook tables may have data point frequencies of daily, weekly, or monthly and I would like to build a dataframe that uses the daily index as a basis for the table elements while including an element for each day in series that are weekly and monthly.
I am looking at the Pandas library, but perhaps there are other libraries that I have overlooked with additional functionality.
Thanks for helping!
For your first question, try something like:
df1 = df1.resample('1d').first()
df2.merge(df1)
That will upsample your monthly or weekly dataframes and merge it with your daily dataframe. Take a look at the interpolate method to fill in missing values. To get the name of the shortest column try this out:
df.count().idxmin()
Hope that helps!
I have a need to print a large table across multiple pages which contains both header rows and a “header”column. Representative of what I would like to achieve is:
https://github.com/EricG-Personal/table_print/blob/master/table.png
I do not want the contents of any cell to be clipped, split between pages, or auto-scaled to be smaller. Each page should have the appropriate header rows and each page should have the appropriate header column (the ID column).
The only aspect not depicted is that some of the cells would contain image data.
Can I achieve this with pandas?
What possible solutions do I have when attempting to print a large dataframe?
Pandas has no such capabilities, it wasn't designed for that in the first place.
I'd suggest converting your DataFrame to excel sheet and print that using MS Excel. It has -to the best of my knowledge- all what you need.