I have 1 list composed of unknown number of dfs:
The dfs have same dimensions with same column names and same column values in the same order:
df1=pd.DataFrame(data=np.transpose([[1,2,3,4],[2,4,6,8]]),index=['A','B','C','D'],columns=['x','y'])
df2=pd.DataFrame(data=np.transpose([[3,3,3,3],[4,4,4,4]]),index=['A','B','C','D'],columns=['x','y'])
I would like to group the values of the n dfs into a new df with the values being the mean of the values of the n dfs
The ouput
df2=pd.DataFrame(data=np.transpose([[2,2.5,3,3.5],[3,4,5,6]]),index=['A','B','C','D'],columns=['x','y'])
Use concat with mean per index values:
print (pd.concat([df1, df2]).mean(level=0))
x y
A 2.0 3.0
B 2.5 4.0
C 3.0 5.0
D 3.5 6.0
First concatenate the dataframes, reset the index to use it as groupby-keys and then calculate the mean over all columns.
pd.concat([df1, df2]).reset_index().groupby('index').mean()
Output
x y
index
A 2.0 3.0
B 2.5 4.0
C 3.0 5.0
D 3.5 6.0
Related
Both tables that I merge have the cells formatted correctly, as numbers, but when I make a left join, the numbers in one of the original tables get dis-formatted (you see e+ in those numbers). What should I do to see those numbers un full?
Problem: When merging, some SKU values that appear in df1 do not appear in df2. In order to represent unavailable values, pandas automatically uses NaN, which is a floating point value. Thus, the integer ISBNs are converted to float. Given the size of the ISBNs, pandas then formats these floating point values in scientific notation.
You could solve this by defining your own floating point value formatter (pd.options.display.float_format), but in your case it might be easier / more effective to convert the ISBNs to a string before merging.
Example:
>>> import pandas as pd
>>> df1 = pd.DataFrame({"SKU": list("abcde"), "ISBN": list(range(1, 6))})
>>> df2 = pd.DataFrame({"SKU": list("bcef"), "ISBN": list(range(4, 8))})
Your problem:
>>> pd.merge(df1, df2, on="SKU", how="left")
SKU ISBN_x ISBN_y
0 a 1 NaN
1 b 2 4.0
2 c 3 5.0
3 d 4 NaN
4 e 5 6.0
>>> _.dtypes
SKU object
ISBN_x int64
ISBN_y float64 # <<< Problematic
vs possible solution:
>>> pd.merge(df1.astype(str), df2.astype(str), on="SKU", how="left")
SKU ISBN_x ISBN_y
0 a 1 NaN
1 b 2 4
2 c 3 5
3 d 4 NaN
4 e 5 6
>>> _.dtypes
SKU object
ISBN_x object
ISBN_y object
I currently have a dataframe with n number of number-value columns and three columns that are datetime and string values. I want to convert all the columns (but three) to numeric values but am not sure what the best method is. Below is a sample dataframe (simplified):
df2 = pd.DataFrame(np.array([[1, '5-4-2016', 10], [1,'5-5-2016', 5],[2, '5-
4-2016', 10], [2, '5-5-2016', 7], [5, '5-4-2016', 8]]), columns= ['ID',
'Date', 'Number'])
I tried using something like (below) but was unsuccessful.
exclude = ['Date']
df = df.drop(exclude, 1).apply(pd.to_numeric,
errors='coerce').combine_first(df)
The expected output: (essentially, the datatype of fields 'ID' and 'Number' change to floats while 'Date' stays the same)
ID Date Number
0 1.0 5-4-2016 10.0
1 1.0 5-5-2016 5.0
2 2.0 5-4-2016 10.0
3 2.0 5-5-2016 7.0
4 5.0 5-4-2016 8.0
Have you tried Series.astype()?
df['ID'] = df['ID'].astype(float)
df['Number'] = df['Number'].astype(float)
or for all columns besides date:
for col in [x for x in df.columns if x != 'Date']:
df[col] = df[col].astype(float)
or
df[[x for x in df.columns if x != 'Date']].transform(lambda x: x.astype(float), axis=1)
You need to call to_numeric with option downcast='float', if you want it change to float. Otherwise, it will be int. You also need to join back to non-converted columns of the original df2
df2[exclude].join(df2.drop(exclude, 1).apply(pd.to_numeric, downcast='float', errors='coerce'))
Out[1815]:
Date ID Number
0 5-4-2016 1.0 10.0
1 5-5-2016 1.0 5.0
2 5-4-2016 2.0 10.0
3 5-5-2016 2.0 7.0
4 5-4-2016 5.0 8.0
I am using:
df.to_csv('file.csv', header=False, mode='a')
to write multiple pandas dataframe one by one to a CSV file.
I make sure that these dataframe have the same sets of column names.
However, it seems that the column orders will be written in a random order, so I have a chaos CSV file.
How to make sure that the new dataframe will be written with the column order of previous data?
Many thanks
I think you can sorting each DataFrame by columns if same columns names in each one:
df.sort_index(axis=1).to_csv('file.csv', header=None, mode='a')
If possible different columns names is possible create helper variable c and add new columns with removing duplicates:
df1 = pd.DataFrame({'C':list('as'),
'B':[4,5],
'A':[7,8]})
df2 = pd.DataFrame({'D':list('as'),
'A':[4,5],
'C':[7,8]})
df3 = pd.DataFrame({'C':list('as'),
'B':[4,5],
'E':[7,8]})
c = df1.columns
#first df should be written to file same way as another df
df1.to_csv('file.csv', header=None, index=False)
c = c.append(df2.columns).drop_duplicates()
df2.reindex(columns=c).to_csv('file.csv', header=None, mode='a', index=False)
c = c.append(df3.columns).drop_duplicates()
df3.reindex(columns=c).to_csv('file.csv', header=None, mode='a', index=False)
df = pd.read_csv('file.csv', header=None, names=c)
print (df)
C B A D E
0 a 4.0 7.0 NaN NaN
1 s 5.0 8.0 NaN NaN
2 7 NaN 4.0 a NaN
3 8 NaN 5.0 s NaN
4 a 4.0 NaN NaN 7.0
5 s 5.0 NaN NaN 8.0
In general terms, the problem I'm having is that I have numerical column names for a dataframe and struggling to use them.
I have a dataframe (df1) like this:
3.2 5.4 1.1
1 1.6 2.8 4.0
2 3.5 4.2 3.2
I want to create another (df2) where each value is:
(the corresponding value in df1 minus the value to the left) /
(the column number in df1 minus the column number to the left)
This means that the first column of df2 is nan and, for instance, the second row, second column is: (4.2-3.5)/(5.4-3.2)
I think maybe this is problematic because the column names aren't of the appropriate type: I've searched elsewhere but haven't found anything on how to use the column names in the way required.
Any and all help appreciated, even if it involves a workaround!
v = np.diff(df1.values, axis=1) / np.diff(df1.columns.values.astype(float))
df2 = pd.DataFrame(v, df1.index, df1.columns[1:]).reindex_like(df1)
df2
3.2 5.4 1.1
1 NaN 0.545455 -0.279070
2 NaN 0.318182 0.232558
You can first transpose the DF and get the rowwise diff. Then divide each column with the column diff. Finally transpose the DF back.
df2 = df.T.assign(c=lambda x: x.index.astype(float)).diff()
df2.apply(lambda x: x.div(df2.c)).drop('c',1).T
Out[367]:
3.2 5.4 1.1
1 NaN 0.545455 -0.279070
2 NaN 0.318182 0.232558
taking inspiration from this discussion here on SO (Merge Columns within a DataFrame that have the Same Name), I tried the method suggested and, while it works while using the function sum() it doesn't when I am using np.nansum :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100))
print(df.head(3))
sum() case:
print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3))
a b
2011-01-01 1.328933 1.678469
2011-01-02 1.878389 1.343327
2011-01-03 0.964278 1.302857
np.nansum() case:
print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3))
a [1.32893299939, 1.87838886222, 0.964278430632,...
b [1.67846885234, 1.34332662587, 1.30285727348, ...
dtype: object
any idea why?
The issue is that np.nansum converts its input to a numpy array, so it effectively loses the column information (sum doesn't do this). As a result, the groupby doesn't get back any column information when constructing the output, so the output is just a Series of numpy arrays.
Specifically, the source code for np.nansum calls the _replace_nan function. In turn, the source code for _replace_nan checks if the input is an array, and converts it to one if it's not.
All hope isn't lost though. You can easily replicate np.nansum with Pandas functions. Specifically use sum followed by fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
The sum should ignore NaN's and just sum the non-null values. The only case you'll get back a NaN is if all the values attempting to be summed are NaN, which is why fillna is required. Note that you could also do the fillna before the groupby, i.e. df.fillna(0).groupby....
If you really want to use np.nansum, you can recast as pd.Series. This will likely impact performance, as constructing a Series can be a relatively expensive, and you'll be doing it multiple times:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
Example Computations
For some example computations, I'll be using the following simple DataFrame, which includes NaN values (your example data doesn't):
df = pd.DataFrame([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb'))
a a a b b
0 1.0 2.0 2.0 NaN 4.0
1 NaN NaN NaN 3.0 3.0
2 NaN NaN -1.0 2.0 NaN
Using sum without fillna:
df.groupby(df.columns, axis=1).sum()
a b
0 5.0 4.0
1 NaN 6.0
2 -1.0 2.0
Using sum and fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0
Comparing to the fixed np.nansum method:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0