pandas: to_csv append mode with preserved columns order - pandas

I am using:
df.to_csv('file.csv', header=False, mode='a')
to write multiple pandas dataframe one by one to a CSV file.
I make sure that these dataframe have the same sets of column names.
However, it seems that the column orders will be written in a random order, so I have a chaos CSV file.
How to make sure that the new dataframe will be written with the column order of previous data?
Many thanks

I think you can sorting each DataFrame by columns if same columns names in each one:
df.sort_index(axis=1).to_csv('file.csv', header=None, mode='a')
If possible different columns names is possible create helper variable c and add new columns with removing duplicates:
df1 = pd.DataFrame({'C':list('as'),
'B':[4,5],
'A':[7,8]})
df2 = pd.DataFrame({'D':list('as'),
'A':[4,5],
'C':[7,8]})
df3 = pd.DataFrame({'C':list('as'),
'B':[4,5],
'E':[7,8]})
c = df1.columns
#first df should be written to file same way as another df
df1.to_csv('file.csv', header=None, index=False)
c = c.append(df2.columns).drop_duplicates()
df2.reindex(columns=c).to_csv('file.csv', header=None, mode='a', index=False)
c = c.append(df3.columns).drop_duplicates()
df3.reindex(columns=c).to_csv('file.csv', header=None, mode='a', index=False)
df = pd.read_csv('file.csv', header=None, names=c)
print (df)
C B A D E
0 a 4.0 7.0 NaN NaN
1 s 5.0 8.0 NaN NaN
2 7 NaN 4.0 a NaN
3 8 NaN 5.0 s NaN
4 a 4.0 NaN NaN 7.0
5 s 5.0 NaN NaN 8.0

Related

Pandas: Merging multiple dataframes efficiently

I have a situation where I need to merge multiple dataframes that I can do easily using the below code:
# Merge all the datasets together
df_prep1 = df_prep.merge(df1,on='e_id',how='left')
df_prep2 = df_prep1.merge(df2,on='e_id',how='left')
df_prep3 = df_prep2.merge(df3,on='e_id',how='left')
df_prep4 = df_prep3.merge(df_4,on='e_id',how='left')
df_prep5 = df_prep4.merge(df_5,on='e_id',how='left')
df_prep6 = df_prep5.merge(df_6,on='e_id',how='left')
But what I want to understand is that if there is any other efficient way to perform this merge, maybe using a helper function? If yes, then how could I achieve that?
You can use reduce from functools module to merge multiple dataframes:
from functools import reduce
dfs = [df_1, df_2, df_3, df_4, df_5, df_6]
out = reduce(lambda dfl, dfr: pd.merge(dfl, dfr, on='e_id', how='left'), dfs)
You can put all your dfs into a list, or pass them from a function, a loop, etc. and then have 1 main df that you merge everything onto.
You can start with an empty df and iterate through. In your case, since you are doing left merge, it looks like your df_prep should already have all of the e_id values that you want. You'll need to figure out what you want to do with any additional columns, e.g., you can have pandas add _x and _y after conflicting column names that you don't merge, or rename them, etc. See this toy example:
main_df = pd.DataFrame({'e_id': [0, 1, 2, 3, 4]})
for x in range(3):
dfx = pd.DataFrame({'e_id': [x], 'another_col' + str(x): [x * 10]})
main_df = main_df.merge(dfx, on='e_id', how='left')
to get:
e_id another_col0 another_col1 another_col2
0 0 0.0 NaN NaN
1 1 NaN 10.0 NaN
2 2 NaN NaN 20.0
3 3 NaN NaN NaN
4 4 NaN NaN NaN

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

Pandas: DataFrame op DataFrame Results in NaNs

Why do simple DataFrame op DataFrame operations result in a union'ed DataFrame? Pandas documentation mentions unionizing because of alignment issues. I don't see any alignment issues with df1 and df2. Aren't alignment issues about different shapes, different dtypes, or different indexes?
df1 = pd.DataFrame([[1,2],[3,4]],columns=list('AB'))
df2 = pd.DataFrame([[5,6],[7,8]],columns=list('CD'))
>> df1*df2
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
Another source of alignment issues is non-matching column names. Here, alignment requires identical column names. Either make the column names the same or use .values. Using .values on just the right-hand DataFrame will retain the DataFrame type.
>> df1*df2.values
A B
0 5 12
1 21 32

Update a Pandas MultiIndex DataFrame

The dataframe "data" has a MultiIndex.
data.head()
Close High Low Open Volume
Symbol Date
A 1999-11-18 28.6358 33.5207 27.3725
30.6572 59753154
1999-11-19 27.2040 28.9727 26.8253 28.9323
16172993
1999-11-22 29.3517 29.3517 26.9935 27.8357
5435127
1999-11-23 27.1198 28.8885 27.1198 28.6358
5035889
1999-11-24 27.6676 28.2571 26.9513 27.0389
5141708
The dictionary f has a key 'AAPL' which is a regular DataFrame.
f['AAPL'].head()
Open High Low Close Volume
Date
2018-06-11 191.350 191.970 190.21 191.23 18308460
2018-06-12 191.385 192.611 191.15 192.28 16911141
2018-06-13 192.420 192.880 190.44 190.70 21638393
2018-06-14 191.550 191.570 190.22 190.80 21610074
2018-06-15 190.030 190.160 188.26 188.84 61719160
I'd like to append to data['AAPL'] so that it has the data from f['AAPL']. This works, but is not inplace:
data.loc['AAPL'].append(f['AAPL'], verify_integrity=True).tail()
Close High Low Open Volume
Date
2018-07-30 189.91 192.20 189.0700 191.90 21029535
2018-07-31 190.29 192.14 189.3400 190.30 39373038
2018-08-01 201.50 201.76 197.3100 199.13 67935716
2018-08-02 207.39 208.38 200.3500 200.58 62404012
2018-08-03 207.99 208.74 205.4803 207.03 33447396
When I try to update data, I get all NaNs.
data.loc['AAPL'] = data.loc['AAPL'].append(f['AAPL'], verify_integrity=True).tail()
Close High Low Open Volume
Date
2018-06-04 NaN NaN NaN NaN NaN
2018-06-05 NaN NaN NaN NaN NaN
2018-06-06 NaN NaN NaN NaN NaN
2018-06-07 NaN NaN NaN NaN NaN
2018-06-08 NaN NaN NaN NaN NaN
Edit:
The "data" DataFrame was created with pandas data_reader:
import pandas_datareader.data as web
data = web.DataReader(['A','AAPL','F'], 'morningstar', start, end)
"f" was created the same way, but using 'iex' as the source instead of 'morningstar' (at the moment the morningstar source is returning 404s, so I switched to iex).
I still don't know why assigning to data.loc['AAPL'] doesn't work, but the following does:
# Converts dict with keys as tickers, DataFrame as values, to a DataFrame with a MultiIndex
new_data = pd.concat(f)
# Just append, and sort index to align the dates
data = data.append(new_data).sort_index()
Personal preference: I would first create a temp df with the data to append as a multi index dataframe.
toappend = pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol'])
And then create a new dataframe by appending the data and new data.
newdata = data.append(toappend, verify_integrity=True)
or if you prefer to do it in one line:
newdata = data.append(pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol']), verify_integrity=True)
My full test code is:
import pandas as pd
import numpy as np
symbols = ['AAA', 'BBB', 'CCC']
dates = ['2018-06-11', '2018-06-12', '2018-06-13']
cols = ['Close', 'High', 'Low']
midx = pd.MultiIndex.from_product([symbols, dates], names=['Symbol', 'Date'])
data= pd.DataFrame(10, midx, cols)
aapldf = pd.DataFrame(15, dates, cols)
aapldf.index.name = 'Date'
f = {'AAPL': aapldf}
toappend = pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol'])
newdata = data.append(toappend, verify_integrity=True)
print(newdata)

How does df.interpolate(inplace=True) function?

I am having trouble understanding how this functions. With inplace=True, the function doesn't output anything and the original df remains unchanged. How does this work?
So sorry I wrote 'filter' in my first post. That was very stupid mistake.
As #Alex requested, the example is as follows:
df = pd.DataFrame(np.random.randn(4,3), columns=map(chr, range(65,68)))
df['B'] = np.nan
print df
print df.interpolate(axis=1)
print df
print df.interpolate(axis=1, inplace=True)
print df
The output is as follows:
A B C
0 -0.956273 NaN 0.919723
1 1.127298 NaN -0.585326
2 -0.045163 NaN -0.946355
3 -1.375863 NaN -1.279663
A B C
0 -0.956273 -0.018275 0.919723
1 1.127298 0.270986 -0.585326
2 -0.045163 -0.495759 -0.946355
3 -1.375863 -1.327763 -1.279663
A B C
0 -0.956273 NaN 0.919723
1 1.127298 NaN -0.585326
2 -0.045163 NaN -0.946355
3 -1.375863 NaN -1.279663
None
A B C
0 -0.956273 NaN 0.919723
1 1.127298 NaN -0.585326
2 -0.045163 NaN -0.946355
3 -1.375863 NaN -1.279663
As you can see, the first interpolation created a copy of the original dataframe. What I wanted is to interpolate and update the original dataframe, so I tried inplace since the documentation states the follow:
inplace : bool, default False
Update the NDFrame in place if possible.
The second interpolation did not return any value, and it did not update the original dataframe. So I'm confused.
And as #joris requested, my pandas version is '0.15.1'. Though this request is due to my mistake...