Case insensitive pandas.concat - pandas

How would I perform a case insensitive pandas.concat?
df1 = pd.DataFrame({"a":[1,2,3]},index=["a","b","c"])
df2 = pd.DataFrame({"b":[1,2,3]},index=["a","b","c"])
df1a = pd.DataFrame({"A":[1,2,3]},index=["A","B","C"])
pd.concat([df1, df2],axis=1)
a b
a 1 1
b 2 2
c 3 3
but this does not work:
pd.concat([df1, df1a],axis=1)
a A
A NaN 1
B NaN 2
C NaN 3
a 1 NaN
b 2 NaN
c 3 NaN
Is there an easy way to do this?
I have the same question for concat on a Series.
This works for a DataFrame:
pd.DataFrame([11,21,31],index=pd.MultiIndex.from_tuples([("A",x) for x in ["a","B","c"]])).rename(str.lower)
but this does not work for a Series:
pd.Series([11,21,31],index=pd.MultiIndex.from_tuples([("A",x) for x in ["a","B","c"]])).rename(str.lower)
TypeError: descriptor 'lower' requires a 'str' object but received a 'tuple'
For renaming, DataFrames use:
def rename_axis(self, mapper, axis=1):
index = self.axes[axis]
if isinstance(index, MultiIndex):
new_axis = MultiIndex.from_tuples([tuple(mapper(y) for y in x) for x in index], names=index.names)
else:
new_axis = Index([mapper(x) for x in index], name=index.name)
whereas when renaming Series:
result.index = Index([mapper_f(x) for x in self.index], name=self.index.name)
so my updated question is how to perform the rename/case insensitive concat with a Series?

You can do this via rename:
pd.concat([df1, df1a.rename(index=str.lower)], axis=1)
EDIT:
If you want to do this with a MultiIndexed Series you'll need to set it manually, for now. There's a bug report over at pandas GitHub repo waiting to be fixed (thanks #ViktorKerkez).
s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: tuple(map(str.lower, x))))
You can replace str.lower with whatever function you want to use to rename your index.
Note that you cannot use reindex in general here, because it tries to find values with the renamed index and thus it will return nan values, unless your rename results in no changes to the original index.

For the MultiIndexed Series objects, if this is not a bug, you can do:
s.index = pd.MultiIndex.from_tuples(
s.index.map(lambda x: tuple(map(str.lower, x)))
)

Related

Pandas: Merging multiple dataframes efficiently

I have a situation where I need to merge multiple dataframes that I can do easily using the below code:
# Merge all the datasets together
df_prep1 = df_prep.merge(df1,on='e_id',how='left')
df_prep2 = df_prep1.merge(df2,on='e_id',how='left')
df_prep3 = df_prep2.merge(df3,on='e_id',how='left')
df_prep4 = df_prep3.merge(df_4,on='e_id',how='left')
df_prep5 = df_prep4.merge(df_5,on='e_id',how='left')
df_prep6 = df_prep5.merge(df_6,on='e_id',how='left')
But what I want to understand is that if there is any other efficient way to perform this merge, maybe using a helper function? If yes, then how could I achieve that?
You can use reduce from functools module to merge multiple dataframes:
from functools import reduce
dfs = [df_1, df_2, df_3, df_4, df_5, df_6]
out = reduce(lambda dfl, dfr: pd.merge(dfl, dfr, on='e_id', how='left'), dfs)
You can put all your dfs into a list, or pass them from a function, a loop, etc. and then have 1 main df that you merge everything onto.
You can start with an empty df and iterate through. In your case, since you are doing left merge, it looks like your df_prep should already have all of the e_id values that you want. You'll need to figure out what you want to do with any additional columns, e.g., you can have pandas add _x and _y after conflicting column names that you don't merge, or rename them, etc. See this toy example:
main_df = pd.DataFrame({'e_id': [0, 1, 2, 3, 4]})
for x in range(3):
dfx = pd.DataFrame({'e_id': [x], 'another_col' + str(x): [x * 10]})
main_df = main_df.merge(dfx, on='e_id', how='left')
to get:
e_id another_col0 another_col1 another_col2
0 0 0.0 NaN NaN
1 1 NaN 10.0 NaN
2 2 NaN NaN 20.0
3 3 NaN NaN NaN
4 4 NaN NaN NaN

How to apply a function to every two columns of a pandas data and return a new dataframe

For example I have a dataframe as follows:
 
A
B
C
D
0
1.049380
0.512696
0.135421
1.396424
1
-0.367589
-0.741008
-1.543296
0.355291
2
1.244623
-0.295761
1.238826
-0.017174
3
0.378124
0.870361
-0.733288
-0.228948
I want to call stats.ttest_ind on all combination of two columns and get new dataframe as follows (don't care the dummy values):
A
B
C
D
A
nan
0.512696
0.135421
1.396424
B
-0.367589
nan
-1.543296
0.355291
C
1.244623
-0.295761
nan
-0.017174
D
0.378124
0.870361
-0.733288
nan
You could use a list comprehension:
ttest_lists = [[ stats.ttest_ind(df[col_i], df[col_j]) if col_i!=col_j else np.nan
for col_i in df] for col_j in df]
To get a DataFrame rather than a list of lists, you can then use:
ttest_df = pd.DataFrame(ttest_lists, columns=df.columns, index=df.columns)

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

How to turn Pandas' DataFrame.groupby() result into MultiIndex

Suppose I have a set of measurements that were obtained by varying two parameters, knob_b and knob_2 (in practice there are a lot more):
data = np.empty((6,3), dtype=np.float)
data[:,0] = [3,4,5,3,4,5]
data[:,1] = [1,1,1,2,2,2]
data[:,2] = np.random.random(6)
df = pd.DataFrame(data, columns=['knob_1', 'knob_2', 'signal'])
i.e., df is
knob_1 knob_2 signal
0 3 1 0.076571
1 4 1 0.488965
2 5 1 0.506059
3 3 2 0.415414
4 4 2 0.771212
5 5 2 0.502188
Now, considering each parameter on its own, I want to find the minimum value that was measured for each setting of this parameter (ignoring the settings of all other parameters). The pedestrian way of doing this is:
new_index = []
new_data = []
for param in df.columns:
if param == 'signal':
continue
group = df.groupby(param)['signal'].min()
for (k,v) in group.items():
new_index.append((param, k))
new_data.append(v)
new_index = pd.MultiIndex.from_tuples(new_index,
names=('parameter', 'value'))
df2 = pd.Series(index=new_index, data=new_data)
resulting df2 being:
parameter value
knob_1 3 0.495674
4 0.277030
5 0.398806
knob_2 1 0.485933
2 0.277030
dtype: float64
Is there a better way to do this, in particular to get rid of the inner loop?
It seems to me that the result of the df.groupby operation already has everything I need - if only there was a way to somehow create a MultiIndex from it without going through the list of tuples.
Use the keys argument of pd.concat():
pd.concat([df.groupby('knob_1')['signal'].min(),
df.groupby('knob_2')['signal'].min()],
keys=['knob_1', 'knob_2'],
names=['parameter', 'value'])

pd.dataframe.apply() create multiple new columns

I have a bunch of files where I want to open, read the first line, parse it into several expected pieces of information, and then put the filenames and those data as rows in a dataframe. My question concerns the recommended syntax to build the dataframe in a pandanic/pythonic way (the file-opening and parsing I already have figured out).
For a dumbed-down example, the following seems to be the recommended thing to do when you want to create one new column:
df = pd.DataFrame(files, columns=['filename'])
df['first_letter'] = df.apply(lambda x: x['filename'][:1], axis=1)
but I can't, say, do this:
df['first_letter'], df['second_letter'] = df.apply(lambda x: (x['filename'][:1], x['filename'][1:2]), axis=1)
as the apply function creates only one column with tuples in it.
Keep in mind that, in place of the lambda function I will place a function that will open the file and read and parse the first line.
You can put the two values in a Series, and then it will be returned as a dataframe from the apply (where each series is a row in that dataframe). With a dummy example:
In [29]: df = pd.DataFrame(['Aa', 'Bb', 'Cc'], columns=['filenames'])
In [30]: df
Out[30]:
filenames
0 Aa
1 Bb
2 Cc
In [31]: df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
Out[31]:
0 1
0 A a
1 B b
2 C c
This you can then assign to two new columns:
In [33]: df[['first', 'second']] = df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
In [34]: df
Out[34]:
filenames first second
0 Aa A a
1 Bb B b
2 Cc C c