pd.concat Plan shapes are not aligned error - pandas

I'm trying to concat two dataframes that have exactly the same column names.
df1 = [['A','B','C']]
df1 = A B C
1762 53RC982 0.22
1763 56XY931 0.33
1767 54AB171 0.47
1771 38CD410 0.22
df2 = [['A','B','C']]
A B C
1810 53RC982 0.42
1811 58XY821 0.63
1812 47AB261 0.33
1820 38CD410 0.81
where A is the unique column. I want to basically join them have the df1 and df2 untouched one after the other this shouldn't be a problem as I have a unique column i.e. if df1 is 100 rows and df2 is 50 rows the combined number of rows should be 150
I've tried:
df_final = pd.concat([df1, df2], axis=0, sort =True)
However, this is giving me
ValueError: Plan shapes are not aligned
if I do :
df_final = pd.concat([df1, df2], axis=1)
This gives me the information side by side. I've also tried the
.append
method similar results.

The problem this was happening is that the dfs didn't have the same data types. Used
df1['A'] = df1['A'].astype(str)
df2['A'] = df2['A'].astype(str)
then
df_final= pd.concat([df1, df2], axis=0, sort = True)
to merge them

Related

Fillna() depending on another column

I want to do next:
Fill DF1 NaN with values from DF2 depending on column value in DF1.
Basically, DF1 has people with "income_type" and some NaN in "total_income". In DF2 there are "median income" for each "income_type". I want to fill NaN in "total_income" DF1 with median values from DF2
DF1, DF2
First, I would merge values from DF2 to DF1 by 'income_type'
DF3 = DF1.merge(DF2, how='left', on='income_type')
This way you have the values of median income and total income in the same dataframe.
After this, I would do an if else statement for a pandas dataframe columns
DF3.loc[DF3['total_income'].isna(), 'total_income'] = DF3['median income']
That will replace the NaN values with the median values from the merge
You need to join the two dataframes and then replace the nan values with the median. Here is a similar working example. Cheers mate.
import pandas as pd
#create the example dataframes
df1 = pd.DataFrame({'income_type':['a','b','c','a','a','b','b'], 'total_income':[200, 300, 500,None,400,None,None]})
df2 = pd.DataFrame({'income_type':['a','b','c'], 'median_income':[205, 305, 505]})
# inner join the df1 with df2 on the column 'income_type'
joined = df1.merge(df2, on='income_type')
# fill the nan values the value from the column 'median_income' and save it in a new column 'total_income_not_na'
joined['total_income_not_na'] = joined['total_income'].fillna(joined['median_income'])
print(joined)

Join on column in pandas

So this works as expected:
df1 = pd.DataFrame({'date':[123,456],'price1':[23,34]}).set_index('date')
df2 = pd.DataFrame({'date':[456,789],'price2':[22,32]}).set_index('date')
df1.join(df2, how='outer')
price1 price2
date
123 23.0 NaN
456 34.0 22.0
789 NaN 32.0
But if I don't set the index, it causes an error:
df1 = pd.DataFrame({'date':[123,456],'price1':[23,34]})
df2 = pd.DataFrame({'date':[456,789],'price2':[22,32]})
df1.join(df2, on='date', how='outer')
ValueError: columns overlap but no suffix specified: Index(['date'], dtype='object')
Why is this, and am I incorrect for supposing they should give the same result?
If you want just to add the two dataframes and not joining by a certain column, you need to add suffixes so not to create columns with the same name. e.g.:
df1.join(df2, how='outer', lsuffix='_left', rsuffix='_right')
if you want to join on the column you should use merge:
df1.merge(df2, how='outer')

Perplexing pandas index change after left merge

I have a data frame and I am interested in a particular row. When I run
questionnaire_events[questionnaire_events['event_id'].eq(6506308)]
I get the row, and its index is 7,816. I then merge questionnaire_events with another data frame
merged = questionnaire_events.merge(
ordinals,
how='left',
left_on='event_id',
right_on='id')
(It is worth noting that the ordinals data frame has no NaNs and no duplicated ids, but questionnaire_events does have some rows with NaN values for event_id.)
merged[merged['event_id'].eq(6506308)]
The resulting row has index 7,581. Why? What has happened in the merge, a left outer merge, to mean that my row has moved from 7,816 to 7,581? If there were multiple rows with the same id in the ordinals data frame then I can see how the merged data frame would have more rows than the left data frame in the merge, but that is not the case, so why has the row moved?
(N.B. Sorry I cannot give a crisp code sample. When I try to produce test data the row index change does not happen, it is only happening on my real data.)
pd.DataFrame.merge does not preserve the original datafame indexes.
df1 = pd.DataFrame({'key':[*'ABCDE'], 'val':[1,2,3,4,5]}, index=[100,200,300,400,500])
print('df1 dataframe:')
print(df1)
print('\n')
df2 = pd.DataFrame({'key':[*'AZCWE'], 'val':[10,20,30,40,50]}, index=[*'abcde'])
print('df2 dataframe:')
print(df2)
print('\n')
df_m = df1.merge(df2, on='key', how='left')
print('df_m dataframe:')
print(df_m)
Now, if your df1 is the default range index, then it is possible that you could get different index in your merged dataframe. If you subset or filter your df1, then your indexing will not match.
Work Around:
df1 = df1.reset_index()
df_m2 = df1.merge(df2, on='key', how='left')
df_m2 = df_m2.set_index('index')
print('df_m2 work around dataframe:')
print(df_m2)
Output:
df_m2 work around dataframe:
key val_x val_y
index
100 A 1 10.0
200 B 2 NaN
300 C 3 30.0
400 D 4 NaN
500 E 5 50.0

Assigning index column to empty pandas dataframe

I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?
The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0
yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df
Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df

Vectorized method to sync two arrays

I have two Pandas TimeSeries: x, and y, which I would like to sync "as of". I would like to find for every element in x the latest (by index) element in y that preceeds it (by index value). For example, I would like to compute this new_x:
x new_x
---- -----
13:01 13:00
14:02 14:00
y
----
13:00
13:01
13:30
14:00
I am looking for a vectorized solution, not a Python loop. The time values are based on Numpy datetime64. The y array's length is in the order of millions, so O(n^2) solutions are probably not practical.
In some circles this operation is known as the "asof" join. Here is an implementation:
def diffCols(df1, df2):
""" Find columns in df1 not present in df2
Return df1.columns - df2.columns maintaining the order which the resulting
columns appears in df1.
Parameters:
----------
df1 : pandas dataframe object
df2 : pandas dataframe objct
Pandas already offers df1.columns - df2.columns, but unfortunately
the original order of the resulting columns is not maintained.
"""
return [i for i in df1.columns if i not in df2.columns]
def aj(df1, df2, overwriteColumns=True, inplace=False):
""" KDB+ like asof join.
Finds prevailing values of df2 asof df1's index. The resulting dataframe
will have same number of rows as df1.
Parameters
----------
df1 : Pandas dataframe
df2 : Pandas dataframe
overwriteColumns : boolean, default True
The columns of df2 will overwrite the columns of df1 if they have the same
name unless overwriteColumns is set to False. In that case, this function
will only join columns of df2 which are not present in df1.
inplace : boolean, default False.
If True, adds columns of df2 to df1. Otherwise, create a new dataframe with
columns of both df1 and df2.
*Assumes both df1 and df2 have datetime64 index. """
joiner = lambda x : x.asof(df1.index)
if not overwriteColumns:
# Get columns of df2 not present in df1
cols = diffCols(df2, df1)
if len(cols) > 0:
df2 = df2.ix[:,cols]
result = df2.apply(joiner)
if inplace:
for i in result.columns:
df1[i] = result[i]
return df1
else:
return result
Internally, this uses pandas.Series.asof().
What about using Series.searchsorted() to return the index of y where you would insert x. You could then subtract one from that value and use it to index y.
In [1]: x
Out[1]:
0 1301
1 1402
In [2]: y
Out[2]:
0 1300
1 1301
2 1330
3 1400
In [3]: y[y.searchsorted(x)-1]
Out[3]:
0 1300
3 1400
note: the above example uses int64 Series