Perplexing pandas index change after left merge - pandas

I have a data frame and I am interested in a particular row. When I run
questionnaire_events[questionnaire_events['event_id'].eq(6506308)]
I get the row, and its index is 7,816. I then merge questionnaire_events with another data frame
merged = questionnaire_events.merge(
ordinals,
how='left',
left_on='event_id',
right_on='id')
(It is worth noting that the ordinals data frame has no NaNs and no duplicated ids, but questionnaire_events does have some rows with NaN values for event_id.)
merged[merged['event_id'].eq(6506308)]
The resulting row has index 7,581. Why? What has happened in the merge, a left outer merge, to mean that my row has moved from 7,816 to 7,581? If there were multiple rows with the same id in the ordinals data frame then I can see how the merged data frame would have more rows than the left data frame in the merge, but that is not the case, so why has the row moved?
(N.B. Sorry I cannot give a crisp code sample. When I try to produce test data the row index change does not happen, it is only happening on my real data.)

pd.DataFrame.merge does not preserve the original datafame indexes.
df1 = pd.DataFrame({'key':[*'ABCDE'], 'val':[1,2,3,4,5]}, index=[100,200,300,400,500])
print('df1 dataframe:')
print(df1)
print('\n')
df2 = pd.DataFrame({'key':[*'AZCWE'], 'val':[10,20,30,40,50]}, index=[*'abcde'])
print('df2 dataframe:')
print(df2)
print('\n')
df_m = df1.merge(df2, on='key', how='left')
print('df_m dataframe:')
print(df_m)
Now, if your df1 is the default range index, then it is possible that you could get different index in your merged dataframe. If you subset or filter your df1, then your indexing will not match.
Work Around:
df1 = df1.reset_index()
df_m2 = df1.merge(df2, on='key', how='left')
df_m2 = df_m2.set_index('index')
print('df_m2 work around dataframe:')
print(df_m2)
Output:
df_m2 work around dataframe:
key val_x val_y
index
100 A 1 10.0
200 B 2 NaN
300 C 3 30.0
400 D 4 NaN
500 E 5 50.0

Related

new_df = df1[df2['pin'].isin(df1['vpin'])] UserWarning: Boolean Series key will be reindexed to match DataFrame index

I'm getting the following warning while executing this line
new_df = df1[df2['pin'].isin(df1['vpin'])]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
The df1 and df2 has only one similar column and they do not have same number of rows.
I want to filter df1 based on the column in df2. If df2.pin is in df1.vpin I want those rows.
There are multiple rows in df1 for same df2.pin and I want to retrieve them all.
pin
count
1
10
2
20
vpin
Column B
1
Cell 2
1
Cell 4
The command is working. I'm trying to overcome the warning.
It doesn't really make sense to use df2['pin'].isin(df1['vpin']) as a boolean mask to index df1 as this mean will have the indices of df2, thus the reindexing performed by pandas.
Use instead:
new_df = df1[df1['vpin'].isin(df2['pin'])]

is There any methods to merge multiple dataframes of different templates

There are a total of 4 dataframes (df1 / df2 / df3 / df4),
Each dataframe has a different template, but they all have the same columns.
I want to merges the row of each dataframe based on the same column, but what function should I use? A 'merge' or 'join' function doesn't seem to work, and deleting the rest of the columns after grouping them into a list seems to be too messy.
I want to make attached image
This is an option, you can merge the dataframes and then drop the useless columns from the total dataframe.
df_total = pd.concat([df1, df2, df3, df4], axis=0)
df_total.drop(['Value2', 'Value3'], axis=1)
You can use reduce to get it done too.
from functools import reduce
reduce(lambda left,right: pd.merge(left, right, on=['ID','value1'], how='outer'), [df1,df2,df3,df4])[['ID','value1']]
ID value1
0 a 1
1 b 4
2 c 5
3 f 1
4 g 5
5 h 6
6 i 1

Performance issue pandas 6 mil rows

need one help.
I am trying to concatenate two data frames. 1st has 58k rows, other 100. Want to concatenate in a way that each of 58k row has 100 rows from other df. So in total 5.8 mil rows.
Performance is very poor, takes 1 hr to do 10 pct. Any suggestions for improvement?
Here is code snippet.
def myfunc(vendors3,cust_loc):
cust_loc_vend = pd.DataFrame()
cust_loc_vend.empty
for i,row in cust_loc.iterrows():
clear_output(wait=True)
a= row.to_frame().T
df= pd.concat([vendors3, a],axis=1, ignore_index=False)
#cust_loc_vend = pd.concat([cust_loc_vend, df],axis=1, ignore_index=False)
cust_loc_vend= cust_loc_vend.append(df)
print('Current progress:',np.round(i/len(cust_loc)*100,2),'%')
return cust_loc_vend
For e.g. if first DF has 5 rows and second has 100 rows
DF1 (sample 2 columns)
I want a merged DF such that each row in DF 2 has All rows from DF1-
Well all you are looking for is a join.But since there is no column column, what you can do is create a column which is similar in both the dataframes and then drop it eventually.
df['common'] = 1
df1['common'] = 1
df2 = pd.merge(df, df1, on=['common'],how='outer')
df = df.drop('tmp', axis=1)
where df and df1 are dataframes.

Concatenating two tables in pandas, giving preference to one for identical indices

I'm trying to combine two data sets df1 and df2. Rows with unique indices are always copied, rows with duplicate indices should always be picked from df1. Imagine two time series, and df2 has additional data but is of lesser quality than df1, so ideally data comes from df1, but I'm willing to backfill from df2
df1:
date value v2
2020/01/01 df1-1 x
2020/01/03 df1-3 y
df2:
date value v2
2020/01/02 df2-2 a
2020/01/03 df2-3 b
2020/01/04 df2-4 c
are combined into
date value v2
2020/01/01 df1-1 x
2020/01/02 df2-2 a
2020/01/03 df1-3 y
2020/01/04 df2-4 c
The best I've got so far is
df = df1.merge(df2, how="outer",left_index=True, right_index=True, suffixes=('','_y'))
df['value'] = result_df['value'].combine_first(result_df['value_y'])
df['v2'] = result_df['v2'].combine_first(result_df['v2'])
df=df[['value', 'v2']]
That gets the job done, but it seems unnecessarily clunky. Is there a more idiomatic way to achieve this?
Your wrote rows with unique indices but you didn't show them,
so I assume that date column should be treated as these indices.
Furthermore, I noticed that all values in your DataFrames are not NaN.
If you guaratee this, you can run:
df1.set_index('date').combine_first(df2.set_index('date'))\
.reset_index()
Steps:
combine_first - combine both DataFrames based on values
in their date columns.
reset_index - change date column (for now the index) into
a "regular" column.
Another possible approach
If both your DataFrames have "standard" index (consecutive numbers starting
from 0) and you want to keep only rows for just these unique indices,
you can run:
df = pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index')\
.set_index('index')
df.index.name = None
But then the result is:
date value v2
0 2020-01-01 df1-1 x
1 2020-01-03 df1-3 y
2 2020-01-04 df2-4 c
so it is different from what you presented as are combined into
(as I assume - your expected result). This time you lost the row
with v2 == 'a'.
Yet another approach
Based also on the assumption that all values in your DataFrames are not NaN:
df1.combine_first(df2)
The result will be just as the previous one.

Pandas: Joining information from multiple data frames, array

Suppose I have three data structures:
A data frame df1, with columns A, B, C of length 10000
A data frame df2, with columns A, some extra misc. columns... of length 8000
A Python list labels of length 8000, where the element at index i corresponds with row i in df2.
I'm trying to create a data frame from this information that, for every element in df2.a, I grab the relevant row from df1 and labels to pair up this information. It's possible that an entry in df2.A is NOT present in df1.A.
Currently, I'm doing this through a for i in xrange(len(df2)) loop, checking if df2.A.iloc[i] is present in df1.A, and if it is, I store df1.A, df1.B, df1.C, labels[i] into a dictionary with the first element as the key and the rest of the elements as a list.
Is there a more efficient way to do this and store the outputs df1.A, df1.B, df1.C, labels[i] into a 4 columns dataframe? The for loop is really slow.
Sample data:
df1
A B C
'uid1' 'Bob' 'Rock'
'uid2' 'Jack' 'Pop'
'uid5' 'Cat' 'Country'
...
df2
A
'uid10'
'uid3'
'uid1'
...
labels
[label10, label3, label1, ...]
OK from what I understand the following should work:
# create a new column for your labels, this will align to your index
df2['labels'] = labels
# now merge the rows from df1 on column 'A'
df2 = df2.merge(df1, on='A', how='left')
Example:
# setup my sample data
temp="""A B C
'uid1' 'Bob' 'Rock'
'uid2' 'Jack' 'Pop'
'uid5' 'Cat' 'Country'"""
temp1="""A
'uid10'
'uid3'
'uid1'"""
labels = ['label10', 'label3', 'label1']
df1 = pd.read_csv(io.StringIO(temp), sep='\s+')
df2 = pd.read_csv(io.StringIO(temp1))
In [97]:
# do the work
df2['labels'] = labels
df2 = df2.merge(df1, on='A', how='left')
df2
Out[97]:
A labels B C
0 'uid10' label10 NaN NaN
1 'uid3' label3 NaN NaN
2 'uid1' label1 'Bob' 'Rock'
This will be considerably faster than looping