Concatenating two tables in pandas, giving preference to one for identical indices - pandas

I'm trying to combine two data sets df1 and df2. Rows with unique indices are always copied, rows with duplicate indices should always be picked from df1. Imagine two time series, and df2 has additional data but is of lesser quality than df1, so ideally data comes from df1, but I'm willing to backfill from df2
df1:
date value v2
2020/01/01 df1-1 x
2020/01/03 df1-3 y
df2:
date value v2
2020/01/02 df2-2 a
2020/01/03 df2-3 b
2020/01/04 df2-4 c
are combined into
date value v2
2020/01/01 df1-1 x
2020/01/02 df2-2 a
2020/01/03 df1-3 y
2020/01/04 df2-4 c
The best I've got so far is
df = df1.merge(df2, how="outer",left_index=True, right_index=True, suffixes=('','_y'))
df['value'] = result_df['value'].combine_first(result_df['value_y'])
df['v2'] = result_df['v2'].combine_first(result_df['v2'])
df=df[['value', 'v2']]
That gets the job done, but it seems unnecessarily clunky. Is there a more idiomatic way to achieve this?

Your wrote rows with unique indices but you didn't show them,
so I assume that date column should be treated as these indices.
Furthermore, I noticed that all values in your DataFrames are not NaN.
If you guaratee this, you can run:
df1.set_index('date').combine_first(df2.set_index('date'))\
.reset_index()
Steps:
combine_first - combine both DataFrames based on values
in their date columns.
reset_index - change date column (for now the index) into
a "regular" column.
Another possible approach
If both your DataFrames have "standard" index (consecutive numbers starting
from 0) and you want to keep only rows for just these unique indices,
you can run:
df = pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index')\
.set_index('index')
df.index.name = None
But then the result is:
date value v2
0 2020-01-01 df1-1 x
1 2020-01-03 df1-3 y
2 2020-01-04 df2-4 c
so it is different from what you presented as are combined into
(as I assume - your expected result). This time you lost the row
with v2 == 'a'.
Yet another approach
Based also on the assumption that all values in your DataFrames are not NaN:
df1.combine_first(df2)
The result will be just as the previous one.

Related

new_df = df1[df2['pin'].isin(df1['vpin'])] UserWarning: Boolean Series key will be reindexed to match DataFrame index

I'm getting the following warning while executing this line
new_df = df1[df2['pin'].isin(df1['vpin'])]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
The df1 and df2 has only one similar column and they do not have same number of rows.
I want to filter df1 based on the column in df2. If df2.pin is in df1.vpin I want those rows.
There are multiple rows in df1 for same df2.pin and I want to retrieve them all.
pin
count
1
10
2
20
vpin
Column B
1
Cell 2
1
Cell 4
The command is working. I'm trying to overcome the warning.
It doesn't really make sense to use df2['pin'].isin(df1['vpin']) as a boolean mask to index df1 as this mean will have the indices of df2, thus the reindexing performed by pandas.
Use instead:
new_df = df1[df1['vpin'].isin(df2['pin'])]

Perplexing pandas index change after left merge

I have a data frame and I am interested in a particular row. When I run
questionnaire_events[questionnaire_events['event_id'].eq(6506308)]
I get the row, and its index is 7,816. I then merge questionnaire_events with another data frame
merged = questionnaire_events.merge(
ordinals,
how='left',
left_on='event_id',
right_on='id')
(It is worth noting that the ordinals data frame has no NaNs and no duplicated ids, but questionnaire_events does have some rows with NaN values for event_id.)
merged[merged['event_id'].eq(6506308)]
The resulting row has index 7,581. Why? What has happened in the merge, a left outer merge, to mean that my row has moved from 7,816 to 7,581? If there were multiple rows with the same id in the ordinals data frame then I can see how the merged data frame would have more rows than the left data frame in the merge, but that is not the case, so why has the row moved?
(N.B. Sorry I cannot give a crisp code sample. When I try to produce test data the row index change does not happen, it is only happening on my real data.)
pd.DataFrame.merge does not preserve the original datafame indexes.
df1 = pd.DataFrame({'key':[*'ABCDE'], 'val':[1,2,3,4,5]}, index=[100,200,300,400,500])
print('df1 dataframe:')
print(df1)
print('\n')
df2 = pd.DataFrame({'key':[*'AZCWE'], 'val':[10,20,30,40,50]}, index=[*'abcde'])
print('df2 dataframe:')
print(df2)
print('\n')
df_m = df1.merge(df2, on='key', how='left')
print('df_m dataframe:')
print(df_m)
Now, if your df1 is the default range index, then it is possible that you could get different index in your merged dataframe. If you subset or filter your df1, then your indexing will not match.
Work Around:
df1 = df1.reset_index()
df_m2 = df1.merge(df2, on='key', how='left')
df_m2 = df_m2.set_index('index')
print('df_m2 work around dataframe:')
print(df_m2)
Output:
df_m2 work around dataframe:
key val_x val_y
index
100 A 1 10.0
200 B 2 NaN
300 C 3 30.0
400 D 4 NaN
500 E 5 50.0

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]

show observation that got lost in merge

Lets say I want to merge two different dataframes by the key of two columns.
Dataframe One has 70000 obs of 10 variables.
Dataframe Two has 4500 obs of 5 variables.
Now I checked how my observations from my New dataframe are left by using this code.
So I realize that my columns from my dataframe Two are now only 4490 obs of 10 variables.
Thats all right.
My question is:
Is there way of giving me back the 5 observations from my dataframe Two I lost during the process. The names would be enough.
Thank you :)
I think you can use dplyr::anti_join for this. From its documentation:
return all rows from x where there are not matching values in y, keeping just columns from x.
You'd probably have to pass your data frame TWO as x.
EDIT: as mentioned in the comments, the syntax for its by argument is different.
Example:
df1 <- data.frame(Name=c("a", "b", "c"),
Date1=c(1,2,3),
stringsAsFactors=FALSE)
df2 <- data.frame(Name=c("a", "d"),
Date2=c(1,2),
stringsAsFactors=FALSE)
> dplyr::anti_join(df2, df1, by=c("Name"="Name", "Date2"="Date1"))
Name Date
1 d 2

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)