Merge two data frames based on common column values in Pandas - pandas

How to get merged data frame from two data frames having common column value such that only those rows make merged data frame having common value in a particular column.
I have 5000 rows of df1 as format : -
director_name actor_1_name actor_2_name actor_3_name movie_title
0 James Cameron CCH Pounder Joel David Moore Wes Studi Avatar
1 Gore Verbinski Johnny Depp Orlando Bloom Jack Davenport Pirates
of the Caribbean: At World's End
2 Sam Mendes Christoph Waltz Rory Kinnear Stephanie Sigman Spectre
and 10000 rows of df2 as
movieId genres movie_title
1 Adventure|Animation|Children|Comedy|Fantasy Toy Story
2 Adventure|Children|Fantasy Jumanji
3 Comedy|Romance Grumpier Old Men
4 Comedy|Drama|Romance Waiting to Exhale
A common column 'movie_title' have common values and based on them, I want to get all rows where 'movie_title' is same. Other rows to be deleted.
Any help/suggestion would be appreciated.
Note: I already tried
pd.merge(dfinal, df1, on='movie_title')
and output comes like one row
director_name actor_1_name actor_2_name actor_3_name movie_title movieId title genres
and on how ="outer"/"left", "right", I tried all and didn't get any row after dropping NaN although many common coloumn do exist.

You can use pd.merge:
import pandas as pd
pd.merge(df1, df2, on="movie_title")
Only rows are kept for which common keys are found in both data frames. In case you want to keep all rows from the left data frame and only add values from df2 where a matching key is available, you can use how="left":
pd.merge(df1, df2, on="movie_title", how="left")

We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge operation.

If you want to merge two DataFrames and you want a merged DataFrame in which only common values from both data frames will appear then do inner merge.
import pandas as pd
merged_Frame = pd.merge(df1, df2, on = id, how='inner')

Related

How to a row in pandas based on column condition?

I have a pandas data frame and I would like to duplicate those rows which meet some column condition (i.e. having multiple elements in CourseID column)
I tried iterating over the data frame to identify the rows which should be duplicated but i don't know how to duplicate them,
Using Pandas version 0.25 it is quite easy:
The first step is to split df.CourseID (converting each element to a list)
and then to explode it (break each list into multiple rows,
repeating other columns in each row):
course = df.CourseID.str.split(',').explode()
The result is:
0 456
1 456
1 799
2 789
Name: CourseID, dtype: object
Then, all to do is to join df with course, but in order to avoid
repeating column names, you have to drop original CourseID column before.
Fortunately, in can be expressed in a single instruction:
df.drop(columns=['CourseID']).join(course)
If you have some older version of Pandas this is a good reason to
upgrade it.

Iterate two dataframes, compare and change a value in pandas or pyspark

I am trying to do an exercise in pandas.
I have two dataframes. I need to compare few columns between both dataframes and change the value of one column in the first dataframe if the comparison is successful.
Dataframe 1:
Article Country Colour Buy
Pants Germany Red 0
Pull Poland Blue 0
Initially all my articles have the flag 'Buy' set to zero.
I have dataframe 2 that looks as:
Article Origin Colour
Pull Poland Blue
Dress Italy Red
I want to check if the article, country/origin and colour columns match (so check whether I can find the each article from dataframe 1 in dataframe two) and, if so, I want to put the flag 'Buy' to 1.
I trying to iterate through both dataframe with pyspark but pyspark daatframes are not iterable.
I thought about doing it in pandas but apaprently is a bad practise to change values during iteration.
Which code in pyspark or pandas would work to do what I need to do?
Thanks!
merge with an indicator then map the values. Make sure to drop_duplicates on the merge keys in the right frame so the merge result is always the same length as the original, and rename so we don't repeat the same information after the merge. No need to have a pre-defined column of 0s.
df1 = df1.drop(columns='Buy')
df1 = df1.merge(df2.drop_duplicates().rename(columns={'Origin': 'Country'}),
indicator='Buy', how='left')
df1['Buy'] = df1['Buy'].map({'left_only': 0, 'both': 1}).astype(int)
Article Country Colour Buy
0 Pants Germany Red 0
1 Pull Poland Blue 1

merging dataframes (python pandas)

I have two dataframes df1 df1 and df2 df2
I want to merge them using python pandas without creating the Cartesian product.Sample output would look like this output How should I do it?
Currently,I am using
df3=pd.merge(df1,df2,on='id',how='left') but it's giving me cross product.The resultant dataframe df3 contains 14 records 6 for id=1 and 8 for id=2.
Thanks,
You may need an additional key for help, create by cumcount
df1['Helpkey']=df1.groupby('id').cumcount()
df2['Helpkey']=df2.groupby('id').cumcount()
df1.merge(df2,how='left').drop('Helpkey',1)

show observation that got lost in merge

Lets say I want to merge two different dataframes by the key of two columns.
Dataframe One has 70000 obs of 10 variables.
Dataframe Two has 4500 obs of 5 variables.
Now I checked how my observations from my New dataframe are left by using this code.
So I realize that my columns from my dataframe Two are now only 4490 obs of 10 variables.
Thats all right.
My question is:
Is there way of giving me back the 5 observations from my dataframe Two I lost during the process. The names would be enough.
Thank you :)
I think you can use dplyr::anti_join for this. From its documentation:
return all rows from x where there are not matching values in y, keeping just columns from x.
You'd probably have to pass your data frame TWO as x.
EDIT: as mentioned in the comments, the syntax for its by argument is different.
Example:
df1 <- data.frame(Name=c("a", "b", "c"),
Date1=c(1,2,3),
stringsAsFactors=FALSE)
df2 <- data.frame(Name=c("a", "d"),
Date2=c(1,2),
stringsAsFactors=FALSE)
> dplyr::anti_join(df2, df1, by=c("Name"="Name", "Date2"="Date1"))
Name Date
1 d 2

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)