show observation that got lost in merge - dataframe

Lets say I want to merge two different dataframes by the key of two columns.
Dataframe One has 70000 obs of 10 variables.
Dataframe Two has 4500 obs of 5 variables.
Now I checked how my observations from my New dataframe are left by using this code.
So I realize that my columns from my dataframe Two are now only 4490 obs of 10 variables.
Thats all right.
My question is:
Is there way of giving me back the 5 observations from my dataframe Two I lost during the process. The names would be enough.
Thank you :)

I think you can use dplyr::anti_join for this. From its documentation:
return all rows from x where there are not matching values in y, keeping just columns from x.
You'd probably have to pass your data frame TWO as x.
EDIT: as mentioned in the comments, the syntax for its by argument is different.
Example:
df1 <- data.frame(Name=c("a", "b", "c"),
Date1=c(1,2,3),
stringsAsFactors=FALSE)
df2 <- data.frame(Name=c("a", "d"),
Date2=c(1,2),
stringsAsFactors=FALSE)
> dplyr::anti_join(df2, df1, by=c("Name"="Name", "Date2"="Date1"))
Name Date
1 d 2

Related

Pandas dividing filtered column from df 1 by filtered column of df 2 warning and weird behavior

I have a data frame which is conditionally broken up into two separate dataframes as follows:
df = pd.read_csv(file, names)
df = df.loc[df['name1'] == common_val]
df1 = df.loc[df['name2'] == target1]
df2 = df.loc[df['name2'] == target2]
# each df has a 'name3' I want to perform a division on after this filtering
The original df is filtered by a value shared by the two dataframes, and then each of the two new dataframes are further filtered by another shared column.
What I want to work:
df1['name3'] = df1['name3']/df2['name3']
However, as many questions have pointed out, this causes a setting with copy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I tried what was recommended in this question:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2.loc[:,'name3']
# also tried:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2['name3']
But in both cases I still get weird behavior and the set by copy warning.
I then tried what was recommended in this answer:
df.loc[df['name2']==target1, 'name3'] = df.loc[df['name2']==target1, 'name3']/df.loc[df['name2'] == target2, 'name3']
which still results in the same copy warning.
If possible I would like to avoid copying the data frame to get around this because of the size of these dataframes (and I'm already somewhat wastefully making two almost identical dfs from the original).
If copying is the best way to go with this problem I'm interested to hear why that works over all the options I explored above.
Edit: here is a simple data frame along the lines of what df would look like after the line df.loc[df['name1'] == common_val]
name1 other1 other2 name2 name3
a x y 1 2
a x y 1 4
a x y 2 5
a x y 2 3
So if target1=1 and target2=2,
I would like df1 to contain only rows where name1=1 and df2 to contain only rows where name2=2, then divide the resulting df1['name3'] by the resulting df2['name3'].
If there is a less convoluted way to do this (without splitting the original df) I'm open to that as well!

Concatenating two tables in pandas, giving preference to one for identical indices

I'm trying to combine two data sets df1 and df2. Rows with unique indices are always copied, rows with duplicate indices should always be picked from df1. Imagine two time series, and df2 has additional data but is of lesser quality than df1, so ideally data comes from df1, but I'm willing to backfill from df2
df1:
date value v2
2020/01/01 df1-1 x
2020/01/03 df1-3 y
df2:
date value v2
2020/01/02 df2-2 a
2020/01/03 df2-3 b
2020/01/04 df2-4 c
are combined into
date value v2
2020/01/01 df1-1 x
2020/01/02 df2-2 a
2020/01/03 df1-3 y
2020/01/04 df2-4 c
The best I've got so far is
df = df1.merge(df2, how="outer",left_index=True, right_index=True, suffixes=('','_y'))
df['value'] = result_df['value'].combine_first(result_df['value_y'])
df['v2'] = result_df['v2'].combine_first(result_df['v2'])
df=df[['value', 'v2']]
That gets the job done, but it seems unnecessarily clunky. Is there a more idiomatic way to achieve this?
Your wrote rows with unique indices but you didn't show them,
so I assume that date column should be treated as these indices.
Furthermore, I noticed that all values in your DataFrames are not NaN.
If you guaratee this, you can run:
df1.set_index('date').combine_first(df2.set_index('date'))\
.reset_index()
Steps:
combine_first - combine both DataFrames based on values
in their date columns.
reset_index - change date column (for now the index) into
a "regular" column.
Another possible approach
If both your DataFrames have "standard" index (consecutive numbers starting
from 0) and you want to keep only rows for just these unique indices,
you can run:
df = pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index')\
.set_index('index')
df.index.name = None
But then the result is:
date value v2
0 2020-01-01 df1-1 x
1 2020-01-03 df1-3 y
2 2020-01-04 df2-4 c
so it is different from what you presented as are combined into
(as I assume - your expected result). This time you lost the row
with v2 == 'a'.
Yet another approach
Based also on the assumption that all values in your DataFrames are not NaN:
df1.combine_first(df2)
The result will be just as the previous one.

Merge two data frames based on common column values in Pandas

How to get merged data frame from two data frames having common column value such that only those rows make merged data frame having common value in a particular column.
I have 5000 rows of df1 as format : -
director_name actor_1_name actor_2_name actor_3_name movie_title
0 James Cameron CCH Pounder Joel David Moore Wes Studi Avatar
1 Gore Verbinski Johnny Depp Orlando Bloom Jack Davenport Pirates
of the Caribbean: At World's End
2 Sam Mendes Christoph Waltz Rory Kinnear Stephanie Sigman Spectre
and 10000 rows of df2 as
movieId genres movie_title
1 Adventure|Animation|Children|Comedy|Fantasy Toy Story
2 Adventure|Children|Fantasy Jumanji
3 Comedy|Romance Grumpier Old Men
4 Comedy|Drama|Romance Waiting to Exhale
A common column 'movie_title' have common values and based on them, I want to get all rows where 'movie_title' is same. Other rows to be deleted.
Any help/suggestion would be appreciated.
Note: I already tried
pd.merge(dfinal, df1, on='movie_title')
and output comes like one row
director_name actor_1_name actor_2_name actor_3_name movie_title movieId title genres
and on how ="outer"/"left", "right", I tried all and didn't get any row after dropping NaN although many common coloumn do exist.
You can use pd.merge:
import pandas as pd
pd.merge(df1, df2, on="movie_title")
Only rows are kept for which common keys are found in both data frames. In case you want to keep all rows from the left data frame and only add values from df2 where a matching key is available, you can use how="left":
pd.merge(df1, df2, on="movie_title", how="left")
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge operation.
If you want to merge two DataFrames and you want a merged DataFrame in which only common values from both data frames will appear then do inner merge.
import pandas as pd
merged_Frame = pd.merge(df1, df2, on = id, how='inner')

How do I preset the dimensions of my dataframe in pandas?

I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!
There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))
You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)