How to search a data frame and remove items that match another data frame - pandas

I have two dataframes:
df1 = names: Tom, Nick, Pat, Frank
df2 = names: Tom, Nick
I would like to make a df3 by having df2 search through df1 and remove matches so I am left with a new dataframe:
df3 = names: Pat, Frank

You can do:
df3 = df1[~df1['names'].isin(df2['names'])]
This checks each name in df1 to see if it is in df2, then takes the opposite of the boolean result, and filters df1 based on those resulting bools.

Related

Split html table to smaller Pandas DataFrames

I'm trying to parse html tables from page ukwtv.de to Pandas DataFrames
The challange is that in one table there are combined 2 or even 3 tables together
From table
TV program name and SID as df1,
Kanal, Standort, etc. as df2,
Technische Details as df3,
Here what I managed to achieve so far:
table_MN = pd.read_html('https://www.ukwtv.de/cms/deutschland-tv/schleswig-holstein-tv.html', thousands='.', decimal=',')
df1 = table_MN[1]
df1.columns = df1.columns.str.replace(" ", "_")
df1.columns = df1.columns.str.replace("\n", "_")
df1=df1.iloc[:7 , :]
for col in df1.columns:
print(col)
if '.' in col:
df1.drop(col, axis=1, inplace=True)
df1.dropna(subset = ["TV-_und_Radio-Programme_des_Bouquets"],axis=0, inplace=True)
df1.head(15)
df2 = table_MN[1]
df2.columns = df2.iloc[7]
df2 = df2.iloc[8: , :]
df2 = df2.reset_index(drop=True)
df2.head(20)
To issue which I have problem to solve
row 7 is hardcoded how to recodnize blank line to split data i two dataframes?
Technische Details column in df1 need to be convered to separete dataframe where Modulation, Guardintervall, ... are Series names

pandas - lookup a value in another DF without merging data from the other dataframe

I have 2 DataFrames. DF1 and DF2.
(Please note that DF2 has more entries than DF1)
What I want to do is add the nationality column to DF1 (The script should look up the name in DF1 and find the corresponding nationality in DF2).
I am currently using the below code
final_DF =df1.merge(df2[['PPS','Nationality']], on=['PPS'], how ='left')
Although the nationality column is being added to DF1 the code duplicating entries and also adding additional data from DF2 that I do not want.
Is there a method to get the nationality from DF2 while only keeping the DF1 data?
Thanks
DF1
DF2
OUPUT
2 points, you need to do.
If there is any duplicated in the DF2
You can define 'how' in the merge statement. so it will look like
final_DF = DF1.merge(DF2, on=['Name'], how = 'left')
since you want to keep only to DF1 rows, 'left' should be the ideal option for you.
For more info refer this

find a value from df1 in df2 and replace other values of the matching rows

I have the following code with 2 dataframes (df1 & df2)
import pandas as pd
data = {'Name': ['Name1', 'Name2', 'Name3', 'Name4', 'Name5'],
'Number': ['456', 'A977', '132a', '6783r', '868354']}
replace = {'NewName': ['NewName1', 'NewName3', 'NewName4', 'NewName5', 'NewName2'],
'ID': ['I753', '25552', '6783r', '868354', 'A977']}
df1 = pd.DataFrame(data, columns = ['Name', 'Number'])
df2 = pd.DataFrame(replace, columns = ['NewName', 'ID'])
Now I would like to compare every item in the 'Number' column of df1 with the 'ID' column of df2. If there is a match, I would like to replace the 'Name' of df1 with the 'NewName' of df2, otherwise it should keep the 'Name' of df1.
First I tried the following code, but unfortunately it mixed the name and the number in the different rows.
df1.loc[df1['Number'].isin(df2['ID']), ['Name']] = df2.loc[df2['ID'].isin(df1['Number']),['NewName']].values
The next code that I tried worked a bit better, but it replaced the 'Name' of df1 with the 'Number' of df1 if there was no matching.
df1['Name'] = df1['Number'].replace(df2.set_index('ID')['NewName'])
How can I stop this behavior in my last code or are there better ways in general to achieve what I would like to do?
You can use map instead of replace to substitute each value in the column Number in df1 with corresponding value from the NewName column in df2 then fill the NaN values (values which can't be substituted) in mapped column with the original values from the Name column in df1:
df1['Name'] = df1['Number'].map(df2.set_index('ID')['NewName']).fillna(df1['Name'])
>>> df1
Name Number
0 Name1 456
1 NewName2 A977
2 Name3 132a
3 NewName4 6783r
4 NewName5 868354

search and compare data between dataframes

I have an issue about merge of data-frame.
I have two data-frames as follow,
df1:
ID name-group status
1 bob,david good
2 CC,robben good
3 jack bad
df2:
ID leader location
2 robben JAPAN
3 jack USA
4 bob UK
I want to get a result as flow.
dft
ID name-group Leader location
1 bob,david
2 CC,robben Robben JAPAN
3 jack Jack USA
the [Leader] and [location] will be merged when
[leader] in df2 **IN** [name-group] of df1
&
[ID] of df2 **=** [ID] of df1
I have tried for loop, but its time-cost is very high.
any ideas for this issue?
Thanks
See the end of the post for runnable code. The proposed solution is in the function, using_tidy.
The main problem here is that having multiple names in name-group, separated
by commas, makes searching for membership difficult. If, instead, df1 had each
member of name-group in its own row, then testing for membership would be
easy. That is, suppose df1 looked like this:
ID name-group status
0 1 bob good
0 1 david good
1 2 CC good
1 2 robben good
2 3 jack bad
Then you could simply merge df1 and df2 on ID and test if leader
equals name-group... almost (see why "almost" below).
Putting df1 in tidy format (PDF)
is the main idea in the solution below. The reason why it improves performance
is because testing for equality between two columns is much much faster than
testing if a column of strings are substrings of another column of strings, or
are members of a column containing a list of strings.
The reason why I said "almost" above is because there is another difficulty --
after merging df1 and df2 on ID, some rows are leaderless such as the bob,david row:
ID name-group Leader location
1 bob,david
Since we simply want to keep these rows and we don't want to test if criteria #1 holds in this case, we need to treat these rows differently -- don't expand them.
We can handle this problem by separating the leaderless rows from those with potential leaders (see below).
The second criteria, that the IDs match is easy to enforce by merging df1 and df2 on ID:
dft = pd.merge(df1, df2, on='ID', how='left')
The first criteria is that dft['leader'] is in dft['name-group'].
This criteria could be expressed as
In [293]: dft.apply(lambda x: pd.isnull(x['leader']) or (x['leader'] in x['name-group'].split(',')), axis=1)
Out[293]:
0 True
1 True
2 True
dtype: bool
but using dft.apply(..., axis=1) calls the lambda function once for each
row. This can be very slow if there are many rows in dft.
If there are many rows in dft we can do better by first converting dft to
tidy format (PDF) -- placing each
member in dft['name-group'] on its own row. But first, let's split dft into 2
sub-DataFrames, those which have a leader, and those which don't:
has_leader = pd.notnull(dft['leader'])
leaderless, leaders = dft.loc[~has_leader, :], dft.loc[has_leader, :]
Now put the leaders in tidy format (one member per row):
member = leaders['name-group'].str.split(',', expand=True)
member = member.stack()
member.index = member.index.droplevel(1)
member.name = 'member'
leaders = pd.concat([member, leaders], axis=1)
The pay off for all this work is that criteria #1 can now be expressed by a fast calculation:
# this enforces criteria #1 (leader of df2 is in name-group of df1)
mask = (leaders['leader'] == leaders['member'])
leaders = leaders.loc[mask, :]
leaders = leaders.drop('member', axis=1)
and the desired result is:
dft = pd.concat([leaderless, leaders], axis=0)
We had to do some work to get df1 into tidy format. We need to benchmark to
determine if the cost of doing that extra work pays off by being able to compute criteria #1 faster.
Here is a benchmark using largish dataframes of 1000 rows for df1 and df2:
In [356]: %timeit using_tidy(df1, df2)
100 loops, best of 3: 17.8 ms per loop
In [357]: %timeit using_apply(df1, df2)
10 loops, best of 3: 98.2 ms per loop
The speed advantage of using_tidy over using_apply increases as the number
of rows in pd.merge(df1, df2, on='ID', how='left') increases.
Here is the setup for the benchmark:
import string
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'name-group':['bob,david', 'CC,robben', 'jack'],
'status':['good','good','bad'],
'ID':[1,2,3]})
df2 = pd.DataFrame({'leader':['robben','jack','bob'],
'location':['JAPAN','USA','UK'],
'ID':[2,3,4]})
def using_apply(df1, df2):
dft = pd.merge(df1, df2, on='ID', how='left')
mask = dft.apply(lambda x: pd.isnull(x['leader']) or (x['leader'] in x['name-group'].split(',')), axis=1)
return dft.loc[mask, :]
def using_tidy(df1, df2):
# this enforces criteria #2 (the IDs are the same)
dft = pd.merge(df1, df2, on='ID', how='left')
# split dft into 2 sub-DataFrames, based on rows which have a leader and those which do not.
has_leader = pd.notnull(dft['leader'])
leaderless, leaders = dft.loc[~has_leader, :], dft.loc[has_leader, :]
# expand leaders so each member in name-group has its own row
member = leaders['name-group'].str.split(',', expand=True)
member = member.stack()
member.index = member.index.droplevel(1)
member.name = 'member'
leaders = pd.concat([member, leaders], axis=1)
# this enforces criteria #1 (leader of df2 is in name-group of df1)
mask = (leaders['leader'] == leaders['member'])
leaders = leaders.loc[mask, :]
leaders = leaders.drop('member', axis=1)
dft = pd.concat([leaderless, leaders], axis=0)
return dft
def make_random_str_array(letters=string.ascii_uppercase, strlen=10, size=100):
return (np.random.choice(list(letters), size*strlen)
.view('|U{}'.format(strlen)))
def make_dfs(N=1000):
names = make_random_str_array(strlen=4, size=10)
df1 = pd.DataFrame({
'name-group':[','.join(np.random.choice(names, size=np.random.randint(1,10), replace=False)) for i in range(N)],
'status':np.random.choice(['good','bad'], size=N),
'ID':np.random.randint(4, size=N)})
df2 = pd.DataFrame({
'leader':np.random.choice(names, size=N),
'location':np.random.randint(10, size=N),
'ID':np.random.randint(4, size=N)})
return df1, df2
df1, df2 = make_dfs()
Why don’t you use
Dft = pd.merge(df1,df2,how=‘left’,left_on = [‘ID’],right_on =[‘ID’])

Conditional on pandas DataFrame's

Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)