search and compare data between dataframes - pandas

I have an issue about merge of data-frame.
I have two data-frames as follow,
df1:
ID name-group status
1 bob,david good
2 CC,robben good
3 jack bad
df2:
ID leader location
2 robben JAPAN
3 jack USA
4 bob UK
I want to get a result as flow.
dft
ID name-group Leader location
1 bob,david
2 CC,robben Robben JAPAN
3 jack Jack USA
the [Leader] and [location] will be merged when
[leader] in df2 **IN** [name-group] of df1
&
[ID] of df2 **=** [ID] of df1
I have tried for loop, but its time-cost is very high.
any ideas for this issue?
Thanks

See the end of the post for runnable code. The proposed solution is in the function, using_tidy.
The main problem here is that having multiple names in name-group, separated
by commas, makes searching for membership difficult. If, instead, df1 had each
member of name-group in its own row, then testing for membership would be
easy. That is, suppose df1 looked like this:
ID name-group status
0 1 bob good
0 1 david good
1 2 CC good
1 2 robben good
2 3 jack bad
Then you could simply merge df1 and df2 on ID and test if leader
equals name-group... almost (see why "almost" below).
Putting df1 in tidy format (PDF)
is the main idea in the solution below. The reason why it improves performance
is because testing for equality between two columns is much much faster than
testing if a column of strings are substrings of another column of strings, or
are members of a column containing a list of strings.
The reason why I said "almost" above is because there is another difficulty --
after merging df1 and df2 on ID, some rows are leaderless such as the bob,david row:
ID name-group Leader location
1 bob,david
Since we simply want to keep these rows and we don't want to test if criteria #1 holds in this case, we need to treat these rows differently -- don't expand them.
We can handle this problem by separating the leaderless rows from those with potential leaders (see below).
The second criteria, that the IDs match is easy to enforce by merging df1 and df2 on ID:
dft = pd.merge(df1, df2, on='ID', how='left')
The first criteria is that dft['leader'] is in dft['name-group'].
This criteria could be expressed as
In [293]: dft.apply(lambda x: pd.isnull(x['leader']) or (x['leader'] in x['name-group'].split(',')), axis=1)
Out[293]:
0 True
1 True
2 True
dtype: bool
but using dft.apply(..., axis=1) calls the lambda function once for each
row. This can be very slow if there are many rows in dft.
If there are many rows in dft we can do better by first converting dft to
tidy format (PDF) -- placing each
member in dft['name-group'] on its own row. But first, let's split dft into 2
sub-DataFrames, those which have a leader, and those which don't:
has_leader = pd.notnull(dft['leader'])
leaderless, leaders = dft.loc[~has_leader, :], dft.loc[has_leader, :]
Now put the leaders in tidy format (one member per row):
member = leaders['name-group'].str.split(',', expand=True)
member = member.stack()
member.index = member.index.droplevel(1)
member.name = 'member'
leaders = pd.concat([member, leaders], axis=1)
The pay off for all this work is that criteria #1 can now be expressed by a fast calculation:
# this enforces criteria #1 (leader of df2 is in name-group of df1)
mask = (leaders['leader'] == leaders['member'])
leaders = leaders.loc[mask, :]
leaders = leaders.drop('member', axis=1)
and the desired result is:
dft = pd.concat([leaderless, leaders], axis=0)
We had to do some work to get df1 into tidy format. We need to benchmark to
determine if the cost of doing that extra work pays off by being able to compute criteria #1 faster.
Here is a benchmark using largish dataframes of 1000 rows for df1 and df2:
In [356]: %timeit using_tidy(df1, df2)
100 loops, best of 3: 17.8 ms per loop
In [357]: %timeit using_apply(df1, df2)
10 loops, best of 3: 98.2 ms per loop
The speed advantage of using_tidy over using_apply increases as the number
of rows in pd.merge(df1, df2, on='ID', how='left') increases.
Here is the setup for the benchmark:
import string
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'name-group':['bob,david', 'CC,robben', 'jack'],
'status':['good','good','bad'],
'ID':[1,2,3]})
df2 = pd.DataFrame({'leader':['robben','jack','bob'],
'location':['JAPAN','USA','UK'],
'ID':[2,3,4]})
def using_apply(df1, df2):
dft = pd.merge(df1, df2, on='ID', how='left')
mask = dft.apply(lambda x: pd.isnull(x['leader']) or (x['leader'] in x['name-group'].split(',')), axis=1)
return dft.loc[mask, :]
def using_tidy(df1, df2):
# this enforces criteria #2 (the IDs are the same)
dft = pd.merge(df1, df2, on='ID', how='left')
# split dft into 2 sub-DataFrames, based on rows which have a leader and those which do not.
has_leader = pd.notnull(dft['leader'])
leaderless, leaders = dft.loc[~has_leader, :], dft.loc[has_leader, :]
# expand leaders so each member in name-group has its own row
member = leaders['name-group'].str.split(',', expand=True)
member = member.stack()
member.index = member.index.droplevel(1)
member.name = 'member'
leaders = pd.concat([member, leaders], axis=1)
# this enforces criteria #1 (leader of df2 is in name-group of df1)
mask = (leaders['leader'] == leaders['member'])
leaders = leaders.loc[mask, :]
leaders = leaders.drop('member', axis=1)
dft = pd.concat([leaderless, leaders], axis=0)
return dft
def make_random_str_array(letters=string.ascii_uppercase, strlen=10, size=100):
return (np.random.choice(list(letters), size*strlen)
.view('|U{}'.format(strlen)))
def make_dfs(N=1000):
names = make_random_str_array(strlen=4, size=10)
df1 = pd.DataFrame({
'name-group':[','.join(np.random.choice(names, size=np.random.randint(1,10), replace=False)) for i in range(N)],
'status':np.random.choice(['good','bad'], size=N),
'ID':np.random.randint(4, size=N)})
df2 = pd.DataFrame({
'leader':np.random.choice(names, size=N),
'location':np.random.randint(10, size=N),
'ID':np.random.randint(4, size=N)})
return df1, df2
df1, df2 = make_dfs()

Why don’t you use
Dft = pd.merge(df1,df2,how=‘left’,left_on = [‘ID’],right_on =[‘ID’])

Related

Merge pandas dataframe on matched substrings [duplicate]

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.
Any similarity algorithm will do (soundex, Levenshtein, difflib's).
Say one DataFrame has the following data:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
number
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
letter
one a
too b
three c
fours d
five e
Then I want to get the resulting DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Similar to #locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:
In [23]: import difflib
In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>
In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e
In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
.
If these were columns, in the same vein you could apply to the column then merge:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
Using fuzzywuzzy
Since there are no examples with the fuzzywuzzy package, here's a function I wrote which will return all matches based on a threshold you can set as a user:
Example datframe
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
# df1
Key
0 Apple
1 Banana
2 Orange
3 Strawberry
# df2
Key
0 Aple
1 Mango
2 Orag
3 Straw
4 Bannanna
5 Berry
Function for fuzzy matching
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
Using our function on the dataframes: #1
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
Using our function on the dataframes: #2
df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
Col1 matches
0 Microsoft Mcrsoft
1 Google gogle
2 Amazon Amason
3 IBM
Installation:
Pip
pip install fuzzywuzzy
Anaconda
conda install -c conda-forge fuzzywuzzy
I have written a Python package which aims to solve this problem:
pip install fuzzymatcher
You can find the repo here and docs here.
Basic usage:
Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Or if you just want to link on the closest match:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].
This is how I would do it with Jaro-Winkler from the jellyfish package:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
Output:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
For a general approach: fuzzy_merge
For a more general scenario in which we want to merge columns from two dataframes which contain slightly different strings, the following function uses difflib.get_close_matches along with merge in order to mimic the functionality of pandas' merge but with fuzzy matching:
import difflib
def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
df_other= df2.copy()
df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff)
for x in df_other[right_on]]
return df1.merge(df_other, on=left_on, how=how)
def get_closest_match(x, other, cutoff):
matches = difflib.get_close_matches(x, other, cutoff=cutoff)
return matches[0] if matches else None
Here are some use cases with two sample dataframes:
print(df1)
key number
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
print(df2)
key_close letter
0 three c
1 one a
2 too b
3 fours d
4 a very different string e
With the above example, we'd get:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
And we could do a left join with:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 NaN NaN
For a right join, we'd have all non-matching keys in the left dataframe to None:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')
key number key_close letter
0 one 1.0 one a
1 two 2.0 too b
2 three 3.0 three c
3 four 4.0 fours d
4 None NaN a very different string e
Also note that difflib.get_close_matches will return an empty list if no item is matched within the cutoff. In the shared example, if we change the last index in df2 to say:
print(df2)
letter
one a
too b
three c
fours d
a very different string e
We'd get an index out of range error:
df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
IndexError: list index out of range
In order to solve this the above function get_closest_match will return the closest match by indexing the list returned by difflib.get_close_matches only if it actually contains any matches.
http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though...
I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column
I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.
use the below command to install
pip install fuzzymatcher
Below is the sample Code (already submitted by RobinL above)
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Errors you may get
ZeroDivisionError: float division by zero---> Refer to this
link to resolve it
OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll
from here and replace the DLL file in your python or anaconda
DLLs folder.
Pros :
Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
Gives you score for all the closest matches for the same record. you choose whats the cutoff score.
cons:
Original package installation is buggy
Required C++ and visual studios installed too
Wont work for 64 bit anaconda/Python
There is a package called fuzzy_pandas that can use levenshtein, jaro, metaphone and bilenco methods. With some great examples here
import pandas as pd
import fuzzy_pandas as fpd
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
results = fpd.fuzzy_merge(df1, df2,
left_on='Key',
right_on='Key',
method='levenshtein',
threshold=0.6)
results.head()
Key Key
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
As a heads up, this basically works, except if no match is found, or if you have NaNs in either column. Instead of directly applying get_close_matches, I found it easier to apply the following function. The choice of NaN replacements will depend a lot on your dataset.
def fuzzy_match(a, b):
left = '1' if pd.isnull(a) else a
right = b.fillna('2')
out = difflib.get_close_matches(left, right)
return out[0] if out else np.NaN
You can use d6tjoin for that
import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']
index number index_right letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 five e
It has a variety of additional features such as:
check join quality, pre and post join
customize similarity function, eg edit distance vs hamming distance
specify max distance
multi-core compute
For details see
MergeTop1 examples - Best match join examples notebook
PreJoin examples - Examples for diagnosing join problems
I have used fuzzywuzz in a very minimal way whilst matching the existing behaviour and keywords of merge in pandas.
Just specify your accepted threshold for matching (between 0 and 100):
from fuzzywuzzy import process
def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
def fuzzy_apply(x, df, column, threshold=threshold):
if type(x)!=str:
return None
match, score, *_ = process.extract(x, df[column], limit=1)[0]
if score >= threshold:
return match
else:
return None
if on is not None:
left_on = on
right_on = on
# create temp column as the best fuzzy match (or None!)
df2['tmp'] = df2[right_on].apply(
fuzzy_apply,
df=df,
column=left_on,
threshold=threshold
)
merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
del merged_df['tmp']
return merged_df
Try it out using the example data:
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
fuzzy_merge(df, df2, on='Key', threshold=80)
Using thefuzz
Using SeatGeek's great package thefuzz, which makes use of Levenshtein distance. This works with data held in columns. It adds matches as rows rather than columns, to preserve a tidy dataset, and allows additional columns to be easily pulled through to the output dataframe.
Sample data
df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})
col_a col_b
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})
col_a col_b
0 one a
1 too b
2 three c
3 fours d
4 five e
Function used to do the matching
def fuzzy_match(
df_left, df_right, column_left, column_right, threshold=90, limit=1
):
# Create a series
series_matches = df_left[column_left].apply(
lambda x: process.extract(x, df_right[column_right], limit=limit) # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
)
# Convert matches to a tidy dataframe
df_matches = series_matches.to_frame()
df_matches = df_matches.explode(column_left) # Convert list of matches to rows
df_matches[
['match_string', 'match_score', 'df_right_id']
] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index) # Convert match tuple to columns
df_matches.drop(column_left, axis=1, inplace=True) # Drop column of match tuples
# Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
if df_matches.index.name:
index_name = df_matches.index.name # Stash index name
else:
index_name = 'index' # Default used by pandas
df_matches.reset_index(inplace=True)
df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True) # The previous index has now become a column: rename for ease of reference
# Drop matches below threshold
df_matches.drop(
df_matches.loc[df_matches['match_score'] < threshold].index,
inplace=True
)
return df_matches
Use function and merge data
import pandas as pd
from thefuzz import process
df_matches = fuzzy_match(
df1,
df2,
'col_a',
'col_a',
threshold=60,
limit=1
)
df_output = df1.merge(
df_matches,
how='left',
left_index=True,
right_on='df_left_id'
).merge(
df2,
how='left',
left_on='df_right_id',
right_index=True,
suffixes=['_df1', '_df2']
)
df_output.set_index('df_left_id', inplace=True) # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table
df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']] # Drop columns used in the matching
df_output.index.name = 'id'
id col_a_df1 col_b_df1 col_b_df2
0 one 1 a
1 two 2 b
2 three 3 c
3 four 4 d
4 five 5 e
Tip: Fuzzy matching using thefuzz is much quicker if you optionally install the python-Levenshtein package too.
For more complex use cases to match rows with many columns you can use recordlinkage package. recordlinkage provides all the tools to fuzzy match rows between pandas data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here
if the join axis is numeric this could also be used to match indexes with a specified tolerance:
def fuzzy_left_join(df1, df2, tol=None):
index1 = df1.index.values
index2 = df2.index.values
diff = np.abs(index1.reshape((-1, 1)) - index2)
mask_j = np.argmin(diff, axis=1) # min. of each column
mask_i = np.arange(mask_j.shape[0])
df1_ = df1.iloc[mask_i]
df2_ = df2.iloc[mask_j]
if tol is not None:
mask = np.abs(df2_.index.values - df1_.index.values) <= tol
df1_ = df1_.loc[mask]
df2_ = df2_.loc[mask]
df2_.index = df1_.index
out = pd.concat([df1_, df2_], axis=1)
return out
TheFuzz is the new version of a fuzzywuzzy
In order to fuzzy-join string-elements in two big tables you can do this:
Use apply to go row by row
Use swifter to parallel, speed up and visualize default apply function (with colored progress bar)
Use OrderedDict from collections to get rid of duplicates in the output of merge and keep the initial order
Increase limit in thefuzz.process.extract to see more options for merge (stored in a list of tuples with % of similarity)
'*' You can use thefuzz.process.extractOne instead of thefuzz.process.extract to return just one best-matched item (without specifying any limit). However, be aware that several results could have same % of similarity and you will get only one of them.
'**' Somehow the swifter takes a minute or two before starting the actual apply. If you need to process small tables you can skip this step and just use progress_apply instead
from thefuzz import process
from collections import OrderedDict
import swifter
def match(x):
matches = process.extract(x, df1, limit=6)
matches = list(OrderedDict((x, True) for x in matches).keys())
print(f'{x:20} : {matches}')
return str(matches)
df1 = df['name'].values
df2['matches'] = df2['name'].swifter.apply(lambda x: match(x))

Performance issue pandas 6 mil rows

need one help.
I am trying to concatenate two data frames. 1st has 58k rows, other 100. Want to concatenate in a way that each of 58k row has 100 rows from other df. So in total 5.8 mil rows.
Performance is very poor, takes 1 hr to do 10 pct. Any suggestions for improvement?
Here is code snippet.
def myfunc(vendors3,cust_loc):
cust_loc_vend = pd.DataFrame()
cust_loc_vend.empty
for i,row in cust_loc.iterrows():
clear_output(wait=True)
a= row.to_frame().T
df= pd.concat([vendors3, a],axis=1, ignore_index=False)
#cust_loc_vend = pd.concat([cust_loc_vend, df],axis=1, ignore_index=False)
cust_loc_vend= cust_loc_vend.append(df)
print('Current progress:',np.round(i/len(cust_loc)*100,2),'%')
return cust_loc_vend
For e.g. if first DF has 5 rows and second has 100 rows
DF1 (sample 2 columns)
I want a merged DF such that each row in DF 2 has All rows from DF1-
Well all you are looking for is a join.But since there is no column column, what you can do is create a column which is similar in both the dataframes and then drop it eventually.
df['common'] = 1
df1['common'] = 1
df2 = pd.merge(df, df1, on=['common'],how='outer')
df = df.drop('tmp', axis=1)
where df and df1 are dataframes.

pandas merge multiple dataframes

For example: I have multiple dataframes. Each data frame has columns: variable_code, variable_description, year.
df1:
variable_code, variable_description
N1, Number of returns
N2, Number of Exemptions
df2:
variable_code, variable_description
N1, Number of returns
NUMDEP, # of dependent
I want to merge these two dataframes to get all variable_codes in both df1 and df2.
variable_code, variable_description
N1 Number of returns
N2 Number of Exemption
NUMDEP # of dependent
There is documentation for merge right here
Since your columns you want to merge on are both called "variable_code" then you can use on='variable_code'
so the whole thing would be:
df1.merge(df2, on='variable_code')
You can specify How='outer' if you want blanks where you have data in only one of those tables. Use how='inner' if you want only data that is in both tables (no blanks).
To attain your requirement, try this:
import pandas as pd
#Create the first dataframe, through a dictionary - several other possibilities exist.
data1 = {'variable_code': ['N1','N2'], 'variable_description': ['Number of returns','Number of Exemptions']}
df1 = pd.DataFrame(data=data1)
#Create second dataframe
data2 = {'variable_code': ['N1','NUMDEP'], 'variable_description': ['Number of returns','# of dependent']}
df2 = pd.DataFrame(data=data2)
#place the dataframes on a list.
dfs = [df1,df2] #additional dfs can be added here.
#You can loop over the list,merging the dfs. But here reduce and a lambda is used.
resultant_df = reduce(lambda left,right: pd.merge(left,right,on=['variable_code','variable_description'],how='outer'), dfs)
This gives:
>>> resultant_df
variable_code variable_description
0 N1 Number of returns
1 N2 Number of Exemptions
2 NUMDEP # of dependent
There are several options available for how, each catering for various needs. outer, used here allows for inclusion of even the rows with empty data. See the docs for detailed explanation on the other options.
First, concatenate df1, df2, by using
final_df = pd.concat([df1,df2]).
Then we can convert columns variable_code, variable_name into dictionary. variable_code as keys, variable_name as values by using
d = dict(zip(final_df['variable_code'], final_df['variable_name'])).
Then convert d into dataframe:
d_df = pd.DataFrame(list(d.items()), columns=['variable_code', 'variable_name']).

Conditional on pandas DataFrame's

Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)

Vectorized method to sync two arrays

I have two Pandas TimeSeries: x, and y, which I would like to sync "as of". I would like to find for every element in x the latest (by index) element in y that preceeds it (by index value). For example, I would like to compute this new_x:
x new_x
---- -----
13:01 13:00
14:02 14:00
y
----
13:00
13:01
13:30
14:00
I am looking for a vectorized solution, not a Python loop. The time values are based on Numpy datetime64. The y array's length is in the order of millions, so O(n^2) solutions are probably not practical.
In some circles this operation is known as the "asof" join. Here is an implementation:
def diffCols(df1, df2):
""" Find columns in df1 not present in df2
Return df1.columns - df2.columns maintaining the order which the resulting
columns appears in df1.
Parameters:
----------
df1 : pandas dataframe object
df2 : pandas dataframe objct
Pandas already offers df1.columns - df2.columns, but unfortunately
the original order of the resulting columns is not maintained.
"""
return [i for i in df1.columns if i not in df2.columns]
def aj(df1, df2, overwriteColumns=True, inplace=False):
""" KDB+ like asof join.
Finds prevailing values of df2 asof df1's index. The resulting dataframe
will have same number of rows as df1.
Parameters
----------
df1 : Pandas dataframe
df2 : Pandas dataframe
overwriteColumns : boolean, default True
The columns of df2 will overwrite the columns of df1 if they have the same
name unless overwriteColumns is set to False. In that case, this function
will only join columns of df2 which are not present in df1.
inplace : boolean, default False.
If True, adds columns of df2 to df1. Otherwise, create a new dataframe with
columns of both df1 and df2.
*Assumes both df1 and df2 have datetime64 index. """
joiner = lambda x : x.asof(df1.index)
if not overwriteColumns:
# Get columns of df2 not present in df1
cols = diffCols(df2, df1)
if len(cols) > 0:
df2 = df2.ix[:,cols]
result = df2.apply(joiner)
if inplace:
for i in result.columns:
df1[i] = result[i]
return df1
else:
return result
Internally, this uses pandas.Series.asof().
What about using Series.searchsorted() to return the index of y where you would insert x. You could then subtract one from that value and use it to index y.
In [1]: x
Out[1]:
0 1301
1 1402
In [2]: y
Out[2]:
0 1300
1 1301
2 1330
3 1400
In [3]: y[y.searchsorted(x)-1]
Out[3]:
0 1300
3 1400
note: the above example uses int64 Series