pandas - fuzzywuzzy - speeding loop up when doing fuzzymatching? - pandas

I am basically trying to join 2 dataframes using approximate match. How I do this in general is listed below:
have the list of strings to matched
define a function using fuzzy's process.extract
apply this function across all rows in the 1st dataframe to get a match
join 1st DF with the 2nd DF based on matching key.
This is my code:
def closest_match(x):
matched = (process.extract(x, matchlist[matchlist.match_name.str.startswith(x[:3])].match_name, limit=1, scorer=fuzz.token_sort_ratio))
if matched:
print(matched[0])
return matched[0][0]
else:
return None
df1['key'] = df1.df1_name.apply(lambda x: closest_match(x))
# merge with 2nd df
joined = df1.merge(df2, left_on='key', right_on='df2_name')
The problem here is about speed. This code takes a very long time for loops of 10000 iteration. And I need this for 100K match. How to speed this code up?

Related

How to apply function to each column and row of dataframe pandas

I have two dataframes.
df1 has an index list made of strings like (row1,row2,..,rown) and a column list made of strings like (col1,col2,..,colm) while df2 has k rows and 3 columns (char_1,char_2,value). char_1 contains strings like df1 indexes while char_2 contains strings like df1 columns. I only want to assign the df2 value to df1 in the right position. For example if the first row of df2 reads ['row3','col1','value2'] I want to assign value2 to df1 in the position ([2,0]) (third row and first column).
I tried to use two functions to slide rows and columns of df1:
def func1(val):
# first I convert the series to dataframe
val=val.to_frame()
val=val.reset_index()
val=val.set_index('index') # I set the index so that it's the right column
def func2(val2):
try: # maybe the combination doesn't exist
idx1=list(cou.index[df2[char_2]==(val2.name)]) #val2.name reads col name of df1
idx2=list(cou.index[df2[char_1]==val2.index.values[0]]) #val2.index.values[0] reads index name of df1
idx= list(reduce(set.intersection, map(set, [idx1,idx2])))
idx=int(idx[0]) # final index of df2 where I need to take value to assign to df1
check=1
except:
check=0
if check==1: # if index exists
val2[0]=df2['value'][idx] # assign value to df1
return val2
val=val.apply(func2,axis=1) #apply the function for columns
val=val.squeeze() #convert again to series
return val
df1=df1.apply(func1,axis=1) #apply the function for rows
I made the conversion inside func1 because without this step I wasn't able to work with series keeping index and column names so I wasn't able to find the index idx in func2.
Well the problem is that it takes forever. df1 size is (3'600 X 20'000) and df2 is ( 500 X 3 ) so it's not too much. I really don't understand the problem.. I run the code for the first row and column to check the result and it's fine and it takes 1 second, but now for the entire process I've been waiting for hours and it's still not finished.
Is there a way to optimize it? As I wrote in the title I only need to run a function that keeps column and index names and works sliding the entire dataframe. Thanks in advance!

Faster returns comparisons in Pandas dataframes?

I have DataFrame containing 600,000 pairs of IDs. Each ID has returns Data in a large monthly returns_df. For each of the 600K pairs, I do the following
I set left and right DataFrames equal to their subset of returns_df.
I merge DataFrames to get months they both have data
I compute an absolute distance by comparing each, then summing results, and running a sigmoid function.
This process is taking ~12 hours as my computer has to create subsets of returns_df each time to compare. Can I substantially speed this up through some sort of vectorized solution or faster filtering?
def get_return_similarity(row):
left = returns_df[returns_df['FundID']==row.left_side_id]
right = returns_df[returns_df['FundID']==row.right_side_id]
temp = pd.merge(left,right, how='inner', on=['Year','Month'])
if temp.shape[0]<12: # Return if overlap < 12 months
return 0
temp['diff'] = abs(temp['Return_x'] - temp['Return_y'])
return 1/(math.exp(70*temp['diff'].sum()/(temp['diff'].shape[0]))) #scaled sigmoid function
df['return_score'] = df[['left_side_id','right_side_id']].apply(get_return_similarity,axis=1)
Thanks in advance for your help! Trying to get better with Pandas
Edit: As suggested, the basic data format is below
returns_df
df I am running the apply on:

pandas merge multiple dataframes

For example: I have multiple dataframes. Each data frame has columns: variable_code, variable_description, year.
df1:
variable_code, variable_description
N1, Number of returns
N2, Number of Exemptions
df2:
variable_code, variable_description
N1, Number of returns
NUMDEP, # of dependent
I want to merge these two dataframes to get all variable_codes in both df1 and df2.
variable_code, variable_description
N1 Number of returns
N2 Number of Exemption
NUMDEP # of dependent
There is documentation for merge right here
Since your columns you want to merge on are both called "variable_code" then you can use on='variable_code'
so the whole thing would be:
df1.merge(df2, on='variable_code')
You can specify How='outer' if you want blanks where you have data in only one of those tables. Use how='inner' if you want only data that is in both tables (no blanks).
To attain your requirement, try this:
import pandas as pd
#Create the first dataframe, through a dictionary - several other possibilities exist.
data1 = {'variable_code': ['N1','N2'], 'variable_description': ['Number of returns','Number of Exemptions']}
df1 = pd.DataFrame(data=data1)
#Create second dataframe
data2 = {'variable_code': ['N1','NUMDEP'], 'variable_description': ['Number of returns','# of dependent']}
df2 = pd.DataFrame(data=data2)
#place the dataframes on a list.
dfs = [df1,df2] #additional dfs can be added here.
#You can loop over the list,merging the dfs. But here reduce and a lambda is used.
resultant_df = reduce(lambda left,right: pd.merge(left,right,on=['variable_code','variable_description'],how='outer'), dfs)
This gives:
>>> resultant_df
variable_code variable_description
0 N1 Number of returns
1 N2 Number of Exemptions
2 NUMDEP # of dependent
There are several options available for how, each catering for various needs. outer, used here allows for inclusion of even the rows with empty data. See the docs for detailed explanation on the other options.
First, concatenate df1, df2, by using
final_df = pd.concat([df1,df2]).
Then we can convert columns variable_code, variable_name into dictionary. variable_code as keys, variable_name as values by using
d = dict(zip(final_df['variable_code'], final_df['variable_name'])).
Then convert d into dataframe:
d_df = pd.DataFrame(list(d.items()), columns=['variable_code', 'variable_name']).

More efficient Pandas code

I am trying to learn Python and Pandas and coming from VBA I am still caught in the habit of looping through every single cell, but I am looking for ways to operate on entire rows at a time.
Below is my part of my code. I have about 3000 stocks in the columns and about 40 or so data points in the rows saved in a dataframe called df.
I do the same kind of loop as showed to test for multiple criterias based on row values for the stocks in each column. As you see my code uses .ix to loop through the 'cells' in the dataframe.
But I have looked for ways to operate on the entire rows at a time, but have failed every attempt.
This take about 7 minutes for the 3000 stocks (but only about 1 minut or so for 2000 stocks??). But this must be able to run much faster?
def piotrosky():
df_temp = pd.DataFrame(np.nan, index=range(10), columns=df.columns)
#bruger dictionary til rename input så man ikke skal gøre det for hver række
dic={0:'positiveNetIncome',1:'positiveOperatingCF',2:'increasingROA', 3:'QualityOfEarnings',4:'longTermDebtToAssets',
5:'currentRatio', 6:'sharesOutVsSharesLast',7:'increasingGrossM',8:'IncreasingAssetTurnOver', 9:'total' }
df_temp.rename(dic, inplace = True)
r=1
#df is a vector with stocks in the columns and datapoints in the rows
#so I always need to loop across the columns
for i in range(df.shape[1]-1):
#positive net income
if df.ix[2,r]>0:
df_temp.ix[0,r]=1
else:
df_temp.ix[0,r]=0
#positiveOpeCF
if df.ix[3,r]>0:
df_temp.ix[1,r]=1
else:
df_temp.ix[1,r]=0
#Continue with several simular loops
#total
df_temp.ix[9,r]=df_temp.ix[0,r]+df_temp.ix[1,r]+df_temp.ix[2,r]+df_temp.ix[3,r]+ \
df_temp.ix[4,r]+df_temp.ix[5,r]+df_temp.ix[6,r]+df_temp.ix[7,r]+df_temp.ix[8,r]
r=r+1
Edit:
All of the below is done on a dataframe that is the transpose of the one you describe in your post. df.T should produce properly formatted input.
Method:
For conditionals on pandas dataframes, you can use the numpy function np.where:
criteria = {}
# np.where(condition, value_if_true, value_if_false)
criteria['positive_net_income'] = np.where(df[2] > 0, 1, 0)
After you get these numpy arrays, you can construct a dataframe from them,
pd.DataFrame(criteria)
and sum across it
pd.DataFrame(criteria).sum(axis=1)
to get a Series you can add as a column to your initial DataFrame
def piotrosky(df):
criteria = {}
criteria['positive_net_income'] = np.where(df[2] > 0, 1, 0)
criteria['positive_operating_cf'] = np.where(df[3] > 0, 1, 0)
...
return pd.DataFrame(criteria).sum(axis=1)
df['piotrosky_score'] = piotrosky(df)

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))