Multiprocessing the Fuzzy match in pandas - pandas

I have two data frames.
DF_Address, which is having 347k distinct addresses and DF_Project which is having 24k records having
Project_Id, Project_Start_Date and Project_Address
I want to check if there is a fuzzy match of my Project_Address in Df_Address. If there is a match, I want to extract the Project_ID and Project_Start_Date for the same. Below is code of what I am trying
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
x, choices=choices, score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
Df_Project ["Project_Address"],
80
)
)
This code does provide an output in the form of a tuple
('matched_string', score)
But it is also giving similar strings. Also I need to extract
Project_Id and Project_Start_Date
. Can someone help me to achieve this using parallel processing as the data is huge.

You can convert the tuple into dataframe and then join out to your base data frame.
import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)
Output:
address random_stuff distance
0 abc 100 10
1 cdf 200 20

Related

how to check the normality of data on a column grouped by an index

i'm working on a dataset which represents the completion time of some activities performed in some processes. There are just 6 types of activities that repeat themselves throughout all the dataset and that are described by a numerical value. The example dataset is as follows:
name duration
1 10
2 12
3 34
4 89
5 44
6 23
1 15
2 12
3 39
4 67
5 47
6 13
I'm trying to check if the duration of the activity is normally distributed with the following code:
import numpy as np
import pylab
import scipy.stats as stats
import seaborn as sns
from scipy.stats import normaltest
measurements = df['duration']
stats.probplot(measurements, dist='norm', plot=pylab)
pylab.show()
ax = sns.distplot(measurements)
stat,p = normaltest(measurements)
print('stat=%.3f, p=%.3f\n' % (stat, p))
if p > 0.05:
print('probably gaussian')
else:
print('probably non gaussian')
But i want to do it for each type of activity, which means applying the stats.probplot(), sns.distplot() and the normaltest() to each group of activities (e.g. checking if all the activities called 1 have a duration which is normally distributed).
Any idea on how can i specify in the functions to return different plots for each group of activities?
With the assumption that you have at least 8 samples per activity (as normaltest will throw an error if you don't) then you can loop through your data based on the unique activity values. You'll have to place pylab.show at the end of each graph so that they are not added to each other:
import numpy as np
import pandas as pd
import pylab
import scipy.stats as stats
import seaborn as sns
import random # Only needed by me to create a mock dataframe
import warnings # "distplot" is depricated. Look into using "displot"... in the meantime
warnings.filterwarnings('ignore') # I got sick of seeing the warning so I muted it
name = [1,2,3,4,5,6]*8
duration = [random.choice(range(0,100)) for _ in range(8*6)]
df = pd.DataFrame({"name":name, "duration":duration})
for name in df.name.unique():
nameDF = df[df.name.eq(name)]
measurements = nameDF['duration']
stats.probplot(measurements, dist='norm', plot=pylab)
pylab.show()
ax = sns.distplot(measurements)
ax.set_title(f'Name: {name}')
pylab.show()
stat,p = normaltest(measurements)
print('stat=%.3f, p=%.3f\n' % (stat, p))
if p > 0.05:
print('probably gaussian')
else:
print('probably non gaussian')
.
.
.
etc.

Distance Matrix - How to find the closest person in a dataframe based on coordinates?

I have a pandas dataframe with three columns: Name, Latitude and Longitude.
For every person in the dataframe I want to 1) determine the person that is closest to him/her and 2)calculate the linear distance to that person. My code is like the example below:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from haversine import haversine
df = pd.read_csv('..data/file_name.csv')
df.set_index('Name', inplace=True)
dm = cdist(df, df, metric=haversine)
closest = dm.argmin(axis=1)
distances = dm.min(axis=1)
df['closest person'] = df.index[closest]
df['distance'] = distances
I know that the issue here is the argmin and min functions I am using are simply causing me to append every person to him/herself which is not what I want. I'm trying to modify the code to determine the distinct individual who is closest. For example the closest person to John Doe is Bob Smith and the distance is xx. I've tried indexing and seeing if there is a way to sort the matrix but it's not really working.
Is there a good way of doing this?
Edit: example input data
You can just modify the 0 values in this way:
#your code
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from haversine import haversine
df = pd.read_csv('..data/file_name.csv')
df.set_index('Name', inplace=True)
dm = cdist(df, df, metric=haversine)
#my code
dm[dm==0] = np.max(dm,axis = 1)
#yoru code
closest = dm.argmin(axis=1)
distances = dm.min(axis=1)
df['closest person'] = df.index[closest]
df['distance'] = distances

Get PARTITION_ID in Dask for Data Frame

Is it possible to get the partition_id in dask after splitting pandas DFs
For example:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(np.random.randn(10,2), columns=["A","B"])
df_parts = dd.from_pandas(df, npartitions=2)
part1 = df_parts.get_partition(0)
In the 2 parts, part1 is the first_partition. So is it possible to do something like the following:
part1.get_partition_id() => which will return 0 or 1
Or is it possible to get the partition ID by iterating through df_parts?
Not sure about built-in functions, but you can achieve what you want with enumerate(df_parts.to_delayed()).
to_delayed will produce a list of delayed objects, one per partition, so you can iterate over them, keeping track of the sequential number with enumerate.

Lambdas function on multiple columns

I am trying to extract only number from multiple columns in my pandas data.frame.
I am able to do so one-by-one columns however I would like to perform this operation simultaneously to multiple columns
My reproduced example:
import pandas as pd
import re
import numpy as np
import seaborn as sns
df = sns.load_dataset('diamonds')
# Create columns one again
df['clarity2'] = df['clarity']
df.head()
df[['clarity', 'clarity2']].apply(lambda x: x.str.extract(r'(\d+)'))
If you want a tuple
cols = ['clarity', 'clarity2']
tuple(df[col].str.extract(r'(\d+)') for col in cols)
If you want a list
cols = ['clarity', 'clarity2']
[df[col].str.extract(r'(\d+)') for col in cols]
adding them to the original data
df['digit1'], df['digit2'] = [df[col].str.extract(r'(\d+)') for col in cols]

Extracting column from Array in python

I am beginner in Python and I am stuck with data which is array of 32763 number, separated by comma. Please find the data here data
I want to convert this into two column 1 from (0:16382) and 2nd column from (2:32763). in the end I want to plot column 1 as x axis and column 2 as Y axis. I tried the following code but I am not able to extract the columns
import numpy as np
import pandas as pd
import matplotlib as plt
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
df = pd.DataFrame(data.flatten())
print(df)
and then I want to write the data in some file let us say data1 in the format as shown in attached pic
It is hard to answer without seeing the format of your data, but you can try
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
print(data.shape) # here we check we got something useful
# this should split data into x,y at position 16381
x = data[:16381]
y = data[16381:]
# now you can create a dataframe and print to file
df = pd.DataFrame({'x':x, 'y':y})
df.to_csv('data1.csv', index=False)
Try this.
#input as dataframe df, its chunk_size, extract output as list. you can mention chunksize what you want.
def split_dataframe(df, chunk_size = 16382):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
or
np.array_split