Distance Matrix - How to find the closest person in a dataframe based on coordinates? - pandas

I have a pandas dataframe with three columns: Name, Latitude and Longitude.
For every person in the dataframe I want to 1) determine the person that is closest to him/her and 2)calculate the linear distance to that person. My code is like the example below:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from haversine import haversine
df = pd.read_csv('..data/file_name.csv')
df.set_index('Name', inplace=True)
dm = cdist(df, df, metric=haversine)
closest = dm.argmin(axis=1)
distances = dm.min(axis=1)
df['closest person'] = df.index[closest]
df['distance'] = distances
I know that the issue here is the argmin and min functions I am using are simply causing me to append every person to him/herself which is not what I want. I'm trying to modify the code to determine the distinct individual who is closest. For example the closest person to John Doe is Bob Smith and the distance is xx. I've tried indexing and seeing if there is a way to sort the matrix but it's not really working.
Is there a good way of doing this?
Edit: example input data

You can just modify the 0 values in this way:
#your code
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from haversine import haversine
df = pd.read_csv('..data/file_name.csv')
df.set_index('Name', inplace=True)
dm = cdist(df, df, metric=haversine)
#my code
dm[dm==0] = np.max(dm,axis = 1)
#yoru code
closest = dm.argmin(axis=1)
distances = dm.min(axis=1)
df['closest person'] = df.index[closest]
df['distance'] = distances

Related

Ploting dataframe with NAs with linearly joined points

I have a dataframe where each column has many missing values. How can I make a plot where the datapoints in each column are joined with lines, i.e. NAs are ignored, instead of having a choppy plot?
import numpy as np
import pandas as pd
pd.options.plotting.backend = "plotly"
d = pd.DataFrame(data = np.random.choice([np.nan] + list(range(7)), size=(10,3)))
d.plot(markers=True)
One way is to use this for each column:
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, name="linear",
line_shape='linear'))
Are there any better ways to accomplish this?
You can use pandas interpolate. Have demonstrated using plotly express and chained use so underlying data is not changed.
Post comments have amended answer so that markers are not shown for interpreted points.
import numpy as np
import pandas as pd
import plotly.express as px
d = pd.DataFrame(data=np.random.choice([np.nan] + list(range(7)), size=(10, 3)))
px.line(d).update_traces(mode="lines+markers").add_traces(
px.line(d.interpolate(limit_direction="both")).update_traces(showlegend=False).data
)

Calculating time-series as a percentage of total

I'm looking at county-level procurement data (millions of bills) and plotting time-series with matplotlib and pandas using groupby:
dataframe_slice.groupby(pd.Grouper(freq='1M')).bill_amount.sum().plot
where bill_amount is a column of floats that shows how much was billed. How can I change the graph to show the dataframe_slice as a percentage of total dataframe bill_amount?
I am not aware of an out-of-the-box pandas function for that (but I am hoping to be proven wrong). Imho, you have to calculate the percentage per group by calculating the total_sum, then determining the percentage per group aggregation. Stand-alone code:
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation
import numpy as np
np.random.seed(123)
n = 200
start = pd.to_datetime("2017-07-17")
end = pd.to_datetime("2018-04-03")
ndays = (end - start).days + 1
date_range = pd.to_timedelta(np.random.rand(n) * ndays, unit="D") + start
df = pd.DataFrame({"ind": date_range,
"bill_amount": np.random.randint(10, 30, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
df.set_index("ind", inplace=True)
#df.sort_index(inplace=True)
#this assumes that your dataframe has a datetime index
#here starts the actual calculation
total_sum = df.bill_amount.sum()
dataframe_slice = df.groupby(pd.Grouper(freq='1M')).bill_amount.sum().div(total_sum)*100
dataframe_slice.plot()
#and we beautify the plot
plt.xlabel("Month of expenditure")
plt.ylabel("Percentage of expenditure")
plt.tight_layout()
plt.show()
Sample output:

Multiprocessing the Fuzzy match in pandas

I have two data frames.
DF_Address, which is having 347k distinct addresses and DF_Project which is having 24k records having
Project_Id, Project_Start_Date and Project_Address
I want to check if there is a fuzzy match of my Project_Address in Df_Address. If there is a match, I want to extract the Project_ID and Project_Start_Date for the same. Below is code of what I am trying
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
x, choices=choices, score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
Df_Project ["Project_Address"],
80
)
)
This code does provide an output in the form of a tuple
('matched_string', score)
But it is also giving similar strings. Also I need to extract
Project_Id and Project_Start_Date
. Can someone help me to achieve this using parallel processing as the data is huge.
You can convert the tuple into dataframe and then join out to your base data frame.
import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)
Output:
address random_stuff distance
0 abc 100 10
1 cdf 200 20

How can I get an interpolated value from a Pandas data frame?

I have a simple Pandas data frame with two columns, 'Angle' and 'rff'. I want to get an interpolated 'rff' value based on entering an Angle that falls between two Angle values (i.e. between two index values) in the data frame. For example, I'd like to enter 3.4 for the Angle and then get an interpolated 'rff'. What would be the best way to accomplish that?
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
s = s.set_index('Angle') #Set 'Angle' as index
print(s)
result = s.at[3.0, "rff"]
print(result)
You may use numpy:
import numpy as np
np.interp(3.4, s.index, s.rff)
#59.6
You could use numpy for this:
import numpy as np
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
print(np.interp(3.4, s.Angle, s.rff))
>>> 59.6

How to plot correlation heatmap when using pyspark+databricks

I am studying pyspark in databricks. I want to generate a correlation heatmap. Let's say this is my data:
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
['col1','col2','col3'])
And this is my code:
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from ggplot import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
['col1','col2','col3'])
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=['col1','col2','col3'],
outputCol=vector_col)
myGraph_vector = assembler.transform(myGraph).select(vector_col)
matrix = Correlation.corr(myGraph_vector, vector_col)
matrix.collect()[0]["pearson({})".format(vector_col)].values
Until here, I can get the correlation matrix. The result looks like:
Now my problems are:
How to transfer matrix to data frame? I have tried the methods of How to convert DenseMatrix to spark DataFrame in pyspark? and How to get correlation matrix values pyspark. But it does not work for me.
How to generate a correlation heatmap which looks like:
Because I just studied pyspark and databricks. ggplot or matplotlib are both OK for my problem.
I think the point where you get confused is:
matrix.collect()[0]["pearson({})".format(vector_col)].values
Calling .values of a densematrix gives you a list of all values, but what you are actually looking for is a list of list representing correlation matrix.
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
columns = ['col1','col2','col3']
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
columns)
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=['col1','col2','col3'],
outputCol=vector_col)
myGraph_vector = assembler.transform(myGraph).select(vector_col)
matrix = Correlation.corr(myGraph_vector, vector_col)
Until now it was basically your code. Instead of calling .values you should use .toArray().tolist() to get a list of lists representing the correlation matrix:
matrix = Correlation.corr(myGraph_vector, vector_col).collect()[0][0]
corrmatrix = matrix.toArray().tolist()
print(corrmatrix)
Output:
[[1.0, 0.9582184104641529, 0.9780872729407004], [0.9582184104641529, 1.0, 0.8776695567739841], [0.9780872729407004, 0.8776695567739841, 1.0]]
The advantage of this approach is that you can turn a list of lists easily into a dataframe:
df = spark.createDataFrame(corrmatrix,columns)
df.show()
Output:
+------------------+------------------+------------------+
| col1| col2| col3|
+------------------+------------------+------------------+
| 1.0|0.9582184104641529|0.9780872729407004|
|0.9582184104641529| 1.0|0.8776695567739841|
|0.9780872729407004|0.8776695567739841| 1.0|
+------------------+------------------+------------------+
To answer your second question. Just one of the many solutions to plot a heatmap (like this or this even better with seaborn).
def plot_corr_matrix(correlations,attr,fig_no):
fig=plt.figure(fig_no)
ax=fig.add_subplot(111)
ax.set_title("Correlation Matrix for Specified Attributes")
ax.set_xticklabels(['']+attr)
ax.set_yticklabels(['']+attr)
cax=ax.matshow(correlations,vmax=1,vmin=-1)
fig.colorbar(cax)
plt.show()
plot_corr_matrix(corrmatrix, columns, 234)