How can I compute the distance below spark dataframe between Location A and B and Location A and Location C?
spark = SparkSession(sc)
df = spark.createDataFrame([('A',
40.202750,29.168350,'B',40.689247,-74.044502),('A',
40.202750,29.168350,'C',25.197197,55.274376)], ['Location1', 'Lat1',
'Long1', 'Location2', 'Lat2', 'Lon2'])
So the dataset below:
+---------+--------+--------+---------+---------+----------+
|Location1| Lat1| Long1|Location2| Lat2| Lon2|
+---------+--------+--------+---------+---------+----------+
| A|40.20275|29.16835| B|40.689247|-74.044502|
| A|40.20275|29.16835| C|25.197197| 55.274376|
+---------+--------+--------+---------+---------+----------+
Thank you
You could use the Haversine formula which goes something like
2*6378*asin(sqrt(pow(sin((lat2-lat1)/2),2) + cos(lat1)*cos(lat2)*pow(sin((lon2-lon1)/2),2)))
Furthermore, you can create a UDF for the same which would be
import pyspark.sql.functions as F
def haversine(lat1, lon1, lat2, lon2):
return 2*6378*sqrt(pow(sin((lat2-lat1)/2),2) + cos(lat1)*cos(lat2)*pow(sin((lon2-lon1)/2),2))
dist_udf=F.udf(haversine, FloatType())
Make sure your latitude and longitude values are in radians.
You could add that conversion as part of the haversine function before the calculation part
Once you have the UDF, you can straightaway do
df.withColumn('distance', dist_udf(F.col('lat1'), F.col('long1'), F.col('lat2'), F.col('long2')))
Related
I have a pandas dataframe with three columns: Name, Latitude and Longitude.
For every person in the dataframe I want to 1) determine the person that is closest to him/her and 2)calculate the linear distance to that person. My code is like the example below:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from haversine import haversine
df = pd.read_csv('..data/file_name.csv')
df.set_index('Name', inplace=True)
dm = cdist(df, df, metric=haversine)
closest = dm.argmin(axis=1)
distances = dm.min(axis=1)
df['closest person'] = df.index[closest]
df['distance'] = distances
I know that the issue here is the argmin and min functions I am using are simply causing me to append every person to him/herself which is not what I want. I'm trying to modify the code to determine the distinct individual who is closest. For example the closest person to John Doe is Bob Smith and the distance is xx. I've tried indexing and seeing if there is a way to sort the matrix but it's not really working.
Is there a good way of doing this?
Edit: example input data
You can just modify the 0 values in this way:
#your code
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from haversine import haversine
df = pd.read_csv('..data/file_name.csv')
df.set_index('Name', inplace=True)
dm = cdist(df, df, metric=haversine)
#my code
dm[dm==0] = np.max(dm,axis = 1)
#yoru code
closest = dm.argmin(axis=1)
distances = dm.min(axis=1)
df['closest person'] = df.index[closest]
df['distance'] = distances
In the Dataframe, I want to compute the arccos, arcsin, arctan in the dataframe :
For exemple :
Angle A
arccos A
arcsin A
arctan
30
15
45
60
Please, I want to compute for each angle of A
Just import pyspark.sql.functions where you will find acos(), asin() and atan()
I am currently looping through GPS coordinates in a dataframe. I am using this loop to look into another dataframe with GPS coordinates of specific locations and update the original dataframe with the closest location. This works fine but it is VERY slow. Is there a faster way?
Here is sample data:
imports:
from shapely.geometry import Point
import pandas as pd
from geopy import distance
Create sample df1
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points
Create sample df2
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)
Two loops and update:
lst = [] #create empty list to populate new df column
for index , row in df_gps.iterrows(): # iterate over first dataframe rows
point = row['points'] # pull out GPS point
closest_distance = 999999 # create container for distance
closest_location = None #create container for closest location
for index1 , row1 in df_locations.iterrows(): # iterate over second dataframe
name = row1['location'] # assign name of location
point2 = row1['gps'] # assign coordinates of location
distances = distance.distance((point.x , point.y) , (point2.x , point2.y)).miles # calculate distance
if distances < closest_distance: # check to see if distance is closer
closest_distance = distances # if distance is closer assign it
closest_location = name # if distance is closer assign name
lst.append(closest_location) # append closest city
df_gps['closest_city'] = lst # add new column with closest cities
I'd really like to do this in the fastest way possible. I have read about the vectorization of pandas and have thought about creating a function and then using apply as mentioned in How to iterate over rows in a DataFrame in Pandas however I need two loops and a conditional in my code so the pattern breaks down. Thank you for the help.
You can use KDTree from Scipy:
from scipy.spatial import KDTree
# Extract lat/lon from your dataframes
points = df_gps['points'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
distances, indices = KDTree(cities).query(points)
df_gps['closest_city'] = df_locations.iloc[indices]['location'].values
df_gps['distance'] = distances
You can use np.where to filter out distances that are too far away.
For performance, check my answer for a similar problem with 25k rows for df_gps and 200k for df_locations.
Based on the insight of Corralien the final answer in code:
from sklearn.neighbors import BallTree, DistanceMetric
points = df_gps['points'].apply(lambda p: np.radians((p.x, p.y))).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: np.radians((p.x, p.y))).apply(pd.Series)
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(cities, metric=dist)
dists, cities = tree.query(points)
df_gps['dist'] = dists.flatten() * 3956
df_gps['closest_city'] = df_locations.iloc[cities.flatten()]['location'].values
I'm interested in performing Linear interpolation using the scipy.interpolate library. The dataset looks somewhat like this:
DATAFRAME for interpolation between X, Y for different RUNs
I'd like to use this interpolated function to find the missing Y from this dataset:
DATAFRAME to use the interpolation function
The number of runs given here is just 3, but I'm running on a dataset that will run into 1000s of runs. Hence appreciate if you could advise how to use the iterative functions for the interpolation ?
from scipy.interpolate import interp1d
for RUNNumber in range(TotalRuns)
InterpolatedFunction[RUNNumber]=interp1d(X, Y)
As I understand it, you want a separate interpolation function defined for each run. Then you want to apply these functions to a second dataframe. I defined a dataframe df with columns ['X', 'Y', 'RUN'], and a second dataframe, new_df with columns ['X', 'Y_interpolation', 'RUN'].
interpolating_functions = dict()
for run_number in range(1, max_runs):
run_data = df[df['RUN']==run_number][['X', 'Y']]
interpolating_functions[run_number] = interp1d(run_data['X'], run_data['Y'])
Now that we have interpolating functions for each run, we can use them to fill in the 'Y_interpolation' column in a new dataframe. This can be done using the apply function, which takes a function and applies it to each row in a dataframe. So let's define an interpolate function that will take a row of this new df and use the X value and the run number to calculate an interpolated Y value.
def interpolate(row):
int_func = interpolating_functions[row['RUN']]
interp_y = int_func._call_linear([row['X'])[0] #the _call_linear method
#expects and returns an array
return interp_y[0]
Now we just use apply and our defined interpolate function.
new_df['Y_interpolation'] = new_df.apply(interpolate,axis=1)
I'm using pandas version 0.20.3, and this gives me a new_df that looks like this:
I am trying to get kurtosis using pandas. By doing some exploration, I have
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
however, the output is:
-0.006755982906479385
But I think the kurtosis (https://en.wikipedia.org/wiki/Kurtosis) should be close to (maybe normalize over N-1 instead of N, but this does not matter here)
(test_series - test_series.mean()).pow(4).mean()/np.power(test_series.std(),4)
which is
2.9908543104146026
The pandas documentation says the following
Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0)
This is probably the excess kurtosis, defined as kurtosis - 3.
Pandas is calculating the UNBIASED estimator of the excess Kurtosis. Kurtosis is the normalized 4th central moment. To find the unbiased estimators of the cumulants you need the k-statistics.
So the unbiased estimator of kurtosis is (k4/k2**2)
To illustrate this:
import pandas as pd
import numpy as np
np.random.seed(11234)
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
#-0.0411811269445872
Now we can calculate this explicitly using the k-statistics:
n = len(test_series)
S1 = test_series.pow(1).sum()
S2 = test_series.pow(2).sum()
S3 = test_series.pow(3).sum()
S4 = test_series.pow(4).sum()
# Eq (7) and (5) from the k-statistics link
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
k2 = (n*S2-S1**2)/(n*(n-1))
# k2 is the same as the N-1 variance: test_series.std(ddof=1)**2
k4/k2**2
#-0.04118112694458816
If you want better agreement to more decimal places, you'll need to be careful with the sums as they get rather large. But they're identical to 12 places.