Linestring end coordinates in a .csv file along with source and target id - pandas

I have a Digital Road Map dataset containing coordinates of nodes interconnected through road network and Node number.
Dataset in xlsx
Dataset has three columns : Col1-**source **, Colm 2 Target and Column 3- geometry. geometry is a linestring of road having start point coordinate, an end point coordinate and few intermediate point coordinates. Source and Target columns are Node number of starting node and end node of each road network.
I want to extract only coordinate of starting node and end node from each row. Then arrange the filtered dataset such that each source and each target has the respective coordinates beside it.
The sample output file is
desired sample output
I am looking for code in shapely, most of the info is on one linestring. Since my data has more than a million rows so I am not able to find a relevant code that iterates through entire dataset.

your sample data is unusable as it is an image. Have simulated some
pick first and last point from LINESTRING
structure as columns (in df)
reshape df as df2 as your desired structure
import io
import shapely.geometry, shapely.wkt
import pandas as pd
import numpy as np
# sample data...
df = pd.read_csv(
io.StringIO(
'''source,target,geometry
0,100,"LINESTRING (5.897759230176348 49.44266714130711, 6.242751092156993 49.90222565367873, 5.674051954784829 49.5294835475575)"
1,101,"LINESTRING (13.59594567226444 48.87717194273715, 12.51844038254671 54.470370591848, 6.658229607783568 49.20195831969157)"
2,102,"LINESTRING (16.71947594571444 50.21574656839354, 23.42650841644439 50.30850576435745, 22.77641889821263 49.02739533140962, 14.60709842291953 51.74518809671997)"
3,103,"LINESTRING (18.62085859546164 54.68260569927078, 23.79919884613338 52.69109935160657, 20.89224450041863 54.31252492941253)"
4,104,"LINESTRING (5.606975945670001 51.03729848896978, 6.589396599970826 51.85202912048339, 3.31501148496416 51.34577662473805, 5.988658074577813 51.85161570902505)"
5,105,"LINESTRING (4.799221632515724 49.98537303323637, 6.043073357781111 50.12805166279423, 3.31501148496416 51.34577662473805, 6.15665815595878 50.80372101501058, 3.314971144228537 51.34575511331991)"
6,106,"LINESTRING (3.31501148496416 51.34577662473805, 3.830288527043137 51.62054454203195, 6.905139601274129 53.48216217713065, 4.705997348661185 53.09179840759776)"
7,107,"LINESTRING (7.092053256873896 53.14404328064489, 3.830288527043137 51.62054454203195, 6.842869500362383 52.22844025329755, 3.31501148496416 51.34577662473805)"
8,108,"LINESTRING (6.589396599970826 51.85202912048339, 6.905139601274129 53.48216217713065, 3.314971144228537 51.34575511331991, 5.988658074577813 51.85161570902505)"
9,109,"LINESTRING (5.606975945670001 51.03729848896978, 4.286022983425084 49.90749664977255)"'''
)
)
# pick first and last point from each linestring as columns
df = df.join(
df["geometry"]
.apply(lambda ls: np.array(shapely.wkt.loads(ls).coords)[[0, -1]])
.apply(
lambda x: {
f"{c}_point": shapely.geometry.Point(x[i])
for i, c in enumerate(df.columns)
if c != "geometry"
}
)
.apply(pd.Series)
)
# reshape to row wise
df2 = pd.melt(
df,
id_vars=["source", "target"],
value_vars=["source_point", "target_point"],
value_name="point",
)
df2["node_number"] = np.where(
df2["variable"] == "source_point", df2["source"], df2["target"]
)
df2 = df2.drop(columns=["source", "target", "variable"])
output
point
node_number
POINT (5.897759230176348 49.44266714130711)
0
POINT (13.59594567226444 48.87717194273715)
1
POINT (16.71947594571444 50.21574656839354)
2
POINT (18.62085859546164 54.68260569927078)
3
POINT (5.606975945670001 51.03729848896978)
4
POINT (4.799221632515724 49.98537303323637)
5
POINT (3.31501148496416 51.34577662473805)
6
POINT (7.092053256873896 53.14404328064489)
7
POINT (6.589396599970826 51.85202912048339)
8
POINT (5.606975945670001 51.03729848896978)
9
POINT (5.674051954784829 49.5294835475575)
100
POINT (6.658229607783568 49.20195831969157)
101
POINT (14.60709842291953 51.74518809671997)
102
POINT (20.89224450041863 54.31252492941253)
103
POINT (5.988658074577813 51.85161570902505)
104
POINT (3.314971144228537 51.34575511331991)
105
POINT (4.705997348661185 53.09179840759776)
106
POINT (3.31501148496416 51.34577662473805)
107
POINT (5.988658074577813 51.85161570902505)
108
POINT (4.286022983425084 49.90749664977255)
109

Do you mean:
df[['Start', 'end']] = df['geometry'].str.split(',', expand=True)

Related

Plotting polygons with Folium and Geopandas don't work

I have tried to plot polygons to map with Geopandas and Folium using Geopandas official tutorial and this dataset. I tried to follow the tutorial as literally as I could but still Folium don't draw polygons. Matplotlib map works and I can create Folium map too. Code:
import pandas as pd
import geopandas as gdp
import folium
import matplotlib.pyplot as plt
df = pd.read_csv('https://geo.stat.fi/geoserver/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=postialue:pno_tilasto&outputFormat=csv')
df.to_csv('coordinates.csv')
#limit to Helsinki and drop unnecessary columns
df['population_2019'] = df['he_vakiy']
df['zipcode'] = df['postinumeroalue'].astype(int)
df['population_2019'] = df['population_2019'].astype(int)
df = df[df['zipcode'] < 1000]
df = df[['zipcode', 'nimi', 'geom', 'population_2019']]
df.to_csv('coordinates_hki.csv')
df.head()
#this is from there: https://gis.stackexchange.com/questions/387225/set-geometry-in-#geodataframe-to-another-column-fails-typeerror-input-must-be
from shapely.wkt import loads
df = gdp.read_file('coordinates_hki.csv')
df.geometry = df['geom'].apply(loads)
df.plot(figsize=(6, 6))
plt.show()
df = df.set_crs(epsg=4326)
print(df.crs)
df.plot(figsize=(6, 6))
plt.show()
m = folium.Map(location=[60.1674881,24.9427473], zoom_start=10, tiles='CartoDB positron')
m
for _, r in df.iterrows():
# Without simplifying the representation of each borough,
# the map might not be displayed
sim_geo = gdp.GeoSeries(r['geometry']).simplify(tolerance=0.00001)
geo_j = sim_geo.to_json()
geo_j = folium.GeoJson(data=geo_j,
style_function=lambda x: {'fillColor': 'orange'})
folium.Popup(r['nimi']).add_to(geo_j)
geo_j.add_to(folium.Popup(r['nimi']))
m
The trick here is to realize that your data is not in units of degrees. You can determine this by looking at the centroid of your polygons:
>>> print(df.geometry.centroid)
0 POINT (381147.564 6673464.230)
1 POINT (381878.124 6676471.194)
2 POINT (381245.290 6677483.758)
3 POINT (381050.952 6678206.603)
4 POINT (382129.741 6677505.464)
...
79 POINT (397465.125 6676003.926)
80 POINT (393716.203 6675794.166)
81 POINT (393436.954 6679515.888)
82 POINT (395196.736 6677776.331)
83 POINT (398338.591 6675428.040)
Length: 84, dtype: geometry
These values are way bigger than the normal range for geospatial data, which is -180 to 180 for longitude, and -90 to 90 for latitude. The next step is to figure out what CRS it is actually in. If you take your dataset URL, and strip off the &outputFormat=csv part, you get this URL:
https://geo.stat.fi/geoserver/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=postialue:pno_tilasto
Search for CRS in that document, and you'll find this:
<gml:Envelope srsName="urn:ogc:def:crs:EPSG::3067" srsDimension="2">
So, it turns out your data is in EPSG:3067, a standard for representing Finnish coordiates.
You need to tell geopandas about this, and convert into WGS84 (the most common coordinate system) to make it compatible with folium.
df.geometry = df['geom'].apply(loads)
df = df.set_crs('EPSG:3067')
df = df.to_crs('WGS84')
The function set_crs(), changes the coordinate system that GeoPandas expects the data to be in, but does not change any of the coordinates. The function to_crs() takes the points in the dataset and re-projects them into a new coordinate system. The effect of these two calls is to convert from EPSG:3067 to WGS84.
By adding these two lines, I get the following result:

How to plot coordinates from single pandas series

I have a pandas series called df1['geometry.coordinates'] of coordinate values in the following format:
geometry.coordinates
0 [150.792711, -34.210868]
1 [151.551228, -33.023339]
2 [148.92149870748742, -34.767207772932835]
3 [151.033742, -33.919998]
4 [150.953963043732, -32.3935017885229]
... ...
432 [114.8927165, -28.902492300000002]
433 [115.34601918477634, -30.041742290803096]
434 [115.4632611, -30.8581035]
435 [121.42151909999998, -30.7804027]
436 [115.69424934340425, -30.680970908597665]
I want to plot each point on a graph, probably through using a scatter plot.
I tried: df1['geometry.coordinates'].plot.scatter() but it gets confused because it only reads it as one list value rather than two and therefore I always get the following error:
TypeError: scatter() missing 2 required positional arguments: 'x' and 'y'
Anyone know how I can solve this?
You need to separate the column containing the list so that you can specify x and y in the plot call.
You can split a column containing a list by constructing a data frame from a list.
pd.DataFrame(df2["geometry.coordinates"].to_list(), columns=['x', 'y']).plot.scatter(x=“x”, y=“y”)
Step 1: Split array into multiple columns
df1[['x','y']] = pd.DataFrame(df1['geometry.coordinates'].tolist(), index= df1.index)
Step 2: Plot
df1.plot.scatter(x = 'x', y = 'y', s = 30) #s is size of dots
You are not giving the parameters to scatter(), so the error is quite logical. Something among the lines of df.scatter.plot(df[0],df[1]) should work.
Also, as you are working working with column vectors, you need to transpose your data for it to be viewed as rows: df.scatter.plot(df.T[0],df.T[1])
I did it this way.
import matplotlib.pyplot as plt
geometry = pd.Series([
[150.792711, -34.210868],
[151.551228, -33.023339],
[148.92149870748742, -34.767207772932835],
[151.033742, -33.919998],
[150.953963043732, -32.3935017885229]])
df = pd.DataFrame(geometry.to_list(), columns = ['x','y'])
plt.scatter(x = df['x'], y = df['y'],
edgecolor ='black')
plt.grid(alpha=.15)
you can try
import pandas as pd
geometry_coordinates=[[150.792711, -34.210868],
[151.551228, -33.023339],
[148.92149870748742, -34.767207772932835],
[151.033742, -33.919998],
[150.953963043732, -32.3935017885229],
[114.8927165, -28.902492300000002],
[115.34601918477634, -30.041742290803096],
[115.4632611, -30.8581035],
[121.42151909999998, -30.7804027],
[115.69424934340425, -30.680970908597665]]
geometry_coordinates=pd.DataFrame(geometry_coordinates,columns=['lat','long'])
geometry_coordinates.plot.scatter(x='lat',y='long')

Flightradar24 pandas groupby and vectorize. A no looping solution

I am looking to perform a fast operation on flightradar data to see if the speed in distance matches the speed reported. I have multiple flights and was told not to run double loops on pandas dataframes. Here is a sample dataframe:
import pandas as pd
from datetime import datetime
from shapely.geometry import Point
from geopy.distance import distance
dates = ['2020-12-26 15:13:01', '2020-12-26 15:13:07','2020-12-26 15:13:19','2020-12-26 15:13:32','2020-12-26 15:13:38']
datetimes = [datetime.fromisoformat(date) for date in dates]
data = {'UTC': datetimes,
'Callsign': ["1", "1","2","2","2"],
'Position':[Point(30.542175,-91.13999200000001), Point(30.546204,-91.14020499999999),Point(30.551443,-91.14417299999999),Point(30.553909,-91.15136699999999),Point(30.554489,-91.155075)]
}
df = pd.DataFrame(data)
What I want to do is add a new column called "dist". This column will be 0 if it is the first element of a new callsign but if not it will be the distance between a point and the previous point.
The resulting df should look like this:
df1 = df
dist = [0,0.27783309075379214,0,0.46131362750613436,0.22464461718704595]
df1['dist'] = dist
What I have tried is to first assign a group index:
df['group_index'] = df.groupby('Callsign').cumcount()
Then groupby
Then try and apply the function:
df['dist'] = df.groupby('Callsign').apply(lambda g: 0 if g.group_index == 0 else distance((g.Position.x , g.Position.y),
(g.Position.shift().x , g.Position.shift().y)).miles)
I was hoping this would give me the 0 for the first index of each group and then run the distance function on all others and return a value in miles. However it does not work.
The code errors out for at least one reason which is because the .x and .y attributes of the shapely object are being called on the series rather than the object.
Any ideas on how to fix this would be much appreciated.
Sort df by callsign then timestamp
Compute distances between adjacent rows using a temporary column of shifted points
For the first row of each new callsign, set distance to 0
Drop temporary column
df = df.sort_values(by=['Callsign', 'UTC'])
df['Position_prev'] = df['Position'].shift().bfill()
def get_dist(row):
return distance((row['Position'].x, row['Position'].y),
(row['Position_prev'].x, row['Position_prev'].y)).miles
df['dist'] = df.apply(get_distances, axis=1)
# Flag row if callsign is different from previous row callsign
new_callsign_rows = df['Callsign'] != df['Callsign'].shift()
# Zero out the first distance of each callsign group
df.loc[new_callsign_rows, 'dist'] = 0.0
# Drop shifted column
df = df.drop(columns='Position_prev')
print(df)
UTC Callsign Position dist
0 2020-12-26 15:13:01 1 POINT (30.542175 -91.13999200000001) 0.000000
1 2020-12-26 15:13:07 1 POINT (30.546204 -91.14020499999999) 0.277833
2 2020-12-26 15:13:19 2 POINT (30.551443 -91.14417299999999) 0.000000
3 2020-12-26 15:13:32 2 POINT (30.553909 -91.15136699999999) 0.461314
4 2020-12-26 15:13:38 2 POINT (30.554489 -91.155075) 0.224645

Add a line at z=0 to ggplot2 heatmap

I have plotted a heatmap in ggplot2. I want to add a curved line to the plot to show where z=0 (i.e. where the value of the data used for the fill is zero), how can I do this?
Thanks
Since no example data or code is provided, I'll illustrate with the volcano dataset, representing heights of a volcano in a matrix. Since the data doesn't contain a zero point, we'll draw the line at the arbitrarily chosen 125 mark.
library(ggplot2)
# Convert matrix to data.frame
df <- data.frame(
row = as.vector(row(volcano)),
col = as.vector(col(volcano)),
value = as.vector(volcano)
)
# Set contour breaks at desired level
ggplot(df, aes(col, row, fill = value)) +
geom_raster() +
geom_contour(aes(z = value),
breaks = 125, col = 'red')
Created on 2020-04-06 by the reprex package (v0.3.0)
If this isn't a good approximation of your problem, I'd suggest to include example data and code in your question.

Difficulty with numpy broadcasting

I have two 2d point clouds (oldPts and newPts) which I whish to combine. They are mx2 and nx2 numpyinteger arrays with m and n of order 2000. newPts contains many duplicates or near duplicates of oldPts and I need to remove these before combining.
So far I have used the histogram2d function to produce a 2d representation of oldPts (H). I then compare each newPt to an NxN area of H and if it is empty I accept the point. This last part I am currently doing with a python loop which i would like to remove. Can anybody show me how to do this with broadcasting or perhaps suggest a completely different method of going about the problem. the working code is below
npzfile = np.load(path+datasetNo+'\\temp.npz')
arrs = npzfile.files
oldPts = npzfile[arrs[0]]
newPts = npzfile[arrs[1]]
# remove all the negative values
oldPts = oldPts[oldPts.min(axis=1)>=0,:]
newPts = newPts[newPts.min(axis=1)>=0,:]
# round to integers
oldPts = np.around(oldPts).astype(int)
newPts = newPts.astype(int)
# put the oldPts into 2d array
H, xedg,yedg= np.histogram2d(oldPts[:,0],oldPts[:,1],
bins = [xMax,yMax],
range = [[0, xMax], [0, yMax]])
finalNewList = []
N = 5
for pt in newPts:
if not H[max(0,pt[0]-N):min(xMax,pt[0]+N),
max(0,pt[1]- N):min(yMax,pt[1]+N)].any():
finalNewList.append(pt)
finalNew = np.array(finalNewList)
The right way to do this is to use linear algebra to compute the distance between each pair of 2-long vectors, and then accept only the new points that are "different enough" from each old point: using scipy.spatial.distance.cdist:
import numpy as np
oldPts = np.random.randn(1000,2)
newPts = np.random.randn(2000,2)
from scipy.spatial.distance import cdist
dist = cdist(oldPts, newPts)
print(dist.shape) # (1000, 2000)
okIndex = np.max(dist, axis=0) > 5
print(np.sum(okIndex)) # prints 1503 for me
finalNew = newPts[okIndex,:]
print(finalNew.shape) # (1503, 2)
Above I use the Euclidean distance of 5 as the threshold for "too close": any point in newPts that's farther than 5 from all points in oldPts is accepted into finalPts. You will have to look at the range of values in dist to find a good threshold, but your histogram can guide you in picking the best one.
(One good way to visualize dist is to use matplotlib.pyplot.imshow(dist).)
This is a more refined version of what you were doing with the histogram. In fact, you ought to be able to get the exact same answer as the histogram by passing in metric='minkowski', p=1 keyword arguments to cdist, assuming your histogram bin widths are the same in both dimensions, and using 5 again as the threshold.
(PS. If you're interested in another useful function in scipy.spatial.distance, check out my answer that uses pdist to find unique rows/columns in an array.)