Plotting polygons with Folium and Geopandas don't work - pandas

I have tried to plot polygons to map with Geopandas and Folium using Geopandas official tutorial and this dataset. I tried to follow the tutorial as literally as I could but still Folium don't draw polygons. Matplotlib map works and I can create Folium map too. Code:
import pandas as pd
import geopandas as gdp
import folium
import matplotlib.pyplot as plt
df = pd.read_csv('https://geo.stat.fi/geoserver/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=postialue:pno_tilasto&outputFormat=csv')
df.to_csv('coordinates.csv')
#limit to Helsinki and drop unnecessary columns
df['population_2019'] = df['he_vakiy']
df['zipcode'] = df['postinumeroalue'].astype(int)
df['population_2019'] = df['population_2019'].astype(int)
df = df[df['zipcode'] < 1000]
df = df[['zipcode', 'nimi', 'geom', 'population_2019']]
df.to_csv('coordinates_hki.csv')
df.head()
#this is from there: https://gis.stackexchange.com/questions/387225/set-geometry-in-#geodataframe-to-another-column-fails-typeerror-input-must-be
from shapely.wkt import loads
df = gdp.read_file('coordinates_hki.csv')
df.geometry = df['geom'].apply(loads)
df.plot(figsize=(6, 6))
plt.show()
df = df.set_crs(epsg=4326)
print(df.crs)
df.plot(figsize=(6, 6))
plt.show()
m = folium.Map(location=[60.1674881,24.9427473], zoom_start=10, tiles='CartoDB positron')
m
for _, r in df.iterrows():
# Without simplifying the representation of each borough,
# the map might not be displayed
sim_geo = gdp.GeoSeries(r['geometry']).simplify(tolerance=0.00001)
geo_j = sim_geo.to_json()
geo_j = folium.GeoJson(data=geo_j,
style_function=lambda x: {'fillColor': 'orange'})
folium.Popup(r['nimi']).add_to(geo_j)
geo_j.add_to(folium.Popup(r['nimi']))
m

The trick here is to realize that your data is not in units of degrees. You can determine this by looking at the centroid of your polygons:
>>> print(df.geometry.centroid)
0 POINT (381147.564 6673464.230)
1 POINT (381878.124 6676471.194)
2 POINT (381245.290 6677483.758)
3 POINT (381050.952 6678206.603)
4 POINT (382129.741 6677505.464)
...
79 POINT (397465.125 6676003.926)
80 POINT (393716.203 6675794.166)
81 POINT (393436.954 6679515.888)
82 POINT (395196.736 6677776.331)
83 POINT (398338.591 6675428.040)
Length: 84, dtype: geometry
These values are way bigger than the normal range for geospatial data, which is -180 to 180 for longitude, and -90 to 90 for latitude. The next step is to figure out what CRS it is actually in. If you take your dataset URL, and strip off the &outputFormat=csv part, you get this URL:
https://geo.stat.fi/geoserver/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=postialue:pno_tilasto
Search for CRS in that document, and you'll find this:
<gml:Envelope srsName="urn:ogc:def:crs:EPSG::3067" srsDimension="2">
So, it turns out your data is in EPSG:3067, a standard for representing Finnish coordiates.
You need to tell geopandas about this, and convert into WGS84 (the most common coordinate system) to make it compatible with folium.
df.geometry = df['geom'].apply(loads)
df = df.set_crs('EPSG:3067')
df = df.to_crs('WGS84')
The function set_crs(), changes the coordinate system that GeoPandas expects the data to be in, but does not change any of the coordinates. The function to_crs() takes the points in the dataset and re-projects them into a new coordinate system. The effect of these two calls is to convert from EPSG:3067 to WGS84.
By adding these two lines, I get the following result:

Related

How can I plot a map of a specfic country using plotly

Using geopandas and matplotlib I have ploted a map of india showing Air Quality Index.
The link to my data is:
https://drive.google.com/file/d/1-xihM-LCB6dNfONbK28CJWOP_PVgXA8C/view?usp=share_link
I want to plot an interactive map with names of the cities and borders of regions of India using plotly?
from matplotlib import cm, colors
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
#restricted to India.
ax = world[world.name == 'India'].plot(color='grey', edgecolor='white')
city_day_gdf.plot(column='AQI_Bucket', ax=ax, cmap='PuBuGn', markersize=city_day_gdf['AQI'])
norm = colors.Normalize(city_day_gdf.AQI.min(), city_day_gdf.AQI.max())
plt.colorbar(cm.ScalarMappable(norm=norm,cmap='PuBuGn'), ax=ax)
plt.title("A Map showing the descriptions of Air Quality Index in terms of AQI magnitude across India between 2015 and 2020")
plt.show()
you can scatter your data using px.scatter_mapbox(). Have reduced data to just newest reading per city. You have not stated how you want to treat time series
with requirement for plotting regions of India, you need some geometry for this. Have used GeoJson from a repo on GitHub for this. Have simplified the geometry as it is very detailed and hence will slow down plotly significantly
finally pull it all together, layer scatter on top of choropleth
import pandas as pd
import geopandas as gpd
import shapely.wkt
import plotly.express as px
import requests
url = "https://drive.google.com/file/d/1-xihM-LCB6dNfONbK28CJWOP_PVgXA8C/view?usp=share_link"
df = pd.read_csv("https://drive.google.com/uc?id=" + url.split("/")[-2], index_col=0)
df["Date"] = pd.to_datetime(df["Date"])
# just reduce data to last date for each city. Nothing in question indicates
# how to dela with time series
df = df.sort_values(["City", "Date"]).groupby("City", as_index=False).last()
# get some geometry for regions of india
gdf_region = gpd.read_file(
"https://raw.githubusercontent.com/Subhash9325/GeoJson-Data-of-Indian-States/master/Indian_States"
)
gdf_region["geometry"] = (
gdf_region.to_crs(gdf_region.estimate_utm_crs())
.simplify(5000)
.to_crs(gdf_region.crs)
)
fig = px.choropleth_mapbox(
gdf_region,
geojson=gdf_region.__geo_interface__,
locations=gdf_region.index,
color="NAME_1",
).update_traces(showlegend=False)
fig.add_traces(
px.scatter_mapbox(
df, lat="Lat", lon="Lon", color="AQI_Bucket", hover_data=["City"]
).data
)
fig.update_layout(
mapbox=dict(
style="carto-positron",
zoom=3,
center=dict(lat=df["Lat"].mean(), lon=df["Lon"].mean()),
)
)

How to convert pandas dataframe to geopandas?

I'm trying to upload excel and convert it to geodataframe
import pandas as pd
import geopandas as gpd
df = pd.read_excel('Centroids.xlsx')
df.head()
servicename servicecentroid
0 Mönchengladbach, Kreisfreie Stadt POINT (4070115.425463234 3123463.773862813)
1 Mettmann, Kreis POINT (4109488.971501033 3131686.7549837814)
2 Düsseldorf, Kreisfreie Stadt POINT (4098292.026333667 3129901.416880203)
Then I'm trying to convert it to geodataframe, but the following error occurs
gdf = gpd.GeoDataFrame(df, geometry='servicecentroid')
TypeError: Input must be valid geometry objects: POINT (4070115.425463234 3123463.773862813)
Please help me what is wrong with my data?
Thank you.
Are your servicecentroid's actual Points? If you want to create a GeoDataFrame you have to make you have a column 'geometry' with actual Point objects. For example:
df = pd.DataFrame({'servicename':['Mönchengladbach, Kreisfreie Stadt', 'Mettmann, Kreis', 'Düsseldorf, Kreisfreie Stadt'], 'geometry':[Point(4070115.425463234, 3123463.773862813), Point(4109488.971501033, 3131686.7549837814), Point(4098292.026333667, 3129901.416880203)]})
gdf = gpd.GeoDataFrame(df)
print(gdf.dtypes)
This will output (notice the geometry dtype):
servicename object
geometry geometry
dtype: object
Note that there is a comma separating the Point values, so:
Point(4070115.425463234, 3123463.773862813)
... instead of:
Point(4070115.425463234 3123463.773862813)
Edit:
To make your live even easier, you can simply run the following code to transform the points in your original dataframe to actual Point objects. This will take the original values, split them, and re-build them as Points.
def my_func(x):
l = re.search(r'\((.*?)\)',x).group(1).split(' ')
return Point(float(l[0]), float(l[1]))
df.geometry = df.geometry.transform(my_func)
it appears that servicecentroid is a WKT string
GeoDataFrame() geometry argument is a list/array/series of geometry objects not a column name
hence it becomes simple to convert series of WKT strings to series of geometric objects using shapely
import pandas as pd
import io
import shapely.wkt
import geopandas as gpd
df = pd.read_csv(
io.StringIO(
"""servicename servicecentroid
0 Mönchengladbach, Kreisfreie Stadt POINT (4070115.425463234 3123463.773862813)
1 Mettmann, Kreis POINT (4109488.971501033 3131686.7549837814)
2 Düsseldorf, Kreisfreie Stadt POINT (4098292.026333667 3129901.416880203)"""
),
sep="\s\s+",
engine="python",
)
# NB CRS is missing, looks like it is a UTM CRS....
gpd.GeoDataFrame(df, geometry=df["servicecentroid"].apply(shapely.wkt.loads))

Cluster groups continuously instead of discrete - python

I'm trying to cluster a group of points in a probabilistic manner. Using below, I have a single set of xy points, which are recorded in X and Y. I want to cluster into groups using a reference point, which is displayed in X2 and Y2.
With the help of an answer the current approach is to measure the distance from the reference point and group using k-means. Although, it provides a method to cluster using the reference point, the hard cutoff and adherence to k clusters makes it somewhat unsuitable when dealing with numerous datasets. For instance, the number of clusters needed for this example is probably 3. But a separate example may different. I'd have to manually go through and alter k every time.
Given the non-probabilistic nature of k-means a separate option could be GMM. Is it possible to account for the reference point when modelling? If I attach the output below the underlying model isn't clustering as I'm hoping for.
If I look at the probability each point is within a group it's not clustered as I'd hoped. With this I run into the same problem with manually altering the amount of components. Because the points are distributed randomly, using “AIC” or “BIC” to select the appropriate number of clusters doesn't work. There is no optimal number.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
})
k-means:
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2)
model = KMeans(n_clusters = 2)
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T)
df['group'] = model.labels_
plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5)
plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5)
GMM:
Y_sklearn = df[['X','Y']].values
gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(Y_sklearn)
labels = gmm.predict(Y_sklearn)
df['group'] = labels
plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis');
plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10)
proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True)
df_pred = pd.concat([df, proba], axis = 1)
In my opinion, if you want to define clusters as "regions where points are close to each other", you should use DBSCAN.
This clustering algorithm finds clusters by looking at regions where points are close to each other (i.e. dense regions), and are separated from other clusters by regions where points are less dense.
This algorithm can categorize points as noise (outliers). Outliers are labelled -1.
They are points that do not belong to any cluster.
Here is some code to perform DBSCAN clustering, and to insert the cluster labels as a new categorical column in the original Y_sklearn DataFrame. It also prints how many clusters and how many outliers are found.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
Y_sklearn = df.loc[:, ["X", "Y"]].copy()
n_points = Y_sklearn.shape[0]
dbs = DBSCAN()
labels_clusters = dbs.fit_predict(Y_sklearn)
#Number of found clusters (outliers are not considered a cluster).
n_clusters = labels_clusters.max() + 1
print(f"DBSCAN found {n_clusters} clusters in dataset with {n_points} points.")
#Number of found outliers (possibly no outliers found).
n_outliers = np.count_nonzero((labels_clusters == -1))
if n_outliers:
print(f"{n_outliers} outliers were found.\n")
else:
print(f"No outliers were found.\n")
#Add cluster labels as a new column to original DataFrame.
Y_sklearn["cluster"] = labels_clusters
#Setting `cluster` column to Categorical dtype makes seaborn function properly treat
#cluster labels as categorical, and not numerical.
Y_sklearn["cluster"] = Y_sklearn["cluster"].astype("category")
If you want to plot the results, I suggest you use Seaborn. Here is some code to plot the points of Y_sklearn DataFrame, and color them by the cluster they belong to. I also define a new color palette, which is just the default Seaborn color palette, but where outliers (with label -1) will be in black.
import matplotlib.pyplot as plt
import seaborn as sns
name_palette = "tab10"
palette = sns.color_palette(name_palette)
if n_outliers:
color_outliers = "black"
palette.insert(0, color_outliers)
else:
pass
sns.set_palette(palette)
fig, ax = plt.subplots()
sns.scatterplot(data=Y_sklearn,
x="X",
y="Y",
hue="cluster",
ax=ax,
)
Using default hyperparameters, the DBSCAN algorithm finds no cluster in the data you provided: all points are considered outliers, because there is no region where points are significantly more dense. Is that your whole dataset, or is it just a sample? If it is a sample, the whole dataset will have much more points, and DBSCAN will certainly find some high density regions.
Or you can try tweaking the hyperparameters, min_samples and eps in particular. If you want to "force" the algorithm to find more clusters, you can decrease min_samples (default is 5), or increase eps (default is 0.5). Of course, the optimal hyperparamete values depends on the specific dataset, but default values are considered quite good for DBSCAN. So, if the algorithm considers all points in your dataset to be outliers, it means that there are no "natural" clusters!
Do you mean density estimation? You can model your data as a Gaussian Mixture and then get a probability of a point to belong to the mixture. You can use sklearn.mixture.GaussianMixture for that. By changing number of components you can control how many clusters you will have. The metric to cluster on is Euclidian distance from the reference point. So the GMM model will provide you with prediction of which cluster the data point should be classified to.
Since your metric is 1d, you will get a set of Gaussian distributions, i.e. a set of means and variances. So you can easily calculate the probability of any point to be in certain cluster, just by calculating how far it is from the reference point and put the value in the normal distribution pdf formula.
To make image more clear, I'm changing the reference point to (-5, 5) and select number of clusters = 4. In order to get the best number of clusters, use some metric that minimizes total variance and penalizes growth of number of mixtures. For example argmin(model.covariances_.sum()*num_clusters)
import pandas as pd
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
})
ref_X, ref_Y = -5, 5
dist = np.sqrt((df.X-ref_X)**2 + (df.Y-ref_Y)**2)
n_mix = 4
gmm = GaussianMixture(n_mix)
model = gmm.fit(dist.values.reshape(-1,1))
x = np.linspace(-35., 35.)
y = np.linspace(-30., 30.)
X, Y = np.meshgrid(x, y)
XX = np.sqrt((X.ravel() - ref_X)**2 + (Y.ravel() - ref_Y)**2)
Z = model.score_samples(XX.reshape(-1,1))
Z = Z.reshape(X.shape)
# plot grid points probabilities
plt.set_cmap('plasma')
plt.contourf(X, Y, Z, 40)
plt.scatter(df.X, df.Y, c=model.predict(dist.values.reshape(-1,1)), edgecolor='black')
You can read more here and here
P.S. score_samples() returns log likelihoods, use exp() to convert to probability
Taking your centre point of 0,0 we can calculate the Euclidean distance from this point to all points in your df.
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
If you have a centre point other than zero it would be:
df['distance'] = np.sqrt((centre_point_x - df['X'])**2 + (centre_point_y - df['Y'])**2)
Using your data and chart as before, we can plot this and see the distance metric increasing as we move away from the centre.
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(df['X'], df['Y'], c = df['distance'], cmap = 'viridis', marker = 'o', s = 30)
ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])
plt.show()
K-means
We can now use this distance data and use it to calculate K-means clusters as you did before, but this time using the distance data and an array of zeros (zeros because this k-means requires a 2d-array but we only want to split the 1d aray of dimensional data. So the zeros act as 'filler'
model = KMeans(n_clusters = 2) #choose how many clusters
# create this 2d array for the KMeans model
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T) # transformed array because the above code produces
# data with 27 columns and 2 rows but we want it the other way round
df['group'] = model.labels_ # put the labels into the dataframe
Then we can plot the results
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(df['X'], df['Y'], c = df['group'], cmap = 'viridis', marker = 'o', s = 30)
ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])
plt.show()
With three clusters we get the following result:
Other clustering methods
Check out SKlearn's clustering page for more options. I experimented with DBSCAN with some good results but it depends on what you are trying to achieve exactly. Check out the table underneath their example charts to see how they each compare.

Multiple different kinds of plots on a single figure and save it to a video

I am trying to plot multiple different plots on a single matplotlib figure with in a for loop. At the moment it is all good in matlab as shown in the picture below and then am able to save the figure as a video frame. Here is a link of a sample video generated in matlab for 10 frames
In python, tried it as below
import matplotlib.pyplot as plt
for frame in range(FrameStart,FrameEnd):#loop1
# data generation code within a for loop for n frames from source video
array1 = np.zeros((200, 3800))
array2 = np.zeros((19,2))
array3 = np.zeros((60,60))
for i in range(len(array2)):#loop2
#generate data for arrays 1 to 3 from the frame data
#end loop2
plt.subplot(6,1,1)
plt.imshow(DataArray,cmap='gray')
plt.subplot(6, 1, 2)
plt.bar(data2D[:,0], data2D[:,1])
plt.subplot(2, 2, 3)
plt.contourf(mapData)
# for fourth plot, use array2[3] and array2[5], plot it as shown and keep the\is #plot without erasing for next frame
not sure how to do the 4th axes with line plots. This needs to be there (done using hold on for this axis in matlab) for the entire sequence of frames processing in the for loop while the other 3 axes needs to be erased and updated with new data for each frame in the movie. The contour plot needs to be square all the time with color bar on the side. At the end of each frame processing, once all the axes are updated, it needs to be saved as a frame of a movie. Again this is easily done in matlab, but not sure in python.
Any suggestions
thanks
I guess you need something like this format.
I have used comments # in code to answer your queries. Please check the snippet
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(6,6))
ax1=fig.add_subplot(311) #3rows 1 column 1st plot
ax2=fig.add_subplot(312) #3rows 1 column 2nd plot
ax3=fig.add_subplot(325) #3rows 2 column 5th plot
ax4=fig.add_subplot(326) #3rows 2 column 6th plot
plt.show()
To turn off ticks you can use plt.axis('off'). I dont know how to interpolate your format so left it blank . You can adjust your figsize based on your requirements.
import numpy as np
from numpy import random
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(6,6)) #First is width Second is height
ax1=fig.add_subplot(311)
ax2=fig.add_subplot(312)
ax3=fig.add_subplot(325)
ax4=fig.add_subplot(326)
#Bar Plot
langs = ['C', 'C++', 'Java', 'Python', 'PHP']
students = [23,17,35,29,12]
ax2.bar(langs,students)
#Contour Plot
xlist = np.linspace(-3.0, 3.0, 100)
ylist = np.linspace(-3.0, 3.0, 100)
X, Y = np.meshgrid(xlist, ylist)
Z = np.sqrt(X**2 + Y**2)
cp = ax3.contourf(X, Y, Z)
fig.colorbar(cp,ax=ax3) #Add a colorbar to a plot
#Multiple line plot
x = np.linspace(-1, 1, 50)
y1 = 2*x + 1
y2 = 2**x + 1
ax4.plot(x, y2)
ax4.plot(x, y1, color='red',linewidth=1.0)
plt.tight_layout() #Make sures plots dont overlap
plt.show()

Mutiple plots in a single window

I need to draw many such rows (for a0 .. a128) in a single window. I've searched in FacetGrid, PairGrid and all over around but couldn't find. Only regplot has similar argument ax but it doesn't plot histograms. My data is 128 real valued features with label column [0, 1]. I need the graphs to be shown from my Python code as a separate application on Linux.
Also, it there a way to scale this histogram to show relative values on Y such that the right curve is not skewed?
g = sns.FacetGrid(df, col="Result")
g.map(plt.hist, "a0", bins=20)
plt.show()
Just a simple example using matplotlib. The code is not optimized (ugly, but simple plot-indexing):
import numpy as np
import matplotlib.pyplot as plt
N = 5
data = np.random.normal(size=(N*N, 1000))
f, axarr = plt.subplots(N, N) # maybe you want sharex=True, sharey=True
pi = [0,0]
for i in range(data.shape[0]):
if pi[1] == N:
pi[0] += 1 # next row
pi[1] = 0 # first column again
axarr[pi[0], pi[1]].hist(data[i], normed=True) # i was wrong with density;
# normed=True should be used
pi[1] += 1
plt.show()
Output: