Drawing map with geopandas library for a continent, but the data points contains whole world

Drawing map with geopandas library for a continent, but the data points contains whole world - pandas

I have got data set of meteorites which were found with latitude and longitude information. I almost have 30,000 data points from all around the world. But I would like to plot the map of only one continent, for example "South America" by using geopandas library.
I am using 'naturalearth_lowres' default map of geopandas. From that world map, I filtered South America. My data which is called mod_data_geo consists geometry type data, Point(longitute, latitude).
Data Set looks like that:
My code:
mod_data_geo = gpd.GeoDataFrame(mod_data, geometry = gpd.points_from_xy(mod_data['long'], mod_data['lat']))
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
countries = world[world['continent'] == "South America"]
axis = countries.plot(color = 'Lightblue', edgecolor = 'black', figsize=(15,15))
mod_data_geo.plot(ax=axis, markersize = 1, color = 'purple' )
Map that I plotted:
How can I filter data of meteorites inside mod_data_geo dataframe with Geopandas library or any other tool, in order to see only meteorites found over the South Africa continent only?
Thank you in advance!

Three ways - crop the image, filter the points with a bounding box, or filter the points by checking whether they're inside the country shapes.
Crop the image
If you'd like the points to extend to the edge of the image, but simply to limit the image extent, you can simply set the x and y limits on the matplotlib axis:
axis = countries.plot(color = 'Lightblue', edgecolor = 'black', figsize=(15,15))
orig_extent = axis.get_extent()
mod_data_geo.plot(ax=axis, markersize = 1, color = 'purple' )
axis.set_extent(*orig_extent)
Filter to a bounding box
The first approach is nice in that it retains all the data that can fit within your plot. But it's not super efficient, as matplotlib has to filter the data for you based on whether it will appear in the image. A faster approach could filter the data first; note that the data will no longer go to the edge of the image.
First find a bounding box around the countries:
In [6]: bounds = countries.bounds.agg({'minx': 'min', 'miny': 'min', 'maxx': 'max', 'maxy': 'max'})
In [7]: bounds
Out[7]:
minx -81.410943
miny -55.611830
maxx -34.729993
maxy 12.437303
dtype: float64
Then you can filter the data based on these bounds:
In [8]: mod_data_filtered = mod_data_geo[(
...: (mod_data_geo.lat >= bounds.miny)
...: (mod_data_geo.lat <= bounds.maxy)
...: (mod_data_geo.long >= bounds.minx)
...: (mod_data_geo.long <= bounds.maxx)
...: )]
Now you can plot with mod_data_filtered.
Note that you could set the extent of the plot to the bounding box, though this will get a bit tight.
Filter with country shapes
If you'd like to filter the data to being within one of the countries rather than just cropping the data to the bounding box, you could use geopandas.GeoSeries.contains.
First, dissolve the data to get a single shape for South America:
In [8]: south_america = countries.dissolve()
In [9]: south_america
Out[9]:
geometry pop_est continent name iso_a3 gdp_md_est
0 MULTIPOLYGON (((-57.75000 -51.55000, -58.05000... 44293293 South America Argentina ARG 879400.0
Then, filter the points to those within the shape:
In [10]: mod_data_filtered = mod_data_geo[south_america.contains(mod_data_geo)]

Related

Altering size of points on map in R (geom_sf) to reflect categorical data

I am creating a map to depict density of datapoints at different locations. At some locations, there is a high density of data available, and at others, there is a low density of data available. I would like to present the map with each data point shown but with each point a certain size to represent the density.
In my data table I have the location, and each location is assigned 'A', 'B', or 'C' to depict 'Low', 'Medium', and 'High' density. When plotting using geom_sf, I am able to get the points on the map, but I would like each category to be represented by a different size circle. I.e. 'Low density' locations with a small circle, and 'High density' locations with a larger circle.
I have been approaching the aesthetics of this map in the same way I would approach it as if it were a normal ggplot situation, but have not had any luck. I feel like I must be missing something obvious related to the fact that I am using geom_sf(), so any advice would be appreciated!
Using a very simple code:
ggplot() +
geom_sf(data = stc_land, color = "grey40", fill = "grey80") +
geom_sf(data = stcdens, aes(shape = Density) +
theme_classic()
I know that the aes() call should go in with the 'stcdens' data, and I got close with the 'shape = Density', but I am not sure how to move forward with assigning what shapes I want to each category.

You probably want to swap shape = Density for size = Density; then the plot should behave itself (and yes, it is a standard ggplot behavior, nothing sf specific :)
As your code is not exactly reproducible allow me to use my favorite example of 3 cities in NC:
library(sf)
library(ggplot2)
shape <- st_read(system.file("shape/nc.shp", package="sf")) # included with sf package
cities <- data.frame(name = c("Raleigh", "Greensboro", "Wilmington"),
x = c(-78.633333, -79.819444, -77.912222),
y = c(35.766667, 36.08, 34.223333),
population = c("high", "medium","low")) %>%
st_as_sf(coords = c("x", "y"), crs = 4326) %>%
dplyr::mutate(population = ordered(population,
levels = c("low", "medium", "high")))
ggplot() +
geom_sf(data = shape, fill = NA) +
geom_sf(data = cities, aes(size = population))
Note that I turned the population from a character variable to ordered factor, where high > medium > low (so that the circles follow the expected order).

Plotting polygons with Folium and Geopandas don't work

I have tried to plot polygons to map with Geopandas and Folium using Geopandas official tutorial and this dataset. I tried to follow the tutorial as literally as I could but still Folium don't draw polygons. Matplotlib map works and I can create Folium map too. Code:
import pandas as pd
import geopandas as gdp
import folium
import matplotlib.pyplot as plt
df = pd.read_csv('https://geo.stat.fi/geoserver/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=postialue:pno_tilasto&outputFormat=csv')
df.to_csv('coordinates.csv')
#limit to Helsinki and drop unnecessary columns
df['population_2019'] = df['he_vakiy']
df['zipcode'] = df['postinumeroalue'].astype(int)
df['population_2019'] = df['population_2019'].astype(int)
df = df[df['zipcode'] < 1000]
df = df[['zipcode', 'nimi', 'geom', 'population_2019']]
df.to_csv('coordinates_hki.csv')
df.head()
#this is from there: https://gis.stackexchange.com/questions/387225/set-geometry-in-#geodataframe-to-another-column-fails-typeerror-input-must-be
from shapely.wkt import loads
df = gdp.read_file('coordinates_hki.csv')
df.geometry = df['geom'].apply(loads)
df.plot(figsize=(6, 6))
plt.show()
df = df.set_crs(epsg=4326)
print(df.crs)
df.plot(figsize=(6, 6))
plt.show()
m = folium.Map(location=[60.1674881,24.9427473], zoom_start=10, tiles='CartoDB positron')
m
for _, r in df.iterrows():
# Without simplifying the representation of each borough,
# the map might not be displayed
sim_geo = gdp.GeoSeries(r['geometry']).simplify(tolerance=0.00001)
geo_j = sim_geo.to_json()
geo_j = folium.GeoJson(data=geo_j,
style_function=lambda x: {'fillColor': 'orange'})
folium.Popup(r['nimi']).add_to(geo_j)
geo_j.add_to(folium.Popup(r['nimi']))
m

The trick here is to realize that your data is not in units of degrees. You can determine this by looking at the centroid of your polygons:
>>> print(df.geometry.centroid)
0 POINT (381147.564 6673464.230)
1 POINT (381878.124 6676471.194)
2 POINT (381245.290 6677483.758)
3 POINT (381050.952 6678206.603)
4 POINT (382129.741 6677505.464)
...
79 POINT (397465.125 6676003.926)
80 POINT (393716.203 6675794.166)
81 POINT (393436.954 6679515.888)
82 POINT (395196.736 6677776.331)
83 POINT (398338.591 6675428.040)
Length: 84, dtype: geometry
These values are way bigger than the normal range for geospatial data, which is -180 to 180 for longitude, and -90 to 90 for latitude. The next step is to figure out what CRS it is actually in. If you take your dataset URL, and strip off the &outputFormat=csv part, you get this URL:
https://geo.stat.fi/geoserver/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=postialue:pno_tilasto
Search for CRS in that document, and you'll find this:
<gml:Envelope srsName="urn:ogc:def:crs:EPSG::3067" srsDimension="2">
So, it turns out your data is in EPSG:3067, a standard for representing Finnish coordiates.
You need to tell geopandas about this, and convert into WGS84 (the most common coordinate system) to make it compatible with folium.
df.geometry = df['geom'].apply(loads)
df = df.set_crs('EPSG:3067')
df = df.to_crs('WGS84')
The function set_crs(), changes the coordinate system that GeoPandas expects the data to be in, but does not change any of the coordinates. The function to_crs() takes the points in the dataset and re-projects them into a new coordinate system. The effect of these two calls is to convert from EPSG:3067 to WGS84.
By adding these two lines, I get the following result:

Cluster groups continuously instead of discrete - python

I'm trying to cluster a group of points in a probabilistic manner. Using below, I have a single set of xy points, which are recorded in X and Y. I want to cluster into groups using a reference point, which is displayed in X2 and Y2.
With the help of an answer the current approach is to measure the distance from the reference point and group using k-means. Although, it provides a method to cluster using the reference point, the hard cutoff and adherence to k clusters makes it somewhat unsuitable when dealing with numerous datasets. For instance, the number of clusters needed for this example is probably 3. But a separate example may different. I'd have to manually go through and alter k every time.
Given the non-probabilistic nature of k-means a separate option could be GMM. Is it possible to account for the reference point when modelling? If I attach the output below the underlying model isn't clustering as I'm hoping for.
If I look at the probability each point is within a group it's not clustered as I'd hoped. With this I run into the same problem with manually altering the amount of components. Because the points are distributed randomly, using “AIC” or “BIC” to select the appropriate number of clusters doesn't work. There is no optimal number.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
})
k-means:
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2)
model = KMeans(n_clusters = 2)
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T)
df['group'] = model.labels_
plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5)
plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5)
GMM:
Y_sklearn = df[['X','Y']].values
gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(Y_sklearn)
labels = gmm.predict(Y_sklearn)
df['group'] = labels
plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis');
plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10)
proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True)
df_pred = pd.concat([df, proba], axis = 1)

In my opinion, if you want to define clusters as "regions where points are close to each other", you should use DBSCAN.
This clustering algorithm finds clusters by looking at regions where points are close to each other (i.e. dense regions), and are separated from other clusters by regions where points are less dense.
This algorithm can categorize points as noise (outliers). Outliers are labelled -1.
They are points that do not belong to any cluster.
Here is some code to perform DBSCAN clustering, and to insert the cluster labels as a new categorical column in the original Y_sklearn DataFrame. It also prints how many clusters and how many outliers are found.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
Y_sklearn = df.loc[:, ["X", "Y"]].copy()
n_points = Y_sklearn.shape[0]
dbs = DBSCAN()
labels_clusters = dbs.fit_predict(Y_sklearn)
#Number of found clusters (outliers are not considered a cluster).
n_clusters = labels_clusters.max() + 1
print(f"DBSCAN found {n_clusters} clusters in dataset with {n_points} points.")
#Number of found outliers (possibly no outliers found).
n_outliers = np.count_nonzero((labels_clusters == -1))
if n_outliers:
print(f"{n_outliers} outliers were found.\n")
else:
print(f"No outliers were found.\n")
#Add cluster labels as a new column to original DataFrame.
Y_sklearn["cluster"] = labels_clusters
#Setting `cluster` column to Categorical dtype makes seaborn function properly treat
#cluster labels as categorical, and not numerical.
Y_sklearn["cluster"] = Y_sklearn["cluster"].astype("category")
If you want to plot the results, I suggest you use Seaborn. Here is some code to plot the points of Y_sklearn DataFrame, and color them by the cluster they belong to. I also define a new color palette, which is just the default Seaborn color palette, but where outliers (with label -1) will be in black.
import matplotlib.pyplot as plt
import seaborn as sns
name_palette = "tab10"
palette = sns.color_palette(name_palette)
if n_outliers:
color_outliers = "black"
palette.insert(0, color_outliers)
else:
pass
sns.set_palette(palette)
fig, ax = plt.subplots()
sns.scatterplot(data=Y_sklearn,
x="X",
y="Y",
hue="cluster",
ax=ax,
)
Using default hyperparameters, the DBSCAN algorithm finds no cluster in the data you provided: all points are considered outliers, because there is no region where points are significantly more dense. Is that your whole dataset, or is it just a sample? If it is a sample, the whole dataset will have much more points, and DBSCAN will certainly find some high density regions.
Or you can try tweaking the hyperparameters, min_samples and eps in particular. If you want to "force" the algorithm to find more clusters, you can decrease min_samples (default is 5), or increase eps (default is 0.5). Of course, the optimal hyperparamete values depends on the specific dataset, but default values are considered quite good for DBSCAN. So, if the algorithm considers all points in your dataset to be outliers, it means that there are no "natural" clusters!

Do you mean density estimation? You can model your data as a Gaussian Mixture and then get a probability of a point to belong to the mixture. You can use sklearn.mixture.GaussianMixture for that. By changing number of components you can control how many clusters you will have. The metric to cluster on is Euclidian distance from the reference point. So the GMM model will provide you with prediction of which cluster the data point should be classified to.
Since your metric is 1d, you will get a set of Gaussian distributions, i.e. a set of means and variances. So you can easily calculate the probability of any point to be in certain cluster, just by calculating how far it is from the reference point and put the value in the normal distribution pdf formula.
To make image more clear, I'm changing the reference point to (-5, 5) and select number of clusters = 4. In order to get the best number of clusters, use some metric that minimizes total variance and penalizes growth of number of mixtures. For example argmin(model.covariances_.sum()*num_clusters)
import pandas as pd
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
})
ref_X, ref_Y = -5, 5
dist = np.sqrt((df.X-ref_X)**2 + (df.Y-ref_Y)**2)
n_mix = 4
gmm = GaussianMixture(n_mix)
model = gmm.fit(dist.values.reshape(-1,1))
x = np.linspace(-35., 35.)
y = np.linspace(-30., 30.)
X, Y = np.meshgrid(x, y)
XX = np.sqrt((X.ravel() - ref_X)**2 + (Y.ravel() - ref_Y)**2)
Z = model.score_samples(XX.reshape(-1,1))
Z = Z.reshape(X.shape)
# plot grid points probabilities
plt.set_cmap('plasma')
plt.contourf(X, Y, Z, 40)
plt.scatter(df.X, df.Y, c=model.predict(dist.values.reshape(-1,1)), edgecolor='black')
You can read more here and here
P.S. score_samples() returns log likelihoods, use exp() to convert to probability

Taking your centre point of 0,0 we can calculate the Euclidean distance from this point to all points in your df.
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
If you have a centre point other than zero it would be:
df['distance'] = np.sqrt((centre_point_x - df['X'])**2 + (centre_point_y - df['Y'])**2)
Using your data and chart as before, we can plot this and see the distance metric increasing as we move away from the centre.
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(df['X'], df['Y'], c = df['distance'], cmap = 'viridis', marker = 'o', s = 30)
ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])
plt.show()
K-means
We can now use this distance data and use it to calculate K-means clusters as you did before, but this time using the distance data and an array of zeros (zeros because this k-means requires a 2d-array but we only want to split the 1d aray of dimensional data. So the zeros act as 'filler'
model = KMeans(n_clusters = 2) #choose how many clusters
# create this 2d array for the KMeans model
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T) # transformed array because the above code produces
# data with 27 columns and 2 rows but we want it the other way round
df['group'] = model.labels_ # put the labels into the dataframe
Then we can plot the results
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(df['X'], df['Y'], c = df['group'], cmap = 'viridis', marker = 'o', s = 30)
ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])
plt.show()
With three clusters we get the following result:
Other clustering methods
Check out SKlearn's clustering page for more options. I experimented with DBSCAN with some good results but it depends on what you are trying to achieve exactly. Check out the table underneath their example charts to see how they each compare.

matplotlib: histogram is not displaying correctly

I have extracted certain data from a csv file contains the information I need to analyze. Made them into a DataFrame. Then group them based on the type of region they are at "reg."
datafileR = datafile = pd.read_csv("pixel_data.csv")
datafileR = pd.DataFrame(datafileR)
### Counting the number of each rows based on the "Reg":
datafileR["Reg"].value_counts()
This is the result I received:
enter image description here
Make a group called region based on the Reg column from dataframe: datafileR:
region = datafileR.groupby(["Reg"])
Now plot them in histogram:
sns.set_theme()
plt.hist(datafileR["Reg"].value_counts(), bins=[70,100,130,160,190],color=["grey"],
histtype='bar', align='mid', orientation='vertical', rwidth=0.85)
This is the image I received, but there should have five categories (Middle East and North Africa, Africa (excl MENA),Asia and Pacific, Europe and Eurasia and Cross-regional)on the x-axies. I am not sure what when wrong. Meanwhile, how to change the states on the y-axis so it displays the actual number?
enter image description here

You are trying to draw a bar plot, not a histogram. Please ref to https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html?highlight=bar#matplotlib.pyplot.bar
datafileR = pd.DataFrame({'reg': np.random.choice(['Asia','Africa','Europe'], size=1000)})
df = datafileR['reg'].value_counts()
plt.bar(x=df.index, height=df.values)
You can also use pandas' plotting functions:
df.plot.bar()
plt.tight_layout()

Annotation box does not appear in matplotlib

The planned annotation box does not appear on my plot, however, I've tried a wide range of values for its coordinates.
What's wrong with that?!
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
def f(s,t):
a = 0.7
b = 0.8
Iext= 0.5
tau = 12.5
v = s[0]
w = s[1]
dndt = v - np.power(v,3)/3 - w + Iext
dwdt = (v + a - b * w)/tau
return [dndt, dwdt]
t = np.linspace(0,200)
s0=[1,1]
s = odeint(f,s0,t)
plt.plot(t,s[:,0],'b-', linewidth=1.0)
plt.xlabel(r"$t(sec.)$")
plt.ylabel(r"$V (volt)$")
plt.legend([r"$V$"])
annotation_string = r"$I_{ext}=0.5$"
plt.text(15, 60, annotation_string, bbox=dict(facecolor='red', alpha=0.5))
plt.show()

The coordinates to plt.text are data coordinates by default. This means in order to be present in the plot they should not exceed the data limits of your plot (here, ~0..200 in x direction, ~-2..2 in y direction).
Something like plt.text(10,1.8) should work.
The problem with that is that once the data limits change (because you plot something different or add another plot) the text item will be at a different position inside the canvas.
If this is undesired, you can specify the text in axes coordinates (ranging from 0 to 1 in both directions). In order to place the text always in the top left corner of the axes, independent on what you plot there, you can use e.g.
plt.text(0.03,0.97, annotation_string, bbox=dict(facecolor='red', alpha=0.5),
transform=plt.gca().transAxes, va = "top", ha="left")
Here the transform keyword tells the text to use Axes coordinates, and va = "top", ha="left" means, that the top left corner of the text should be the anchor point.

The annotation is appearing far above your plot because you have given a 'y' coordinate of 60, whereas your plot ends at '2' (upwards).
Change the second argument here:
plt.text(15, 60, annotation_string, bbox=dict(facecolor='red', alpha=0.5))
It needs to be <=2 to show up on the plot itself. You may also want to change the x coorinate (from 15 to something less), so that it doesn't obscure your lines.
e.g.
plt.text(5, 1.5, annotation_string, bbox=dict(facecolor='red', alpha=0.5))
Don't be alarmed by my (5,1.5) suggestion, I would then add the following line to the top of your script (beneath your imports):
rcParams['legend.loc'] = 'best'
This will choose a 'best fit' for your legend; in this case, top left (just above your annotation). Both look quite neat then, your choice though :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Drawing map with geopandas library for a continent, but the data points contains whole world - pandas

Related

Altering size of points on map in R (geom_sf) to reflect categorical data

Plotting polygons with Folium and Geopandas don't work

Cluster groups continuously instead of discrete - python

matplotlib: histogram is not displaying correctly

Annotation box does not appear in matplotlib

Categories

Resources