Using geopandas and matplotlib I have ploted a map of india showing Air Quality Index.
The link to my data is:
https://drive.google.com/file/d/1-xihM-LCB6dNfONbK28CJWOP_PVgXA8C/view?usp=share_link
I want to plot an interactive map with names of the cities and borders of regions of India using plotly?
from matplotlib import cm, colors
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
#restricted to India.
ax = world[world.name == 'India'].plot(color='grey', edgecolor='white')
city_day_gdf.plot(column='AQI_Bucket', ax=ax, cmap='PuBuGn', markersize=city_day_gdf['AQI'])
norm = colors.Normalize(city_day_gdf.AQI.min(), city_day_gdf.AQI.max())
plt.colorbar(cm.ScalarMappable(norm=norm,cmap='PuBuGn'), ax=ax)
plt.title("A Map showing the descriptions of Air Quality Index in terms of AQI magnitude across India between 2015 and 2020")
plt.show()
you can scatter your data using px.scatter_mapbox(). Have reduced data to just newest reading per city. You have not stated how you want to treat time series
with requirement for plotting regions of India, you need some geometry for this. Have used GeoJson from a repo on GitHub for this. Have simplified the geometry as it is very detailed and hence will slow down plotly significantly
finally pull it all together, layer scatter on top of choropleth
import pandas as pd
import geopandas as gpd
import shapely.wkt
import plotly.express as px
import requests
url = "https://drive.google.com/file/d/1-xihM-LCB6dNfONbK28CJWOP_PVgXA8C/view?usp=share_link"
df = pd.read_csv("https://drive.google.com/uc?id=" + url.split("/")[-2], index_col=0)
df["Date"] = pd.to_datetime(df["Date"])
# just reduce data to last date for each city. Nothing in question indicates
# how to dela with time series
df = df.sort_values(["City", "Date"]).groupby("City", as_index=False).last()
# get some geometry for regions of india
gdf_region = gpd.read_file(
"https://raw.githubusercontent.com/Subhash9325/GeoJson-Data-of-Indian-States/master/Indian_States"
)
gdf_region["geometry"] = (
gdf_region.to_crs(gdf_region.estimate_utm_crs())
.simplify(5000)
.to_crs(gdf_region.crs)
)
fig = px.choropleth_mapbox(
gdf_region,
geojson=gdf_region.__geo_interface__,
locations=gdf_region.index,
color="NAME_1",
).update_traces(showlegend=False)
fig.add_traces(
px.scatter_mapbox(
df, lat="Lat", lon="Lon", color="AQI_Bucket", hover_data=["City"]
).data
)
fig.update_layout(
mapbox=dict(
style="carto-positron",
zoom=3,
center=dict(lat=df["Lat"].mean(), lon=df["Lon"].mean()),
)
)
Related
I have tried to plot polygons to map with Geopandas and Folium using Geopandas official tutorial and this dataset. I tried to follow the tutorial as literally as I could but still Folium don't draw polygons. Matplotlib map works and I can create Folium map too. Code:
import pandas as pd
import geopandas as gdp
import folium
import matplotlib.pyplot as plt
df = pd.read_csv('https://geo.stat.fi/geoserver/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=postialue:pno_tilasto&outputFormat=csv')
df.to_csv('coordinates.csv')
#limit to Helsinki and drop unnecessary columns
df['population_2019'] = df['he_vakiy']
df['zipcode'] = df['postinumeroalue'].astype(int)
df['population_2019'] = df['population_2019'].astype(int)
df = df[df['zipcode'] < 1000]
df = df[['zipcode', 'nimi', 'geom', 'population_2019']]
df.to_csv('coordinates_hki.csv')
df.head()
#this is from there: https://gis.stackexchange.com/questions/387225/set-geometry-in-#geodataframe-to-another-column-fails-typeerror-input-must-be
from shapely.wkt import loads
df = gdp.read_file('coordinates_hki.csv')
df.geometry = df['geom'].apply(loads)
df.plot(figsize=(6, 6))
plt.show()
df = df.set_crs(epsg=4326)
print(df.crs)
df.plot(figsize=(6, 6))
plt.show()
m = folium.Map(location=[60.1674881,24.9427473], zoom_start=10, tiles='CartoDB positron')
m
for _, r in df.iterrows():
# Without simplifying the representation of each borough,
# the map might not be displayed
sim_geo = gdp.GeoSeries(r['geometry']).simplify(tolerance=0.00001)
geo_j = sim_geo.to_json()
geo_j = folium.GeoJson(data=geo_j,
style_function=lambda x: {'fillColor': 'orange'})
folium.Popup(r['nimi']).add_to(geo_j)
geo_j.add_to(folium.Popup(r['nimi']))
m
The trick here is to realize that your data is not in units of degrees. You can determine this by looking at the centroid of your polygons:
>>> print(df.geometry.centroid)
0 POINT (381147.564 6673464.230)
1 POINT (381878.124 6676471.194)
2 POINT (381245.290 6677483.758)
3 POINT (381050.952 6678206.603)
4 POINT (382129.741 6677505.464)
...
79 POINT (397465.125 6676003.926)
80 POINT (393716.203 6675794.166)
81 POINT (393436.954 6679515.888)
82 POINT (395196.736 6677776.331)
83 POINT (398338.591 6675428.040)
Length: 84, dtype: geometry
These values are way bigger than the normal range for geospatial data, which is -180 to 180 for longitude, and -90 to 90 for latitude. The next step is to figure out what CRS it is actually in. If you take your dataset URL, and strip off the &outputFormat=csv part, you get this URL:
https://geo.stat.fi/geoserver/wfs?service=WFS&version=2.0.0&request=GetFeature&typeName=postialue:pno_tilasto
Search for CRS in that document, and you'll find this:
<gml:Envelope srsName="urn:ogc:def:crs:EPSG::3067" srsDimension="2">
So, it turns out your data is in EPSG:3067, a standard for representing Finnish coordiates.
You need to tell geopandas about this, and convert into WGS84 (the most common coordinate system) to make it compatible with folium.
df.geometry = df['geom'].apply(loads)
df = df.set_crs('EPSG:3067')
df = df.to_crs('WGS84')
The function set_crs(), changes the coordinate system that GeoPandas expects the data to be in, but does not change any of the coordinates. The function to_crs() takes the points in the dataset and re-projects them into a new coordinate system. The effect of these two calls is to convert from EPSG:3067 to WGS84.
By adding these two lines, I get the following result:
In Pandas, I am trying to generate a Ridgeline plot for which the density values are shown (either as Y axis or color-ramp). I am using the Joyplot but any other alternative ways are fine.
So, first I created the Ridge plot to show the different distribution plot for each condition (you can reproduce it using this code):
import pandas as pd
import joypy
import matplotlib
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'Category1':np.random.choice(['C1','C2','C3'],1000),'Category2':np.random.choice(['B1','B2','B3','B4','B5'],1000),
'year':np.arange(start=1900, stop=2900, step=1),
'Data':np.random.uniform(0,1,1000),"Period":np.random.choice(['AA','CC','BB','DD'],1000)})
data_pivot=df1.pivot_table('Data', ['Category1', 'Category2','year'], 'Period')
fig, axes = joypy.joyplot(data_pivot, column=['AA', 'BB', 'CC', 'DD'], by="Category1", ylim='own', figsize=(14,10), legend=True, alpha=0.4)
so it generates the figure but without my desired Y axis. So, based on this post, I could add a colorramp, which neither makes sense nor show the differences between the distribution plot of the different categories on each line :) ...
ar=df1['Data'].plot.kde().get_lines()[0].get_ydata() ## a workaround to get the probability values to set the colorramp max and min
norm = plt.Normalize(ar.min(), ar.max())
original_cmap = plt.cm.viridis
cmap = matplotlib.colors.ListedColormap(original_cmap(norm(ar)))
sm = matplotlib.cm.ScalarMappable(cmap=original_cmap, norm=norm)
sm.set_array([])
# plotting ....
fig, axes = joypy.joyplot(data_pivot,colormap = cmap , column=['AA', 'BB', 'CC', 'DD'], by="Category1", ylim='own', figsize=(14,10), legend=True, alpha=0.4)
fig.colorbar(sm, ax=axes, label="density")
But what I want is some thing like either of these figures (preferably with colorramp) :
I am working on the Spotify dataset from Kaggle. I plotted a barplot showing the top artists with most songs in the dataframe.
But the X-axis is showing numbers and I want to show names of the Artists.
names = list(df1['artist'][0:19])
plt.figure(figsize=(8,4))
plt.xlabel("Artists")
sns.barplot(x=np.arange(1,20),
y=df1['song_title'][0:19]);
I tried both list and Series object type but both are giving error.
How to replace the numbers in xticks with names?
Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Data
Data from Spotify - All Time Top 2000s Mega Dataset
df = pd.read_csv('Spotify-2000.csv')
titles = pd.DataFrame(df.groupby(['Artist'])['Title'].count()).reset_index().sort_values(['Title'], ascending=False).reset_index(drop=True)
titles.rename(columns={'Title': 'Title Count'}, inplace=True)
# titles.head()
Artist Title Count
Queen 37
The Beatles 36
Coldplay 27
U2 26
The Rolling Stones 24
Plot
plt.figure(figsize=(8, 4))
chart = sns.barplot(x=titles.Artist[0:19], y=titles['Title Count'][0:19])
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.show()
OK, so I didnt know this, although now it seems stupid not to do so in hindsight!
Pass names(or string labels) in the argument for X-axis.
use plt.xticks(rotate=90) so the labels don't overlap
I am using MatLibPlot to fetch data from an excel file and to create a scatter plot.
Here is a minimal sample table
In my scatter plot, I have two sets of XY values. In both sets, my X values are country population. I have Renewable Energy Consumed as my Y value in one set and Non-Renewable Energy Consumed in the other set.
For each Country, I would like to have a line from the renewable point to the non-renewable point.
My example code is as follows
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'example_graphs.xlsx'
datasheet = pd.read_excel(excel_file, sheet_name=0, index_col=0)
ax = datasheet.plot.scatter("Xcol","Y1col",c="b",label="set_one")
datasheet.scatter("Xcol","Y2col",c="r",label="set_two", ax=ax)
ax.show()
And it produces the following plot
I would love to be able to draw a line between the two sets of points, preferably a line I can change the thickness and color of.
As commented, you could simply loop over the dataframe and plot a line for each row.
import pandas as pd
import matplotlib.pyplot as plt
datasheet = pd.DataFrame({"Xcol" : [1,2,3],
"Y1col" : [25,50,75],
"Y2col" : [75,50,25]})
ax = datasheet.plot.scatter("Xcol","Y1col",c="b",label="set_one")
datasheet.plot.scatter("Xcol","Y2col",c="r",label="set_two", ax=ax)
for n,row in datasheet.iterrows():
ax.plot([row["Xcol"]]*2,row[["Y1col", "Y2col"]], color="limegreen", lw=3, zorder=0)
plt.show()
I'm doing a k-means clustering of activities on some open source projects on GitHub and am trying to plot the results together with the cluster centroids using Seaborn Scatterplot Matrix.
I can successfully plot the results of the clustering analysis (example tsv output below)
user_id issue_comments issues_created pull_request_review_comments pull_requests category
1 0.14936519790888722 2.0100502512562812 0.0 0.60790273556231 Group 0
1882 0.11202389843166542 0.5025125628140703 0.0 0.0 Group 1
2 2.315160567587752 20.603015075376884 0.13297872340425532 1.21580547112462 Group 2
1789 36.8185212845407 82.91457286432161 75.66489361702128 74.46808510638297 Group 3
The problem I'm having is that I'd like to be able to also plot the centroids of the clusters on the matrix plot too. Currently I'm my plotting script looks like this:
import seaborn as sns
import pandas as pd
from pylab import savefig
sns.set()
# By default, Pandas assumes the first column is an index
# so it will be skipped. In our case it's the user_id
data = pd.DataFrame.from_csv('summary_clusters.tsv', sep='\t')
grid = sns.pairplot(data, hue="category", diag_kind="kde")
savefig('normalised_clusters.png', dpi = 150)
This produces the expected output:
I'd like to be able to mark on each of these plots the centroids of the clusters. I can think of two ways to do this:
Create a new 'CENTROID' category and just plot this together with the other points.
Manually add extra points to the plots after calling sns.pairplot(data, hue="category", diag_kind="kde").
If (1) is the solution then I'd like to be able to customise the marker (perhaps a star?) to make it more prominent.
If (2) I'm all ears. I'm pretty new to Seaborn and Matplotlib so any assistance would be very welcome :-)
pairplot isn't going to be all that well suited to this sort of thing, but it's possible to make it work with a few tricks. Here's what I would do.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
sns.set_color_codes()
# Make some random iid data
cov = np.eye(3)
ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50),
np.random.multivariate_normal([1, 1, 1], cov, 50)])
ds = pd.DataFrame(ds, columns=["x", "y", "z"])
# Fit the k means model and label the observations
km = KMeans(2).fit(ds)
ds["label"] = km.labels_.astype(str)
Now comes the non-obvious part: you need to create a dataframe with the centroid locations and then combine it with the dataframe of observations while identifying the centroids as appropriate using the label column:
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
centroids["label"] = ["0 centroid", "1 centroid"]
full_ds = pd.concat([ds, centroids], ignore_index=True)
Then you just need to use PairGrid, which is a bit more flexible than pairplot and will allow you to map other plot attributes by the hue variable along with the color (at the expense of not being able to draw histograms on the diagonals):
g = sns.PairGrid(full_ds, hue="label",
hue_order=["0", "1", "0 centroid", "1 centroid"],
palette=["b", "r", "b", "r"],
hue_kws={"s": [20, 20, 500, 500],
"marker": ["o", "o", "*", "*"]})
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()
An alternate solution would be to plot the observations as normal then change the data attributes on the PairGrid object and add a new layer. I'd call this a hack, but in some ways it's more straightforward.
# Plot the data
g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"])
# Change the PairGrid dataset and add a new layer
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
g.data = centroids
g.hue_vals = [0, 1]
g.map_offdiag(plt.scatter, s=500, marker="*")
I know I'm a bit late to the party, but here is a generalized version of mwaskom's code to work with n clusters. Might save someone a few minutes
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def cluster_scatter_matrix(data_norm, cluster_number):
sns.set_color_codes()
km = KMeans(cluster_number).fit(data_norm)
data_norm["label"] = km.labels_.astype(str)
centroids = pd.DataFrame(km.cluster_centers_, columns=data_norm.columns)
centroids["label"] = [str(n)+" centroid" for n in range(cluster_number)]
full_ds = pd.concat([data_norm, centroids], ignore_index=True)
g = sns.PairGrid(full_ds, hue="label",
hue_order=[str(n) for n in range(cluster_number)]+[str(n)+" centroid" for n in range(cluster_number)],
#palette=["b", "r", "b", "r"],
hue_kws={"s": [ 20 for n in range(cluster_number)]+[500 for n in range(cluster_number)],
"marker": [ 'o' for n in range(cluster_number)]+['*' for n in range(cluster_number)]}
)
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()