matplotlib: histogram is not displaying correctly - dataframe

I have extracted certain data from a csv file contains the information I need to analyze. Made them into a DataFrame. Then group them based on the type of region they are at "reg."
datafileR = datafile = pd.read_csv("pixel_data.csv")
datafileR = pd.DataFrame(datafileR)
### Counting the number of each rows based on the "Reg":
datafileR["Reg"].value_counts()
This is the result I received:
enter image description here
Make a group called region based on the Reg column from dataframe: datafileR:
region = datafileR.groupby(["Reg"])
Now plot them in histogram:
sns.set_theme()
plt.hist(datafileR["Reg"].value_counts(), bins=[70,100,130,160,190],color=["grey"],
histtype='bar', align='mid', orientation='vertical', rwidth=0.85)
This is the image I received, but there should have five categories (Middle East and North Africa, Africa (excl MENA),Asia and Pacific, Europe and Eurasia and Cross-regional)on the x-axies. I am not sure what when wrong. Meanwhile, how to change the states on the y-axis so it displays the actual number?
enter image description here

You are trying to draw a bar plot, not a histogram. Please ref to https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html?highlight=bar#matplotlib.pyplot.bar
datafileR = pd.DataFrame({'reg': np.random.choice(['Asia','Africa','Europe'], size=1000)})
df = datafileR['reg'].value_counts()
plt.bar(x=df.index, height=df.values)
You can also use pandas' plotting functions:
df.plot.bar()
plt.tight_layout()

Related

Drawing map with geopandas library for a continent, but the data points contains whole world

I have got data set of meteorites which were found with latitude and longitude information. I almost have 30,000 data points from all around the world. But I would like to plot the map of only one continent, for example "South America" by using geopandas library.
I am using 'naturalearth_lowres' default map of geopandas. From that world map, I filtered South America. My data which is called mod_data_geo consists geometry type data, Point(longitute, latitude).
Data Set looks like that:
My code:
mod_data_geo = gpd.GeoDataFrame(mod_data, geometry = gpd.points_from_xy(mod_data['long'], mod_data['lat']))
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
countries = world[world['continent'] == "South America"]
axis = countries.plot(color = 'Lightblue', edgecolor = 'black', figsize=(15,15))
mod_data_geo.plot(ax=axis, markersize = 1, color = 'purple' )
Map that I plotted:
How can I filter data of meteorites inside mod_data_geo dataframe with Geopandas library or any other tool, in order to see only meteorites found over the South Africa continent only?
Thank you in advance!
Three ways - crop the image, filter the points with a bounding box, or filter the points by checking whether they're inside the country shapes.
Crop the image
If you'd like the points to extend to the edge of the image, but simply to limit the image extent, you can simply set the x and y limits on the matplotlib axis:
axis = countries.plot(color = 'Lightblue', edgecolor = 'black', figsize=(15,15))
orig_extent = axis.get_extent()
mod_data_geo.plot(ax=axis, markersize = 1, color = 'purple' )
axis.set_extent(*orig_extent)
Filter to a bounding box
The first approach is nice in that it retains all the data that can fit within your plot. But it's not super efficient, as matplotlib has to filter the data for you based on whether it will appear in the image. A faster approach could filter the data first; note that the data will no longer go to the edge of the image.
First find a bounding box around the countries:
In [6]: bounds = countries.bounds.agg({'minx': 'min', 'miny': 'min', 'maxx': 'max', 'maxy': 'max'})
In [7]: bounds
Out[7]:
minx -81.410943
miny -55.611830
maxx -34.729993
maxy 12.437303
dtype: float64
Then you can filter the data based on these bounds:
In [8]: mod_data_filtered = mod_data_geo[(
...: (mod_data_geo.lat >= bounds.miny)
...: (mod_data_geo.lat <= bounds.maxy)
...: (mod_data_geo.long >= bounds.minx)
...: (mod_data_geo.long <= bounds.maxx)
...: )]
Now you can plot with mod_data_filtered.
Note that you could set the extent of the plot to the bounding box, though this will get a bit tight.
Filter with country shapes
If you'd like to filter the data to being within one of the countries rather than just cropping the data to the bounding box, you could use geopandas.GeoSeries.contains.
First, dissolve the data to get a single shape for South America:
In [8]: south_america = countries.dissolve()
In [9]: south_america
Out[9]:
geometry pop_est continent name iso_a3 gdp_md_est
0 MULTIPOLYGON (((-57.75000 -51.55000, -58.05000... 44293293 South America Argentina ARG 879400.0
Then, filter the points to those within the shape:
In [10]: mod_data_filtered = mod_data_geo[south_america.contains(mod_data_geo)]

Plotly base values are in percentage

I have table in which one my base values are in percentage
ID TYPE PERCENTAGE
1 gold 15%
2 silver 71.4%
3 platinum 20%
4 copper 88.88%
But plotly doesn't like that
Do you know how I could tell him "hey these data are in percentage, please show me a percentage graph"?
I think plotly is the required answer, so I created it in Plotly. I have converted the percentages in the existing data frame to decimal format. Finally, I set the Y axis display to '%'.
import plotly.express as px
df['PERCENTAGE'] = df['PERCENTAGE'].apply(lambda x:float(str(x).strip('%')) / 100)
fig = px.bar(df, x='TYPE', y='PERCENTAGE')
fig.update_layout(yaxis_tickformat='%')
fig.show()
Does this work for you:
df.PERCENTAGE = df.PERCENTAGE.str.replace('%', '') #remove % sign
df.PERCENTAGE = pd.to_numeric(df.PERCENTAGE) #convert to numeric
plt.bar(df.TYPE, df.PERCENTAGE) #plot
plt.ylabel('Percentage')
plt.show()
Output:
Note you can always check the type of your data with df.dtypes

How to plot lineplot with several lines as countries from dataset

Encountered a problem whilst plotting from GDP dataset:
As I trying to plot, I cannot figure out how to take more than 1 year:
plt.figure(figsize=(14,6))
gdp = sns.lineplot(x=df_gdp['Country Name'], y=df_gdp['1995'], marker='o', color='mediumvioletred', sort=False)
for item in gdp.get_xticklabels():
item.set_rotation(45)
plt.xticks(ha='right',fontweight='light',fontsize='large')
output:
How to plot all years on X, amount on Y and lines as each country ?
How to modify Y stick to shown whole digits, not only 1-2-3-4-5-6 and lell
You need to transform your dataframe to "long form" format, then pass the relevant column names to lineplot
df2 = df.melt(id_vars=['Country Name'], var_name='year', value_name='GDP')
sns.lineplot(x='year', y='GDP', hue='Country Name', data=df2)

How to visualize 'suicides_no' w.r.t 'gdp_per_capita ($)' for a given country over the years, in the following data frame

The DataFrame can be viewed here: Global Suicide Dataset
I have made a pivot table with country and year as indices using the following code:
df1 = pd.pivot_table(df, index = ['country', 'year'],
values=['suicides_no','gdp_per_capita ($)', 'population', 'suicides/100k pop'],
aggfunc = {"suicides_no" : np.sum
,"gdp_per_capita ($)" : np.mean
,"population" : np.mean
,"suicides/100k pop" : np.mean})
Output:
Now for my project, i want to visualize how does the suicides_no vary with the gdp_per_capita for a country over the years. But I am unable to plot it. Can somebody please help me out?
First lets convert indexes to columns using df1.reset_index(inplace=True)
Now, you can draw this in a scatter plot where the main features are - Year (preferably on x-axis) and suicides_no (on y-axis). The gdp_per_capita will go as size of the dots.
In this case you have two options:
Draw different plots for each country. (gdp will be shown as hue)
sns.catplot(x='year', y='suicides_no', row='country', hue='gdp_per_capita ($)', data=df1)
Draw everything in a single plot. Scatter plot with GDP as dot size, and Country as Color (hue)
sns.scatterplot(x='year', y='suicides_no', hue='country', size='gdp_per_capita ($)', data=df1)

How to plot a stacked bar using the groupby data from the dataframe in python?

I am reading huge csv file using pandas module.
filename = pd.read_csv(filepath)
Converted to Dataframe,
df = pd.DataFrame(filename, index=None)
From the csv file, I am concerned with the three columns of name country, year, and value.
I have groupby the country names and sum the values of it as in the following code and plot it as a bar graph.
df.groupby('country').value.sum().plot(kind='bar')
where, x axis is country and y axis is value.
Now, I want to make this bar graph as a stacked bar and used the third column year with different color bars representing each year. Looking forward for an easy way.
Note that, year column contains years from 2000 to 2019.
Thanks.
from what i understand you should try something like :
df.groupby(['country', 'Year']).value.sum().unstack().plot(kind='bar', stacked=True)