Holoviews Polygons inputs - dataframe

I have been able to make a choropleth map in Bokeh using multiple lists (latitudes, longitudes, county names, value to display, color to display). I wanted to use Holoviews with Bokeh to get their color legend as I prefer it over Bokeh's disjoint grouping one.
In general, I have been unable to find good documentation on structuring a dataframe so that Holoviews can pull data from it. I found mentions of it on their GeoViews documentation, and tried to replicate the Choropleths example they give but cannot get it to work. How do dataframes need to be formatted for Holoviews?

If you are wanting to render polygons from dataframes in HoloViews/GeoViews you have one of two options:
1) Use geopandas dataframes, which will work out of the box. Just pass your geopandas dataframe to the Polygons element and it will display itself.
2) Pass in a list of dataframes one for each polygon, e.g. in the following example we create list of dataframes by creating Box elements and calling dframe on them. This list of dataframes can now be passed to the Polygons element:
list_of_dfs = [hv.Box(0, 0, i/10.).dframe() for i in range(10, 1, -1)]
hv.Polygons(list_of_dfs)

Related

Cannot plot a histogram from a Pandas dataframe

I've used pandas.read_csv to generate a 1000-row dataframe with 32 columns. I'm looking to plot a histogram or bar chart (depending on data type) of each column. For columns of type 'int64', I've tried doing matplotlib.pyplot.hist(df['column']) and df.hist(column='column'), as well as calling matplotlib.pyplot.hist on df['column'].values and df['column'].to_numpy(). Weirdly, nthey all take areally long time (>30s) and when I've allowed them to complet, I get unit-height bars in multiple colors, as if there's some sort of implicit grouping and they're all being separated into different groups. Any ideas about what I can do to get a normal histogram? Unfortunately I closed the charts so I can't show you an example right now.
Edit - this seems to be a much bigger problem with Int columns, and casting them to float fixes the problem.
Follow these two steps:
import the Histogram class from the Matplotlib library
use the "plot" method, which will accept a dataframe as argument
import matplotlib.pyplot as plt
plt.hist(df['column'], color='blue', edgecolor='black', bins=int(45/1))
Here's the source.

How to make a Scatter Plot for a Dataset with 4 Attribtues and 5th attribute being the Cluster

I have a dataset which looks like this,
It has four attributes and the fifth column (which I added by myself) is the cluster of each row to which the row belongs.
I want to build something like a Scatter Plot for this dataset, but I am unable to do so. I have tried searching it up and the best I could find was this following question on Stackoverflow,
How to make a 4d plot with matplotlib using arbitrary data
Using this, I was able to make a Scatter Plot but it can only be done for three attributes while fourth attribute being the cluster of each row.
Can anyone help me figure out how would it be possible to do the same to make a Scatter Plot for a dataset similar to mine?
I would recommend something like seaborn's pairplot:
import seaborn as sns
sns.pairplot(df, hue="cluster")
See the images in the link, of what it looks like.
This creates several pairwise scatterplots instead of trying to make a 3D plot and arbitrarily flatten one of the dimensions.

Pandas dataframe rendered with bokeh shows no marks

I am attempting to create a simple hbar() chart on two columns [project, bug_count]. Sample dataframe follows:
df = pd.DataFrame({'project': ['project1', 'project2', 'project3', 'project4'],
'bug_count': [43683, 31647, 27494, 24845]})
When attempting to render any chart: scatter, circle, vbar etc... I get a blank chart.
This very simple code snippet shows an empty viz. This example shows a f.circle() just for demonstration, I'm actually trying to implement a f.hbar().
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
f = figure()
f.circle(df['project'], df['bug_count'],size = 10)
show(f)
The values of df['project'] are strings, i.e. categorical values, not numbers. Categorical ranges must be explicitly provided, since you are the only person who possess the knowledge of what order the arbitrary factors should appear in on the axis. Something like
p = figure(x_range=sorted(set(df['project'])))
There is an entire chapter in the User's Guide devoted to Handling Categorical Data, with many complete examples (including many bar charts) that you can refer to.

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.

seaborn how do i create a box plot of only particular attributes in a dataframe

I would like to create two boxplots to visualize different attributes within my data by splitting the attributes up based on their scale. I currently have this
box plots to show the distributions of attributes
sns.boxplot(data=df)
box plot with all attributes included
I would like it to be like the images below with the attributes in different box plots based on their scale but with the attribute labels below each boxplot (not the current integers).
box plots to show the distributions of attributes
sns.boxplot(data=[df['mi'],df['steps'],df['Standing time'],df['lying time']])
box plot by scale 1
You can subset a pandas DataFrame by indexing with a list of column names
sns.boxplot(data=df[['mi', 'steps', 'Standing time', 'lying time']])